All Apps and Add-ons

Difference between Data format of the data sent to Hadoop via Hadoop Data Roll and Splunk Hadoop Connect?

saranya_fmr
Communicator

Hi aLL,

The goal is to move the data from Splunk to Hadoop/S3 for longer data retention. Currently we store data in Splunk only for two months.
We want to send data to Hadoop and the analytics team wants to further analyse this via Hadoop Techniques like Spark, HIVE , etc

  1. Firstly what is the difference between the data format of data that is sent via Hadoop Connect Export and Hadoop Data Roll? I believe Hadoop Connect exports search results and Hadoop Data Roll send the raw data journal.gz
  2. Can I use Hadoop Techniques like Hive, Pig..etc for analytics on the archived data sent to Hadoop via Hadoop Data Roll?
    AND
    Can I use Hadoop Techniques like Hive, Pig..etc for analytics on the archived data sent to Hadoop via Splunk Hadoop Connect Export?

  3. I came across Splunk Archive Bucket Reader - an additional app is required to analyze the archived data via Hadoop's applications like Pig , Hive , Spark.
    Is this a mandatory app required If I want to analyse the Hadoop data?

0 Karma
1 Solution

rdagan_splunk
Splunk Employee
Splunk Employee

1) You are correct. Hadoop Data Roll copies the journal.gz and Hadoop Connect has options to pick CSV, raw events, XML, or JSON
2) Yes. Hive, Pig, Spark, Hadoop MR Jobs can all be used on files that Splunk sends to Hadoop using both Hadoop Connect or Hadoop Data Roll
3) The Bucket Reader is needed only for the journal.gz files that Hadoop Data Roll copies to HDFS. And it is only needed if you use Hive, Pig, Spark, Hadoop MR Jobs with the journal.gz file. If you decided to read the journal.gz, but use Hadoop Data Roll as the reading engine, the Bucket Reader is not needed.

View solution in original post

0 Karma

newbie2tech
Communicator

Hi saranya_fmr,

Were you able to get this implemented? Wanted to hear back what approach you had taken? Would be helpful if you share the details.

Thanks!

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

1) You are correct. Hadoop Data Roll copies the journal.gz and Hadoop Connect has options to pick CSV, raw events, XML, or JSON
2) Yes. Hive, Pig, Spark, Hadoop MR Jobs can all be used on files that Splunk sends to Hadoop using both Hadoop Connect or Hadoop Data Roll
3) The Bucket Reader is needed only for the journal.gz files that Hadoop Data Roll copies to HDFS. And it is only needed if you use Hive, Pig, Spark, Hadoop MR Jobs with the journal.gz file. If you decided to read the journal.gz, but use Hadoop Data Roll as the reading engine, the Bucket Reader is not needed.

0 Karma

saranya_fmr
Communicator

Hi @rdagan ,

Thankyou for your response. But I'm confused with with point 2 and 3.
If I want to send Splunk data to Hadoop via Hadoop Data Rol l and use Hadoop techniques like Hive, Spark etc as search engine , that means I DO NEED Bucket Reader ?
Am I right?
However If Im using Hadoop Connect for export , I DO NOT NEED Bucket Reader ?

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Both of your comments are correct. The Bucket Reader is only needed if you use Hadoop Data Roll and you want to use the Splunk generated raw data file (journal.gz) with Hadoop tools.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...