Solved: Difference between Data format of the data sent to...

saranya_fmr · ‎09-26-2017

Hi aLL,

The goal is to move the data from Splunk to Hadoop/S3 for longer data retention. Currently we store data in Splunk only for two months.
We want to send data to Hadoop and the analytics team wants to further analyse this via Hadoop Techniques like Spark, HIVE , etc

Firstly what is the difference between the data format of data that is sent via Hadoop Connect Export and Hadoop Data Roll? I believe Hadoop Connect exports search results and Hadoop Data Roll send the raw data journal.gz
Can I use Hadoop Techniques like Hive, Pig..etc for analytics on the archived data sent to Hadoop via Hadoop Data Roll?
AND
Can I use Hadoop Techniques like Hive, Pig..etc for analytics on the archived data sent to Hadoop via Splunk Hadoop Connect Export?
I came across Splunk Archive Bucket Reader - an additional app is required to analyze the archived data via Hadoop's applications like Pig , Hive , Spark.
Is this a mandatory app required If I want to analyse the Hadoop data?

rdagan_splunk · ‎09-27-2017

1) You are correct. Hadoop Data Roll copies the journal.gz and Hadoop Connect has options to pick CSV, raw events, XML, or JSON
2) Yes. Hive, Pig, Spark, Hadoop MR Jobs can all be used on files that Splunk sends to Hadoop using both Hadoop Connect or Hadoop Data Roll
3) The Bucket Reader is needed only for the journal.gz files that Hadoop Data Roll copies to HDFS. And it is only needed if you use Hive, Pig, Spark, Hadoop MR Jobs with the journal.gz file. If you decided to read the journal.gz, but use Hadoop Data Roll as the reading engine, the Bucket Reader is not needed.

View solution in original post

newbie2tech · ‎01-08-2018

Hi saranya_fmr,

Were you able to get this implemented? Wanted to hear back what approach you had taken? Would be helpful if you share the details.

Thanks!

rdagan_splunk · ‎09-27-2017

1) You are correct. Hadoop Data Roll copies the journal.gz and Hadoop Connect has options to pick CSV, raw events, XML, or JSON
2) Yes. Hive, Pig, Spark, Hadoop MR Jobs can all be used on files that Splunk sends to Hadoop using both Hadoop Connect or Hadoop Data Roll
3) The Bucket Reader is needed only for the journal.gz files that Hadoop Data Roll copies to HDFS. And it is only needed if you use Hive, Pig, Spark, Hadoop MR Jobs with the journal.gz file. If you decided to read the journal.gz, but use Hadoop Data Roll as the reading engine, the Bucket Reader is not needed.

saranya_fmr · ‎09-28-2017

Hi @rdagan ,

Thankyou for your response. But I'm confused with with point 2 and 3.
If I want to send Splunk data to Hadoop via Hadoop Data Rol l and use Hadoop techniques like Hive, Spark etc as search engine , that means I DO NEED Bucket Reader ?
Am I right?
However If Im using Hadoop Connect for export , I DO NOT NEED Bucket Reader ?

rdagan_splunk · ‎10-06-2017

Both of your comments are correct. The Bucket Reader is only needed if you use Hadoop Data Roll and you want to use the Splunk generated raw data file (journal.gz) with Hadoop tools.

Difference between Data format of the data sent to Hadoop via Hadoop Data Roll and Splunk Hadoop Connect?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Difference between Data format of the data sent to Hadoop via Hadoop Data Roll and Splunk Hadoop Connect?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits