Solved: splunk to hadoop data export using hadoop connect ...

newbie2tech · ‎05-30-2017

Hi All,

We are in need to export the data from Splunk to Hadoop.

I have come across the hadoop connect app and i am in the process of setting it up.

Question that i have is around it's reliability, my usecase is to export pretty much all the data that comes to splunk(500+ gb/day). Is this app reliable to handle this volume? If NOT what volume it can handle and any suggestions on the approach to deal with exporting 500+ gb volume.

Also are they any performance impact of this app running on SearchHead and exporting this volume of data, anything that we need to be aware off, We plan to schedule the export every 30 mins or so.

Any other aspect/scenario/limitation which we need to be aware off.

Note: 500 gb/day is currently getting indexed so IT team needs this to be done from splunk and NOT from source, all future data which will be onboarded, will be sent to both splunk and hadoop.

Splunk Version : 6.5.2
Cloudera Enterprise 5.10.1 (hadoop-2.6.0)
Kerberos Secured

Note :Looking for some guidance will award points for the appropriate/working suggestions

mattymo · ‎06-04-2017

Hi newbie2tech,

From what I understand, Hadoop connect can do about 1TB/day per search head. There is no built in fault tolerance and would obviously have a cost in terms of search load and the movement of data to hadoop.

If you are basically replicating all ingested data over to hadoop, I would recommend that hadoop data roll is the way to go. (included free in Splunk enterprise 6.5+) http://docs.splunk.com/Documentation/Splunk/6.6.1/Indexer/ArchivingindexestoHadoop

Not only is it going to be more scalable/performant for you (170 GB/hr per indexer last I heard), it also has built in redundancy, ensuring a copy of each of your buckets makes it to hadoop.

It also ensures only a single copy of your buckets is sent to Hadoop, regardless of your rep factor if you are clustering.

Where Hadoop Connect has the benefit of allowing you to choose various formats vs Hadoop data roll which will transfer Splunk's proprietary journal.gz, there is a free app that you can use to allow Hadoop tools to read the journal. https://splunkbase.splunk.com/app/2759/

All things considered, I recommend you check out Hadoop Data roll for this use case.

- MattyMo

View solution in original post

mattymo · ‎06-04-2017

Hi newbie2tech,

From what I understand, Hadoop connect can do about 1TB/day per search head. There is no built in fault tolerance and would obviously have a cost in terms of search load and the movement of data to hadoop.

If you are basically replicating all ingested data over to hadoop, I would recommend that hadoop data roll is the way to go. (included free in Splunk enterprise 6.5+) http://docs.splunk.com/Documentation/Splunk/6.6.1/Indexer/ArchivingindexestoHadoop

Not only is it going to be more scalable/performant for you (170 GB/hr per indexer last I heard), it also has built in redundancy, ensuring a copy of each of your buckets makes it to hadoop.

It also ensures only a single copy of your buckets is sent to Hadoop, regardless of your rep factor if you are clustering.

Where Hadoop Connect has the benefit of allowing you to choose various formats vs Hadoop data roll which will transfer Splunk's proprietary journal.gz, there is a free app that you can use to allow Hadoop tools to read the journal. https://splunkbase.splunk.com/app/2759/

All things considered, I recommend you check out Hadoop Data roll for this use case.

- MattyMo

newbie2tech · ‎06-05-2017

Thank you mmodestino for your inputs, I will look at these from my end and get back to you hopefully by tomorrow.

newbie2tech · ‎06-20-2017

Apologies for delay in getting back, I had to put this usecase on hold due to other priorities, now am back to executing this. Thank you for the pointers.

Does the data roll approach delete the data from splunk index or just makes a copy or we have option to choose either of these? I see that we have the ability to read the archived data in hadoop from splunk .

mattymo · ‎06-21-2017

you can do both.

You can archive as soon as possible then allow the indexes to naturally roll buckets to frozen as they will remain in S3. Or you can simply roll to S3 as part of your rollToFrozen plan.

- MattyMo

newbie2tech · ‎07-24-2017

Hi MModestino,

In you Hadoop roll usage suggestion you mentioned "there is a free app that you can use to allow Hadoop tools to read the journal", how effective are these? Do you know of any implementations? My usecase requirement is to send data to hadoop such that hadoop consume it and do reporting by correlating this transferred data with it's own existing data from other sources. With this app can hadoop consume the data?

Also this would mean we need to install splunk enterprise and then the Bucket Reader app on the Hadoop Cluster right? Our Kerberized Hadoop Cluster has 2 master nodes and 4 data nodes..so the splunk enterprise and bucket reader app should be installed on just one of the master nodes?

mattymo · ‎07-24-2017

hit up the dev on the splunkbase page for more about bucket reader and check it out in your lab, see if it can satisfy your needs.

I would implement both in a lab and see which one you like. or why not both!? 😉

Again, it comes down to requirements. data roll for accounting for every last drop IMO

But if you have well defined use cases, hadoop connect can prep the data format you deposit in hadoop.

- MattyMo

newbie2tech · ‎07-24-2017

Okay MModestino, that is good suggestion I will try to do both(hope my hadoop team agrees for it :)) and take a call from there

newbie2tech · ‎05-31-2017

Hi Team,

Any insights? looking forward to your guidance.

splunk to hadoop data export using hadoop connect app(BOUNTY Offered)

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

splunk to hadoop data export using hadoop connect app(BOUNTY Offered)

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I