topic Re: difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll? in Getting Data In

difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

Harishma — Tue, 29 Sep 2020 16:43:55 GMT

I read splunk docs and understood the below:
Splunk Index archiving from cold to frozen to a particular location can be done either via

Automatically by splunk Indexer
Or by using coldToFrozenscript.py

This will archive data to a particular directory that we mention in indexes.conf.
However it faces problems in cases of clustered architecture due to same multiple buckets being copied as per below doc.

https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/Automatearchiving

Im planning to use Hadoop Data Roll to send the splunk index data to Hadoop for longer Retention wherein it CAN be further processed/analysed using hadoop technologies(hive,pig, etc)

A) Now my Qs is - I configure the splunk index archiving to Hadoop by following the below steps i.e creating a Hadoop provider and updating indexes.conf with the below details and as per below doc?

[splunk_index_archive]
vix.output.buckets.from.indexes
vix.output.buckets.older.than
vix.output.buckets.path
vix.provider

https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/ConfigureSplunkindexarchivingtoHadoop

B) Does Hadoop Data Roll require the coldToFrozenExample.py script to send data to Hadoop?
C) Does Hadoop Data Roll tackle the multiple copies issue?

D) Can someone kindly help what does this doc refer?
https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/SetanarchivescripttoHadoop

Whats the difference between the steps in my above Qs in A)
and using this script for data transfer to Hadoop?
OR
Is it mandatory to use this script for Index transfer to Hadoop?

I'm really confused, Kindly Help.....

Re: difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

rdagan_splunk — Tue, 29 Sep 2020 16:44:30 GMT

A) Exactly. You setup a Provider and VIX to archive the data. Also, you will need to install Hadoop and Java on all of your Search Heads and Indexers. The actual copy for the buckets is done from the Indexers.
B) Hadoop Data Roll does not need the script coldToFrozenExample.py.
C) Yes Hadoop Data Roll will only copy 1 bucket. the other copies will not be moved to HDFS.
D) Hadoop Data Roll does not need a script, it just use the flag vix.output.buckets.older.than = seconds to determine if the bucket has to be copied or not.

"$SPLUNK_HOME/etc/apps/splunk_archiver/bin/coldToFrozen.sh" is used just to prevent buckets from being deleted by Splunk before you copy these buckets to HDFS.
Do not confuse this script with this non-HDFS script: $SPLUNK_HOME/bin/coldToFrozenExample.py

Re: difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

Harishma — Thu, 16 Nov 2017 12:43:24 GMT

Hi @rdagan ,

Thankyou so much for your response. But few queries to hlep me understand better.

My Hadoop cluster is kerberos authenticated. So I need to follow these steps to install kerberos utilities on splunk servers as well right? coz this is not mentioned in the hadoop data roll system requirements doc https://docs.splunk.com/Documentation/HadoopConnect/1.2.5/DeployHadoopConnect/Kerberosclientutilities
Also the CDH Version is Hadoop 2.6.0-cdh5.9.1 Is this supported? I see only upto CDH 5.6 mentioned in the system requirements.

D) Our retention policy is 62 days, so If I mention
vix.output.buckets.older.than=60 , when the data becomes 60 day old it will get copied to HDFS. So when parameter can handle this , why is there a necessity for this coldToFrozenSh.script >?
Does it mean copying will take time?
Is it better/advisable to have this script also in place?

Re: difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

rdagan_splunk — Tue, 29 Sep 2020 16:49:30 GMT

A)
Here is the link to configure your Provider with Kerberos: https://docs.splunk.com/Documentation/Splunk/7.0.0/HadoopAnalytics/ConfigureKerberosauthentication
Also, make sure the Kerberos keytab, Hadoop Home, and Java home are exactly on the same location in both the Search Head and all the Indexers. Otherwise you might see this error: https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/Troubleshoot
Hadoop version 2.6 should work without any issues.

B)
The script $SPLUNK_HOME/etc/apps/splunk_archiver/bin/coldToFrozen.sh is not needed. Consider the coldToFrozen.sh Script as a fallback and not your primary hook for archiving. This script buys you more time when either your system is receiving data faster than normal, or when the archiving storage layer is down, so that you'll have more time to archive bucket. To facilitate this further, for each archive index you can set your vix.output.buckets.older.than = seconds as low as possible, so that buckets are archived as quickly as possible.
So if for example you change your settings from vix.output.buckets.older.than=60 (days) to vix.output.buckets.older.than=50 (days) you should not have any need for that script.

Re: difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

Harishma — Fri, 17 Nov 2017 08:51:11 GMT

Hi @rdagan ,

Thankyou much for your assistance 🙂