Getting Data In

difference between index archiving (cold to Frozen) and archive to Hadoop via Hadoop Data Roll?

Harishma
Communicator

I read splunk docs and understood the below:
Splunk Index archiving from cold to frozen to a particular location can be done either via

  1. Automatically by splunk Indexer
  2. Or by using coldToFrozenscript.py

This will archive data to a particular directory that we mention in indexes.conf.
However it faces problems in cases of clustered architecture due to same multiple buckets being copied as per below doc.

https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/Automatearchiving


Im planning to use Hadoop Data Roll to send the splunk index data to Hadoop for longer Retention wherein it CAN be further processed/analysed using hadoop technologies(hive,pig, etc)

A) Now my Qs is - I configure the splunk index archiving to Hadoop by following the below steps i.e creating a Hadoop provider and updating indexes.conf with the below details and as per below doc?

[splunk_index_archive]
vix.output.buckets.from.indexes
vix.output.buckets.older.than
vix.output.buckets.path
vix.provider

https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/ConfigureSplunkindexarchivingtoHadoop

B) Does Hadoop Data Roll require the coldToFrozenExample.py script to send data to Hadoop?
C) Does Hadoop Data Roll tackle the multiple copies issue?


D) Can someone kindly help what does this doc refer?
https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/SetanarchivescripttoHadoop

Whats the difference between the steps in my above Qs in A)
and using this script for data transfer to Hadoop?
OR
Is it mandatory to use this script for Index transfer to Hadoop?

I'm really confused, Kindly Help.....

0 Karma
1 Solution

rdagan_splunk
Splunk Employee
Splunk Employee

A) Exactly. You setup a Provider and VIX to archive the data. Also, you will need to install Hadoop and Java on all of your Search Heads and Indexers. The actual copy for the buckets is done from the Indexers.
B) Hadoop Data Roll does not need the script coldToFrozenExample.py.
C) Yes Hadoop Data Roll will only copy 1 bucket. the other copies will not be moved to HDFS.
D) Hadoop Data Roll does not need a script, it just use the flag vix.output.buckets.older.than = seconds to determine if the bucket has to be copied or not.

"$SPLUNK_HOME/etc/apps/splunk_archiver/bin/coldToFrozen.sh" is used just to prevent buckets from being deleted by Splunk before you copy these buckets to HDFS.
Do not confuse this script with this non-HDFS script: $SPLUNK_HOME/bin/coldToFrozenExample.py

View solution in original post

rdagan_splunk
Splunk Employee
Splunk Employee

A)
Here is the link to configure your Provider with Kerberos: https://docs.splunk.com/Documentation/Splunk/7.0.0/HadoopAnalytics/ConfigureKerberosauthentication
Also, make sure the Kerberos keytab, Hadoop Home, and Java home are exactly on the same location in both the Search Head and all the Indexers. Otherwise you might see this error: https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/Troubleshoot
Hadoop version 2.6 should work without any issues.

B)
The script $SPLUNK_HOME/etc/apps/splunk_archiver/bin/coldToFrozen.sh is not needed. Consider the coldToFrozen.sh Script as a fallback and not your primary hook for archiving. This script buys you more time when either your system is receiving data faster than normal, or when the archiving storage layer is down, so that you'll have more time to archive bucket. To facilitate this further, for each archive index you can set your vix.output.buckets.older.than = seconds as low as possible, so that buckets are archived as quickly as possible.
So if for example you change your settings from vix.output.buckets.older.than=60 (days) to vix.output.buckets.older.than=50 (days) you should not have any need for that script.

Harishma
Communicator

Hi @rdagan ,

Thankyou much for your assistance 🙂

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

A) Exactly. You setup a Provider and VIX to archive the data. Also, you will need to install Hadoop and Java on all of your Search Heads and Indexers. The actual copy for the buckets is done from the Indexers.
B) Hadoop Data Roll does not need the script coldToFrozenExample.py.
C) Yes Hadoop Data Roll will only copy 1 bucket. the other copies will not be moved to HDFS.
D) Hadoop Data Roll does not need a script, it just use the flag vix.output.buckets.older.than = seconds to determine if the bucket has to be copied or not.

"$SPLUNK_HOME/etc/apps/splunk_archiver/bin/coldToFrozen.sh" is used just to prevent buckets from being deleted by Splunk before you copy these buckets to HDFS.
Do not confuse this script with this non-HDFS script: $SPLUNK_HOME/bin/coldToFrozenExample.py

View solution in original post

Harishma
Communicator

Hi @rdagan ,

Thankyou so much for your response. But few queries to hlep me understand better.

A)

D) Our retention policy is 62 days, so If I mention
vix.output.buckets.older.than=60 , when the data becomes 60 day old it will get copied to HDFS. So when parameter can handle this , why is there a necessity for this coldToFrozenSh.script >?
Does it mean copying will take time?
Is it better/advisable to have this script also in place?

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!