How can I send splunk cold buckets to S3?
We have our on-premises splunk and send Splunk data to S3 for longer storage.
I came across this Hadoop Data Roll that sends the splunk data to S3A filesystem. This looks something to deal with Hadoop+S3 , which Im not quite aware of. I'm very new to AWS. I thought Splunk can send data directly to S3 for archival. Isnt that possible?
The document says to provide some provider parameters. Can someone please elaborate on this? Does this mean I need to have Hadoop installed on S3?
Attempting to bring this current.
For Splunk cloud: customers: https://docs.splunk.com/Documentation/SplunkCloud/8.1.2011/Admin/DataSelfStorage
For Splunk enterprise with SmartStore(s3):
https://docs.splunk.com/Documentation/Splunk/8.1.3/Indexer/SmartStorearchitecture
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
For Splunk enterprise without smartstore:
No similar feature found. Build your own as previously mentioned.
see also (https://community.splunk.com/t5/All-Apps-and-Add-ons/How-to-put-cold-and-frozen-data-on-s3-in-AWS/)
I have, starting from a Splunk python script, developed coldToFrozenPlusS3Uplaod.py that encrypts and uploads frozen buckets to S3.
It can be found here: https://github.com/marboxvel/Encrypt-upload-archived-Splunk-buckets
Take a look at the indexes.conf documentation for Splunk 7.0. There's a new feature (unsupported, hopefully out in 7.1?) called remotePath and storageType (look at the very end for an example). Automatically handles S3 and caching data back for searching.
Still not supported in 7.1. It would be nice to have a remoteFrozenPath in 7.2
@nickhillscpl
It seems a bit difficult to find details on the API of the Splunk script. How would one setup the indexes.conf with a couple of indexes and the cold2frozen.py script?
@kpawar
I have been looking at the Data Roll functionality but from what you describe it archives not frozen data but also warm and cold? The documentation also talks about Hunk which seems to be some legacy thing? Any chance you can clarify?
You dont need to install hadoop on S3.
You need to install java and hadoop client on your splunk search head and indexer. Download link to Hadoop client 2.6 : http://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Then on your search head you create provider by following steps from https://docs.splunk.com/Documentation/Hunk/6.4.8/Hunk/ArchivingSplunkindexestoS3
After creating provider you setup archived index.
Once all this is setup, splunk will archive data every hour, so that any buckets that are warm or cold will be sent to AWS S3.
Hi @Anonymous ,
We have a clustered Indexer , so If I use coldtofrozen.py script , it would copy redundant data as well, right?
Thus by using Hadoop Data Roll I can overcome this redundancy of data copied.
I am not using any Hadoop here , I want to move it to S3....
But I still didnt understand what these below HADOOP parameters refer to?
The doc says the below:
Hadoop Home: /absolute/path/to/apache/hadoop-2.6.0
vix.fs.s3a.access.key:
vix.fs.s3a.secret.key:
vix.env.HADOOP_TOOLS: $HADOOP_HOME/share/hadoop/tools/lib
vix.splunk.jars: $HADOOP_TOOLS/hadoop-aws-2.6.0.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar
Hadoop Data Roll interacts with AWS S3 using hadoop client libraries. So, you need to download hadoop client first on your Search Head and Indexers.
These hadoop parameters are required for connecting to S3.
Hadoop Home is path where hadoop client is installed. So, if you have hadoop client at location /opt/hadoop-2.6, then you add below parameter to your provider.
vix.env.HADOOP_HOME = /opt/hadoop-2.6
You also need below parameter. This gives path too hadoop libs.
vix.env.HADOOP_TOOLS = $HADOOP_HOME/share/hadoop/tools/lib
Below parameter gives, path to aws jars from hadoop client location.
vix.splunk.jars = $HADOOP_TOOLS/hadoop-aws-2.6.0.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar
I suppose that means there is no way to make use of an ec2 role to access the S3 bucket rather than using an access key and secret?
Awesomee..!! I understood now. Thankyou so much @kpawar
Do we use Hadoop Data roll for sending data to EMR Hadoop as well?
Also as @ByteFlinger asked - Is there no way to access S3 bucket rather than using access and secret key?
Yes, you can use Hadoop Data Roll for sending data to EMR Hadoop.
Currently, you need to provide access/secret key to use Hadoop Data Roll on AWS S3. There is no other way that using your aws access/secret keys.
The whole Hadoop Data Roll stuff seems a bit more complex than the coldToFrozen script.
From what you describe it seems that you are not only archiving frozen data but also warm and cold, so it feels a bit like a backup rather than an archive, is that correct?
Also the whole Hunk stuff seems to be mentioned as legacy in the documentation, should one really be using that?
Hunk and Hadoop Data Roll will take few steps to set up, but once its set up correctly, it works.
With Hadoop Data Roll, we ensure that bucket is archived before being frozen, that's why cold/warm buckets are archived. After bucket is archived and its retention period is over, local bucket on indexer filesystem is deleted, but you have archived bucket on hadoop filesystem.
Hunk and Hadoop Data Roll are same thing. If you want to archive data, Hunk/Hadoop Data Roll will be a good option, and you could use that.
But that means it is essentially an archive for the warm/cold data while frozen data is still deleted. If one has an indexer cluster with a minimum replication factor of 2, my understanding is that this would be a bit redundant and the frozen data is still deleted, is that correct?
I am trying to gauge what would be the different between the coldToFrozen script and the data roll.
If one has indexer cluster, there are multiple copies of buckets. But during archiving, we archive only one copy irrespective of how many copies of buckets are there. So hadoop does not have redundant data.
And also can Amazon Glacier be for data roll?
I should probably do a bit of reading before posting sometimes. Using Amazon Glacier with data roll would make no sense so disregard the last question. Only curious about the stanza
I have been doing some reading and I think I understand more how it works now. It seems that there is one single stanza for setting up archiving for all indexes which means the list can get pretty big. Is that correct?
Hey, I wrote some scripts to do this a while ago.
I must confess Ihave not recently reviewed or tested this, but perhaps it can show you the general direction:
https://github.com/nickhills81/splunkDeepfreeze
In its simplest form you would update your indexes like this:
[your_index]
homePath = $SPLUNK_HOME/var/lib/splunk/your_index/db
coldPath = $SPLUNK_HOME/var/lib/splunk/your_index/colddb
thawedPath = $SPLUNK_HOME/var/lib/splunk/your_index/thaweddb
maxDataSize = auto_high_volume
frozenTimePeriodInSecs = 2592000 #Freezes data after 30 days
coldToFrozenScript = /opt/splunk/etc/apps/your_app/bin/cold2frozen.py
Note the last 2 lines
That will move your buckets to the path you set in the cold2frozen script after 30 days.