I have an indexer cluster with a minimum replication factor of 2 to prevent data loss. I would like to setup Splunk to archive frozen data after the retention period has passed to an S3 bucket (This will eventually be in a S3 glacier bucket for minimal cost and reliable storage). This data DOES NOT need to be searchable, it just needs to be available to thawing in the future.
It seems that Splunk provides a few options with advantages and disadvantages so I am trying to understand what would be the best in such a scenario.
Using cold to frozen script
This seems to fit most of the criteria but it requires a separate disk area to move the frozen data to first. There are also some questions on this method
What is the API of such a script, I cannot find any information. By that I mean what arguments does Splunk supply the script if any?
Instead of having the coldToFrozen script move the data and then a separate script to move to S3 as per recommendation, couldn't one set auto archiving (coldToFrozenDir) in Splunk and then having the second script move from there to S3, thus saving one script?
Hadoop Data Roll
This one seems to be a bit of an odd ball. The information on how this works is spread everywhere and one might think you require a Hadoop cluster here but some information seems to point to the fact that one can just have a Hadoop client on Splunk to write to S3. Is this correct? Also some other things.
This is definitely more complicated to setup. Is there a definitive step by step guide on how to go about this with examples?
It is a bit unclear on how this works. Does it archive warm/cold buckets too? Does it archive frozen data at all?
I want Splunk to be searching warm/cold data from the disks, not from S3 but it is unclear if this is the case here
So I am a bit confused on what would be the best way to go here. It feels simple to setup the coldToFrozen script (if I can figure out the API of the script call) but I am willing to get my hands dirty with the Hadoop data roll process if that means I will have not only archived data in S3 but also searchable but only if Splunk is still searching from the disks for hot/warm/cold buckets and only frozen from S3 (due to obvious performance differences)
It is a bit of a long post but any comments and suggestions are more than welcome to try and clarify the issue.
The frozen workflow simply executes a script and passes it a single parameter.
This parameter is the full path to the bucket which is about to move from cold to frozen.
If you wrote a batch/bash script it could be a simple as one line (pseudo code):
copy $1 /your_frozen/path
The effect of this is that Splunk will call your script like this: yourscript.batsh /path/to/cold/bucket
Your script then effectively runs
copy /path/to/cold/bucket /your_frozen/path
Once the copy operation completes, your script ends, and Splunk deletes its copy from the cold DB.
You are totally correct in that you could write 1 script which would handle copying the data to cloud storage, however you need to be very careful about error handling to make sure you don't accidentally tell Splunk you have finished uploading your frozen copy before it deletes its copy leaving you in trouble.
Its for this reason my script runs two processes - one to copy the data to a temporary path, leaving Splunk free to delete the cold copy, and then a second process to upload the files to S3 in its own sweet time. The second script cleans up after the upload, and since it is run as a scripted input, you can see the results of the upload in Splunk! 🙂
Awesome. Thank you. The fact that Splunk passes one single argument to the script and what that argument is does not seem to be documented anywhere which is the type of information I was looking for when I mentioned the API of that script hook.