I would like to run Splunk in AWS. I would like to send log data to the system and, as Splunk indexes this, I would like Splunk to build a copy of my raw data in Amazon S3. I've looked at the following options, but nothing seems to quite fit:
Shuttl (https://github.com/splunk/splunk-shuttl) looks good, but nobody is actively developing this Splunk Archiving: This archives Splunk's native files. I could write something to move files to S3, but it doesn't seem that I could get the raw data? I would guess that the raw data is somewhere deep in Splunk's file format, but is this likely to change in the future? Also, archiving looks like it works on old data and I would like to move data to S3 as it is indexed. Splunk Hadoop Connect: This seems to export parsed data, not the raw data (although I may have read this wrong).
Have Splunk monitor a DIFFERENT directory than where the files are coming and tell it to delete the files when it is done with them like this in inputs.conf:
move_policy = sinkhole
Then have a script in the cron firing continuously that does this:
For every file in the incoming directory:
IF the file is not in the "linked" DB, create a copy/link to MyPath where Splunk is monitoring, then add it to the "linked" DB.
ELSE if the file is no longer linked to the Splunk directory, archive it to Amazon S3 (Splunk is done with it) and remove it from the "linked" DB.