Hey Splunk Community,
One of my biggest challenges now is trying to figure out how to get old stuff out and focus on the new data coming in. I am wondering if there is a way to use the ( | delete) command or something on a schedule to delete/purge a file from a monitored directory once a new file is added. I have begun playing around with the dedup command to focus on the relevant data, but I think I will run into problems in the future if I can't remove old data from my index (because of changing unique identifiers and using too much storage space in the splunk cluster).
Example: Day 1 I upload a file with 10,000 events into a monitored directory. Day 2, I pull from the data source and the relevant data is now 9,500 events because 1,000 machines were removed and 500 new ones were added (9,000 is the same, 500 are completely new, 1,000 are no longer relevant).
How can I delete the file from day 1 from my host and only look at the events captured on day 2, etc? Is it possible to run a scheduled task to remove old data sources and only focus on the most recent? ( I also don't want to host/index to show 19,500 events, just the 9,500 from day 2). Is it possible to do this in Splunk Light?
Any thoughts on this issue would be greatly appreciated!
How can I delete the file from day 1 from my host and only look at the events captured on day 2, etc?
The actual file that contains the data can be deleted using the batch command with move_policy = sinkhole. See inputs.conf for documentation.
Is it possible to run a scheduled task to remove old data sources and only focus on the most recent? ( I also don't want to host/index to show 19,500 events, just the 9,500 from day 2). You can use the time picker to select "last 24 hours", or "indexearliest=-1d" in your search.
Is it possible to do this in Splunk Light? Sure.
The data has time stamps for when the whole file is indexed, but each individual event is nearly indexed at the same time. The older index has timestamps too from whenever the last log in was complete? Is there a way to specify a delete command on all events before a certain time?
The delete command actually doesn't delete data but it just makes it un-searchable. So no disk space is recovered and your issue of too much storage usage will not get resolved with that. Based on your comment above, I think the actual data doesn't timestamp written into it but Splunk is assigning the time of ingestion as _time. If this is correct you can just configure appropriate data retention for this index so that older data will get deleted automatically by Splunk. For selecting the latest data only, you can do something like this (assuming data comes once a day)
index=foo [| tstats max(_time) as earliest WHERE index=foo earliest=-7d| eval earliest=relative_time(earliest,"@d") ]
Ok, I think I understand what you mean. However, I am confused about the "as earliest WHERE" part. What is that referring to? Also, is there a way to access the splunk database (I think it's a mongoDB) and delete everything from there before the new files are dropped in? It doesn't need to be realtime, it will run on a schedule, so If I can write a script to clear the db just before the new file comes in, I think that could work.
The mongoDB is used for KV store feature. Splunk stores it's data in indexes which is comprised of data buckets. See this for more details about that http://docs.splunk.com/Documentation/Splunk/6.4.1/Indexer/HowSplunkstoresindexes.
So in the subsearch, I'm checking the latest/max value of _time for last 7 days of data in the index "foo" (replace it with the index where you're data is coming daily).