Getting Data In

Indexing data with a batch input, why are the sources and data still in the Splunk index after the file is deleted?

New Member

Hey Splunk Community,

I am having some confusion about the [batch] input. I have read the documentation and thought I understood. What I have noticed is that the files dropped into the directory that the [batch://Path] points to are getting removed, but the sources and data are still in the Splunk index. I was thinking that the source of the data would be removed when the file was deleted by batch. Can someone explain what the move_policy sinkhole is actually doing and what it means to add data "destructively"?

I am ideally looking to upload data to the directory, have it index that file, and then remove the data and source from the index (at a certain point after a report is generated or a new file is added to the directory.) I see that the file is removed locally, but the data and source still show up in Splunk, even though the file itself is gone. How can I get rid of the old data?

Thanks!

0 Karma

Communicator

It's a bit older, but...

Removing the file is one thing. This happens at the operating system level (no news I guess).

On the other side you have Splunk: Once the data is indexed you have events. Every event has the filename as the value for the field named "source". You cannot remove this from the index without removing the event itself. It's permament at least as long as the data exist in the index.

You could set a "dummy" source name in inputs.conf if you don't like the original filename to appear in the events.

Also there's no clue in "removing the data". The data is the basis. The index is "only" on top. It don't think (but that's only my view) that the index could exist without the underlying (raw) data. (If someone know that this view is wrong, please correct me.)

0 Karma

Ultra Champion

If you want the data for this particular index to be short lived, you can set frozenTimePeriodInSecs for this index to be a day or two, based on your needs. Its default appears to be 188697600, which is approximately 6 years based on How to view data retention settings in Splunk

Revered Legend

The batch type of input is used in cases where
1) You don't want to retain the log file at the source (Splunk to read and delete the file). This way Splunk handles your log folder cleanup.
2) The data is not written continuously to the log file (historical data)

The retention of data in Splunk index is not linked with how Splunk is reading the log file (continuous monitoring versus read and delete). It's retention is set at Index level. If you want your old data to be deleted (of course it will depend on the format of the logs and all other factors), you would set the appropriate data retention policy for the index.

Have a look at following to know more about data retention

http://wiki.splunk.com/Deploy:BucketRotationAndRetention
https://answers.splunk.com/answers/121820/data-retention-policy.html

Don’t Miss Global Splunk
User Groups Week!

Free LIVE events worldwide 2/8-2/12
Connect, learn, and collect rad prizes
and swag!