Knowledge Management

when is it safe to delete oneshot input file?

bcavagnolo
Explorer

Hello. I have a script that invokes the command line splunk tool on an single index/search head to oneshot index log files. Is it safe to delete the input log file after splunk oneshot returns with status 0? The reason I ask is the search status webpage shows the number of indexed events ticking upward for a while after that the command returns. It seems to work, but I don't want to do it this way if it is not safe.

Tags (1)
1 Solution

amrit
Splunk Employee
Splunk Employee

While it is okay to continue reading from a file that has been deleted (the OS guarantees this), this is not "safe" in that your Splunk instance could be restarted (or crash) before the data is indexed (especially if it is held up due to hardware errors or similar). In such a case, you will have deleted your source file and will not have any way to index the missing data.

"lukejadamec" 's suggestion that data is living in the index queue upon return 0 from "splunk add oneshot" is incorrect - this would imply that, for a 5 GB file, we load all 5 GB into memory before returning from the command. Instead, the file is read & indexed in a streaming fashion.

The best way to tell whether a file has been fully indexed is to verify that the eventcount for the file is correct in the index (in other words, do a search source=foo | stats count, or metasearch, or similar). However, this is obviously difficult in the case of multiline events and/or incorrect event parsing settings.

Therefore, the most reliable way to tell whether a oneshot file has been indexed is the following type of heuristic:

1) $ splunk add oneshot foo.log

2) Query the REST API at /services/data/inputs/oneshot and observe status of the item named foo.log (Bytes Indexed vs Size)

3) Eventually the file will be fully read and mostly indexed, with the remaining bits sitting in various queues, awaiting indexing. Upon hitting this condition, foo.log will not longer display in the REST API.

At this point, data should finish indexing quickly - however, there could still be various issues preventing proper indexing, such as running out of disk space, a downed network connection to a downstream indexer, influx of data from other sources, etc. Therefore, the last step is:

4) Run timed searches (perhaps every 30 seconds) checking the eventcount for foo.log, until it stabilizes, meaning the eventcount hasn't changed for a few minutes. At this point, it is reasonable to consider the data fully indexed.

View solution in original post

lukejadamec
Super Champion

Party foul.

Amrit's answer could be reworded to reflect the non-paid-for-service splunk-answers-spirit to engender supportive participation of the Splunk Community as a whole. If it were me, I might reword the following comment:

"lukejadamec" 's suggestion that data is living in the index queue upon return 0 from "splunk add oneshot" is incorrect - this would imply that, for a 5 GB file, we load all 5 GB into memory before returning from the command. Instead, the file is read & indexed in a streaming fashion.

Too perhaps something like this:

“LukeJadamec”’s answer to the effect that the data is living in the index queue upon “return 0” from “splunk add oneshot” is not technically accurate. While Splunk will respond with a “hey dude, everything is Okay (returncode 0) message” the reality is that splunkd is actually only connects to the file and begins to “successfully stream” that file. So, beware of events that might break that stream, eg: restarting splunkd on the indexer, or restarting the hosting system may cause you to loose data. However, this will only be a problem if the file being indexed is very large, because for files less than say 1GB with a network connection of say 10GB/s you could probably not restart a system fast enough to cause a problem.

Of course, details should be provided as appropriate after something similar to the above example of an explanation of the deficiencies of other posters comments is stated.

0 Karma

piebob
Splunk Employee
Splunk Employee

hi Luke,
i sent you an email last Friday (from rachel@splunk), hopefully it didn't get caught in your spam filters. let me know!

0 Karma

amrit
Splunk Employee
Splunk Employee

You have my apologies - I did come across as a tool there. Pressed for time, I rushed out that answer and did not proofread, thus missing that it ended up sounding negative. Criticizing community effort was certainly not my intention!

amrit
Splunk Employee
Splunk Employee

While it is okay to continue reading from a file that has been deleted (the OS guarantees this), this is not "safe" in that your Splunk instance could be restarted (or crash) before the data is indexed (especially if it is held up due to hardware errors or similar). In such a case, you will have deleted your source file and will not have any way to index the missing data.

"lukejadamec" 's suggestion that data is living in the index queue upon return 0 from "splunk add oneshot" is incorrect - this would imply that, for a 5 GB file, we load all 5 GB into memory before returning from the command. Instead, the file is read & indexed in a streaming fashion.

The best way to tell whether a file has been fully indexed is to verify that the eventcount for the file is correct in the index (in other words, do a search source=foo | stats count, or metasearch, or similar). However, this is obviously difficult in the case of multiline events and/or incorrect event parsing settings.

Therefore, the most reliable way to tell whether a oneshot file has been indexed is the following type of heuristic:

1) $ splunk add oneshot foo.log

2) Query the REST API at /services/data/inputs/oneshot and observe status of the item named foo.log (Bytes Indexed vs Size)

3) Eventually the file will be fully read and mostly indexed, with the remaining bits sitting in various queues, awaiting indexing. Upon hitting this condition, foo.log will not longer display in the REST API.

At this point, data should finish indexing quickly - however, there could still be various issues preventing proper indexing, such as running out of disk space, a downed network connection to a downstream indexer, influx of data from other sources, etc. Therefore, the last step is:

4) Run timed searches (perhaps every 30 seconds) checking the eventcount for foo.log, until it stabilizes, meaning the eventcount hasn't changed for a few minutes. At this point, it is reasonable to consider the data fully indexed.

othersider2
New Member

This post appears to be addressing the use of splunk add oneshot. Does the advice change when using splunkforwarder add oneshot?

0 Karma

lukejadamec
Super Champion

Yes, it is safe.
The events are loaded into the index queue when you get the return 0. What you're seeing are events moving from the index queue into the index.

lukejadamec
Super Champion

Where's all the love here?
I got my answer from the docs, and it is accurate unless the indexer (or server) is restarted.
It is always nice to have over achievers in the house, but if you're gonna beat my 95% help down, then at least do it three times so I can get a badge.

amrit
Splunk Employee
Splunk Employee

This answer is not correct - this operation is not technically safe. I will reply in another answer...

Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...