Hi,
I've been troubleshooting a problem where files are occasionally getting missed in Splunk. The app creates a lot of files and a lot of data - they roll over at 50mb, about every 1-2 minutes. Just today, I caught an "unable to open file" message, and when I went on the system, it wasn't there - probably because they have a cleanup job that moves files on a regular basis. The file in question is over an hour old, so I'm beginning to wonder if Splunk is having a hard time keeping up.
How can we easily validate the Splunk universal forwarder isn't falling behind? This app has lots of server and lots of files, so running a btool after the fact isn't going to help me (nor will list monitors...). Looking for ideas/thoughts...
Update:
I have noticed that on certain systems, the same file keeps getting "removed from queue", which doesn't make sense, as it's still active. (And the file is very busy).
04-16-2016 22:44:05.213 -0400 INFO BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:06.202 -0400 INFO BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:07.212 -0400 INFO BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:08.221 -0400 INFO BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
Thanks!
Along with all of @martin_mueller's good points, consider using sinkhole
which does the housekeeping for you inside of Splunk:
http://docs.splunk.com/Documentation/Splunk/6.4.0/admin/Inputsconf
[batch://<path>]
* One time, destructive input of files in <path>.
* For continuous, non-destructive inputs of files, use monitor instead.
# Additional attributes:
move_policy = sinkhole
* IMPORTANT: This attribute/value pair is required. You *must* include "move_policy = sinkhole" when defining batch inputs.
* This loads the file destructively.
* Do not use the batch input type for files you do not want to consume destructively.
* As long as this is set, Splunk won't keep track of indexed files. Without the "move_policy = sinkhole" setting, it won't load the files destructively and will keep a track of them.
The question indicates files are still actively written to after Splunk sees them for the first time.
Using sinkhole
can be a terrible idea for files still written to by the application. Make sure you don't have Splunk trying to pull them out from under your app.
True, this would only be an option if these files are appearing in their entirety and are not continuously written.
Hi @woodcock ,
i got the same problem and solved it by set throughput maxKBps = unlimit in litmits.conf file.
So Can you explain why the throughput make loss data?
Increasing throughput should decrease data loss, not increase it. What do you mean?
I always deploy maxKBps = 0
unless there is some reason not to.
You'll lose data if you rotate the logs away from underneath the forwarder when it can't keep up.
First of all, make sure the forwarder monitors rolled uncompressed files so it has a chance to work off a peak.
Second, make sure there is enough headroom in the thruput limit in limits.conf for peak times. The default setting is way too low for 50MB/min.
To view the current state of the tailing processor, check out http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/ - it'll tell you what files are monitored right now, how far into the file Splunk has read, and so on.
To check if files were missed, check your indexed data for gaps. You should not see zeros in a search like this:
| tstats count where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=30s host | timechart sum(count) as count by host
A zero could mean "missing data from that host", or "host did not generate data in those 30 seconds". If you expect a file to be 1-2 minutes long and a file is missing, there should be at least one 30-second-bucket that's empty from that host.
If your data (or file names) has incrementing values you could also search for gaps in those.
To check indexing delay, run something like this:
| tstats max(_indextime) as maxindextime where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=1s host | eval delay = maxindextime-_time | timechart max(delay) by host
If that approaches minutes, you may be dropping behind significantly depending on how long rolled files remain on disk.