Re: How to tell if Splunk universal forwarder perf...

a212830 · ‎04-16-2016

Hi,

I've been troubleshooting a problem where files are occasionally getting missed in Splunk. The app creates a lot of files and a lot of data - they roll over at 50mb, about every 1-2 minutes. Just today, I caught an "unable to open file" message, and when I went on the system, it wasn't there - probably because they have a cleanup job that moves files on a regular basis. The file in question is over an hour old, so I'm beginning to wonder if Splunk is having a hard time keeping up.

How can we easily validate the Splunk universal forwarder isn't falling behind? This app has lots of server and lots of files, so running a btool after the fact isn't going to help me (nor will list monitors...). Looking for ideas/thoughts...

Update:

I have noticed that on certain systems, the same file keeps getting "removed from queue", which doesn't make sense, as it's still active. (And the file is very busy).

04-16-2016 22:44:05.213 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:06.202 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:07.212 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:08.221 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.

Thanks!

woodcock · ‎04-26-2016

Along with all of @martin_mueller's good points, consider using sinkhole which does the housekeeping for you inside of Splunk:

http://docs.splunk.com/Documentation/Splunk/6.4.0/admin/Inputsconf

[batch://<path>]
* One time, destructive input of files in <path>.
* For continuous, non-destructive inputs of files, use monitor instead.
# Additional attributes:
move_policy = sinkhole
* IMPORTANT: This attribute/value pair is required. You *must* include  "move_policy = sinkhole" when defining batch inputs.
* This loads the file destructively.
* Do not use the batch input type for files you do not want to consume destructively.
* As long as this is set, Splunk won't keep track of indexed files. Without the "move_policy = sinkhole" setting, it won't load the files destructively and will keep a track of them.

martin_mueller · ‎04-26-2016

The question indicates files are still actively written to after Splunk sees them for the first time.

martin_mueller · ‎04-26-2016

Using sinkhole can be a terrible idea for files still written to by the application. Make sure you don't have Splunk trying to pull them out from under your app.

woodcock · ‎04-26-2016

True, this would only be an option if these files are appearing in their entirety and are not continuously written.

dailv1808 · ‎01-22-2019

Hi @woodcock ,
i got the same problem and solved it by set throughput maxKBps = unlimit in litmits.conf file.
So Can you explain why the throughput make loss data?

https://imgur.com/a/BUmw9z2

woodcock · ‎01-22-2019

Increasing throughput should decrease data loss, not increase it. What do you mean?

woodcock · ‎01-22-2019

I always deploy maxKBps = 0 unless there is some reason not to.

martin_mueller · ‎01-22-2019

You'll lose data if you rotate the logs away from underneath the forwarder when it can't keep up.

martin_mueller · ‎04-17-2016

First of all, make sure the forwarder monitors rolled uncompressed files so it has a chance to work off a peak.
Second, make sure there is enough headroom in the thruput limit in limits.conf for peak times. The default setting is way too low for 50MB/min.

To view the current state of the tailing processor, check out http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/ - it'll tell you what files are monitored right now, how far into the file Splunk has read, and so on.
To check if files were missed, check your indexed data for gaps. You should not see zeros in a search like this:

| tstats count where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=30s host | timechart sum(count) as count by host

A zero could mean "missing data from that host", or "host did not generate data in those 30 seconds". If you expect a file to be 1-2 minutes long and a file is missing, there should be at least one 30-second-bucket that's empty from that host.
If your data (or file names) has incrementing values you could also search for gaps in those.

To check indexing delay, run something like this:

| tstats max(_indextime) as maxindextime where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=1s host | eval delay = maxindextime-_time | timechart max(delay) by host

If that approaches minutes, you may be dropping behind significantly depending on how long rolled files remain on disk.

How to tell if Splunk universal forwarder performance is keeping up and sending all monitored data as expected?

Splunk Custom Visualizations App End of Life

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk