Getting Data In

How to tell if Splunk universal forwarder performance is keeping up and sending all monitored data as expected?

a212830
Champion

Hi,

I've been troubleshooting a problem where files are occasionally getting missed in Splunk. The app creates a lot of files and a lot of data - they roll over at 50mb, about every 1-2 minutes. Just today, I caught an "unable to open file" message, and when I went on the system, it wasn't there - probably because they have a cleanup job that moves files on a regular basis. The file in question is over an hour old, so I'm beginning to wonder if Splunk is having a hard time keeping up.

How can we easily validate the Splunk universal forwarder isn't falling behind? This app has lots of server and lots of files, so running a btool after the fact isn't going to help me (nor will list monitors...). Looking for ideas/thoughts...

Update:

I have noticed that on certain systems, the same file keeps getting "removed from queue", which doesn't make sense, as it's still active. (And the file is very busy).

04-16-2016 22:44:05.213 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:06.202 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:07.212 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:08.221 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.

Thanks!

0 Karma

woodcock
Esteemed Legend

Along with all of @martin_mueller's good points, consider using sinkhole which does the housekeeping for you inside of Splunk:

http://docs.splunk.com/Documentation/Splunk/6.4.0/admin/Inputsconf

[batch://<path>]
* One time, destructive input of files in <path>.
* For continuous, non-destructive inputs of files, use monitor instead.
# Additional attributes:
move_policy = sinkhole
* IMPORTANT: This attribute/value pair is required. You *must* include  "move_policy = sinkhole" when defining batch inputs.
* This loads the file destructively.
* Do not use the batch input type for files you do not want to consume destructively.
* As long as this is set, Splunk won't keep track of indexed files. Without the "move_policy = sinkhole" setting, it won't load the files destructively and will keep a track of them. 
0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The question indicates files are still actively written to after Splunk sees them for the first time.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Using sinkhole can be a terrible idea for files still written to by the application. Make sure you don't have Splunk trying to pull them out from under your app.

woodcock
Esteemed Legend

True, this would only be an option if these files are appearing in their entirety and are not continuously written.

0 Karma

dailv1808
Path Finder

Hi @woodcock ,
i got the same problem and solved it by set throughput maxKBps = unlimit in litmits.conf file.
So Can you explain why the throughput make loss data?

https://imgur.com/a/BUmw9z2

0 Karma

woodcock
Esteemed Legend

Increasing throughput should decrease data loss, not increase it. What do you mean?

0 Karma

woodcock
Esteemed Legend

I always deploy maxKBps = 0 unless there is some reason not to.

martin_mueller
SplunkTrust
SplunkTrust

You'll lose data if you rotate the logs away from underneath the forwarder when it can't keep up.

martin_mueller
SplunkTrust
SplunkTrust

First of all, make sure the forwarder monitors rolled uncompressed files so it has a chance to work off a peak.
Second, make sure there is enough headroom in the thruput limit in limits.conf for peak times. The default setting is way too low for 50MB/min.

To view the current state of the tailing processor, check out http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/ - it'll tell you what files are monitored right now, how far into the file Splunk has read, and so on.
To check if files were missed, check your indexed data for gaps. You should not see zeros in a search like this:

| tstats count where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=30s host | timechart sum(count) as count by host

A zero could mean "missing data from that host", or "host did not generate data in those 30 seconds". If you expect a file to be 1-2 minutes long and a file is missing, there should be at least one 30-second-bucket that's empty from that host.
If your data (or file names) has incrementing values you could also search for gaps in those.

To check indexing delay, run something like this:

| tstats max(_indextime) as maxindextime where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=1s host | eval delay = maxindextime-_time | timechart max(delay) by host

If that approaches minutes, you may be dropping behind significantly depending on how long rolled files remain on disk.

Get Updates on the Splunk Community!

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...