Getting Data In

How to tell if Splunk universal forwarder performance is keeping up and sending all monitored data as expected?

a212830
Champion

Hi,

I've been troubleshooting a problem where files are occasionally getting missed in Splunk. The app creates a lot of files and a lot of data - they roll over at 50mb, about every 1-2 minutes. Just today, I caught an "unable to open file" message, and when I went on the system, it wasn't there - probably because they have a cleanup job that moves files on a regular basis. The file in question is over an hour old, so I'm beginning to wonder if Splunk is having a hard time keeping up.

How can we easily validate the Splunk universal forwarder isn't falling behind? This app has lots of server and lots of files, so running a btool after the fact isn't going to help me (nor will list monitors...). Looking for ideas/thoughts...

Update:

I have noticed that on certain systems, the same file keeps getting "removed from queue", which doesn't make sense, as it's still active. (And the file is very busy).

04-16-2016 22:44:05.213 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:06.202 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:07.212 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.
04-16-2016 22:44:08.221 -0400 INFO  BatchReader - Removed from queue file='/gsysrtpp23/logs/ORS_RTP_Node2_PR/ORS_RTP_Node2_PR.20160416_223009_902.log'.

Thanks!

0 Karma

woodcock
Esteemed Legend

Along with all of @martin_mueller's good points, consider using sinkhole which does the housekeeping for you inside of Splunk:

http://docs.splunk.com/Documentation/Splunk/6.4.0/admin/Inputsconf

[batch://<path>]
* One time, destructive input of files in <path>.
* For continuous, non-destructive inputs of files, use monitor instead.
# Additional attributes:
move_policy = sinkhole
* IMPORTANT: This attribute/value pair is required. You *must* include  "move_policy = sinkhole" when defining batch inputs.
* This loads the file destructively.
* Do not use the batch input type for files you do not want to consume destructively.
* As long as this is set, Splunk won't keep track of indexed files. Without the "move_policy = sinkhole" setting, it won't load the files destructively and will keep a track of them. 
0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The question indicates files are still actively written to after Splunk sees them for the first time.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Using sinkhole can be a terrible idea for files still written to by the application. Make sure you don't have Splunk trying to pull them out from under your app.

woodcock
Esteemed Legend

True, this would only be an option if these files are appearing in their entirety and are not continuously written.

0 Karma

dailv1808
Path Finder

Hi @woodcock ,
i got the same problem and solved it by set throughput maxKBps = unlimit in litmits.conf file.
So Can you explain why the throughput make loss data?

https://imgur.com/a/BUmw9z2

0 Karma

woodcock
Esteemed Legend

Increasing throughput should decrease data loss, not increase it. What do you mean?

0 Karma

woodcock
Esteemed Legend

I always deploy maxKBps = 0 unless there is some reason not to.

martin_mueller
SplunkTrust
SplunkTrust

You'll lose data if you rotate the logs away from underneath the forwarder when it can't keep up.

martin_mueller
SplunkTrust
SplunkTrust

First of all, make sure the forwarder monitors rolled uncompressed files so it has a chance to work off a peak.
Second, make sure there is enough headroom in the thruput limit in limits.conf for peak times. The default setting is way too low for 50MB/min.

To view the current state of the tailing processor, check out http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/ - it'll tell you what files are monitored right now, how far into the file Splunk has read, and so on.
To check if files were missed, check your indexed data for gaps. You should not see zeros in a search like this:

| tstats count where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=30s host | timechart sum(count) as count by host

A zero could mean "missing data from that host", or "host did not generate data in those 30 seconds". If you expect a file to be 1-2 minutes long and a file is missing, there should be at least one 30-second-bucket that's empty from that host.
If your data (or file names) has incrementing values you could also search for gaps in those.

To check indexing delay, run something like this:

| tstats max(_indextime) as maxindextime where index=foo sourcetype=bar source=/gsysrtpp23/logs* by _time span=1s host | eval delay = maxindextime-_time | timechart max(delay) by host

If that approaches minutes, you may be dropping behind significantly depending on how long rolled files remain on disk.

Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...