We have the following search -
base search | eval diff= _indextime - _time | eval capturetime=strftime(_time,"%Y-%m-%d %H:%M:%S") | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | table capturetime indextime diff
We see the following -
So, we see a delay of over five hours in indexing. Is there a way to find out where these events "got stuck"? In this case, these events are coming from hadoop servers and the forwarder processes around 1/2 million files. We would like to know whether the delay is at the forwarder level or on the indexer side.
We had a production change on Wednesday night. On the following day, Thursday, we saw this delay in indexing -
followed by -
On Friday there was no delay (the right column) -
And we saw this behavior before on other production changes involving this large hadoop file systems. So, I think that it takes the forwarder hours to scan this large number of files and index the right information, a day or two later all is fine. Just checked it now and it's perfect. So, the delay's time frame is around the forwarder bounce time.
The thing is - what can we improve on the forwarder to lower this delay after the bounce?
On the forwarder we see -
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1033069 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 64000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
And thank you @MuS and @somesoni2 for validating that nothing is fundamentally wrong with either the forwarder's configuration or the index queues...
Do you see any helpful information in this Management Console dashboard?
Indexing Pipeline: http://docs.splunk.com/Documentation/Splunk/6.6.0/DMC/IndexingInstance
ok, I see -
$ find . -name "limits.conf" | xargs grep -i maxKBps ./etc/apps/universal_config_forwarder/local/limits.conf:maxKBps = 0 ./etc/apps/SplunkUniversalForwarder/default/limits.conf:maxKBps = 256 ./etc/system/default/limits.conf:maxKBps = 0
use this command to show what is actually applied as config:
splunk btool limits list thruput
that is on the forwarder. But by looks of it, you have no limit active ... Did you check
DMC / MC for any blocked queues?
I was late/early on that. Check the various queue sizes if there is any high spikes on the queue sizes.
index=_internal sourcetype=splunkd source=*metrics.log group=queue | timechart avg(current_size) by name
You can add host=yourUFName to see queue sizes on UF and host=Indexer (add more OR condition for all indexers) to see queue sizes on Indexers. You may need to adjust queue sizes based on results from there. https://answers.splunk.com/answers/38218/universal-forwarder-parsingqueue-kb-size.html
The aggQueue is where date parsing and line merging happens. This suggest that there may be in-efficient event parsing configuration setup. What is the sourcetype definition (props.conf on indexers) you've for sourcetypes involved?
It means there is no config setup and Splunk has to figure everything out, hence the spikes. I would suggest defining an efficient line breaking and event parsing for this data and get it deployed on Indexers (would need to restart indexers). I hope you'd see lower latency/queue sizes after that. If you could share some sample raw events, we can suggest some.