We've been having issue with license usage lately where we see sudden spike of eps from multiple host.
Recently, I found that there is indextime (eval indextime=_indextime) and used it to compare it with the timestamp (_time).
It is not shown in the image above but the initial time for both graphs was exactly 00:00:00 but there is certainly a lag at the end where there is a spike of event count in indextime.
So my question is, am I interpreting the graph properly? that the forwarder is pushing logs at certain point instead of streaming it continuously?
If so, what could be causing this lag? We have more than 100 devices reporting to our heavy forwarder. Is it too much for heavy forwarder to handle? Or is it the issue with the hardware (memory or CPU) of heavy forwarder or indexer? Again, similar lag is observed over dozens of our hosts which are generating large amount of logs.
Thanks in advance!
If you have latencies on the order of 10K seconds then it is almost certainly NOT a horspower issue with your forwarder (unless you are processing ZIPped files). It is far more likely that you have a TimeZone issue and Splunk is interpreting timestamps as being hours off from what they really are. In your first chart, the lag never goes below 10K so this really could be the problem. Try looking at this data:
index=* | eval lagSecs = _indextime - _time | stats min(lagSecs) avg(lagSecs) max(lagSecs) by index, host, sourcetype, date_zone
If your min, max, and avg are roughly the same and >1K or so (or if ever <0), then you almost certainly have a TZ issue. The
date_zone field will show "default" if the TZ from the host OS of the Heavy Forwarder is being used or something else if it is being set by some configuration file in Splunk.
If you're on Splunk 6.2.x you can
run DMC (HeHe, yoo bro....) that's the Distributed Management Console using this URI
This will provide a nice overview of the pipelines. Maybe you will get hints out of it.
On pre 6.2.x setup's use the S.o.S. App https://splunkbase.splunk.com/app/748/
Hope this helps ...
We are using Splunk 6.1x and I've just installed SoS and finished setting and.. it doesn't look good.
I went into Indexing>Distributed Indexing Performance and checked the Real-time measured indexing rate and latency per Type:index and found the average latency of this index to be 30k sec.. which is 8 hours.
I've checked and ran the query I found from http://answers.splunk.com/answers/31151/index-performance-issue-high-latency.html and tried
index=_internal source=*metrics.log blocked and found that most of the host reporting to this troubling index are exceeding the max_size... I guess I'll go change the setting on the forwarder.
What you need to do is graph this to see if there is bunching up of events. My suspicion is that you will see that
lagSecs is fairly static and you will have to look elsewhere:
index=* | eval lagSecs = _indextime - _time | eval CombinedSource = host . "/" . sourcetype . "/" . index | timechart avg(lagSecs) by CombinedSource
But watch out for "false lag" - which is often caused by adding new inputs.
For example, on Monday you begin to index the database log files - they were not indexed in the past, and there are many days of historical data. The above calculation of
_indextime - _time will yield enormous values because you are indexing data that is not current.
This calculation for lag should only be used when the environment is in a steady state, with all forwarders online and up to date on their indexing, with no new inputs.
I've tried your query and found that the lagSecs stays 0 but then increases exponentially during the general work hour (yes we are using Splunk for traffic monitoring) and decrease to 0 after work hour. So it's the same question again. Is it the issue with heavy forwarder not able to handle the amount of logs coming in or what else?