What is the source of indexing lag and how to fix ...

hcheang · ‎06-16-2015

Hello,

We've been having issue with license usage lately where we see sudden spike of eps from multiple host.

Recently, I found that there is indextime (eval indextime=_indextime) and used it to compare it with the timestamp (_time).

It is not shown in the image above but the initial time for both graphs was exactly 00:00:00 but there is certainly a lag at the end where there is a spike of event count in indextime.
So my question is, am I interpreting the graph properly? that the forwarder is pushing logs at certain point instead of streaming it continuously?
If so, what could be causing this lag? We have more than 100 devices reporting to our heavy forwarder. Is it too much for heavy forwarder to handle? Or is it the issue with the hardware (memory or CPU) of heavy forwarder or indexer? Again, similar lag is observed over dozens of our hosts which are generating large amount of logs.

Thanks in advance!

woodcock · ‎06-17-2015

If you have latencies on the order of 10K seconds then it is almost certainly NOT a horspower issue with your forwarder (unless you are processing ZIPped files). It is far more likely that you have a TimeZone issue and Splunk is interpreting timestamps as being hours off from what they really are. In your first chart, the lag never goes below 10K so this really could be the problem. Try looking at this data:

index=* | eval lagSecs = _indextime - _time | stats  min(lagSecs) avg(lagSecs) max(lagSecs) by index, host, sourcetype, date_zone

If your min, max, and avg are roughly the same and >1K or so (or if ever <0), then you almost certainly have a TZ issue. The date_zone field will show "default" if the TZ from the host OS of the Heavy Forwarder is being used or something else if it is being set by some configuration file in Splunk.

MuS · ‎06-16-2015

Hi hcheang,

If you're on Splunk 6.2.x you can run DMC (HeHe, yoo bro....) that's the Distributed Management Console using this URI

/en-US/app/splunk_management_console/indexing_performance_instance

This will provide a nice overview of the pipelines. Maybe you will get hints out of it.
On pre 6.2.x setup's use the S.o.S. App https://splunkbase.splunk.com/app/748/

Hope this helps ...

cheers, MuS

hcheang · ‎06-16-2015

We are using Splunk 6.1x and I've just installed SoS and finished setting and.. it doesn't look good.
I went into Indexing>Distributed Indexing Performance and checked the Real-time measured indexing rate and latency per Type:index and found the average latency of this index to be 30k sec.. which is 8 hours.

I've checked and ran the query I found from http://answers.splunk.com/answers/31151/index-performance-issue-high-latency.html and tried index=_internal source=*metrics.log blocked and found that most of the host reporting to this troubling index are exceeding the max_size... I guess I'll go change the setting on the forwarder.

Jarohnimo · ‎08-08-2019

Change what setting and did it fix your problem?

woodcock · ‎06-16-2015

What you need to do is graph this to see if there is bunching up of events. My suspicion is that you will see that lagSecs is fairly static and you will have to look elsewhere:

index=* | eval lagSecs = _indextime - _time | eval CombinedSource = host . "/" . sourcetype . "/" . index | timechart avg(lagSecs) by CombinedSource

lguinn2 · ‎06-16-2015

But watch out for "false lag" - which is often caused by adding new inputs.

For example, on Monday you begin to index the database log files - they were not indexed in the past, and there are many days of historical data. The above calculation of _indextime - _time will yield enormous values because you are indexing data that is not current.

This calculation for lag should only be used when the environment is in a steady state, with all forwarders online and up to date on their indexing, with no new inputs.

hcheang · ‎06-16-2015

This is how it looks like

woodcock · ‎08-09-2019

Your image link does not work.

hcheang · ‎06-16-2015

I've tried your query and found that the lagSecs stays 0 but then increases exponentially during the general work hour (yes we are using Splunk for traffic monitoring) and decrease to 0 after work hour. So it's the same question again. Is it the issue with heavy forwarder not able to handle the amount of logs coming in or what else?

What is the source of indexing lag and how to fix it?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!