We have cases when the indexing delays are up to 15 minutes, it's rare but it happens. In these cases, we see that the indexing queues are at 80 – 100 percent capacity on three of the eight indexers. We see moderate bursts of data in these situations but not major bursts.
These eight indexers use Hitachi G1500 arrays with FMD (flash memory drives).
How can we understand better these situations and hopefully minimize the delays?
the issue could be due to the soucetyping of the data.
May be try going to the "Data Quality" dashboard in the Monitoring Console and check.
It is available in Monitoring console --> Indexing --> Data Inputs --> Data quality
It would should you the issues related to line breaking, time stamping or aggregation.
Solve these issues and see if the indexing delays continue.
We got the following query from our Sales Engineer -
index=_introspection host=host_name earliest=-24h sourcetype="splunk_resource_usage" data.avg_total_ms>0 component=IOstats | timechart span=1m avg("data.avg_total_ms") by host | eval threshold = 10
We have eight indexers and we can easily see that for some indexers we have a couple of milliseconds delay throughout the day, some are up and down the 10 ms threshold and two that consistently reach the 30 or 40 ms level during the day.
Is 10 milliseconds the threshold? If we go over it consistently, does it mean that we have a hardware issue?
How is the data distribution on your indexers also? If you have forwarders that are sticking to the three indexers, it could cause delays as the queues are filled.
The issue is likely with the load balancing setting on your forwarders in outputs.conf
If you have autoLBVolume set, try disabling it. That parameter causes a forwarder to switch to another indexer only after a certain amount of data has been sent.
A more preferred method, especially when you see some indexers receiving more data than others as you describe, is to use autoLBFrequency. This setting will force the forwarder to switch to another indexer after a specified interval (in seconds), rather than amount of data sent. Typically you can get a better distribution with this setting in place, and a shorter interval setting.
See the documentation here for more info: