I have an environment with 4 indexers, 2 search heads, with about 20 intermediate forwarders(running universal forwarders) indexing about 100gigs per day. End point clients forwarding number in the thousands.
Deployment monitor is reporting our indexers are overloaded, sometimes all 4 indexers report
"Index Queue 95th Percentile As Fraction of Max Queue Size" reports up to about 80 - 98%"
I went ahead and checked an individual indexer that was reporting being overloaded
Using a search on the queues like:
index=_internal source=*metrics.log* group=queue | timechart span=1m perc95(current_size) by name
Shows that Aggregator queue hovers around 500-2000 for a minute then drops way back down
indexqueue hovers around 700 - 2000 for a few minutes during this time then drops back down.
typingqueue and splunktcpin show occasional 1k spikes as well.
checking system load on indexers, cpu, memory and disk activity seems fairly normal(not overloaded)
The only changes we made in the past few months was to use universal forwarders on our intermediate proxies and removed the limits.conf entry(we had some delayed data so removed thruput entry)
[thruput] maxKBps = 0
Should we possibly put a thruput limit back on, but raise it above 256, would heavy intermediate forwarders be a better way to go?
I will probably file a support ticket on this but would be interested to hear others thoughts on the way we should go.
I have heard a few reports that the "overloaded" notifications are a bit aggressive. This is something that we will consider improving in a future version of Deployment Monitor.
For now, I would not worry about occassional full queues such as you are describing. If they are consistently full, then usually the stage after the full queue is the problem. For example, if you indexing queue is constantly full, your disk is too slow. If your output queue is constantly full, the device on the other end is probably not able to keep up with the output.
I am checking the disk with iostat -k -x 5
Shows occasional spikes around 2-3k wkB/s could possibly correspond to the high queues but very brief and goes back to normal fast. These machines should be able to handle the 80-100 gigs per day. As we have 4 indexers setup with raid 10 ...wierd....
Intermittently full queues are not a problem - that is just the queue system doing its job of adjusting to congestion further down the pipeline. As I mentioned, the current settings are perhaps a bit aggressive and I might tune them down in the future.
Yeah, if you could tune it down somewhat that would be appreciated. Every once in a while a "red light" goes off for one out of four indexers that then goes away for another 8-10 hours.
Duly noted. Has anyone mentioned how awesome the Splunk community is at providing useful feedback?
araitz, I love the deployment monitor... Thanks for the hard work. And yes, a muzzle on indexer overload would be nice. 😉
In fairness, there are a lot of people that have done a lot more work than me on it. I will pass the love it on 🙂