Deployment Architecture

Deployment monitor reporting indexers overloaded

sonicZ
Contributor

Hello,

I have an environment with 4 indexers, 2 search heads, with about 20 intermediate forwarders(running universal forwarders) indexing about 100gigs per day. End point clients forwarding number in the thousands.

Deployment monitor is reporting our indexers are overloaded, sometimes all 4 indexers report

"Index Queue 95th Percentile As Fraction of Max Queue Size" reports up to about 80 - 98%"

I went ahead and checked an individual indexer that was reporting being overloaded
Using a search on the queues like:

index=_internal source=*metrics.log* group=queue | timechart span=1m perc95(current_size) by name

Shows that Aggregator queue hovers around 500-2000 for a minute then drops way back down
indexqueue hovers around 700 - 2000 for a few minutes during this time then drops back down.
typingqueue and splunktcpin show occasional 1k spikes as well.

checking system load on indexers, cpu, memory and disk activity seems fairly normal(not overloaded)

The only changes we made in the past few months was to use universal forwarders on our intermediate proxies and removed the limits.conf entry(we had some delayed data so removed thruput entry)

[thruput]
maxKBps = 0

Should we possibly put a thruput limit back on, but raise it above 256, would heavy intermediate forwarders be a better way to go?
I will probably file a support ticket on this but would be interested to hear others thoughts on the way we should go.

0 Karma
1 Solution

araitz
Splunk Employee
Splunk Employee

I have heard a few reports that the "overloaded" notifications are a bit aggressive. This is something that we will consider improving in a future version of Deployment Monitor.

For now, I would not worry about occassional full queues such as you are describing. If they are consistently full, then usually the stage after the full queue is the problem. For example, if you indexing queue is constantly full, your disk is too slow. If your output queue is constantly full, the device on the other end is probably not able to keep up with the output.

View solution in original post

araitz
Splunk Employee
Splunk Employee

I have heard a few reports that the "overloaded" notifications are a bit aggressive. This is something that we will consider improving in a future version of Deployment Monitor.

For now, I would not worry about occassional full queues such as you are describing. If they are consistently full, then usually the stage after the full queue is the problem. For example, if you indexing queue is constantly full, your disk is too slow. If your output queue is constantly full, the device on the other end is probably not able to keep up with the output.

araitz
Splunk Employee
Splunk Employee

In fairness, there are a lot of people that have done a lot more work than me on it. I will pass the love it on 🙂

0 Karma

bensbrowning
Explorer

araitz, I love the deployment monitor... Thanks for the hard work. And yes, a muzzle on indexer overload would be nice. 😉

araitz
Splunk Employee
Splunk Employee

Duly noted. Has anyone mentioned how awesome the Splunk community is at providing useful feedback?

wmosher
Path Finder

+1 for less aggressive here

0 Karma

tmeader
Contributor

Yeah, if you could tune it down somewhat that would be appreciated. Every once in a while a "red light" goes off for one out of four indexers that then goes away for another 8-10 hours.

0 Karma

araitz
Splunk Employee
Splunk Employee

Intermittently full queues are not a problem - that is just the queue system doing its job of adjusting to congestion further down the pipeline. As I mentioned, the current settings are perhaps a bit aggressive and I might tune them down in the future.

0 Karma

sonicZ
Contributor

I am checking the disk with iostat -k -x 5
Shows occasional spikes around 2-3k wkB/s could possibly correspond to the high queues but very brief and goes back to normal fast. These machines should be able to handle the 80-100 gigs per day. As we have 4 indexers setup with raid 10 ...wierd....

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...