I recently had a Splunk outage. My monitoring software showed, plenty of IO, CPU and RAM available. Yet forwarders were reporting the TCP queues were full on the receiving indexers.
I popped into Splunk on Splunk and looking at my fill Ratios all 4 stages, which are normally 0. The 4th indexing queue was maxed. We actually had lower than average throughput. After some poking around I discovered a set of Real time dashboards were created by our NOC and send out to the general population. Once I disabled RT the queues went right back to 0%.
The abusive RT dashboard aside. I feel there is some performance tuning I am missing. With plenty of system resources available I'd like to undertand why these queues backed up so bad and what I can do get the indexing queue better performance... ideally without installing 10 more indexers 🙂