When you look at the queue order of Parsing Queue - Aggregator Queue - Typing Queue - Index Queue, the one farthest to the right that is full most likely has the root cause, and then things backup to the left (and once the Parsing Queue is full, that's when it affects the Forwarders).
Since the Index Queue is full the culprit often comes down to read/write issues on the Indexer - writing to disk is what happens after the Index Queue. These are the questions I would ask of your environment next:
Are you running out of disk space on your Indexers?
Are your events coming in from
Forwarders too fast or too much data,
and the IOPS you have available not
enough for the two Indexers you have?
If you take a look in the DMC for
License Usage, what are your trends
and how does that compare to when the
Indexers have a problem?
As a side note, I would definitely look into adding a third Indexer no matter what. You have all of these other supporting Splunk instances around your Index Cluster, but if one goes down you immediately lose 50% of your capacity. Adding a third will give you that much more headroom. Also, Splunk scales wonderfully horizontal at the Indexer layer...it is cheap/easy to improve all aspects of your environment by adding more there (indexing speed, search speed, disk space, etc).
... View more