In my indexer cluster, on the MC under "Indexing>Performance>Indexing Performance: Deployment" I'm noticing that some about half of my indexers show close to 100% across queues (from parsing to indexing) and about half show less that 20% across queues (Quite a few are at 0% across queues).
My question is, why isnt the data being load balanced from the UFs? If some indexers are full, why is data not being sent to the indexers who have low volume in their queues? I keep getting warnings that forwarding destinations have failed, like they're only trying to send to the full indexers. My outputs.conf accounts for all indexers in the cluster, so there must be something else I'm overlooking.
This generally indicates that load balancing on your forwarders is not optimal.
You can set the forwarders to use time or data based load balancing, and when you see this type of unbalanced behavior you should adjust.
Forwarders will switch indexers based on time or data. In a busy cluster, it is good practice to use time vs data (seconds vs MB's e.g.).
This way, the forwarders will pick a new indexer every 30 seconds, or whatever you pick, rather than getting "stuck" on one indexer based on MB's/data.
Usually this problem occurs when there is disk IOPS performance issues. Also some long running searches can coz queues to get filled up as read operations may be blocking the input queue. Check the Monitoring Console -> Resource Usage Machine dashboard. If you see high IOPS bandwidth utilization , it indicated problem with IOPS. What is the current disk IOPS for the indexer having issues?
Thanks for the info! I'm checking now, and the indexers with nearly full processing queues are between 55% and 60 % I/O Bandwidth Utilization. I'm curious about long running searches, that sounds like it could be a problem..
You can check Search Activity- Deployment - Indexers dashboard on monitoring console and see if there are some long running or real time searches running. Also sometimes when the disks are old the issue comes due to saturation and IOPS performance is slow. You can run some tests to check the IOPS for your indexers. >50% is high utilization . Also splunk support can help you confirm if its IOPS issue if no long running or real time searches are running.
You need to work out if this is a forwarder problem, or an indexer issue.
Ideally your forwarders will send events evenly across your indexing peers. If for some reason some of your indexers are not sending events to all peers you could expect this type of behaviour if your event volume is large (but ideally not).
Check a sample of your forwarders and see if they are sending events to all peers:
index=_internal host=yourforwardername |stats count by splunk_server (15 mins should do)
hopefully you see relativly consistent numbers across all peers?
If you only see results for a subset of peers, verify that the forwarder is a.) configured to send data to all peers b.) actually can - check routing, firewalls etc.
If you see events on all peers, but significantly more events on a subset of peers, check your load balancing on the UF outputs - volume based is a good approach.
If your UFs all appear to be sending data evenly, (and in all probability) you likely have a problem with your indexers.
Try to establish any differences between the affected subset - are they all on the same site, do all peers have similar available IOPS and space? Does your cluster report any replication issues? Have a look in _internal for hints.
You will need to dig into your indexer logs to look for clues, but start off with forwarders and try to rule them out first.