We have very large SHC cluster with 6 indexer clusters and a total of > 1500 indexers across these 6 clusters.
The issue:
- we would add an indexer back to an indexer cluster (e.g. it had hardware fixed)
- the indexer would join the cluster again
- the search heads would briefly REMOVE ALL/almost all indexers (not just the ones that were in the SAME indexer cluster being added back)
- then each SHC would add the indexers back
- most or all of the SHC heads would repeat this process so over a many minute period you could have searches that were not searching all possible indexers
For each head the time period where all indexers were removed was less than a minute BUT it meant that searches would run and find NO indexers/fewer indexers to search.
The solution provided by Splunk that worked is to add a setting to distsearch.conf (and btw the setting is not documented and not in distsearch.conf.spec so you would get a btool warning I am told)
[distributedSearch]
useIPAddrAsHost = false
I am sharing this solution in case you encountered the issue.
@burwell Thanks for sharing the info. Seems you are handling very big infra.