Hey!
I am currently standing up an enterprise splunk system that has a multi-site(2) indexer cluster of 8 Peers and 2 Cluster Managers in HA configuration (LB by F5). I've noticed that if we have outages specific to a site, data rightfully continues to get ingested at the site that is still up.....
But upon the return of service to the secondary site, we have a thousand or more fixup tasks (normal I suppose) but at times they hang and eventually I get replication failures in my health check. Usually unstable pending-down-up status is associated with the peers from the site that went down as they attempt to clean up.
This is still developmental, so I have the luxury of deleting things with no consequence. The only fix I have seen to work is deleting all the data from the peers that went down and allowing them to resync and copy from a clean slate. I'm sure there is a better way to remedy this issue.
Can anyone explain/or point me in the direction of the appropriate solution and what the exact cause of this problem is?
I've read this Anomalous bucket issues - Splunk Documentation but roll, resync, delete doesn't quite do enough. And there is no mention as to why the failures start to occur. From my understanding, fragmented buckets play a factor when reboots or unexpected outages happen but how do I exactly regain some stability in my data replication.
Do you have Cold Volume storage restrictions? The offline may have cold buckets that want to replicate back to the always on site which that site may have removed due to volume utilization restrictions. Do you have any details in your internal logs which indicate which buckets are not replicating, anything special about those specific buckets?
I may have restrictions but I'm not sure what they are, as I inherited some configurations from our Splunk PS whom started building our architecture but I subsequently ended up finishing. Where should I look?
As for the internal logs, they are vague. It does tell me which buckets but there are a lot of them. Nothing stands out to me, but that could be my untrained eye on whether they are all hot/warm/cold.
Leverage your monitoring console as the easy method to check volume sizes on each of your indexers. Ideally the total space should be absolutely identical across all indexers, ie 100MB x 8idx/site x 2 sites (completely made up numbers).
| rest splunk_servers=* /services/data/index-volumes
Run that SPL on your search head and it will return for all servers in your search cluster and indexing cluster. You can add more search terms to get down to the indexer level and then transform the results for "/dataHot", "/dataCold", and "_splunk_summaries". Look at the per server results for used vs available/total and create a calculated field for %used. Anything above 85% for /dataCold is typically a strong indication you need to expand your storage capabilities. Note that "/dataHot" by design runs full before it will roll a bucket over to the /dataCold volume.