Hey! I am currently standing up an enterprise splunk system that has a multi-site(2) indexer cluster of 8 Peers and 2 Cluster Managers in HA configuration (LB by F5). I've noticed that if we have outages specific to a site, data rightfully continues to get ingested at the site that is still up..... But upon the return of service to the secondary site, we have a thousand or more fixup tasks (normal I suppose) but at times they hang and eventually I get replication failures in my health check. Usually unstable pending-down-up status is associated with the peers from the site that went down as they attempt to clean up. This is still developmental, so I have the luxury of deleting things with no consequence. The only fix I have seen to work is deleting all the data from the peers that went down and allowing them to resync and copy from a clean slate. I'm sure there is a better way to remedy this issue. Can anyone explain/or point me in the direction of the appropriate solution and what the exact cause of this problem is? I've read this Anomalous bucket issues - Splunk Documentation but roll, resync, delete doesn't quite do enough. And there is no mention as to why the failures start to occur. From my understanding, fragmented buckets play a factor when reboots or unexpected outages happen but how do I exactly regain some stability in my data replication.
... View more