Splunk Enterprise

Cause of replication failures in indexer cluster?

doeboy
New Member

Hey!

I am currently standing up an enterprise splunk system that has a multi-site(2) indexer cluster of 8 Peers and 2 Cluster Managers in HA configuration (LB by F5). I've noticed that if we have outages specific to a site, data rightfully continues to get ingested at the site that is still up.....

But upon the return of service to the secondary site, we have a thousand or more fixup tasks (normal I suppose) but at times they hang and eventually I get replication failures in my health check. Usually unstable pending-down-up status is associated with the peers from the site that went down as they attempt to clean up.

This is still developmental, so I have the luxury of deleting things with no consequence. The only fix I have seen to work is deleting all the data from the peers that went down and allowing them to resync and copy from a clean slate. I'm sure there is a better way to remedy this issue. 

 

Can anyone explain/or point me in the direction of the appropriate solution and what the exact cause of this problem is? 

I've read this Anomalous bucket issues - Splunk Documentation but roll, resync, delete doesn't quite do enough. And there is no mention as to why the failures start to occur. From my understanding, fragmented buckets play a factor when reboots or unexpected outages happen but how do I exactly regain some stability in my data replication.

Labels (2)
0 Karma

dural_yyz
Builder

Do you have Cold Volume storage restrictions?  The offline may have cold buckets that want to replicate back to the always on site which that site may have removed due to volume utilization restrictions.  Do you have any details in your internal logs which indicate which buckets are not replicating, anything special about those specific buckets?

0 Karma

doeboy
New Member

I may have restrictions but I'm not sure what they are, as I inherited some configurations from our Splunk PS whom started building our architecture but I subsequently ended up finishing. Where should I look? 

As for the internal logs, they are vague. It does tell me which buckets but there are a lot of them. Nothing stands out to me, but that could be my untrained eye on whether they are all hot/warm/cold.

0 Karma

dural_yyz
Builder

Leverage your monitoring console as the easy method to check volume sizes on each of your indexers.  Ideally the total space should be absolutely identical across all indexers, ie 100MB x 8idx/site x 2 sites (completely made up numbers).

| rest splunk_servers=* /services/data/index-volumes

Run that SPL on your search head and it will return for all servers in your search cluster and indexing cluster.  You can add more search terms to get down to the indexer level and then transform the results for "/dataHot", "/dataCold", and "_splunk_summaries".  Look at the per server results for used vs available/total and create a calculated field for %used.  Anything above 85% for /dataCold is typically a strong indication you need to expand your storage capabilities.  Note that "/dataHot" by design runs full before it will roll a bucket over to the /dataCold volume.

0 Karma
Get Updates on the Splunk Community!

Infographic provides the TL;DR for the 2024 Splunk Career Impact Report

We’ve been buzzing with excitement about the recent validation of Splunk Education! The 2024 Splunk Career ...

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...