Deployment Architecture

Summary Replication

vr2312
Contributor

We have a indexer clustered environment, and we have premium apps such as ES and ITSI running.

We were asked to enable summary_replication on the master, which would automatically push the configurations to the Peer nodes.

The document suggests that the replication takes a huge chunk of bandwidth during the first time and will then recede on the bandwidth front.

We are an environment ingesting 2.1 TB of data everyday and ever since we have enabled replication, we are observing the following issues :

  1. The network connectivity to our Cloud instances takes a toll and thus results in inacessibility of indexers.

  2. It also provides much of an error messages when running searches (Peer down. Check peer rg)

  3. I can observe error messages on Indexer, without that being restarted :

  4. 05-03-2017 14:11:13.726 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:14:18.834 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:16:14.569 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:23:52.253 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:25:07.445 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:30:42.824 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:35:32.179 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:37:53.508 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:43:40.909 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:48:56.141 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:56:30.990 -0400 INFO CMMasterProxy - Master is back up!
    05-03-2017 14:57:20.214 -0400 INFO CMMasterProxy - Master is back up!

Are these all related ? Should i disable summary replication ?

Any inputs ?

0 Karma
1 Solution

vr2312
Contributor

So after a lot of pondering and digging deep, identified an underlying cause that was grabbing ore resources from the indexers. There was a search head which was replicated and was left from being managed. This search head was supposed to be decommissioned but was missed.

Hence, that search head, with both ES and a variety of Apps installed in it, was consuming to the searches and was messing with the performance of the indexers.

We removed the identified server, the replication job and performance of indexers came up just like that, without a scratch.

Strange that this one server gave us a lot of issues thus making us realize that every single instance needs to be accountable.

View solution in original post

0 Karma

vr2312
Contributor

So after a lot of pondering and digging deep, identified an underlying cause that was grabbing ore resources from the indexers. There was a search head which was replicated and was left from being managed. This search head was supposed to be decommissioned but was missed.

Hence, that search head, with both ES and a variety of Apps installed in it, was consuming to the searches and was messing with the performance of the indexers.

We removed the identified server, the replication job and performance of indexers came up just like that, without a scratch.

Strange that this one server gave us a lot of issues thus making us realize that every single instance needs to be accountable.

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

glad you resolved it- thats pretty strange, but good to hear it all working! is summary replication working too? 🙂

0 Karma

vr2312
Contributor

Yes, summary replication got back up again.

How are things with your issue ? Were the support helpful ? @dxu_splunk

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

its possible - whats the trigger condition before the "Master is back up" message? (there will be a corresponding "Master is down" message.

0 Karma

vr2312
Contributor

Hello @dxu, i am getting errors like this :

CMMasterProxy - Master is down! Make sure pass4SymmKey is matching if master is running.

But i can observe that the pass4SymmKey is the same as when the whole infrastructure was set up and nothing has changed recently other than enabling Summary replication.

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

whats the preceding messages before it says "master is down" any errors?

0 Karma

vr2312
Contributor

major Correction, i saw those Master is backup error messages on the Indexer Instance and not on the Cluster Master as i had mentioned before.

The Errors preceeding it were regarding HttpListener - Read Timeout communicating with the Search heads.

I am also seeing these erors from the Cluster master this time :

05-03-2017 17:33:10.339 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:33:13.266 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:33:16.496 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:48.780 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:55.253 -0400 WARN CMPeer - decSummaryRepCount already 0!
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~359~3883BDEE-E8F7-4359-9209-7DE85C9FF9CD msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~362~30B96A42-BB42-4CBC-972C-B1B167E04197 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~372~B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:30:39.136 -0400 WARN CMMaster - event=removePeerBuckets peer=85FBFE19-9070-4893-B57C-E9762FE90622 peer_name=walxsplunkidx1d bid=msad~370~85FBFE19-9070-4893-B57C-E9762FE90622 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:30:25.591 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:32.639 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

Ah strange. something doesnt look right there - those 500 errors on CMRepJob are causing your peers to re-add themselves to the cluster. Maybe try a Cluster Master restart and see if theres any improvements? If not, I'd probably disable summary_replication

0 Karma

vr2312
Contributor

I have disabled the summary_replication and have raised a ticket. let us see what the Support Responds.

0 Karma
Get Updates on the Splunk Community!

The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...