We have a indexer clustered environment, and we have premium apps such as ES and ITSI running.
We were asked to enable summary_replication on the master, which would automatically push the configurations to the Peer nodes.
The document suggests that the replication takes a huge chunk of bandwidth during the first time and will then recede on the bandwidth front.
We are an environment ingesting 2.1 TB of data everyday and ever since we have enabled replication, we are observing the following issues :
The network connectivity to our Cloud instances takes a toll and thus results in inacessibility of indexers.
It also provides much of an error messages when running searches (Peer down. Check peer rg)
I can observe error messages on Indexer, without that being restarted :
05-03-2017 14:11:13.726 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:14:18.834 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:16:14.569 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:23:52.253 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:25:07.445 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:30:42.824 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:35:32.179 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:37:53.508 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:43:40.909 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:48:56.141 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:56:30.990 -0400 INFO CMMasterProxy - Master is back up!
05-03-2017 14:57:20.214 -0400 INFO CMMasterProxy - Master is back up!
Are these all related ? Should i disable summary replication ?
Any inputs ?
So after a lot of pondering and digging deep, identified an underlying cause that was grabbing ore resources from the indexers. There was a search head which was replicated and was left from being managed. This search head was supposed to be decommissioned but was missed.
Hence, that search head, with both ES and a variety of Apps installed in it, was consuming to the searches and was messing with the performance of the indexers.
We removed the identified server, the replication job and performance of indexers came up just like that, without a scratch.
Strange that this one server gave us a lot of issues thus making us realize that every single instance needs to be accountable.
So after a lot of pondering and digging deep, identified an underlying cause that was grabbing ore resources from the indexers. There was a search head which was replicated and was left from being managed. This search head was supposed to be decommissioned but was missed.
Hence, that search head, with both ES and a variety of Apps installed in it, was consuming to the searches and was messing with the performance of the indexers.
We removed the identified server, the replication job and performance of indexers came up just like that, without a scratch.
Strange that this one server gave us a lot of issues thus making us realize that every single instance needs to be accountable.
glad you resolved it- thats pretty strange, but good to hear it all working! is summary replication working too? 🙂
Yes, summary replication got back up again.
How are things with your issue ? Were the support helpful ? @dxu_splunk
its possible - whats the trigger condition before the "Master is back up" message? (there will be a corresponding "Master is down" message.
Hello @dxu, i am getting errors like this :
CMMasterProxy - Master is down! Make sure pass4SymmKey is matching if master is running.
But i can observe that the pass4SymmKey is the same as when the whole infrastructure was set up and nothing has changed recently other than enabling Summary replication.
whats the preceding messages before it says "master is down" any errors?
major Correction, i saw those Master is backup error messages on the Indexer Instance and not on the Cluster Master as i had mentioned before.
The Errors preceeding it were regarding HttpListener - Read Timeout communicating with the Search heads.
I am also seeing these erors from the Cluster master this time :
05-03-2017 17:33:10.339 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:33:13.266 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:33:16.496 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:48.780 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:55.253 -0400 WARN CMPeer - decSummaryRepCount already 0!
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~359~3883BDEE-E8F7-4359-9209-7DE85C9FF9CD msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~362~30B96A42-BB42-4CBC-972C-B1B167E04197 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:31:13.858 -0400 WARN CMMaster - event=removePeerBuckets peer=B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 peer_name=walxsplunkidx3d bid=msad~372~B6E7FF07-3CB3-4D8A-8D22-F8FD8042AE81 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:30:39.136 -0400 WARN CMMaster - event=removePeerBuckets peer=85FBFE19-9070-4893-B57C-E9762FE90622 peer_name=walxsplunkidx1d bid=msad~370~85FBFE19-9070-4893-B57C-E9762FE90622 msg="Bucket is not on any other peer! Removing it."
05-03-2017 17:30:25.591 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
05-03-2017 17:30:32.639 -0400 WARN CMRepJob - _rc=0 statusCode=500 err=No error
Ah strange. something doesnt look right there - those 500 errors on CMRepJob are causing your peers to re-add themselves to the cluster. Maybe try a Cluster Master restart and see if theres any improvements? If not, I'd probably disable summary_replication
I have disabled the summary_replication and have raised a ticket. let us see what the Support Responds.