Deployment Architecture
Highlighted

Cluster Master is unable to Meet Search Factor and Replication Factor, cluster Peer status is flapping Up and Down.

Communicator

Our master node is still losing sight of indexer nodes. We had a problem similar to this that was affecting up to half of our cluster at a time. Currently, it is affecting 1 or 2 indexers at a time. Master reports the indexer as down, but the indexer is in fact up. This flapping of the
In total you have 20 indexer, where each indexer has around 25K buckets . SO in total you have 20*25,000= 500,000 bucket.

Highlighted

Re: Cluster Master is unable to Meet Search Factor and Replication Factor, cluster Peer status is flapping Up and Down.

Motivator

Worked with Splunk Support to resolve this issue, following steps were recommended.

So there are lot of buckets for cluster master to manage. Normally when the count of number of bucket goes high cluster master has to do lot of processing to stay in compliance (for RF and SF). Splunk works best with fewer but large buckets.

Here are some of the things that were checked.

1) It was found that “maxDataSize = auto” was defined for index. Refer link -- http://docs.splunk.com/Documentation/Splunk/latest/Admin/Indexesconf and change setting to maxDataSize = "autohighvolume", this will help reduce bucket count going forward, although it will not help reduce the number of buckets already created.

The cluster master splunkd.log file shows that since March 31st you have peer flap between “UP” and “DOWN” three time as showsn below.


04-01-2015 22:34:06.575 +0000 INFO CMPeer - peer=04290883-330C-47CB-A3FB-FF0DE1B52C2D peername=cdc-anivia-splunkindexer14 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
04-01-2015 22:34:14.484 +0000 WARN CMMaster - event=heartbeat guid=04290883-330C-47CB-A3FB-FF0DE1B52C2D msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'
[rbal@undiag02:/diags/case
229878/cdc-anivia-splunkmaster10/log]$ grep -i down splunkd.log.*
splunkd.log.3:04-01-2015 00:01:33.667 +0000 INFO CMPeer - peer=30C53796-A152-4A3B-9C46-A7022EAE4DED peer
name=cdc-anivia-splunkindexer2 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
splunkd.log.3:04-01-2015 00:01:36.388 +0000 WARN CMMaster - event=heartbeat guid=30C53796-A152-4A3B-9C46-A7022EAE4DED msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'
splunkd.log.5:03-31-2015 09:06:17.195 +0000 INFO CMPeer - peer=E506C091-30E4-4BAE-997A-9D01E151E259 peer_name=cdc-anivia-splunkindexer10 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
splunkd.log.5:03-31-2015 09:06:20.012 +0000 WARN CMMaster - event=heartbeat guid=E506C091-30E4-4BAE-997A-9D01E151E259 msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'


2)So based on the above error messages, this error message appeared three times between “03-30-2015 18:45:17.283” and “04-01-2015 22:45:40.221”.

3)Below are setting for Cluster Master

----------Cluster Master----------

/etc/system/local/server.conf 
[clustering] 
 heartbeat_timeout = 60
 max_peer_build_load = 5 

4) The error message seen here may result if the cluster peer miss the heartbeat that may happen if cluster master or cluster peer is too busy, which is likely when peer have too many buckets.

5) On the Cluster Master change heartbeat_timeout from 300 (your current setting – default is 60s) to 600s., this master side change can be done without restarting via the cli:

./bin/splunk edit cluster-config -mode master -mode master -heartbeat_timeout 600 

6) Increase service_interval from 1 sec to 4-5 sec: This will required restart of the Cluster master:

- server.conf 
[clustering] 
service_interval = 5 

7) Increase rotatePeriodInSecs for the Cluster Peer, this will need to be changed in indexes.conf as global setting.

  • indexes.conf (at indexers) (default is 60) 300 could be reasonable for very busy situation. rotatePeriodInSecs = 180

This will require bundle push.

😎 on Peers you need to have following configuration, note this will need bundle push. default value of heartbeatperiod is ‘1s’.
server.conf
[clustering]
heartbeat
period = 10

9)In addition increased the ulimit.

Highlighted

Re: Cluster Master is unable to Meet Search Factor and Replication Factor, cluster Peer status is flapping Up and Down.

Motivator

User the Searches here to see the Cluster Master Service Queue Progress: These are to be run on Cluster Master and remember to change earliest and latest time as per the need or the time you are trying to review.

index=internal earliest=12/10/2015:08:00:00 latest=12/11/2015:16:30:00 source=metrics.log name=cmmasterexecutor
OR ( subtaskcounts cmmasterendpoints clustermasterpeerscreate )
| timechart span=5m max(current
size) max(jobsadded) max(jobsfinished) count(eval(clustermasterpeerscreate=1)) AS PeerReAdd
index=
internal source=metrics.log name=cmmasterservice group=subtaskcounts
| timechart span=5m max(tofix*) as ToFix*
| fields - ToFixtotal

Second

index=internal earliest=12/10/2015:08:00:00 latest=12/11/2015:16:30:00 source=metrics.log name=cmmasterservice group=subtaskcounts
| timechart span=5m max(to
fix) as To_Fix
| fields - ToFixtotal

0 Karma