Deployment Architecture

Cluster Master is unable to Meet Search Factor and Replication Factor, cluster Peer status is flapping Up and Down.

sat94541
Communicator

Our master node is still losing sight of indexer nodes. We had a problem similar to this that was affecting up to half of our cluster at a time. Currently, it is affecting 1 or 2 indexers at a time. Master reports the indexer as down, but the indexer is in fact up. This flapping of the
In total you have 20 indexer, where each indexer has around 25K buckets . SO in total you have 20*25,000= 500,000 bucket.

rbal_splunk
Splunk Employee
Splunk Employee

User the Searches here to see the Cluster Master Service Queue Progress: These are to be run on Cluster Master and remember to change earliest and latest time as per the need or the time you are trying to review.

index=_internal earliest=12/10/2015:08:00:00 latest=12/11/2015:16:30:00 source=metrics.log name=cmmaster_executor
OR ( subtask_counts cmmaster_endpoints clustermasterpeers_create )
| timechart span=5m max(current_size) max(jobs_added) max(jobs_finished) count(eval(clustermasterpeers_create=1)) AS PeerReAdd
index=_internal source=metrics.log name=cmmaster_service group=subtask_counts
| timechart span=5m max(to_fix*) as To_Fix*
| fields - To_Fix_total

Second

index=_internal earliest=12/10/2015:08:00:00 latest=12/11/2015:16:30:00 source=metrics.log name=cmmaster_service group=subtask_counts
| timechart span=5m max(to_fix*) as To_Fix*
| fields - To_Fix_total

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Worked with Splunk Support to resolve this issue, following steps were recommended.

So there are lot of buckets for cluster master to manage. Normally when the count of number of bucket goes high cluster master has to do lot of processing to stay in compliance (for RF and SF). Splunk works best with fewer but large buckets.

Here are some of the things that were checked.

1) It was found that “maxDataSize = auto” was defined for index. Refer link -- http://docs.splunk.com/Documentation/Splunk/latest/Admin/Indexesconf and change setting to maxDataSize = "auto_high_volume", this will help reduce bucket count going forward, although it will not help reduce the number of buckets already created.

The cluster master splunkd.log file shows that since March 31st you have peer flap between “UP” and “DOWN” three time as showsn below.


04-01-2015 22:34:06.575 +0000 INFO CMPeer - peer=04290883-330C-47CB-A3FB-FF0DE1B52C2D peer_name=cdc-anivia-splunkindexer14 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
04-01-2015 22:34:14.484 +0000 WARN CMMaster - event=heartbeat guid=04290883-330C-47CB-A3FB-FF0DE1B52C2D msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'
[rbal@undiag02:/diags/case_229878/cdc-anivia-splunkmaster1_0/log]$ grep -i down splunkd.log.*
splunkd.log.3:04-01-2015 00:01:33.667 +0000 INFO CMPeer - peer=30C53796-A152-4A3B-9C46-A7022EAE4DED peer_name=cdc-anivia-splunkindexer2 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
splunkd.log.3:04-01-2015 00:01:36.388 +0000 WARN CMMaster - event=heartbeat guid=30C53796-A152-4A3B-9C46-A7022EAE4DED msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'
splunkd.log.5:03-31-2015 09:06:17.195 +0000 INFO CMPeer - peer=E506C091-30E4-4BAE-997A-9D01E151E259 peer_name=cdc-anivia-splunkindexer10 transitioning from=Up to=Down reason="heartbeat or restart timeout=60"
splunkd.log.5:03-31-2015 09:06:20.012 +0000 WARN CMMaster - event=heartbeat guid=E506C091-30E4-4BAE-997A-9D01E151E259 msg='signaling Clear-Masks-And-ReAdd (received heartbeat from Down peer)'


2)So based on the above error messages, this error message appeared three times between “03-30-2015 18:45:17.283” and “04-01-2015 22:45:40.221”.

3)Below are setting for Cluster Master

----------Cluster Master----------

/etc/system/local/server.conf 
[clustering] 
 heartbeat_timeout = 60
 max_peer_build_load = 5 

4) The error message seen here may result if the cluster peer miss the heartbeat that may happen if cluster master or cluster peer is too busy, which is likely when peer have too many buckets.

5) On the Cluster Master change heartbeat_timeout from 300 (your current setting – default is 60s) to 600s., this master side change can be done without restarting via the cli:

./bin/splunk edit cluster-config -mode master -mode master -heartbeat_timeout 600 

6) Increase service_interval from 1 sec to 4-5 sec: This will required restart of the Cluster master:

- server.conf 
[clustering] 
service_interval = 5 

7) Increase r*otatePeriodInSecs* for the Cluster Peer, this will need to be changed in indexes.conf as global setting.

  • indexes.conf (at indexers) (default is 60) 300 could be reasonable for very busy situation. rotatePeriodInSecs = 180

This will require bundle push.

😎 on Peers you need to have following configuration, note this will need bundle push. default value of heartbeat_period is ‘1s’.
server.conf
[clustering]
heartbeat_period = 10

9)In addition increased the ulimit.

Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!