The cluster is running on 7.1.6 for months but in the last weeks we are seeing more and more errors like:
08-11-2019 20:48:39.441 +0000 ERROR SHCSlave - event=SHPSlave::handleHeartbeatDone heartbeat failure (reason: failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout")
08-11-2019 20:48:39.441 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout"
08-11-2019 20:45:09.149 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"
08-11-2019 20:45:09.063 +0000 WARN SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"
This is happening regardless of the captain.
The bin/splunk show shcluster-status is also failing from time to time, either showing all members as down, or showing:
"Failed to proxy call to member https://xxxxxxx:8089.
Encountered some errors while trying to obtain shcluster status."
We have a 11 node search head cluster, 28 threads and 128 GB of RAM each, running on Oracle linux 7.6 version, all patched, all physical servers.
We have a high limit for splunk user:
splunk@xxxxxxxx:~$ cat /proc/194162/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 200000 200000 processes
Max open files 200000 200000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 514519 514519 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Other relevant info:
splunk@xxxxxxxxx:~$ bin/splunk btool limits list --debug | egrep -v /opt/splunk/etc/system/default/limits.conf
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [scheduler]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc = 75
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1 = 95
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1.when = * 01-10 * * *
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [search]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf base_max_searches = 6
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf fetch_remote_search_log = disabled
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_chunk_queue_size = 100000000
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_per_cpu = 2
And:
splunk@xxxxxx:~$ cat /opt/splunk/etc/apps/xxxxxxxx/default/server.conf
[clustering]
mode = searchhead
master_uri = clustermaster:ha_primary
multisite = true
[clustermaster:ha_primary]
master_uri = https://xxx.com:8089
pass4SymmKey = xxxxxxx
multisite = true
[shclustering]
scheduling_heuristic = round_robin
captain_is_adhoc_searchhead = true
executor_workers = 50
conf_replication_period = 5
conf_replication_max_pull_count = 1000
conf_replication_max_push_count = 100
[httpServer]
maxSockets = -1
maxThreads = -1
Appreciate any help or tips. Splunk case opened but it is being too slow to fix, weeks with the issue now.
We were able to fix all issues by placing these parameters at server.conf:
[shclustering]
conf_replication_period = 10
conf_replication_max_pull_count = 2000
conf_replication_max_push_count = 200
cxn_timeout_raft = 5
We were able to fix all issues by placing these parameters at server.conf:
[shclustering]
conf_replication_period = 10
conf_replication_max_pull_count = 2000
conf_replication_max_push_count = 200
cxn_timeout_raft = 5