Deployment Architecture

Search Head Cluster 7.1 presents status_line="Error connecting: Connect Timeout" socket_error="Connect Timeout" frequently

Explorer

The cluster is running on 7.1.6 for months but in the last weeks we are seeing more and more errors like:

08-11-2019 20:48:39.441 +0000 ERROR SHCSlave - event=SHPSlave::handleHeartbeatDone heartbeat failure (reason: failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout")
08-11-2019 20:48:39.441 +0000 WARN  SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members/7A0DC929-2222-4AE2-B8A2-C87C47085DE6 captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=200 status_line="Read Timeout" socket_error="Read Timeout"
08-11-2019 20:45:09.149 +0000 WARN  SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"
08-11-2019 20:45:09.063 +0000 WARN  SHCMasterHTTPProxy - Low Level http request failure err=failed method=POST path=/services/shcluster/captain/members captain=xxxxxxxxxxxxxxxx:8089 rc=0 actual_response_code=502 expected_response_code=201 status_line="Write Timeout" socket_error="Write Timeout"

This is happening regardless of the captain.

The bin/splunk show shcluster-status is also failing from time to time, either showing all members as down, or showing:

"Failed to proxy call to member https://xxxxxxx:8089. 
Encountered some errors while trying to obtain shcluster status."

We have a 11 node search head cluster, 28 threads and 128 GB of RAM each, running on Oracle linux 7.6 version, all patched, all physical servers.

We have a high limit for splunk user:

splunk@xxxxxxxx:~$ cat /proc/194162/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             200000               200000               processes
Max open files            200000               200000               files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514519               514519               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Other relevant info:

splunk@xxxxxxxxx:~$ bin/splunk btool limits list --debug | egrep -v /opt/splunk/etc/system/default/limits.conf
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [scheduler]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc = 75
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1 = 95
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_perc.1.when = * 01-10 * * *
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf [search]
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf base_max_searches = 6
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf fetch_remote_search_log = disabled
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_chunk_queue_size = 100000000
/opt/splunk/etc/apps/xxxxx_us_all_base_sh/default/limits.conf max_searches_per_cpu = 2

And:

splunk@xxxxxx:~$ cat /opt/splunk/etc/apps/xxxxxxxx/default/server.conf
[clustering]
mode = searchhead
master_uri = clustermaster:ha_primary
multisite = true

[clustermaster:ha_primary]
master_uri = https://xxx.com:8089
pass4SymmKey = xxxxxxx
multisite = true

[shclustering]
scheduling_heuristic = round_robin
captain_is_adhoc_searchhead = true
executor_workers = 50
conf_replication_period = 5
conf_replication_max_pull_count = 1000
conf_replication_max_push_count = 100

[httpServer]
maxSockets = -1
maxThreads = -1

Appreciate any help or tips. Splunk case opened but it is being too slow to fix, weeks with the issue now.

0 Karma
1 Solution

Explorer

We were able to fix all issues by placing these parameters at server.conf:

[shclustering]
conf_replication_period = 10
conf_replication_max_pull_count = 2000
conf_replication_max_push_count = 200
cxn_timeout_raft = 5

View solution in original post

0 Karma

Explorer

We were able to fix all issues by placing these parameters at server.conf:

[shclustering]
conf_replication_period = 10
conf_replication_max_pull_count = 2000
conf_replication_max_push_count = 200
cxn_timeout_raft = 5

View solution in original post

0 Karma