Solved: Re: In my Search Head Cluster, why does "show shcl...

scheng_splunk · ‎02-21-2019

After a recent bundle push from deployer to our search head cluster (SHC) members running Splunk Enterprise version 7.2.4, SHC is in a broken state with missing member information:

[splunk@SH1 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84 
initialized_flag : 0 
label : SH1
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

Members: 

[splunk@SH2 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-27986022 
initialized_flag : 0 
label : SH1
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

[splunk@SH3 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84 
initialized_flag : 0 
label : SH1 
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

Members:

It appears the election had successfully done with all members voted SH1 to be the captain, but member information just couldn't get updated.

From SHC captain SH1's splunkd.log:

02-20-2019 19:02:53.796 -0600 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=https://SH3:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mode=json, socket_error=Connection refused to https://SH3:8089

Tried below procedure to clean up RAFT then bootstrap a static captain but same result after doing so: https://docs.splunk.com/Documentation/Splunk/7.2.4/DistSearch/Handleraftissues#Fix_the_entire_cluste...
Confirmed all members have their serverName defined properly to its own name.
Confirmed no network issue as each member can access each other's mgmt port 8089 through below curl cmd:

curl -s -k https://hostname:8089/services/server/info
Also tried increasing the thread through the below setting and restarted Splunk on all members.

server.conf
[httpServer]
maxSockets =1000000
maxThreads= 50000
The issue remains the same. None of the SHC members are listed under "show shcluster-status" and the SHC remains broken along with kvstore cluster not established.

scheng_splunk · ‎02-21-2019

This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.

You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:

e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms

Reference:
https://httpstatuses.com/413

The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:

[httpServer]
max_content_length=21474836480

Reference:

max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount

of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...

View solution in original post

scheng_splunk · ‎02-21-2019

This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.

You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:

e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms

Reference:
https://httpstatuses.com/413

The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:

[httpServer]
max_content_length=21474836480

Reference:

max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount

of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...

In my Search Head Cluster, why does "show shcluster-status" show captain, but not the members' information?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

In my Search Head Cluster, why does "show shcluster-status" show captain, but not the members' information?

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?