After a recent bundle push from deployer to our search head cluster (SHC) members running Splunk Enterprise version 7.2.4, SHC is in a broken state with missing member information:
[splunk@SH1 bin]$ ./splunk show shcluster-status
Captain:
dynamic_captain : 1
elected_captain : Wed Feb 20 19:02:42 2019
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84
initialized_flag : 0
label : SH1
mgmt_uri : https://SH1:8089
min_peers_joined_flag : 0
rolling_restart_flag : 0
service_ready_flag : 0
Members:
[splunk@SH2 bin]$ ./splunk show shcluster-status
Captain:
dynamic_captain : 1
elected_captain : Wed Feb 20 19:02:42 2019
id : 718F33BC-E8A5-4EDB-AFAE-27986022
initialized_flag : 0
label : SH1
mgmt_uri : https://SH1:8089
min_peers_joined_flag : 0
rolling_restart_flag : 0
service_ready_flag : 0
[splunk@SH3 bin]$ ./splunk show shcluster-status
Captain:
dynamic_captain : 1
elected_captain : Wed Feb 20 19:02:42 2019
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84
initialized_flag : 0
label : SH1
mgmt_uri : https://SH1:8089
min_peers_joined_flag : 0
rolling_restart_flag : 0
service_ready_flag : 0
Members:
It appears the election had successfully done with all members voted SH1 to be the captain, but member information just couldn't get updated.
From SHC captain SH1's splunkd.log:
02-20-2019 19:02:53.796 -0600 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=https://SH3:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mode=json, socket_error=Connection refused to https://SH3:8089
Confirmed all members have their serverName defined properly to its own name.
Confirmed no network issue as each member can access each other's mgmt port 8089 through below curl cmd:
curl -s -k https://hostname:8089/services/server/info
Also tried increasing the thread through the below setting and restarted Splunk on all members.
server.conf
[httpServer]
maxSockets =1000000
maxThreads= 50000
The issue remains the same. None of the SHC members are listed under "show shcluster-status" and the SHC remains broken along with kvstore cluster not established.
This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.
You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:
e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
Reference:
https://httpstatuses.com/413
The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:
[httpServer]
max_content_length=21474836480
Reference:
max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount
of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...
This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.
You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:
e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
Reference:
https://httpstatuses.com/413
The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:
[httpServer]
max_content_length=21474836480
Reference:
max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount
of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...