Deployment Architecture

In my Search Head Cluster, why does "show shcluster-status" show captain, but not the members' information?

scheng_splunk
Splunk Employee
Splunk Employee

After a recent bundle push from deployer to our search head cluster (SHC) members running Splunk Enterprise version 7.2.4, SHC is in a broken state with missing member information:

[splunk@SH1 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84 
initialized_flag : 0 
label : SH1
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

Members: 

[splunk@SH2 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-27986022 
initialized_flag : 0 
label : SH1
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

[splunk@SH3 bin]$ ./splunk show shcluster-status 

Captain: 
dynamic_captain : 1 
elected_captain : Wed Feb 20 19:02:42 2019 
id : 718F33BC-E8A5-4EDB-AFAE-279860226B84 
initialized_flag : 0 
label : SH1 
mgmt_uri : https://SH1:8089 
min_peers_joined_flag : 0 
rolling_restart_flag : 0 
service_ready_flag : 0 

Members: 

It appears the election had successfully done with all members voted SH1 to be the captain, but member information just couldn't get updated.

From SHC captain SH1's splunkd.log:

02-20-2019 19:02:53.796 -0600 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=https://SH3:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mode=json, socket_error=Connection refused to https://SH3:8089 
  • Tried below procedure to clean up RAFT then bootstrap a static captain but same result after doing so: https://docs.splunk.com/Documentation/Splunk/7.2.4/DistSearch/Handleraftissues#Fix_the_entire_cluste...
  • Confirmed all members have their serverName defined properly to its own name.

  • Confirmed no network issue as each member can access each other's mgmt port 8089 through below curl cmd:

    curl -s -k https://hostname:8089/services/server/info

  • Also tried increasing the thread through the below setting and restarted Splunk on all members.

    server.conf
    [httpServer]
    maxSockets =1000000
    maxThreads= 50000
    The issue remains the same. None of the SHC members are listed under "show shcluster-status" and the SHC remains broken along with kvstore cluster not established.

1 Solution

scheng_splunk
Splunk Employee
Splunk Employee

This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.

You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:

e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms

Reference:
https://httpstatuses.com/413

The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:

[httpServer]
max_content_length=21474836480 

Reference:

max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount

of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...

View solution in original post

scheng_splunk
Splunk Employee
Splunk Employee

This issue is most likely due to dispatch directory on each of the SHC members was very large through large bundle push and leads to large payload hence SHC failing to add the SH members.

You may check splunkd_access.log to look for any 413 PAYLOAD TOO LARGE error:

e.g.
x.x.x.x - - [20/Feb/2019:19:32:29.471 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms
x.x.x.x - - [20/Feb/2019:19:32:25.024 -0600] "POST /services/shcluster/captain/members HTTP/1.1" 413 180 - - - 0ms

Reference:
https://httpstatuses.com/413

The root cause is bundle is too large and hit 2GB limit of max_content_length.
To resolve it you may set following in server.conf on all the SH members and restart splunk to apply the setting:

[httpServer]
max_content_length=21474836480 

Reference:

max_content_length =
* Maximum content length, in bytes.
* HTTP requests over the size specified are rejected.
* This setting exists to avoid allocating an unreasonable amount

of memory from web requests.
* In environments where indexers have enormous amounts of RAM, this number
can be reasonably increased to handle
large quantities of bundle data.
* Default: 2147483648 (2GB)
https://docs.splunk.com/Documentation/Splunk/latest/Admin/Serverconf#Splunkd_HTTP_server_configurati...

Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...