We lost one node out of a three node search head cluster. We went to static captaincy.
Sometime along the line, it appears that scheduled searches stopped working. Usually restarting one of the search heads got things going again, but right now the shcluster is in a mess.
The thing that always seems to accompany trouble with this thing is when mgmt_uri starts showing up in a 'show shcluster-status' as '?'.
Right now I have static captaincy transfer to node adculsplunkp6. a show shcluster-status there shows:
Captain: dynamic_captain : 0 elected_captain : Thu Jan 7 09:52:39 2016 id : F0214F20-327E-4591-ACC7-A03929CF829F initialized_flag : 1 label : adculsplunkp6 maintenance_mode : 0 mgmt_uri : ? min_peers_joined_flag : 1 rolling_restart_flag : 0 service_ready_flag : 1 Members: adculsplunkp6 label : adculsplunkp6 mgmt_uri : ? mgmt_uri_alias : https://xx.xx.xx.xxx:8089 status : Up adculsplunkp2 label : adculsplunkp2 mgmt_uri : ? mgmt_uri_alias : https://xx.xx.xx.xx:8089 status : Up
On the other (non-captain), it's still shows a different captain and no member.
Captain: dynamic_captain : 0 elected_captain : Thu Jan 7 10:01:16 2016 id : F0214F20-327E-4591-ACC7-A03929CF829F initialized_flag : 1 label : adculsplunkp2 maintenance_mode : 0 mgmt_uri : ? min_peers_joined_flag : 1 rolling_restart_flag : 0 service_ready_flag : 1 Members:
How do I get the correct mgmt_uris in there so things start behaving again?
So you don't really need to go to static if you have 2/3 of the nodes available. Were you doing it as a preventative measure in case you lost another node? If so, It would only cover you if you lost the non-captain.
Did you run the configure both remaining nodes to use the same static captain? Did you use fully qualified domain names?
You can go back to dynamic captaincy by bootstrapping one of the members (preferably the old static captain), then convert the others.
One last thing, are all your saved searches failing, or only some? If it is only some, it could be due to the fact that you have fewer cores available to process, which would decrease number of searches you can run.
I used the same static captain on both, and specified IP address (not my choice, the guy that set up the cluster did it that way).
I am going back to static because whenever I call Splunk support with my 2 node cluster, they tell me that I am in a unsupported configuration. That third node is gone, and they don't support 2 node clusters.
The saved searches issue was caused by 6.3.0 bug; apparently they started tracking # of running searches across the cluster, had a bug in it, so eventually the cluster figured everyone was over quota and stopped schedules searches. Details at https://answers.splunk.com/answers/329518/why-do-scheduled-searches-randomly-stop-running-in.html
The issue here is that, in case of static captaincy we read the mgmt_uri from memory. Hence, when we restarted the node, the value was lost and we did not read the value from disk/config. Hence the "?" in show shcluster-status command.
This issue has been fixed in 6.4.7 and 6.3.11, so feel free to upgrade your environment.