At oh:dark:30 yesterday one of my search heads took a dirt nap. 1 in a 4 member cluster.
Thread - ReplicationDataReceiverThread: about to throw a ThreadException: pthread_create: Resource temporarily unavailable
I'm still looking into what to do with that (limits look good, 16 CPU, 32 GB server).
However, more concerning is that this server crash caused a whole mess of scheduled jobs to be missed. There is no record of them in any _internal nor _audit logs. They just completely, silently, were skipped.
7.0.4
EDIT: Per the first response below, yes it was the captain. I still wouldn't expect it to lose scheduled events however. Maybe I was too optimistic.
Hi,
Is this search head the "captain"? If not, then scheduled searches on the dead member will be carried out by other member or captain itself. Since, you lost only 1 in a 4 member cluster, scheduled searches should not be affected. HTH!
Per docs,
Role of the captain
The captain is a cluster member and in that capacity it performs the search activities typical of any cluster member, servicing both ad hoc and scheduled searches. If necessary, you can limit the captain's search activities so that it performs only ad hoc searches and not scheduled searches. See Configure the captain to run ad hoc searches only.
The captain also coordinates activities among all cluster members. Its responsibilities include:
Scheduling jobs. It assigns jobs to members, including itself, based on relative current loads.
Coordinating alerts and alert suppressions across the cluster. The captain tracks each alert but the member running an initiating search fires it.
Tru, it was the captain (verified with index=_internal sourcetype=splunkd shc_captain instance_roles log_level=INFO
), but that's still pretty concerning. One of the benefits of clustering is resiliency. If a server goes down and we (effectively) lose data, I'd say it's failing the resiliency test. I guess I'll need to put together a search that shows which scheduled events were silently punted.
Yes. As soon as captain goes down, remaining members should start the election process. New captain should get votes from members in order to win the election. Run below searches to get some more intel.
I would say start your investigation from the former captain. Look at $SPLUNK_HOME/var/log/splunk/splunkd_stderr.log
to see why splunk crashed.
To see if voting process has initiated and completed, run: index=_internal sourcetype=splunkd SHCRaftConsensus *vote*
If you come across the below message in the results of above query, then start investigating on the server which did not vote (this generally happens when there is a corrupted raft file on the search head).
10-21-2018 04:16:32.558 -0400 INFO SHCRaftConsensus - requestVote done to http://SH:8089 was not OK. Will backoff for certain time
To see if any member has corrupted raft state: index=_internal sourcetype=splunkd ERROR SHCRaftConsensus corrupted
Output of above query:
10-21-2018 03:27:13.711 -0400 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=http://SH:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mode=json, error=500 - \n In handler 'shclustermemberconsensus': Search head clustering: Search head cluster member has a corrupted raft state. to http://SH:8089
Negative, Goatrider. The corrupted search didn't hit. I knew the SHC was the scheduler, but I also thought there was some recovery in place for when the captain died unexpectedly. To a) drop dozens of searches on the floor, and b) provide no indication of this ... hurts. A lot.
SH3 (SHC) went down at 5:58-ish due to reboot (NOC monkey). Raft vote completed and landed on a de-preferred member, SH8 at 5:59:32. By 6:00:25, captaincy had moved to a preferred captain, SH2. When SH3 returned at about 6:00:00, it had bad limits, which caused the error above (pthread) and it died at about 6:30am.
At 6:00:25, cluster had a captain (SH2), which should help taking care of scheduled jobs.
Can you run below search to get details of skipped search, (from an accepted answer😞
index="_internal" sourcetype="scheduler"
| eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S")
| stats values(scheduled) as scheduled
values(savedsearch_name) as search_name
values(status) as status
values(reason) as reason
values(run_time) as run_time
values(dm_node) as dm_node
values(sid) as sid
by _time,savedsearch_name | sort -scheduled
| table scheduled, search_name, status, reason, run_time
Bueno search. But i've been down that road and there were only 4 listed skipped or delegated_remote_error. Sadly the Big Deal(tm) searches that were skipped are not in the scheduler log, skipped, error, or otherwise.