Is a dead SHC member expected to cause a batch of ...

twinspop · ‎10-20-2018

At oh:dark:30 yesterday one of my search heads took a dirt nap. 1 in a 4 member cluster.

Thread - ReplicationDataReceiverThread: about to throw a ThreadException: pthread_create: Resource temporarily unavailable

I'm still looking into what to do with that (limits look good, 16 CPU, 32 GB server).

However, more concerning is that this server crash caused a whole mess of scheduled jobs to be missed. There is no record of them in any _internal nor _audit logs. They just completely, silently, were skipped.

7.0.4

EDIT: Per the first response below, yes it was the captain. I still wouldn't expect it to lose scheduled events however. Maybe I was too optimistic.

sudosplunk · ‎10-23-2018

Hi,

Is this search head the "captain"? If not, then scheduled searches on the dead member will be carried out by other member or captain itself. Since, you lost only 1 in a 4 member cluster, scheduled searches should not be affected. HTH!

Per docs,

Role of the captain
The captain is a cluster member and in that capacity it performs the search activities typical of any cluster member, servicing both ad hoc and scheduled searches. If necessary, you can limit the captain's search activities so that it performs only ad hoc searches and not scheduled searches. See Configure the captain to run ad hoc searches only.

The captain also coordinates activities among all cluster members. Its responsibilities include:

Scheduling jobs. It assigns jobs to members, including itself, based on relative current loads.
Coordinating alerts and alert suppressions across the cluster. The captain tracks each alert but the member running an initiating search fires it.

twinspop · ‎10-23-2018

Tru, it was the captain (verified with index=_internal sourcetype=splunkd shc_captain instance_roles log_level=INFO), but that's still pretty concerning. One of the benefits of clustering is resiliency. If a server goes down and we (effectively) lose data, I'd say it's failing the resiliency test. I guess I'll need to put together a search that shows which scheduled events were silently punted.

sudosplunk · ‎10-23-2018

Yes. As soon as captain goes down, remaining members should start the election process. New captain should get votes from members in order to win the election. Run below searches to get some more intel.

I would say start your investigation from the former captain. Look at $SPLUNK_HOME/var/log/splunk/splunkd_stderr.log to see why splunk crashed.

To see if voting process has initiated and completed, run: index=_internal sourcetype=splunkd SHCRaftConsensus *vote*

If you come across the below message in the results of above query, then start investigating on the server which did not vote (this generally happens when there is a corrupted raft file on the search head).

10-21-2018 04:16:32.558 -0400 INFO  SHCRaftConsensus - requestVote done to http://SH:8089 was not OK. Will backoff for certain time

To see if any member has corrupted raft state: index=_internal sourcetype=splunkd ERROR SHCRaftConsensus corrupted

Output of above query:

10-21-2018 03:27:13.711 -0400 ERROR SHCRaftConsensus - failed appendEntriesRequest err: uri=http://SH:8089/services/shcluster/member/consensus/pseudoid/raft_append_entries?output_mode=json, error=500 - \n In handler 'shclustermemberconsensus': Search head clustering: Search head cluster member has a corrupted raft state. to http://SH:8089

twinspop · ‎10-23-2018

Negative, Goatrider. The corrupted search didn't hit. I knew the SHC was the scheduler, but I also thought there was some recovery in place for when the captain died unexpectedly. To a) drop dozens of searches on the floor, and b) provide no indication of this ... hurts. A lot.

twinspop · ‎10-23-2018

SH3 (SHC) went down at 5:58-ish due to reboot (NOC monkey). Raft vote completed and landed on a de-preferred member, SH8 at 5:59:32. By 6:00:25, captaincy had moved to a preferred captain, SH2. When SH3 returned at about 6:00:00, it had bad limits, which caused the error above (pthread) and it died at about 6:30am.

sudosplunk · ‎10-23-2018

At 6:00:25, cluster had a captain (SH2), which should help taking care of scheduled jobs.
Can you run below search to get details of skipped search, (from an accepted answer😞

 index="_internal" sourcetype="scheduler" 
             | eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") 
             | stats values(scheduled) as scheduled
                     values(savedsearch_name) as search_name
                     values(status) as status
                     values(reason) as reason
                     values(run_time) as run_time 
                     values(dm_node) as dm_node
                     values(sid) as sid
                     by _time,savedsearch_name |  sort -scheduled
             | table scheduled, search_name, status, reason, run_time

twinspop · ‎10-23-2018

Bueno search. But i've been down that road and there were only 4 listed skipped or delegated_remote_error. Sadly the Big Deal(tm) searches that were skipped are not in the scheduler log, skipped, error, or otherwise.

Is a dead SHC member expected to cause a batch of missed scheduled searches?

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

Join the Conversation

Is a dead SHC member expected to cause a batch of missed scheduled searches?

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey