Getting Data In

Why does the indexer peer node takes hours to join cluster?

mdtoro
Explorer

I have a case where some indexers take 4 to 5 hours to join the cluster. The system shows no/little system usage (CPU, Mem, I/O).

The splunkd.log appears to loop through the same log entries multiple times. Then, the indexer continues loading when I see a log entry: Running job=BundleForcefulStateMachineResetJob

After this reset job is ran I quickly see the public key for the master loaded and the indexer joins the cluster shortly thereafter.

Here is a snippet of the log:

10-13-2022 11:22:02.293 -0700 WARN HttpListener - Socket error from 127.0.0.1:54240 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 11:44:08.721 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (1103256 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 11:44:24.950 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=1119484 new=0 null=0 total=56 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4062'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4164'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1103256'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'44','null':'0','last':'55','time_ms':'0'}
10-13-2022 12:00:00.001 -0700 INFO ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py
10-13-2022 12:18:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:30.137 -0700 WARN HttpListener - Socket error from 127.0.0.1:54544 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 12:29:09.955 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (2182584 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:29:25.957 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=2198585 new=1 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'2182584'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}
10-13-2022 12:46:13.298 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (496854 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:46:13.867 -0700 WARN HttpListener - Socket error from 127.0.0.1:54220 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.907 -0700 WARN HttpListener - Socket error from 127.0.0.1:54254 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.931 -0700 WARN HttpListener - Socket error from 127.0.0.1:54560 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.955 -0700 WARN HttpListener - Socket error from 127.0.0.1:54538 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:17.070 -0700 INFO BundleJob - Running job=BundleForcefulStateMachineResetJob

Labels (2)
0 Karma

anirban_td
Explorer

Which version of Splunk Enterprise are you on?

IIRC there was a bug on v9.x where Indexer peer node takes hours to join cluster.

Did you try raising a support case?

0 Karma

mdtoro
Explorer

What is the fix for the slow restart?   Indexer Peers were at 8.x while the cluster master was at 9.0.0.1  However, we have upgraded all to 9.0.0.1 but still have the same behavior.

 

0 Karma

mdtoro
Explorer

Same behavior on another indexer today.  

Log snippet:

10-19-2022 00:34:52.035 -0700 INFO  KeyManagerSearchPeers - Finished reading public key for peer: /opt....

10-19-2022 00:34:59.332 -0700 WARN  TailReader - Could not send data to output queue (parsingQueue), retrying...

10-19-2022 00:36:00.466 -0700 WARN  TailReader - Could not send data to output queue (parsingQueue), retrying...

10-19-2022 00:36:02.071 -0700 INFO  KeyManagerLocalhost - Checking for localhost key pair

10-19-2022 00:39:27.140 -0700 WARN  TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:27.140 -0700 INFO  TcpInputProc - Stopping IPv4 port 9996

10-19-2022 00:39:27.140 -0700 INFO  TcpInputProc - Stopping IPv4 port 9997

10-19-2022 00:39:37.338 -0700 WARN  TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:55.313 -0700 WARN  TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:57.465 -0700 WARN  TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:45:32.706 -0700 INFO  PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (246751 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

10-19-2022 00:48:01.239 -0700 INFO  ServerRoles - Declared role=search_peer.

10-19-2022 00:57:07.531 -0700 INFO  PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (168575 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

10-19-2022 01:00:00.001 -0700 INFO  ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py

10-19-2022 01:14:27.337 -0700 INFO  PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (513382 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

Than... lots of  activity after Running job=BundleForcefulStateMachineResetJob

10-19-2022 04:40:58.082 -0700 WARN  HttpListener - Socket error from 127.0.0.1:56904 while accessing /services/data/indexes: Broken pipe

10-19-2022 04:40:59.226 -0700 INFO  BundleJob - Running job=BundleForcefulStateMachineResetJob

10-19-2022 04:41:13.098 -0700 INFO  PipelineComponent - CallbackRunnerThread is unusually busy,  this may cause service delays: time_ms=1537142 new=0 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1521139'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}

10-19-2022 04:45:33.840 -0700 INFO  KeyManagerSearchPeers - Updating public key for search peer

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...