Why does the indexer peer node takes hours to join...

mdtoro · ‎10-19-2022

I have a case where some indexers take 4 to 5 hours to join the cluster. The system shows no/little system usage (CPU, Mem, I/O).

The splunkd.log appears to loop through the same log entries multiple times. Then, the indexer continues loading when I see a log entry: Running job=BundleForcefulStateMachineResetJob

After this reset job is ran I quickly see the public key for the master loaded and the indexer joins the cluster shortly thereafter.

Here is a snippet of the log:

10-13-2022 11:22:02.293 -0700 WARN HttpListener - Socket error from 127.0.0.1:54240 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 11:44:08.721 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (1103256 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 11:44:24.950 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=1119484 new=0 null=0 total=56 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4062'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4164'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1103256'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'44','null':'0','last':'55','time_ms':'0'}
10-13-2022 12:00:00.001 -0700 INFO ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py
10-13-2022 12:18:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:30.137 -0700 WARN HttpListener - Socket error from 127.0.0.1:54544 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 12:29:09.955 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (2182584 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:29:25.957 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=2198585 new=1 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'2182584'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}
10-13-2022 12:46:13.298 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (496854 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:46:13.867 -0700 WARN HttpListener - Socket error from 127.0.0.1:54220 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.907 -0700 WARN HttpListener - Socket error from 127.0.0.1:54254 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.931 -0700 WARN HttpListener - Socket error from 127.0.0.1:54560 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.955 -0700 WARN HttpListener - Socket error from 127.0.0.1:54538 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:17.070 -0700 INFO BundleJob - Running job=BundleForcefulStateMachineResetJob

anirban_td · ‎12-01-2022

Which version of Splunk Enterprise are you on?

IIRC there was a bug on v9.x where Indexer peer node takes hours to join cluster.

Did you try raising a support case?

mdtoro · ‎12-23-2022

What is the fix for the slow restart? Indexer Peers were at 8.x while the cluster master was at 9.0.0.1 However, we have upgraded all to 9.0.0.1 but still have the same behavior.

mdtoro · ‎10-19-2022

Same behavior on another indexer today.

Log snippet:

10-19-2022 00:34:52.035 -0700 INFO KeyManagerSearchPeers - Finished reading public key for peer: /opt....

10-19-2022 00:34:59.332 -0700 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...

10-19-2022 00:36:00.466 -0700 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...

10-19-2022 00:36:02.071 -0700 INFO KeyManagerLocalhost - Checking for localhost key pair

10-19-2022 00:39:27.140 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:27.140 -0700 INFO TcpInputProc - Stopping IPv4 port 9996

10-19-2022 00:39:27.140 -0700 INFO TcpInputProc - Stopping IPv4 port 9997

10-19-2022 00:39:37.338 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:55.313 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:39:57.465 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds

10-19-2022 00:45:32.706 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (246751 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

10-19-2022 00:48:01.239 -0700 INFO ServerRoles - Declared role=search_peer.

10-19-2022 00:57:07.531 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (168575 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

10-19-2022 01:00:00.001 -0700 INFO ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py

10-19-2022 01:14:27.337 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (513382 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.

Than... lots of activity after Running job=BundleForcefulStateMachineResetJob

10-19-2022 04:40:58.082 -0700 WARN HttpListener - Socket error from 127.0.0.1:56904 while accessing /services/data/indexes: Broken pipe

10-19-2022 04:40:59.226 -0700 INFO BundleJob - Running job=BundleForcefulStateMachineResetJob

10-19-2022 04:41:13.098 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=1537142 new=0 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1521139'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}

10-19-2022 04:45:33.840 -0700 INFO KeyManagerSearchPeers - Updating public key for search peer

Why does the indexer peer node takes hours to join cluster?

indexer

Linux

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?