I have a case where some indexers take 4 to 5 hours to join the cluster. The system shows no/little system usage (CPU, Mem, I/O).
The splunkd.log appears to loop through the same log entries multiple times. Then, the indexer continues loading when I see a log entry: Running job=BundleForcefulStateMachineResetJob
After this reset job is ran I quickly see the public key for the master loaded and the indexer joins the cluster shortly thereafter.
Here is a snippet of the log:
10-13-2022 11:22:02.293 -0700 WARN HttpListener - Socket error from 127.0.0.1:54240 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 11:44:08.721 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (1103256 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 11:44:24.950 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=1119484 new=0 null=0 total=56 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4062'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4164'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1103256'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'44','null':'0','last':'55','time_ms':'0'}
10-13-2022 12:00:00.001 -0700 INFO ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py
10-13-2022 12:18:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:19:32.106 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:02.105 -0700 WARN DispatchReaper - Failed to read search info for id=1665688686.28
10-13-2022 12:20:30.137 -0700 WARN HttpListener - Socket error from 127.0.0.1:54544 while accessing /servicesNS/splunk-system-user/splunk_archiver/search/jobs: Broken pipe
10-13-2022 12:29:09.955 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (2182584 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:29:25.957 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=2198585 new=1 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'2182584'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}
10-13-2022 12:46:13.298 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (496854 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-13-2022 12:46:13.867 -0700 WARN HttpListener - Socket error from 127.0.0.1:54220 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.907 -0700 WARN HttpListener - Socket error from 127.0.0.1:54254 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.931 -0700 WARN HttpListener - Socket error from 127.0.0.1:54560 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:13.955 -0700 WARN HttpListener - Socket error from 127.0.0.1:54538 while accessing /services/data/indexes: Broken pipe
10-13-2022 12:46:17.070 -0700 INFO BundleJob - Running job=BundleForcefulStateMachineResetJob
Which version of Splunk Enterprise are you on?
IIRC there was a bug on v9.x where Indexer peer node takes hours to join cluster.
Did you try raising a support case?
What is the fix for the slow restart? Indexer Peers were at 8.x while the cluster master was at 9.0.0.1 However, we have upgraded all to 9.0.0.1 but still have the same behavior.
Same behavior on another indexer today.
Log snippet:
10-19-2022 00:34:52.035 -0700 INFO KeyManagerSearchPeers - Finished reading public key for peer: /opt....
10-19-2022 00:34:59.332 -0700 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
10-19-2022 00:36:00.466 -0700 WARN TailReader - Could not send data to output queue (parsingQueue), retrying...
10-19-2022 00:36:02.071 -0700 INFO KeyManagerLocalhost - Checking for localhost key pair
10-19-2022 00:39:27.140 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds
10-19-2022 00:39:27.140 -0700 INFO TcpInputProc - Stopping IPv4 port 9996
10-19-2022 00:39:27.140 -0700 INFO TcpInputProc - Stopping IPv4 port 9997
10-19-2022 00:39:37.338 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds
10-19-2022 00:39:55.313 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds
10-19-2022 00:39:57.465 -0700 WARN TcpInputProc - Stopping all listening ports. Queues blocked for more than 300 seconds
10-19-2022 00:45:32.706 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (246751 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-19-2022 00:48:01.239 -0700 INFO ServerRoles - Declared role=search_peer.
10-19-2022 00:57:07.531 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (168575 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
10-19-2022 01:00:00.001 -0700 INFO ExecProcessor - setting reschedule_ms=3599999, for command=/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_instrumentation/bin/instrumentation.py
10-19-2022 01:14:27.337 -0700 INFO PipelineComponent - MetricsManager:probeandreport() took longer than seems reasonable (513382 milliseconds) in callbackRunnerThread. Might indicate hardware or splunk limitations.
Than... lots of activity after Running job=BundleForcefulStateMachineResetJob
10-19-2022 04:40:58.082 -0700 WARN HttpListener - Socket error from 127.0.0.1:56904 while accessing /services/data/indexes: Broken pipe
10-19-2022 04:40:59.226 -0700 INFO BundleJob - Running job=BundleForcefulStateMachineResetJob
10-19-2022 04:41:13.098 -0700 INFO PipelineComponent - CallbackRunnerThread is unusually busy, this may cause service delays: time_ms=1537142 new=0 null=0 total=57 {'name':'DistributedRestCallerCallback','valid':'1','null':'0','last':'3','time_ms':'0'},{'name':'HTTPAuthManager:timeoutCallback','valid':'1','null':'0','last':'1','time_ms':'0'},{'name':'IndexProcessor:ipCallback-0','valid':'1','null':'0','last':'6','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-1','valid':'1','null':'0','last':'19','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-2','valid':'1','null':'0','last':'30','time_ms':'4000'},{'name':'IndexProcessor:ipCallback-3','valid':'1','null':'0','last':'41','time_ms':'4000'},{'name':'MetricsManager:probeandreport','valid':'1','null':'0','last':'0','time_ms':'1521139'},{'name':'PullBasedPubSubSvr:timerCallback','valid':'1','null':'0','last':'2','time_ms':'0'},{'name':'ThreadedOutputProcessor:timerCallback','valid':'4','null':'0','last':'40','time_ms':'0'},{'name':'triggerCollection','valid':'45','null':'0','last':'56','time_ms':'0'}
10-19-2022 04:45:33.840 -0700 INFO KeyManagerSearchPeers - Updating public key for search peer