Something weird started happening in our Splunk environment for ITSI native saved search : service_health_monitor
This search started getting 100% skipped with reason: The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached
So, I checked the jobs section and found that the search was stuck running at x% < 100 and hence the next scheduled search could not start. So tried deleting that one, so that it can run in next run, but the next run showed the same behaviour ie, stuck halfway.
Inspect job shows most of the time was spent on startup.handoff and below is what I can see in the end of savedsearch.log that after the noop process (BEGIN OPEN: Processor=noop) splunk seems stuck.
Please provide any insights which can help in investigating further.
09-07-2020 17:41:05.680 INFO LocalCollector - Final required fields list = Message,_raw,_subsecond,_time,alert_level,alert_severity,app,index,indexed_is_service_max_severity_event,is_service_in_maintenance,itsi_kpi_id,itsi_service_id,kpi,kpiid,prestats_reserved_*,psrsvd_*,scoretype,service,serviceid,source,urgency 09-07-2020 17:41:05.680 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:05.680 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:05.680 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:05.680 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.171 INFO ChunkedExternProcessor - Exiting custom search command after getinfo since we are in preview mode:gethealth 09-07-2020 17:41:06.177 INFO SearchOrchestrator - Starting the status control thread. 09-07-2020 17:41:06.177 INFO SearchOrchestrator - Starting phase=1 09-07-2020 17:41:06.177 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:06.177 INFO ReducePhaseExecutor - Stating phase_1 09-07-2020 17:41:06.177 INFO SearchStatusEnforcer - Enforcing disk quota = 26214400000 09-07-2020 17:41:06.177 INFO PreviewExecutor - Preview Enforcing initialization done 09-07-2020 17:41:06.177 INFO DispatchExecutor - BEGIN OPEN: Processor=stats 09-07-2020 17:41:06.209 INFO ResultsCollationProcessor - Writing remote_event_providers.csv to disk 09-07-2020 17:41:06.209 INFO DispatchExecutor - END OPEN: Processor=stats 09-07-2020 17:41:06.209 INFO DispatchExecutor - BEGIN OPEN: Processor=gethealth 09-07-2020 17:41:06.217 INFO DispatchExecutor - END OPEN: Processor=gethealth 09-07-2020 17:41:06.217 INFO DispatchExecutor - BEGIN OPEN: Processor=noop 09-07-2020 17:48:07.948 INFO ReducePhaseExecutor - ReducePhaseExecutor=1 action=PREVIEW