Something weird started happening in our Splunk environment for ITSI native saved search : service_health_monitor
This search started getting 100% skipped with reason: The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached
So, I checked the jobs section and found that the search was stuck running at x% < 100 and hence the next scheduled search could not start. So tried deleting that one, so that it can run in next run, but the next run showed the same behaviour ie, stuck halfway.
Inspect job shows most of the time was spent on startup.handoff and below is what I can see in the end of savedsearch.log that after the noop process (BEGIN OPEN: Processor=noop) splunk seems stuck.
Please provide any insights which can help in investigating further.
09-07-2020 17:41:05.680 INFO LocalCollector - Final required fields list = Message,_raw,_subsecond,_time,alert_level,alert_severity,app,index,indexed_is_service_max_severity_event,is_service_in_maintenance,itsi_kpi_id,itsi_service_id,kpi,kpiid,prestats_reserved_*,psrsvd_*,scoretype,service,serviceid,source,urgency 09-07-2020 17:41:05.680 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:05.680 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:05.680 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:05.680 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.105 INFO UserManager - Unwound user context: splunk-system-user -> NULL 09-07-2020 17:41:06.171 INFO ChunkedExternProcessor - Exiting custom search command after getinfo since we are in preview mode:gethealth 09-07-2020 17:41:06.177 INFO SearchOrchestrator - Starting the status control thread. 09-07-2020 17:41:06.177 INFO SearchOrchestrator - Starting phase=1 09-07-2020 17:41:06.177 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Setting user context: splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:06.177 INFO UserManager - Done setting user context: NULL -> splunk-system-user 09-07-2020 17:41:06.177 INFO ReducePhaseExecutor - Stating phase_1 09-07-2020 17:41:06.177 INFO SearchStatusEnforcer - Enforcing disk quota = 26214400000 09-07-2020 17:41:06.177 INFO PreviewExecutor - Preview Enforcing initialization done 09-07-2020 17:41:06.177 INFO DispatchExecutor - BEGIN OPEN: Processor=stats 09-07-2020 17:41:06.209 INFO ResultsCollationProcessor - Writing remote_event_providers.csv to disk 09-07-2020 17:41:06.209 INFO DispatchExecutor - END OPEN: Processor=stats 09-07-2020 17:41:06.209 INFO DispatchExecutor - BEGIN OPEN: Processor=gethealth 09-07-2020 17:41:06.217 INFO DispatchExecutor - END OPEN: Processor=gethealth 09-07-2020 17:41:06.217 INFO DispatchExecutor - BEGIN OPEN: Processor=noop 09-07-2020 17:48:07.948 INFO ReducePhaseExecutor - ReducePhaseExecutor=1 action=PREVIEW
@Nisha18789 did you find the solution to the issue? we have the same issue and wondering if we can change the cron to run every 10 minutes instead of every one minute.
hi @pagillar , yes we identified the issue was due to some of the services have unicode character in the service name, which was causing the service health monitor search skipping. We identified those services and renamed them which fixed the issue.
Now, I am not sure if its easy to identify that in our environment or not. But here is how we identified it -
get list of all service name using rest api query or lookup and use any online unicodelookup tool to find the ones.
Hope this helps!
@Nisha18789 Thanks