Why Isn't My Search Job Expiring? The "Zombie" Job...

RookieSplunker · ‎08-20-2025

Hi everyone,

I'm looking for some help with a Splunk issue I recently encountered. A user's search job consumed a large amount of disk space, and despite the default job expiration time being 10 minutes, this particular job was not deleted for about 36 hours.

The _internal logs showed this WARN message:

08-17-2025 06:46:13.183 +0000 WARN DispatchManager [4118198 TcpChannelThread] - enforceQuotas: username="[user_name]", search_id="[search_id]" - QUEUED reason="The maximum disk usage quota for this user has been reached.", concurrency_limit="500"

I've heard a theory that if a search job fails midway due to a quota issue, Splunk sometimes marks it as a "zombie" job. These jobs aren't cleaned up by the standard expiration process but are instead handled by the dispatch_cleanup process, which runs much less frequently (e.g., every 36 hours by default).

Can anyone confirm if this theory is correct? I haven't been able to find official documentation to validate it. Any insights or documentation links would be greatly appreciated!

Thanks!

RookieSplunker · ‎08-21-2025

Thanks, @tej57 . I understand your point, but I’d like to clarify something — why specifically 36 hours? My understanding was that the expiration should have been 10 minutes, so I’m curious how it ended up showing 36 hours.

From what I gather, the search artifacts are governed by the clean_dispatch process, which runs on a default 36-hour interval. That’s why we’re seeing that number. The search itself likely still had its own expiration time (e.g., 10 minutes for a user search), but the cleanup of its artifacts only happens when clean_dispatch runs. So it’s not that the search’s expiration was changed to 36 hours; rather, the artifacts remained until the clean_dispatch cycle, which is set to 36 hours.

Can you confirm if this is the reason why it’s showing 36 hours instead of 10 minutes? For example, could it be that Splunk marked it as a zombie search and extended the expiration to 36 hours so that clean_dispatch could clean it up?

isoutamo · ‎08-22-2025

Are you sure that no-one hasn't extend the lifetime of this job?

RookieSplunker · ‎08-22-2025

Yes, because this query is initiated by my SOAR playbook, so there is no manual intervention. It runs when a trigger occurs (not on a schedule) and has run successfully before. However, on the day this issue occurred, the query ran three times with varying intervals and durations. The first run completed in around 2 minutes, the second run failed after about 1.5 hours, and the final run was cancelled after approximately 8.5 hours. The size of the artifacts sent to the indexer grew significantly during the second and third executions. The expiration time was changed to 36 hours with the final run—could the 8.5-hour run, and the 36-hour expiration time be related?

tej57 · ‎08-21-2025

Hello @RookieSplunker,

Yes, your understanding is correct and you'll need to clean the dispatch directory job manually. I discovered following from the Splunk Doc - https://help.splunk.com/en/splunk-enterprise/search/search-manual/9.4/manage-jobs/dispatch-directory...

Thanks,
Tejas.

---
If the above solution is helpful, an upvote is appreciated..!!

Why Isn't My Search Job Expiring? The "Zombie" Job Theory

other

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

Why Isn't My Search Job Expiring? The "Zombie" Job Theory

other

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?