Sometimes when I review splunk logs or job inspector I see that I have searches in zombie state. What does this mean?
Splunk calls a search a zombie when it the search is no longer running, but did not declare explicitly that it has finished its work.
Typically this means that the search crashed, but some other scenarios might exist like an awkward shutdown, or some other class of bug like an error writing out the completion state.
Note, this is not the same as a UNIX "zombie" process, it's only tangentially even similar.
We had also "zombie" searches. In our case it were searches still running on the indexer, but not anymore on the Search-Head.
This occurs in Splunk 4.3.x with KV_MODE=XML and logs with invalid XML.
If you have such problems, ask Splunk for the Patch-Release.
I had this issue today with a real time search (Splunk enterprise v6.2.0), we were experiencing issues where the search terms weren't being picked up by the real time search/alert and alerting on terms that we were excluding. After searching Answers & Docs repeatedly and even Googled the issue, I found nothing.
After much searching/changing search terms, testing & nashing of teeth, I turned on the "List in Triggered Alerts" and then examined the next alert that came up in Job Inspector. I saw that the search terms that I put in the alert were there, but the search job properties did not have the changed search terms, so I was finally on a hot trail. I tried changing the search terms several times, but the changes never made it to the search job properties that are what the search head sends to the indexer.
When I went to the Activity > Jobs menu and went to look at the particular user & Running jobs, I saw a number of zombie processes out there. When I looked at them in Job Inspector, I saw that they had the very search terms that I was trying to change. So, instead of restarting the Splunk instance, I tried finalizing the job / alert that was running (real time). The zombie processes evaporated (and stopped eating my cpu brains!) then the job started back up and it was using the correct, changed search terms. I tried changing the search terms a few times after that and the running job correctly reflected the changes.
I hope this will help someone as I spent about 5 hours messing with this, but it is a good lesson learned. I wasn't aware that zombie processes could prevent changes, although it makes sense. I'll have to use the Job Inspector weapon in more often to rid my installation of zombies.
No @jrodman, my answer is not just informational, it is an operational way to correct an issue with zombied search jobs in 6.2.x, and it would be nice if zombie searches were part of the documentation (maybe in the troubleshooting section). I'm not sure why you would put that comment under my answer.
Interesting thing that I ran across a few days after my answer above is a search on how to find Zombie jobs in Splunk, in a field called isZombie! I created this search and I alert if I get any results.
In essence, it shows if a search job died for some reason, but the search continues. A real-time search or long running, intensive search that continues while another kicks off (maybe repeatedly) will certainly cause issues (as I've experienced).
| rest /services/search/jobs | search isZombie>0| table author id isDone isFailed isFinalized isPaused isRealTimeSearch isSaved isZombie normalizedSearch request.search request.earliesttime request.latesttime sid title updated
Found these Splunk doc links concerning zombied searches:
From doc, classic definition of zombie: Gets a value that indicates if the process running the current search job is dead, but with the search not finished.
This doc indicates much of the same: http://dev.splunk.com/view/python-sdk/SP-CAAAEE5
From doc: isZombie A Boolean that indicates whether the process running the search is dead, but with the search not finished
Lastly, search for isZombie in the JavaSDK doc about Job: http://docs.splunk.com/DocumentationStatic/JavaSDK/1.0/com/splunk/Job.html
In my case the issue was prolonged high RAM usage due to a complex report.
Due note that splunk recommends 12GB of RAM.
Iam running a test server in a Ligthsail aws, its a bitnami distro with only 512 MB Ram. splunkd RAM usage is more than 90% in large searches,
expanded the swap to 12GB solve the problem.
Just follow these steps:
Make all swap off sudo swapoff -a Resize the swapfile sudo dd if=/dev/zero of=/swapfile bs=1M count=1024 Make swapfile usable sudo mkswap /swapfile Make swapon again sudo swapon /swapfile
Recommended hardware capacity
The following requirements are accurate for a single instance installation with light to moderate use. For significant enterprise and distributed deployments, see Capacity Planning.
Platform Recommended hardware capacity/configuration
Non-Windows platforms 2x six-core, 2+ GHz CPU, 12GB RAM, Redundant Array of Independent Disks (RAID) 0 or 1+0, with a 64 bit OS installed.
Windows platforms 2x six-core, 2+ GHz CPU, 12GB RAM, RAID 0 or 1+0, with a 64-bit OS installed.