ITSI gives me messages like this frequently:
Refresh queue job: ['5da491033e5b6c4c9510ecf2', '5da491033e5b6c4c9510ecf3'] is stuck. Please confirm and manually delete this job from the queue.
Note: The values inside of the brackets vary but, the rest of the message is always the same. I've noticed some KPI base searches seem to just quit working which I'm thinking is probably related to this message.
I've searched all over the place for instructions and also opened a ticket. It seems that nobody knows the mysterious process referenced in the message to "confirm and manually delete this job from the queue". What am I confirming? How do I delete the job from the queue?
I have no idea what this is telling me or how to perform the recommended action.
Alternatively, if that doesn't work, you can just exclude the jobID off of the curl command and clear the whole refresh queue. Clearing the whole queue does have a caveat. If you had some jobs that were committed into the queue, you may have to go back and re-do them as the job that was going to commit them to the proper kvstore collection(s) has now been removed.
Alternatively, if the job(s) are no longer in queue, then they have been naturally removed from the queue and the messages that you are seeing in the UI can be disregarded. I'm looking into whether or not we have a value to lower the overall amount of messages you're receiving for that specific error or if this is something we may need to file an enhancement request over. I'll update if I find anything. Cheers!
EDIT: I managed to get a dev's ear about the frequency of the errors. It looks like we run a search in the background once every ~30 minutes specifically to check for any refresh queue job issues. According to the devs, we removed this message in either 4.2 or 4.3 so that the UI didn't get so bogged down with erroneous messages.
I have seen errors like this when the OS's filesystem is corrupt. The way to check is to go to the CLI as user = root and try to delete the files yourself. If you get an error message, then research that. If the filesystem is corrupt, then the only option is to stop the server, unmount the drive and fsck it.
I never faced this issue before, so my suggestion is to delete the job from CLI/WEB. The admin role is required to delete this job from CLI/web.
The jobs path is $SPLUNK_HOME/var/run/splunk/dispatch/
Check the job inspector to confirm the job SID and find the job. If you are running on linux just run the rm to delete the file.
I dont know why, but sometimes if you delete the job from web interface, it is possible that you can still see the job at job inspector, so I believe it is more secure if you delete it from CLI command. I hope this can help you to delete this job, but I don't have any idea why this happen.