Hi,
In one of our indexer clusters which we query from a search head cluster, only one of the indexers is giving this error while running a search. The error I'm getting is:
<indexer_hostname>Search process did not exit cleanly, exit_code=255, description="exited with code 255". Please look in search.log for this peer in the Job Inspector for more info.
When going through search.log, for that particular indexer. All I can find is:
INFO DistributedSearchResultCollectionManager - Connecting to peer=<indexer> connectAll 0 connectToSpecificPeer 1
INFO DistributedSearchResultCollectionManager - Successfully created search result collector for peer=<indexer> in 0.002 seconds
And there aren't any ERROR
in the search.log.
Although I found some errors in splunkd.log for the same indexer which is as below:
ERROR DistBundleRestHandler - Problem untarring file: /opt/splunk/var/run/searchpeers/xxx.bundle
WARN DistBundleRestHandler - There was a problem renaming: /opt/splunk/var/run/searchpeers/xxx.tmp -> /opt/splunk/var/run/searchpeers/xxxx: Directory not empty
I have seen some of the previous answers, stating that there might be not enough free space available in that particular indexer, but when checked, there is still 40% more available space.
I couldn't figure out what was the problem as there was no ERROR
in the search.log. I'm on Splunk 7.1.3
Thanks in advance.
The error was solved, when I increased the ulimits for open files of indexers to the recommended 64000 initially it 4096.
The error was solved, when I increased the ulimits for open files of indexers to the recommended 64000 initially it 4096.
We had this error before and it turned out to be IO bound. The search peer's IO was very low and so unable to handle the request properly.
Then how did you solve it?
Increase available IO on the host. That may be non trivial to do unless its virtual but that's what we did. Either increase speed of disks, number of disks or decrease other IO load.
We did this after trying several other exit_code=255 fixes (there seems to be many ways to get this error) and finding out that our IOPS on that particular box did not meet minimum spec.
The IOPS are above the recommended specs but our ulimits for open files was less but even after increasing them. The error is still present. As of now for the problematic indexers I have quarantined. But I'm looking for a solution so that I can resolve the issue.
Do you know anyother solutions that might be able to resolve this.
Thanks
Have you confirmed that the permissions are correct on your Search Peer? [edit, originally my response said SH]
Its easily done - accidentally running ./splunk stop && ./splunk start
as root.
Unfortunately, the next time you restart the service as splunk
the permissions are messed up.
Check the contents of /opt/splunk/var/run/searchpeers/ are all owned by "splunk" and not "root"
We haven't restarted any of the search heads recently, and the permissions for the /opt/splunk/var/run/searchpeers/ is
on the search heads it is splunk
on the indexers it is root
which I think are correct as it should be
Sorry - I meant the Search Peer (not Head) - will amend
Who does Splunk run as on the peers? (best practice suggests it should not be root - in which case, that could very well be your issue)
There are some reasons you might run as root - but that is seperate conversation 🙂
I'm not sure what is best to be run as. But all other indexers and also this indexer was running as root from the beginning which is long time back. I was receiving this error as of now only on one of the indexer in the cluster.
were you looking at the search log on the offending indexer? the one in $SPLUNK_HOME/var/run/splunk/dispatch//search.log?
No, I was looking on the search head.
The search.log on the problem indexer, I found in dispatch folder multiple search heads remote searches directories.
In the search.log,
This is the common log entries I found for ERROR
ERROR dispatchRunner - RunDispatch::runDispatchThread threw error: Application does not exist: <app_name_which_exists>
But all entries app_names were different which were created long back. None of the app's were created recently.
those logs will get purged after the search has expired. i'd suggest rerunning the search that is causing the problem, get the sid from the job inspector and then go find that specific search log on the indexer. And when you do find it, you may want to make a copy elsewhere since it will roll eventually.
Currently, I've stopped splunk on that indexer as it is giving the error to the every search. Not particularly one search. Even if I search the index which is not in that indexer, search head is giving me that error.
hmm...ok. so you have errors about untarring the search bundle and errors about apps not existing. i wonder if you can look at the search bundle on that indexer and see if the app is in there? $SPLUNK_HOME/var/run/searchpeers I believe is the location.
In there are the search bundles from the search heads that contain all of the config the indexers need to run the searches. If they're not making it there, then maybe the index is throwing an error because it doesn't have the context to run the search?
Yes, all the apps are present in the $SPLUNK_HOME/var/run/searchpeers location. I see the bundle from the deployer and in that all the apps, present on the search are here too.
Well, I see that none of the search heads bundle is on the indexer and in the deployer's bundle the app for which the error is given that app does not exist. So, as you said may be indexer doesn't have the required bundle. What might be the issue or any remediation for this?