I sometimes receive the following error message in my shp environment (4.3.5) when executing a search:
ERROR: Reached end-of-stream while waiting for more data from peer
I would like to know a few things about this message and how to find the root cause
This message is raised by the distributed search framework whenever a search peer ceases to respond and/or send data mid-stream.
The most common reason for this is that the remote search process on the search peer reported in the error has crashed.
The next steps to take in this investigation are as follows:
$SPLUNK_HOME/var/log/splunk
for any crash log files corresponding to the time at which you saw the error.$SPLUNK_HOME/var/log/splunk/splunkd.log
around the time at which the error was observed in the UI for anything relevant recorded by the main splunkd process.$SPLUNK_HOME/var/run/splunk/dispatch/remote_$SID
. In that artifact, look at the end of the search.log
file for any indicators of the root cause of the crash.Unfortunately, this error message is very ambiguous because it really is just saying that the socket Splunk is listening on was not closed by Splunk. How or what closed that socket, even if it was done correctly, is something Splunk has no information on.
There are a variety of possible reasons that this message can appear and in a distributed search on older versions of Splunk, it can definitely be a red herring. You may wish to double check your results of the search against the raw events. Due to the wide variety of reasons that the socket may close and spawn this warning message, its hard to say what the root cause might be. This ranges from the OS doing some clean up, to timeout settings, to network latency, cluster timeouts, performance limitations, etc.
Streaming and non streaming search commands would pretty much have no effect as we are talking about different types of "streaming". The only exception would be in case of a search performance issue where making the search more performant may help in avoiding the error.
Troubleshooting this error should start with checking the metrics for the search(audit.log, metrics.log, search.log from the dispatch) that generated this error message. In other words, find out if there is a performance issue for the as this is one of the most common causes of the socket being closed prematurely and causing this error to appear. Corresponding to that, checking the splunkd.log will give an indication of what splunkd is doing at the time of the error but you will want to take a look at events before and at the time the error occurs as there may be an underlying issue. An example of such might be that a timeout has occurred for reaching a search peer, or perhaps the splunkd processor is spending a lot of time in one of the processing queues, etc. If the error happens with every search and is a global problem, then you may want to focus on finding errors in splunkd if there are only certain searches this occurs with then you will want to investigate the search job.
When opening a case to support for this error message, you will want to include a diag file and the dispatch folder for when the search ran that causes the warning message to be displayed. Depending on the issue, more information may be required and the support team will request it when needed.
This message is raised by the distributed search framework whenever a search peer ceases to respond and/or send data mid-stream.
The most common reason for this is that the remote search process on the search peer reported in the error has crashed.
The next steps to take in this investigation are as follows:
$SPLUNK_HOME/var/log/splunk
for any crash log files corresponding to the time at which you saw the error.$SPLUNK_HOME/var/log/splunk/splunkd.log
around the time at which the error was observed in the UI for anything relevant recorded by the main splunkd process.$SPLUNK_HOME/var/run/splunk/dispatch/remote_$SID
. In that artifact, look at the end of the search.log
file for any indicators of the root cause of the crash.