Splunk Search
Highlighted

Why are we getting error "Timed out waiting for peer XXX", but the search status=success?

Explorer

To monitor if my nightly searches ran properly I'm looking at:

index=_internal sourcetype=scheduler earliest=@d | <few_more_filtering>

but I've just noticed that in case of a receiveTimeout error for one of the involved peers, the "status" field in the resulting events contains the value "success", even if opening the search results from the job list I can see an error:

Timed out waiting for peer XXX. If this occurs frequently, receiveTimeout in distsearch.conf may need to be increased. Search results might be incomplete!

I tried to run a global search like:

 splunk_server=* index=* "Timed out waiting for peer"

But nothing is popping up.

Is there a way to set up an alert in case a search ran, but failed or had any issues? The "status" field doesn't seem to cover the latter scenario...

0 Karma
Highlighted

Re: Why are we getting error "Timed out waiting for peer XXX", but the search status=success?

Communicator

index=* will not give you results from the _internal index. Try:

index=_internal splunk_server=* "Timed out waiting for peer"
0 Karma
Highlighted

Re: Why are we getting error "Timed out waiting for peer XXX", but the search status=success?

Explorer

Yeah I forgot to say I've already tried with index=_* too but nothing there neither.

0 Karma
Highlighted

Re: Why are we getting error "Timed out waiting for peer XXX", but the search status=success?

Explorer

As a workaround I'm now checking the messages.error field from the API (i.e. /services/search/jobs)... those messages are available there.

I still think the status field from the scheduler events log should be set to something different than success if actually something happened 😉

0 Karma
Highlighted

Re: Why are we getting error "Timed out waiting for peer XXX", but the search status=success?

Splunk Employee
Splunk Employee

This error occurs when your Search Heads attempts to send a search job to a Search Peer (usually one of your Indexers) and the Indexer does not respond in within the default timeout period so the Search continues but without using that Indexer (which of course probably means that some of your events are not returned so your search is wrong). In my experience, the problem can often be cleared simply by restarting the Splunk instance on the Indexer in question but sometimes you need to dig deeper. In any case, something is keeping your Indexers so busy that it cannot reliably respond to search requests even though the Splunk instance is running. I am sure this kind of thing can also commonly be caused by misconfigured/misbehaving load-balancers or other identity/load-shifting equipment that is between your Search Head and your Indexer peers.

0 Karma