We're standing up a new Splunk environment (3 search heads in a cluster, 10 indexers), and for a few of the search-intensive dashboards, we've been getting multiple errors and warnings regarding peers. For example:
Reading error while waiting for peer xxxxx004. Search results might be incomplete! Reading error while waiting for peer xxxxx005. Search results might be incomplete! Reading error while waiting for peer xxxxx006. Search results might be incomplete! Unknown error for peer xxxxx004. Search Results might be incomplete. If this occurs frequently, please check on the peer. Unable to distribute to peer named xxxx007 at uri=xxxxx:8089 using the uri-scheme=https because peer has status="Down". Please verify uri-scheme, connectivity to the search peer, that the search peer is up, and an adequate level of system resources are available. See the Troubleshooting Manual for more information.
Our Splunk architect has checked that the peers are up and running over multiple times, and suspects this is due to sloppy inefficient SPL code (lots of 'dedup' commands), and the PS folks from Splunk in the room suggested optimizing the code too. It only seems to happen on pages that have a good number of searches (8-10 queries). But we're worried there are underlying issues that could be causing this, not just SPL, particularly because we don't see them occur in our small dev box.
Note: the data source is a relatively small data source being fed from Splunk DB Connect. And this doesn't happen in our single all-in-one dev box, but is happening in our production environment with a lot more resources. For one page, there are 8 total queries, many of which are relatively normal SPL statements (some dedups, setting _time to different date string fields, filtering dates, and timecharts).
This seems to happen frequently if two people are viewing the same search-intensive dashboard, or dashboards that are search intensive hitting the same indices. E.g., I can see if it i have two tabs open on the same page or related pages loading at the the same time. We've also had a lot of saved searches not run that were schedule at the same time (i've since spread them out), suggesting to me in retrospect that we probably had a similar issue with the saved searches.
What else could be leading to the issues and is this something that can be fixed with better SPL? Could there be short connectivity issues leading to temporary peer being down?
Here's a typical query on the pages where this happens:
<query>index=sm9_us source=interaction | fields INTERACTION_ID ASSIGNMENT TERRITORY OPEN_TIME | dedup INTERACTION_ID | search $territory$ ASSIGNMENT="*$assignFilter$*" | search ASSIGNMENT=$AssignGroup$ | eval _time = `utc_to_local_tz(OPEN_TIME)` | eval today = relative_time(now(), "+0d@d" ) | eval diff = (_time - today) / 86400 | search diff>-30 diff<0 | timechart span=1d count fixedrange=false </query> <earliest>$time_tok.earliest$</earliest> <latest>$time_tok.latest$</latest>
This smells like a network connectivity issue. If all indexers are on the same network segment/vlan as the search head the next thing I would look at would be the ulimits for open files if this is a linux environment. All Splunk servers should have a very high value (64k) for this setting.
Run the following search:
index=_internal source=*splunkd.log ulimit open files
if any of your indexers or search heads return a low number (>4k) then you may have found your problem.
Thanks for the feedback - i've sent your response to our architect to look into, they suspect there may be some underlying network connectivity issues.
@wcooper003: Were you able to resolve the issue? If yes, how? I'm particularly interested in the error message on line 7 above.
Our architecture team mostly resolved these. From what I could gather - there was some connectivity issues within the cloud hosting environment that led to this. Took months to get it resolved, thankfully we were still in a testing phase - big yellow warnings on all dashboards were troublesome...
Can you please share how the connectivity issues are resolved?
We are in the same situation and any help will be appreciated.