Deployment Architecture

Splunk Distributed search peer not working as expected. There are multiple errors logs

umeshagarwal008
Explorer

Hi All,

We have 4 search head (non clustered) and 16 search peers (non clustered) . Each search head points to all 16 search peers.

Recently one of our search head was getting freeze and no search was working. So we tried disabling and enabling the search peers the problem was still the same. So while testing we disabled first three search peers and the search started working.

But now though the search is working but when we try to enable any one or all three disabled search peers the search head again gets freeze and no search works.

I have tried restarting the search head and peers but no improvement.
Deleted and added the search peers in config file from server still no improvement.

Below are the errors logs i have noted on search head for those peers. All the disabled peers have similar error logs:

01-02-2019 02:12:06.146 +0100 WARN DistributedPeerManager - Unable to distribute to peer named
at uri= using the uri-scheme=https because peer has status="Down". Please verify uri-scheme,
connectivity to the search peer, that the search peer is up, and an adequate level of system resources are available.
See the Troubleshooting Manual for more information.

01-02-2019 02:11:35.352 +0100 WARN DistributedPeer - Peer:
Unable to get server info from services/server/info due to:
Connect Timeout; exceeded 10000 milliseconds

01-02-2019 02:10:24.314 +0100 INFO StatusMgr - destHost=, destIp=, destPort=9997,
eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor

01-02-2019 01:38:01.074 +0100 WARN DistributedBundleReplicationManager - replicateDelta: failed for peer=,
uri=,
cur_time=1546386051, cur_checksum=1546386051, prev_time=1546381229, prev_checksum=4121658182606070965,
delta=/opt/splunk/var/run/-1546381229-1546386051.delta

01-02-2019 01:38:01.074 +0100 ERROR DistributedBundleReplicationManager - Reading reply to upload: rv=-2,
Receive from= timed out; exceeded 60sec,
as per=distsearch.conf/[replicationSettings]/sendRcvTimeout

01-02-2019 01:36:04.709 +0100 WARN DistributedPeerManager - Unable to distribute to peer named at
uri because replication was unsuccessful.
replicationStatus Failed failure info: failed_because_BUNDLE_DATA_TRANSMIT_FAILURE

11-07-2018 10:51:07.007 +0100 WARN DistributedPeer - Peer: Unable to get bundle list

11-27-2018 20:12:16.688 +0100 WARN DistributedPeer - Peer:
Unable to get server info from /services/server/info due to: No route to host

Any kind of help will be really helpful.

  • Umesh
0 Karma
1 Solution

umeshagarwal008
Explorer

Sorry for being late on this. I was able to solve this by copying the bundles from working indexer to non working indexer.

View solution in original post

0 Karma

umeshagarwal008
Explorer

Sorry for being late on this. I was able to solve this by copying the bundles from working indexer to non working indexer.

0 Karma

ram254481493
Explorer

Hi could you please explain what the file name of bundles that you move

0 Karma

richgalloway
SplunkTrust
SplunkTrust

@umeshagarwal008 If your problem is resolved, please accept an answer to help future readers.

---
If this reply helps you, Karma would be appreciated.
0 Karma

BainM
Communicator

This is a big giveaway:

11-27-2018 20:12:16.688 +0100 WARN DistributedPeer - Peer: 
Unable to get server info from /services/server/info due to: No route to host

Check your DNS settings.

0 Karma

harsmarvania57
SplunkTrust
SplunkTrust

As @lakshman239 mentioned, this looks like network issue because Connection timed out and No route to host error generally occur when there is firewall block or routing issue.

0 Karma

umeshagarwal008
Explorer

02-06-2019 11:14:24.183 +0100 INFO StatusMgr - destHost=, destIp=, destPort=9997, eventType=connect_done, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor

This it the current status

0 Karma

harsmarvania57
SplunkTrust
SplunkTrust

Can you please try to telnet from Search Head to Indexer on Port 8089 ?

0 Karma

umeshagarwal008
Explorer

Just checked. Its getting connected.

0 Karma

umeshagarwal008
Explorer

Trying ...
Connected to (URL)
Escape character is '^]'.
Connection closed by foreign host.

0 Karma

lakshman239
SplunkTrust
SplunkTrust

Pls go through https://docs.splunk.com/Documentation/Splunk/7.2.3/DistSearch/Limittheknowledgebundlesize

https://docs.splunk.com/Documentation/Splunk/latest/DistSearch/Whatsearchheadssend

Sometime, your search head may try to send a lot of large CSV files and apps/conf's that are not needed for the indexers. Check in your environment and if you have such scenario, try to blacklist them. This will help reduce bandwidth usage and remove the bundle replication errors.

0 Karma

woodcock
Esteemed Legend

Open a case with support.

0 Karma

umeshagarwal008
Explorer

Yes that the last option we have.

0 Karma

lakshman239
SplunkTrust
SplunkTrust

did you check the connectivity from the SH to indexers on the required ports? [.e.g 8089] Any chance, this was broken in the recent past or the servers moved to a diff network segment [ IP address] and hence connectivity takes longer and times out?

0 Karma

BainM
Communicator

Just laksman239 says, we need a little more info. What kind of load do you have on your systems? How is memory and CPU doing on your SH's and indexers? How many searches are happening at any one moment? You can check all of this in your DMC or "Monitoring" console (Settings -> DMC or Monitoring).

0 Karma

umeshagarwal008
Explorer

Everything looks good on DMC. After further investigation looks like the issue is with Bundle replication.
I am trying to copy one set of bundle from a working search peer to disabled search peer for that search-head and see if it works.

0 Karma

umeshagarwal008
Explorer

Sure let me gather these information and share with you.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...