Deployment Architecture

Bundle replication response code 500

coreyf311
Path Finder

We have a Splunk Ent. 7.0.2 Search head cluster. We are seeing errors like below.

Search peer splunksh-01 has the following message: bundle replication: problem replicating config (bundle) to search peer 'splunkidx-01', HTTP response code 500 (HTTP/1.1 500 read timeout). read timeout (unknown write error).

We are an entirely virtual set up using VMware.

Tags (1)

mhouse333
Loves-to-Learn Lots

It is possible that there are other things going on that is causing this error than what is stated above.  Since I identified a unique root cause I wanted to share with all.  The last bullet below was what worked for me but the below bullets represents a summary of recommended steps to get to root cause for this.

  • First verify that the size of the bundle being sent from SH is not greater than the bundle size limit setting on the SH (maxBundleSize distSearch.conf) or the Indexer (max_content_lengh server.conf)
  • Then check for permissions/ownership errors on all the instanced by running “ls -lahR /opt/spunk | grep root”
  • Then run ./splunk btool check
  • Then check the CM bundle details and compare if the latest active bundle in the peers is same as the CM.
  • Then run the top command to see if there are any resources using a significant percentage of CPU utilization over Splunk.  A new application could have been introduced that is preventing writes from taking place over a long period of time due to files being locked by other application.  This can be further verified by:
    • Run the following on each indexer “sudo tcpdump <ipaddressofsourceSH>” then attempt to run your search from the SH and see if you see the commands coming over.
    • If fails that there is an application that on in your environment that is preventing Splunk from doing what it need to do and you need to apply for an Splunk exceptions for the recently introduced application.
0 Karma

Rob2520
Communicator

@coreyf311 did you find any resolution to this issue? I started seeing same error popping up ON and OFF on my search heads.

0 Karma

stefan_d
Path Finder

Same error and some other timeout errors. running 7.2.3 on VMWare.

How do I prove fault at virtual infrastructure or application...

0 Karma

skalliger
SplunkTrust
SplunkTrust

Hm, this could be either a timeout issue or a bundle replication-related bug (thought about upgrading to a higher 7.0.x version?). You could also try setting a higher connectionTimeout in your distsearch.conf under the replicationSettings stanza and see if this helps in any way.

Edit: You could also take a look into this diagnosis doc.

Skalli

0 Karma

coreyf311
Path Finder

I don't know what the exact issue was except when we migrated that indexer from one ESXi host to another the problem was resolved. The problem was on the ESX host but not sure the exact issue as I am not on the team that supports vmware.

0 Karma