Deployment Architecture

This keeps appearing in my search head: "Problem Replicating config (bundle) to search peer x.x.x.x:8089, error while transmitting bundle date."

thomas_forbes
Communicator

I have been receiving the message mentioned above for a few weeks, so I decided to check my splunkd.log file. (2) errors appear over and over:

1)

ERROR DistributedBundleReplicationManager - Unexpected problem while uploading bundle:  Unknown write error.  

2)

ERROR DistributedBundleReplicationManager - Unable to upload bundle to peer named "internalsplunkurl" uri = x.x.x.x:8089.  

This is appearing in all (4) of my peer indexers. (2) are local and (2) are geographically separated. I have added some entries into my distsearch.conf file including - [distributedSearch], [replicationSettings], [replicationWhitelist]. I am not sure what else I can do in order to fix this issue.

Thanks,
Tom Forbes

1 Solution

thomas_forbes
Communicator

After several hours of searching and testing I was able to figure what my issue was.

Several weeks back I was working on a search and in the process of designing this search I executed a pretty general search of my indexed data and exported it to a csv file. My reasoning for this is that I was interested in using the results of the search csv as a lookup table file. The file was not particularly large but it was significant enough to cause issues with replication of the data bundle. I followed (this link: https://answers.splunk.com/answers/302532/large-lookup-caused-the-replication-bundle-to-fail-1.html) and picked up on some verbiage that large csv files could cause issues with bundle replication. I ended up deleting the file in question and magically my search head returned to normal and I was able to query my data as expected.

Please reference comments above for any background information that maybe missing in this answer.

In the future if I plan to execute similar searches that include csv files I plan to blacklist in my distsearch.conf file to avoid this issue.

Thanks for the input everyone.

View solution in original post

mhouse333
Loves-to-Learn

It is possible that there are other things going on that is causing this error than what is stated above.  Since I identified a unique root cause I wanted to share with all.  The last bullet below was what worked for me but the below bullets represents a summary of recommended steps to get to root cause for this.

  • First verify that the size of the bundle being sent from SH is not greater than the bundle size limit setting on the SH (maxBundleSize distSearch.conf) or the Indexer (max_content_lengh server.conf)
  • Then check for permissions/ownership errors on all the instanced by running “ls -lahR /opt/spunk | grep root”
  • Then run ./splunk btool check
  • Then check the CM bundle details and compare if the latest active bundle in the peers is same as the CM.
  • Then run the top command to see if there are any resources using a significant percentage of CPU utilization over Splunk.  A new application could have been introduced that is preventing writes from taking place over a long period of time due to files being locked by other application.  This can be further verified by:
    • Run the following on each indexer “sudo tcpdump <ipaddressofsourceSH>” then attempt to run your search from the SH and see if you see the commands coming over.
    • If fails that there is an application that on in your environment that is preventing Splunk from doing what it need to do and you need to apply for an Splunk exceptions for the recently introduced application.
0 Karma

thomas_forbes
Communicator

After several hours of searching and testing I was able to figure what my issue was.

Several weeks back I was working on a search and in the process of designing this search I executed a pretty general search of my indexed data and exported it to a csv file. My reasoning for this is that I was interested in using the results of the search csv as a lookup table file. The file was not particularly large but it was significant enough to cause issues with replication of the data bundle. I followed (this link: https://answers.splunk.com/answers/302532/large-lookup-caused-the-replication-bundle-to-fail-1.html) and picked up on some verbiage that large csv files could cause issues with bundle replication. I ended up deleting the file in question and magically my search head returned to normal and I was able to query my data as expected.

Please reference comments above for any background information that maybe missing in this answer.

In the future if I plan to execute similar searches that include csv files I plan to blacklist in my distsearch.conf file to avoid this issue.

Thanks for the input everyone.

View solution in original post

lycollicott
Motivator

Are your search heads at the local site and are they on the same subnet? Every search you run will send a bundle to the indexers and those errors indicate that is failing. Make sure that your search heads can connect to both the local and remote indexers on TCP port 8089.

What kind of network is between the search heads and the remote indexers? If it is a VPN tunnel then the servers need to have their MTU set to 1500 or lower to traverse the internet.

I have had these errors before and it has been either the port 8089 or the MTU.

thomas_forbes
Communicator

Thanks for the input.

Both of my indexers located at my main site (my main site includes (1) search head, (2) indexers, (1) master node, and (1) deployment server) are on the same subnet. My remote indexers are on the same subnet as their search head.

My main site search head has no ability to search any indexed data whatsoever local or remote. My remote search head does have the ability to search data from it's set of local indexers and indexers from my main site.

Also interestingly each indexer has different numbers of indexed events to search from under the search tab. For example at my remote site indexer (1) has access to 19,000,000+ indexed events, indexer (2) has access to 33,000,000+ indexed events. At my main site indexer (1) has access to 3,000,000+ events and indexer (2) has access to 1,600,000,000+ events (this is not a typo either). So the amounts data available for each indexer varies wildly.

I am sure this has to do with large bundle sizes. What I am not sure of is if the actual issue is bandwidth issues concerning my network infrastructure.

0 Karma

morethanyell
Contributor

omg this is it

0 Karma

sjohnson_splunk
Splunk Employee
Splunk Employee

Crank up the logging level to DEBUG for this component: DistributedBundleReplicationManager on one of your indexers. This may give you some additional clues.

Perhaps there is a permissions problem with the directory var/run/dispatch?

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!