I have been receiving the message mentioned above for a few weeks, so I decided to check my splunkd.log file. (2) errors appear over and over:
1)
ERROR DistributedBundleReplicationManager - Unexpected problem while uploading bundle: Unknown write error.
2)
ERROR DistributedBundleReplicationManager - Unable to upload bundle to peer named "internalsplunkurl" uri = x.x.x.x:8089.
This is appearing in all (4) of my peer indexers. (2) are local and (2) are geographically separated. I have added some entries into my distsearch.conf file including - [distributedSearch], [replicationSettings], [replicationWhitelist]. I am not sure what else I can do in order to fix this issue.
Thanks,
Tom Forbes
After several hours of searching and testing I was able to figure what my issue was.
Several weeks back I was working on a search and in the process of designing this search I executed a pretty general search of my indexed data and exported it to a csv file. My reasoning for this is that I was interested in using the results of the search csv as a lookup table file. The file was not particularly large but it was significant enough to cause issues with replication of the data bundle. I followed (this link: https://answers.splunk.com/answers/302532/large-lookup-caused-the-replication-bundle-to-fail-1.html) and picked up on some verbiage that large csv files could cause issues with bundle replication. I ended up deleting the file in question and magically my search head returned to normal and I was able to query my data as expected.
Please reference comments above for any background information that maybe missing in this answer.
In the future if I plan to execute similar searches that include csv files I plan to blacklist in my distsearch.conf file to avoid this issue.
Thanks for the input everyone.
It is possible that there are other things going on that is causing this error than what is stated above. Since I identified a unique root cause I wanted to share with all. The last bullet below was what worked for me but the below bullets represents a summary of recommended steps to get to root cause for this.
After several hours of searching and testing I was able to figure what my issue was.
Several weeks back I was working on a search and in the process of designing this search I executed a pretty general search of my indexed data and exported it to a csv file. My reasoning for this is that I was interested in using the results of the search csv as a lookup table file. The file was not particularly large but it was significant enough to cause issues with replication of the data bundle. I followed (this link: https://answers.splunk.com/answers/302532/large-lookup-caused-the-replication-bundle-to-fail-1.html) and picked up on some verbiage that large csv files could cause issues with bundle replication. I ended up deleting the file in question and magically my search head returned to normal and I was able to query my data as expected.
Please reference comments above for any background information that maybe missing in this answer.
In the future if I plan to execute similar searches that include csv files I plan to blacklist in my distsearch.conf file to avoid this issue.
Thanks for the input everyone.
Are your search heads at the local site and are they on the same subnet? Every search you run will send a bundle to the indexers and those errors indicate that is failing. Make sure that your search heads can connect to both the local and remote indexers on TCP port 8089.
What kind of network is between the search heads and the remote indexers? If it is a VPN tunnel then the servers need to have their MTU set to 1500 or lower to traverse the internet.
I have had these errors before and it has been either the port 8089 or the MTU.
Thanks for the input.
Both of my indexers located at my main site (my main site includes (1) search head, (2) indexers, (1) master node, and (1) deployment server) are on the same subnet. My remote indexers are on the same subnet as their search head.
My main site search head has no ability to search any indexed data whatsoever local or remote. My remote search head does have the ability to search data from it's set of local indexers and indexers from my main site.
Also interestingly each indexer has different numbers of indexed events to search from under the search tab. For example at my remote site indexer (1) has access to 19,000,000+ indexed events, indexer (2) has access to 33,000,000+ indexed events. At my main site indexer (1) has access to 3,000,000+ events and indexer (2) has access to 1,600,000,000+ events (this is not a typo either). So the amounts data available for each indexer varies wildly.
I am sure this has to do with large bundle sizes. What I am not sure of is if the actual issue is bandwidth issues concerning my network infrastructure.
omg this is it
Crank up the logging level to DEBUG for this component: DistributedBundleReplicationManager on one of your indexers. This may give you some additional clues.
Perhaps there is a permissions problem with the directory var/run/dispatch?