First check the peer's splunkd.log for any messages during the same time as the search-head's DistributedBundleReplicationManager error.
If you do find in the peer's splunkd.log messages such as:
ERROR DistBundleRestHandler - File users/xxx/yyy/local/props.conf in knowledge bundle is either not in white list or else excluded by black list. Bundle /opt/splunk/var/run/searchpeers/ will be removed
...then this means that there must be on the peer a rouge 'distsearch.conf' which does't not explicitly whitelist or blacklist any bundle files ... as a result by default the peer simply rejects the bundle.
To workaround this please remove any distsearch.conf (from system/local OR etc/apps/appname/local) on the peers and restart Splunk.
In version 6.1 a new functionality was added to the peer which allows peers to blacklist/whitelist bundle contents based on locally defined rules (via local distsearch.conf).
... View more
The culprit for this condition in our case was the leftover pooling.ini.lock file under /etc/pooling/.
Essentially when a pool member (search-head) validates itself with the pool a check is performed against pooling.ini and a lock is created by the requesting search-head.
The 'lock' is comprised of two files under: /etc/pooling/
On occasion search-heads fail to clean up after themselves and remove their respective lock and leaving them behind in the location above. Another search-head attempting to execute a search cannot validate itself against the pooling.ini due to the existence of the other lock and by default attempts/waits for 10s at which point it proceeds with accessing the pooling.ini regardless.
To confirm whether you are hitting this issue please check:
1) In splunkd.log or btool.log (it is unclear why this message appears in both places or 1 out of 2) there will be messages as the following:
ERROR SearchHeadPoolInfo - Error reading search head pool info: Failed to lock //****/pool/etc/pooling/pooling.ini with code 1, possible reason: No such file or directory
2) The job inspector output for any search job in version 5+ will include in Execution Costs a measurement:
startup.handoff = 10000
Note: In pre-v5 Splunk the issue may be there but the startup.handoff is not calculated, therefore it may be harder to verify if you have hit the condition. Also "total run time" in Job Inspector does NOT include the startup.handoff time.
This value is somewhere in the 10s value matching the current per current design timeout. In version 5.0.6 the behavior will change (SPL-66563) where Splunk will optimistically attempt to open the pooling.ini first and then fall back on a file-based lock around pooling.ini
3) Another simple check would be to execute ANY search and measure time against the "wall" clock. If it takes around 10s before you see anything on the UI, AND the above two checks are positive then you have hit this behavior.
If you have hit this condition the workaround is simply removing
... View more
At this time, the scheduled search maintaining the "sos_servers_cache" asset lookup that the Topology view consumes will add any newly-found search peers but will not remove those that no longer respond.
This is a limitation of the current implementation that we plan to improve on in a future release of S.o.S, where we will probably still show the non-responding peers but mark them as such ("missing" or "unresponsive").
In order to get rid of decommissioned search peers, you need to edit the $SPLUNK_HOME/etc/apps/sos/lookups/sos_servers_cache.csv lookup table and manually remove their entries. We also hope to offer a UI-driven method to do this in a future release.
... View more
It has been observed in other cases that a possible antivirus scan may be holding the checkpoint file at the same time that Splunk is attempting to rename it.
Please stop the antivirus and retry.
Splunk 5.0.9 and 6.0 has new improvements targeting this particular scenarios, where the rename attempt will be retried once again at a later time.
... View more