I am having an issue with the knowledge bundle directory not deleting old bundles. This started after upgrading from 6.3 to 6.5.1. We only have 1 search head that keeps sending bundles to the directory on the search peer every 2 minutes and does not remove the older ones. There are more than the default 5 that are kept. Is there any place that I can look to find the cause?
This is a 6.5.0 / 6.5.1 bug.
Version 6.5.2 fix it as mentioned in "Release Notes": https://docs.splunk.com/Documentation/Splunk/6.5.2/ReleaseNotes/6.5.2
For additional info also see: https://answers.splunk.com/answers/482121/why-is-the-search-head-distributing-entire-knowled.html#an...
For anyone interested, here's my shell script that I used to (1) monitor disk usage of the
searchpeers folder and (2) cleanup any (unused) artifacts older than 24 hours. I find it frustrating that Splunk isn't "doing the right thing" out of the box. But I'm not above workaround scripts when necessary. So here goes.
Note that I ran into some issues with timing on my system due to the massive volume of these files by the time I found and attempted to correct the issue. (The searhpeers folder was over 1TB, and took over 12 hours to delete.) Hence the time checks around the
du and the directory related
rm. The other thing of note is that I check the "active_tokens" in the bundles before deleting them.
I created this to run as a scripted input.
As always, "use at your own risk" and "your millage may vary", ...
Code listing for
#!/bin/sh MAX_HOURS=24 test -d $SPLUNK_HOME/var/run/searchpeers || exit links=$(stat $SPLUNK_HOME/var/run/searchpeers/ -c %h 2>/dev/null) if [ $links -lt 3 ] then # Empty directory. Not a search peer?... Nothing to do. exit fi start=$(date +"%s.%3N") size_mb=$(du -sm $SPLUNK_HOME/var/run/searchpeers | cut -f1) full_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.bundle' | wc -l) delta_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.delta' | wc -l) tmp_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -name '*.tmp' | wc -l) bundle_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d \! -name '*.tmp' | wc -l) end=$(date +"%s.%3N") total=$(echo "$end - $start" | bc) echo "$(date) searchpeers_total_mb=$size_mb full_bundle=$full_bundle delta=$delta_bundle tmp=$tmp_folders bundle_folders=$bundle_folders collection_sec=$total" if [ $bundle_folders -ge 5 ] then # Do cleanup, if necessary find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -mmin +$[$MAX_HOURS * 60] | ( while read bundle do #echo Looking at directory $bundle alive_searches=$(find $bundle/alive_tokens -type p | wc -l) if [ $alive_searches -eq 0 ] then echo "$(date) Removing old unused bundle: $bundle" start=$(date +"%s.%3N") rm -rf "$bundle" end=$(date +"%s.%3N") total=$(echo "$end - $start" | bc) echo "$(date) Removed bundle=$(basename $bundle) seconds=$total" else echo "$(date) Bundle $bundle still in use by $alive_searches searches. Not removing it." fi done ) find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -mmin +$[$MAX_HOURS * 60] | \ while read f do echo "$(date) removing old bundle file: $f" rm -f "$f" done fi
The cleaning up of bundles does need a restart of the Indexer. hence let me know do we need to restart the indexer if I use this script? what is the impact of running this script as cron on a daily basis? what is the downtime?
Seems like there should be a real fix to this. Last week we did a 6.5.3 to 6.5.1 upgrade and now are seeing indexer hangs, the
searchpeers folder is over 900GB on some of the indexers (takes over 12 mins to just run
du on the folder!), high indexer load times, and crazy high
throttle_optimize index subtask times. Just seems like whatever is fundamentally going on here should be addressed in the core product.
BTW, I've noticed a very high number of
.tmp folders in the searchpeers directory. There are very few ".delta" files, and lots of ".bundle" files. I'm not sure if the issues is that Splunk isn't cleaning up this folder, or if the issue if there is a snowball effect where it's taking so long for each bundle to be processed that it can't keep up. We're seeing bundle times over over 90 seconds. (And the default bundle interval is 60 seconds).
BTW, seeing this issue on both bundles from independent SHs as well as from the SHC. And I can reproduce it in multiple deployments.
FYI. From support:
SPL-133450: 6.5+ splunk does full bundle replication everytime - slowing down the system
This will be fixed in Splunk 6.5.2 (should be available by the end of the month) however, if you are interested we do have a incremental patch to address this issue that is not publicly available.
BTW, this patch does NOT fix the fact that some indexers still have the "searchpeers" folder growing uncontrollably. So I'm posting my own script here for anyone else running into a similar issue.
Is this standalone or in a cluster?
We had an issue in a SHC, where the
app.conf contained the
install_source_checksum key which caused the SHC Deployer to keep sending the app over and over again until it ran out of disk space.
It's a Distributive environment. 2 SH, one SH is a Deployer, and 2 Peers. I verified that there were some apps and had install_source_checksum and I commented it out, restarted Splunk, but it still seems to keep pushing the bundles every minute to the peer.
I had a similar issue going from 6.3.2 to 6.4.3. It ended up resolving itself after a few days, but I had to do a lot manual cleanup. My support ticket didn't reveal any answers.