I am having an issue with the knowledge bundle directory not deleting old bundles. This started after upgrading from 6.3 to 6.5.1. We only have 1 search head that keeps sending bundles to the directory on the search peer every 2 minutes and does not remove the older ones. There are more than the default 5 that are kept. Is there any place that I can look to find the cause?
I ended up just writing a PowerShell script to retain the 5 bundles and delete the rest.
This is a 6.5.0 / 6.5.1 bug.
Version 6.5.2 fix it as mentioned in "Release Notes": https://docs.splunk.com/Documentation/Splunk/6.5.2/ReleaseNotes/6.5.2
For additional info also see: https://answers.splunk.com/answers/482121/why-is-the-search-head-distributing-entire-knowled.html#an...
For anyone interested, here's my shell script that I used to (1) monitor disk usage of the searchpeers
folder and (2) cleanup any (unused) artifacts older than 24 hours. I find it frustrating that Splunk isn't "doing the right thing" out of the box. But I'm not above workaround scripts when necessary. So here goes.
Note that I ran into some issues with timing on my system due to the massive volume of these files by the time I found and attempted to correct the issue. (The searhpeers folder was over 1TB, and took over 12 hours to delete.) Hence the time checks around the du
and the directory related rm
. The other thing of note is that I check the "active_tokens" in the bundles before deleting them.
I created this to run as a scripted input.
As always, "use at your own risk" and "your millage may vary", ...
Code listing for searchpeers.sh
:
#!/bin/sh
MAX_HOURS=24
test -d $SPLUNK_HOME/var/run/searchpeers || exit
links=$(stat $SPLUNK_HOME/var/run/searchpeers/ -c %h 2>/dev/null)
if [ $links -lt 3 ]
then
# Empty directory. Not a search peer?... Nothing to do.
exit
fi
start=$(date +"%s.%3N")
size_mb=$(du -sm $SPLUNK_HOME/var/run/searchpeers | cut -f1)
full_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.bundle' | wc -l)
delta_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.delta' | wc -l)
tmp_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -name '*.tmp' | wc -l)
bundle_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d \! -name '*.tmp' | wc -l)
end=$(date +"%s.%3N")
total=$(echo "$end - $start" | bc)
echo "$(date) searchpeers_total_mb=$size_mb full_bundle=$full_bundle delta=$delta_bundle tmp=$tmp_folders bundle_folders=$bundle_folders collection_sec=$total"
if [ $bundle_folders -ge 5 ]
then
# Do cleanup, if necessary
find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -mmin +$[$MAX_HOURS * 60] | (
while read bundle
do
#echo Looking at directory $bundle
alive_searches=$(find $bundle/alive_tokens -type p | wc -l)
if [ $alive_searches -eq 0 ]
then
echo "$(date) Removing old unused bundle: $bundle"
start=$(date +"%s.%3N")
rm -rf "$bundle"
end=$(date +"%s.%3N")
total=$(echo "$end - $start" | bc)
echo "$(date) Removed bundle=$(basename $bundle) seconds=$total"
else
echo "$(date) Bundle $bundle still in use by $alive_searches searches. Not removing it."
fi
done
)
find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -mmin +$[$MAX_HOURS * 60] | \
while read f
do
echo "$(date) removing old bundle file: $f"
rm -f "$f"
done
fi
Very good script, but when your destination folder is symlink then it won't work. To fix it, add -L for each find command in that script 😉.
I ended up just writing a PowerShell script to retain the 5 bundles and delete the rest.
The cleaning up of bundles does need a restart of the Indexer. hence let me know do we need to restart the indexer if I use this script? what is the impact of running this script as cron on a daily basis? what is the downtime?
Thanks,
I did something similar, although my script deletes anything over 24 hours old...
Seems like there should be a real fix to this. Last week we did a 6.5.3 to 6.5.1 upgrade and now are seeing indexer hangs, the searchpeers
folder is over 900GB on some of the indexers (takes over 12 mins to just run du
on the folder!), high indexer load times, and crazy high throttle_optimize
index subtask times. Just seems like whatever is fundamentally going on here should be addressed in the core product.
BTW, I've noticed a very high number of .tmp
folders in the searchpeers directory. There are very few ".delta" files, and lots of ".bundle" files. I'm not sure if the issues is that Splunk isn't cleaning up this folder, or if the issue if there is a snowball effect where it's taking so long for each bundle to be processed that it can't keep up. We're seeing bundle times over over 90 seconds. (And the default bundle interval is 60 seconds).
BTW, seeing this issue on both bundles from independent SHs as well as from the SHC. And I can reproduce it in multiple deployments.
FYI. From support:
SPL-133450: 6.5+ splunk does full bundle replication everytime - slowing down the system
This will be fixed in Splunk 6.5.2 (should be available by the end of the month) however, if you are interested we do have a incremental patch to address this issue that is not publicly available.
BTW, this patch does NOT fix the fact that some indexers still have the "searchpeers" folder growing uncontrollably. So I'm posting my own script here for anyone else running into a similar issue.
Is this standalone or in a cluster?
We had an issue in a SHC, where the app.conf
contained the install_source_checksum
key which caused the SHC Deployer to keep sending the app over and over again until it ran out of disk space.
It's a Distributive environment. 2 SH, one SH is a Deployer, and 2 Peers. I verified that there were some apps and had install_source_checksum and I commented it out, restarted Splunk, but it still seems to keep pushing the bundles every minute to the peer.
I had a similar issue going from 6.3.2 to 6.4.3. It ended up resolving itself after a few days, but I had to do a lot manual cleanup. My support ticket didn't reveal any answers.