Getting Data In

Why is the knowledge bundle directory filling up after 6.5.1 upgrade?

Communicator

I am having an issue with the knowledge bundle directory not deleting old bundles. This started after upgrading from 6.3 to 6.5.1. We only have 1 search head that keeps sending bundles to the directory on the search peer every 2 minutes and does not remove the older ones. There are more than the default 5 that are kept. Is there any place that I can look to find the cause?

1 Solution

Communicator

I ended up just writing a PowerShell script to retain the 5 bundles and delete the rest.

View solution in original post

Path Finder

This is a 6.5.0 / 6.5.1 bug.
Version 6.5.2 fix it as mentioned in "Release Notes": https://docs.splunk.com/Documentation/Splunk/6.5.2/ReleaseNotes/6.5.2
For additional info also see: https://answers.splunk.com/answers/482121/why-is-the-search-head-distributing-entire-knowled.html#an...

Super Champion

For anyone interested, here's my shell script that I used to (1) monitor disk usage of the searchpeers folder and (2) cleanup any (unused) artifacts older than 24 hours. I find it frustrating that Splunk isn't "doing the right thing" out of the box. But I'm not above workaround scripts when necessary. So here goes.

Note that I ran into some issues with timing on my system due to the massive volume of these files by the time I found and attempted to correct the issue. (The searhpeers folder was over 1TB, and took over 12 hours to delete.) Hence the time checks around the du and the directory related rm. The other thing of note is that I check the "active_tokens" in the bundles before deleting them.

I created this to run as a scripted input.

As always, "use at your own risk" and "your millage may vary", ...

Code listing for searchpeers.sh:

#!/bin/sh
MAX_HOURS=24

test -d  $SPLUNK_HOME/var/run/searchpeers || exit
links=$(stat $SPLUNK_HOME/var/run/searchpeers/ -c %h 2>/dev/null)

if [ $links -lt 3 ]
then
        # Empty directory.    Not a search peer?...   Nothing to do.
        exit
fi

start=$(date +"%s.%3N")
size_mb=$(du -sm $SPLUNK_HOME/var/run/searchpeers | cut -f1)

full_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.bundle' | wc -l)
delta_bundle=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -name '*.delta' | wc -l)
tmp_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -name '*.tmp' | wc -l)
bundle_folders=$(find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d \! -name '*.tmp' | wc -l)

end=$(date +"%s.%3N")
total=$(echo "$end - $start" | bc)

echo "$(date) searchpeers_total_mb=$size_mb full_bundle=$full_bundle delta=$delta_bundle tmp=$tmp_folders bundle_folders=$bundle_folders collection_sec=$total"

if [ $bundle_folders -ge 5 ]
then
        # Do cleanup, if necessary
        find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type d -mmin +$[$MAX_HOURS * 60] | (
                while read bundle
                do
                        #echo Looking at directory $bundle
                        alive_searches=$(find $bundle/alive_tokens -type p | wc -l)
                        if [ $alive_searches -eq 0 ]
                        then
                                echo "$(date) Removing old unused bundle:  $bundle"
                                start=$(date +"%s.%3N")
                                rm -rf "$bundle"
                                end=$(date +"%s.%3N")
                                total=$(echo "$end - $start" | bc)
                                echo "$(date) Removed bundle=$(basename $bundle) seconds=$total"
                        else
                                echo "$(date) Bundle $bundle still in use by $alive_searches searches.  Not removing it."
                        fi
                done
        )

        find $SPLUNK_HOME/var/run/searchpeers -maxdepth 1 -mindepth 1 -type f -mmin +$[$MAX_HOURS * 60] | \
        while read f
        do
                echo "$(date) removing old bundle file: $f"
                rm -f "$f"
        done
fi

Communicator

I ended up just writing a PowerShell script to retain the 5 bundles and delete the rest.

View solution in original post

Contributor

The cleaning up of bundles does need a restart of the Indexer. hence let me know do we need to restart the indexer if I use this script? what is the impact of running this script as cron on a daily basis? what is the downtime?

Thanks,

0 Karma

Legend

I did something similar, although my script deletes anything over 24 hours old...

0 Karma

Super Champion

Seems like there should be a real fix to this. Last week we did a 6.5.3 to 6.5.1 upgrade and now are seeing indexer hangs, the searchpeers folder is over 900GB on some of the indexers (takes over 12 mins to just run du on the folder!), high indexer load times, and crazy high throttle_optimize index subtask times. Just seems like whatever is fundamentally going on here should be addressed in the core product.

BTW, I've noticed a very high number of .tmp folders in the searchpeers directory. There are very few ".delta" files, and lots of ".bundle" files. I'm not sure if the issues is that Splunk isn't cleaning up this folder, or if the issue if there is a snowball effect where it's taking so long for each bundle to be processed that it can't keep up. We're seeing bundle times over over 90 seconds. (And the default bundle interval is 60 seconds).

BTW, seeing this issue on both bundles from independent SHs as well as from the SHC. And I can reproduce it in multiple deployments.

0 Karma

Super Champion

FYI. From support:

SPL-133450: 6.5+ splunk does full bundle replication everytime - slowing down the system

This will be fixed in Splunk 6.5.2 (should be available by the end of the month) however, if you are interested we do have a incremental patch to address this issue that is not publicly available.

0 Karma

Super Champion

BTW, this patch does NOT fix the fact that some indexers still have the "searchpeers" folder growing uncontrollably. So I'm posting my own script here for anyone else running into a similar issue.

0 Karma

Communicator

Is this standalone or in a cluster?
We had an issue in a SHC, where the app.conf contained the install_source_checksum key which caused the SHC Deployer to keep sending the app over and over again until it ran out of disk space.

Communicator

It's a Distributive environment. 2 SH, one SH is a Deployer, and 2 Peers. I verified that there were some apps and had install_source_checksum and I commented it out, restarted Splunk, but it still seems to keep pushing the bundles every minute to the peer.

0 Karma

Contributor

I had a similar issue going from 6.3.2 to 6.4.3. It ended up resolving itself after a few days, but I had to do a lot manual cleanup. My support ticket didn't reveal any answers.