When the filesystem that Splunk uses to store its indexes becomes unavailable, goes into read-only mode or Splunk crashes, inconsistencies are sometimes introduced in the metadata files of some indexes and buckets. These files typically are Sources.data, Hosts.data and SourceTypes.data. There is a set of these in the index hot/warm directory, and in each bucket.
The presence of a corrupt metadata file in a bucket of one of the indexes currently used will keep Splunk from restarting. Typically, errors as shown below will show up in $SPLUNK_HOME/var/log/splunk/splunkd.log and Splunk will crash when attempting to start :
ERROR WordPositionData - couldn't parse hash code
Unfortunately as Splunk starts, although splunkd.log reports which index contains a corrupt metadata file it will not indicate in which bucket that file is present or which file that is.
Is there a way to quickly scan an index an all of its buckets to detect which metadata files are corrupted and need to be moved out of the way?
There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :
$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate
Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :
for i in find "$PATH_TO_INDEX" \( -name db_*_*_* -o -name hot_v*_* \)
; do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'
--validate
or fanned out for readability (at least readable for shellscripts):
for i in `find "$PATH_TO_INDEX" \( -name db_*_*_* -o -name hot_v*_* \)`; do
echo "Checking metadata in bucket $i ...";
$SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate
"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".
Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.
Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.
Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.
Repeat the operation for each index for which splunkd.log reports this type of error.
As a corrolary to the metadata checker above, the following can be used to check the health of your tsidx (text search) files.
for tsidx_file in $(find "$PATH_TO_INDEX" -type f -name '*.tsidx'); do
output="$(splunk cmd tsidxprobe "$tsidx_file")"
tsidxprobe_exit_code=$?
if [ $tsidxprobe_exit_code != 0 ]; then
echo tsidxprobe "error: $tsidx_file gave an error; return code: $tsidxprobe_exit_code"
echo "$output"
fi
done
The main useful idea here is tsidxprobe returns nonzero on failure, and the output is hard to guess, so store and emit it if it was a fail.
There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :
$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate
Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :
for i in find "$PATH_TO_INDEX" \( -name db_*_*_* -o -name hot_v*_* \)
; do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'
--validate
or fanned out for readability (at least readable for shellscripts):
for i in `find "$PATH_TO_INDEX" \( -name db_*_*_* -o -name hot_v*_* \)`; do
echo "Checking metadata in bucket $i ...";
$SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate
"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".
Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.
Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.
Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.
Repeat the operation for each index for which splunkd.log reports this type of error.
NOTE: I tried this on Splunk 4.2.4 and it reposted that "recover" was removed.
Do note that in most cases, it's the metadata files in the index root directory and/or in it's hot buckets that are responsible for this situation.