Splunk Search

How to quickly validate the metadata files of a given index and of all its buckets?

hexx
Splunk Employee
Splunk Employee

When the filesystem that Splunk uses to store its indexes becomes unavailable, goes into read-only mode or Splunk crashes, inconsistencies are sometimes introduced in the metadata files of some indexes and buckets. These files typically are Sources.data, Hosts.data and SourceTypes.data. There is a set of these in the index hot/warm directory, and in each bucket.

The presence of a corrupt metadata file in a bucket of one of the indexes currently used will keep Splunk from restarting. Typically, errors as shown below will show up in $SPLUNK_HOME/var/log/splunk/splunkd.log and Splunk will crash when attempting to start :

ERROR WordPositionData - couldn't parse hash code

Unfortunately as Splunk starts, although splunkd.log reports which index contains a corrupt metadata file it will not indicate in which bucket that file is present or which file that is.

Is there a way to quickly scan an index an all of its buckets to detect which metadata files are corrupted and need to be moved out of the way?

1 Solution

hexx
Splunk Employee
Splunk Employee

There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :

$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate

Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :

for i in find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \); do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/' --validate

or fanned out for readability (at least readable for shellscripts):

for i in `find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \)`; do 
    echo "Checking metadata in bucket $i ..."; 
    $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate

"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".

Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.

Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.

Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.

Repeat the operation for each index for which splunkd.log reports this type of error.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

As a corrolary to the metadata checker above, the following can be used to check the health of your tsidx (text search) files.

for tsidx_file in $(find "$PATH_TO_INDEX" -type f -name '*.tsidx'); do
   output="$(splunk cmd tsidxprobe "$tsidx_file")"
   tsidxprobe_exit_code=$?
   if [ $tsidxprobe_exit_code != 0 ]; then
      echo tsidxprobe "error: $tsidx_file gave an error; return code: $tsidxprobe_exit_code"
      echo "$output"
   fi
done

The main useful idea here is tsidxprobe returns nonzero on failure, and the output is hard to guess, so store and emit it if it was a fail.

hexx
Splunk Employee
Splunk Employee

There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :

$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate

Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :

for i in find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \); do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/' --validate

or fanned out for readability (at least readable for shellscripts):

for i in `find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \)`; do 
    echo "Checking metadata in bucket $i ..."; 
    $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate

"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".

Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.

Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.

Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.

Repeat the operation for each index for which splunkd.log reports this type of error.

rgcurry
Contributor

NOTE: I tried this on Splunk 4.2.4 and it reposted that "recover" was removed.

0 Karma

hexx
Splunk Employee
Splunk Employee

Do note that in most cases, it's the metadata files in the index root directory and/or in it's hot buckets that are responsible for this situation.

Get Updates on the Splunk Community!

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Splunk Education Goes to Washington | Splunk GovSummit 2024

If you’re in the Washington, D.C. area, this is your opportunity to take your career and Splunk skills to the ...