Splunk Search

How to quickly validate the metadata files of a given index and of all its buckets?

hexx
Splunk Employee
Splunk Employee

When the filesystem that Splunk uses to store its indexes becomes unavailable, goes into read-only mode or Splunk crashes, inconsistencies are sometimes introduced in the metadata files of some indexes and buckets. These files typically are Sources.data, Hosts.data and SourceTypes.data. There is a set of these in the index hot/warm directory, and in each bucket.

The presence of a corrupt metadata file in a bucket of one of the indexes currently used will keep Splunk from restarting. Typically, errors as shown below will show up in $SPLUNK_HOME/var/log/splunk/splunkd.log and Splunk will crash when attempting to start :

ERROR WordPositionData - couldn't parse hash code

Unfortunately as Splunk starts, although splunkd.log reports which index contains a corrupt metadata file it will not indicate in which bucket that file is present or which file that is.

Is there a way to quickly scan an index an all of its buckets to detect which metadata files are corrupted and need to be moved out of the way?

1 Solution

hexx
Splunk Employee
Splunk Employee

There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :

$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate

Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :

for i in find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \); do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/' --validate

or fanned out for readability (at least readable for shellscripts):

for i in `find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \)`; do 
    echo "Checking metadata in bucket $i ..."; 
    $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate

"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".

Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.

Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.

Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.

Repeat the operation for each index for which splunkd.log reports this type of error.

View solution in original post

jrodman
Splunk Employee
Splunk Employee

As a corrolary to the metadata checker above, the following can be used to check the health of your tsidx (text search) files.

for tsidx_file in $(find "$PATH_TO_INDEX" -type f -name '*.tsidx'); do
   output="$(splunk cmd tsidxprobe "$tsidx_file")"
   tsidxprobe_exit_code=$?
   if [ $tsidxprobe_exit_code != 0 ]; then
      echo tsidxprobe "error: $tsidx_file gave an error; return code: $tsidxprobe_exit_code"
      echo "$output"
   fi
done

The main useful idea here is tsidxprobe returns nonzero on failure, and the output is hard to guess, so store and emit it if it was a fail.

hexx
Splunk Employee
Splunk Employee

There is a command that ships with Splunk and which is capable of checking the consistency of the metadata files of any given index or bucket :

$SPLUNK_HOME/bin/splunk cmd recover-metadata {path_to_index|path_to_bucket} --validate

Note that the "--validate" option will essentially act as "fsck -n" : It will report errors but not make any changes. For a given index, I like to run the script below to check the metadata files at the root of the hot/warm db and then those contained in each bucket :

for i in find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \); do echo "Checking metadata in bucket $i ..."; $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate; done; $SPLUNK_HOME/bin/splunk cmd recover-metadata echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/' --validate

or fanned out for readability (at least readable for shellscripts):

for i in `find "$PATH_TO_INDEX" \( -name db_*_*_*  -o -name hot_v*_* \)`; do 
    echo "Checking metadata in bucket $i ..."; 
    $SPLUNK_HOME/bin/splunk cmd recover-metadata $i --validate
done
$SPLUNK_HOME/bin/splunk cmd recover-metadata `echo $i | sed 's/\(.*\)\/db_[^/]*$/\1/'` --validate

"PATH_TO_INDEX" should be the path to the directory of the affected index containing the "db" and "colddb" directories. For the default index ("main"), it is "$SPLUNK_HOME/var/lib/splunk/defaultdb".

Each time an error is reported, the corresponding .data file should be moved out of the way or deleted, as Splunk will rebuild them on the next start up.

Another solution is to create a "meta.dirty" file at the root of the affected index db ($SPLUNK_HOME/var/lib/splunk/defaultdb/db/ for example), which will also dynamically prompt Splunk to rebuild the metadata files for that index.

Once all corrupted metadata files have been removed, the check should be run again. It will indicate errors for those files because they can't be found, but Splunk should be now ready to start.

Repeat the operation for each index for which splunkd.log reports this type of error.

rgcurry
Contributor

NOTE: I tried this on Splunk 4.2.4 and it reposted that "recover" was removed.

0 Karma

hexx
Splunk Employee
Splunk Employee

Do note that in most cases, it's the metadata files in the index root directory and/or in it's hot buckets that are responsible for this situation.

Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...