Due to migration from a stand-alone indexer (on Windows) to three-node indexer cluster (CentOS), we are doing a test migration of an index. We had managed to copy all buckets to one of the new nodes (indexer01) and appended instances GUID to the end of the bucket folders, started splunk instaces, exited maintenance mode and waited for the replication to happen. And everything worked like charm until...
We tried to search this newly migrated index for period "All time" on the cluster. The other two nodes returned the warning such as:
[indexer02] Failed to read size=257 event(s) from rawdata in bucket='qualys~18~C221DE6A-20A8-41A8-A8D2-27E1F7A4B043' path='/opt/splunkcold/qualys/colddb/rb_1526515073_1526386468_18_A221DE6B-20A00-41A8-A8D2-27E1F7A4B043. Rawdata may be corrupt, see search.log. Results may be incomplete!
We tried with repairing the specific bucket with splunk rebuild command and reinitiated the replication, but the result was same. So we dug a little deeper and figured out that the rawdata folder, which should contain journal.gz, slicemin.dat and slicesv2.dat also contains weird plain-text non-compressed file with raw events. It is named with a number which doesn't tell much.
The question is the following - what is this file. It is located only on indexer01 and it is not being replicated to other nodes. Is there any way to append this file to journal.gz or to force the replication of this file as well.
After small amount of scripting and even greater amount of waiting we finally managed to get those slice files merged with journal.gz. We tried uncompressing journal.gz and just appending slice file, but it didn't work. Final solution was to use "splunk cmd exporttool" and "splunk cmd importtool". Because the amount of bucket, that needed to be fixed was above 2500 we addapted the export/import script that was published splunkwiki years ago. I am sharing it here in case somebody needs it in future.
#!/bin/bash
:'
This is a bucket fixup script. It fixes the buckets with leftover slices by using export command.
It also renames buckets that are single-instance to cluster format (by adding instance GUID)
Author: Žiga Humar, Our Space Appliances
Author takes no responsibily for this script or for any data corruption it might cause.
Thanks to jrodman, whos script was good starting point. And to Christian Bran at Splunk Support who expained me the
logic about leftover slices.
'
# EDIT YOUR VARIABLES HERE
BUCKET_TMPDIR=/tmp
SPLUNK_HOME=/opt/splunk
SPLUNK_BIN=/opt/splunk/bin/splunk
INSTANCE_GUID="C221DE6A-20A8-41B8-A8D2-27E1F7A4B0B8"
PATHS=(splunkhot splunkcold)
# VARIABLES FINISHED
EXPORT_CMD="$SPLUNK_BIN cmd exporttool"
IMPORT_CMD="$SPLUNK_BIN cmd importtool"
declare -a index_list
# select indexes to process
for path in ${PATHS[0]};
do
for index in /opt/$path/*;
do
index_name=$(basename $index)
if [ -d $index ] && [ ${index_name:0:1} != "_" ] && [ $index_name != "audit" ]
then
index_list+=($index_name)
fi
done
done
# loop trough hot and warm paths
for path in ${PATHS[@]};
do
echo "$(date) Processing path=/opt/$path/"
# loop trought indexes
for index in /opt/$path/*;
do
index_name=$(basename $index)
# check if this index should be processes by this instance
index_found=0
for iteration_index in "${index_list[@]}"
do
if [ "$iteration_index" == "$index_name" ] ; then
index_found=1
fi
done
# if this is folder and if it should be processed keep on going
if [ -d $index ] && [ $index_found == 1 ]
then
echo "$(date) Processing index: $index_name"
for bucket in $index/*/db_*;
do
bucket_dir=$(dirname $bucket)
bucket_name=$(basename $bucket)
if [ -d $bucket ] && [ ${bucket_name:0:2} == "db" ]
then
bucket_id_guid=$(echo $bucket_name | sed 's/db_[0-9]*_[0-9]*_//')
bucket_guid=$(echo $bucket_id_guid | sed 's/[0-9]*//')
bucket_id=$(echo $bucket_id_guid | sed 's/_[0-9a-Z\-]*$//')
#echo "$(date) Checking bucket=$bucket_id index=$index_name"
#echo "$(date) Guid: ${bucket_guid}"
# If rawdata folder contains uncompressed slice file, do the export/import procesure
if [ $(find $bucket -type f -regex '.*rawdata/[0-9]+$' | wc -l ) != "0" ] ;
then
echo "$(date) FIXUP task for bucket=${bucket_id} index=$index_name required"
echo "$(date) Exporting bucket=${bucket_id} index=$index_name"
NEW_BUCKET=$BUCKET_TMPDIR/new_bucket_$index_$bucket_id
EXPORTING_BUCKET=$BUCKET_TMPDIR/export_bucket_$index_$bucket_id.csv
# delete old export files (just in case they are left from previous migration)
rm -Rf $NEW_BUCKET
rm -Rf $EXPORTING_BUCKET
# do export
SECONDS=0
$EXPORT_CMD $bucket $EXPORTING_BUCKET -csv
EXPORT_ENDTIME=$(date +%s)
duration_export=$SECONDS
echo "Export took $duration_export seconds."
# do import
echo "$(date) Reimporting bucket=${bucket_id} index=$index_name"
SECONDS=0
$IMPORT_CMD $NEW_BUCKET $EXPORTING_BUCKET
duration_import=$SECONDS
echo "Reimport took $duration_import seconds."
# go into new bucket and get earliest and latest time in the bucket
(cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
global_low=0
global_high=0
while read high low; do
if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
global_high=$high
fi
if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
global_low=$low
fi
done
REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}_${INSTANCE_GUID}
# move the old bucket to temporary location
if [ -d $bucket ];
then
mv $bucket $BUCKET_TMPDIR
else
echo >&2 bucket $bucket vanished while processing.. inserting new one and hoping for the best
fi
# replacing old bucket with a new one
echo "Replacing bucket=${bucket_id} index=$index_name"
mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME
}
# delete temporary export file and the old one.
rm -rf $BUCKET_TMPDIR/$bucket_name # delete old one
rm $EXPORTING_BUCKET # delete exported one
# if bucket folder doesn't end with the INSTANCE_GUID, lets append it
elif [ "$bucket_guid" != "_${INSTANCE_GUID}" ];
then
echo "$(date) Renaming bucket=${bucket_id} from single instance to cluster format index=$index_name"
mv $bucket ${bucket}_${INSTANCE_GUID}
fi
fi
done
fi
done
done
Just for info: we had 30% of buckets in this state. The total size of all indexes is 3 TB. It took us 4 days to run the script. And after running it, there was also a couple of other buckets that were corrupted which had to be exported in same fashion, but it was fast and easy job.
Just an update: we are in contact with Splunk Support. This weird plain-text non-compressed file is a temporary slice file where fresh events are temporary written. When the slice is "full" it is appended to journal.gz. When buckets are being rotated from hot to warm, slice is merged with journal.gz. But in case of indexer crash, this slice is not merged with journal.gz so the bucket is left with journal.gz and the slice file.
Now we are working on a way how to merge this slices with journal.gz as effective as possible. Will keep you posted.
After small amount of scripting and even greater amount of waiting we finally managed to get those slice files merged with journal.gz. We tried uncompressing journal.gz and just appending slice file, but it didn't work. Final solution was to use "splunk cmd exporttool" and "splunk cmd importtool". Because the amount of bucket, that needed to be fixed was above 2500 we addapted the export/import script that was published splunkwiki years ago. I am sharing it here in case somebody needs it in future.
#!/bin/bash
:'
This is a bucket fixup script. It fixes the buckets with leftover slices by using export command.
It also renames buckets that are single-instance to cluster format (by adding instance GUID)
Author: Žiga Humar, Our Space Appliances
Author takes no responsibily for this script or for any data corruption it might cause.
Thanks to jrodman, whos script was good starting point. And to Christian Bran at Splunk Support who expained me the
logic about leftover slices.
'
# EDIT YOUR VARIABLES HERE
BUCKET_TMPDIR=/tmp
SPLUNK_HOME=/opt/splunk
SPLUNK_BIN=/opt/splunk/bin/splunk
INSTANCE_GUID="C221DE6A-20A8-41B8-A8D2-27E1F7A4B0B8"
PATHS=(splunkhot splunkcold)
# VARIABLES FINISHED
EXPORT_CMD="$SPLUNK_BIN cmd exporttool"
IMPORT_CMD="$SPLUNK_BIN cmd importtool"
declare -a index_list
# select indexes to process
for path in ${PATHS[0]};
do
for index in /opt/$path/*;
do
index_name=$(basename $index)
if [ -d $index ] && [ ${index_name:0:1} != "_" ] && [ $index_name != "audit" ]
then
index_list+=($index_name)
fi
done
done
# loop trough hot and warm paths
for path in ${PATHS[@]};
do
echo "$(date) Processing path=/opt/$path/"
# loop trought indexes
for index in /opt/$path/*;
do
index_name=$(basename $index)
# check if this index should be processes by this instance
index_found=0
for iteration_index in "${index_list[@]}"
do
if [ "$iteration_index" == "$index_name" ] ; then
index_found=1
fi
done
# if this is folder and if it should be processed keep on going
if [ -d $index ] && [ $index_found == 1 ]
then
echo "$(date) Processing index: $index_name"
for bucket in $index/*/db_*;
do
bucket_dir=$(dirname $bucket)
bucket_name=$(basename $bucket)
if [ -d $bucket ] && [ ${bucket_name:0:2} == "db" ]
then
bucket_id_guid=$(echo $bucket_name | sed 's/db_[0-9]*_[0-9]*_//')
bucket_guid=$(echo $bucket_id_guid | sed 's/[0-9]*//')
bucket_id=$(echo $bucket_id_guid | sed 's/_[0-9a-Z\-]*$//')
#echo "$(date) Checking bucket=$bucket_id index=$index_name"
#echo "$(date) Guid: ${bucket_guid}"
# If rawdata folder contains uncompressed slice file, do the export/import procesure
if [ $(find $bucket -type f -regex '.*rawdata/[0-9]+$' | wc -l ) != "0" ] ;
then
echo "$(date) FIXUP task for bucket=${bucket_id} index=$index_name required"
echo "$(date) Exporting bucket=${bucket_id} index=$index_name"
NEW_BUCKET=$BUCKET_TMPDIR/new_bucket_$index_$bucket_id
EXPORTING_BUCKET=$BUCKET_TMPDIR/export_bucket_$index_$bucket_id.csv
# delete old export files (just in case they are left from previous migration)
rm -Rf $NEW_BUCKET
rm -Rf $EXPORTING_BUCKET
# do export
SECONDS=0
$EXPORT_CMD $bucket $EXPORTING_BUCKET -csv
EXPORT_ENDTIME=$(date +%s)
duration_export=$SECONDS
echo "Export took $duration_export seconds."
# do import
echo "$(date) Reimporting bucket=${bucket_id} index=$index_name"
SECONDS=0
$IMPORT_CMD $NEW_BUCKET $EXPORTING_BUCKET
duration_import=$SECONDS
echo "Reimport took $duration_import seconds."
# go into new bucket and get earliest and latest time in the bucket
(cd $NEW_BUCKET; ls *.tsidx | sed 's/-[0-9]\+\.tsidx$//' |sed 's/-/ /') | {
global_low=0
global_high=0
while read high low; do
if [ $global_high -eq 0 ] || [ $high -gt $global_high ]; then
global_high=$high
fi
if [ $global_low -eq 0 ] || [ $low -lt $global_low ]; then
global_low=$low
fi
done
REAL_BUCKET_NAME=db_${global_high}_${global_low}_${bucket_id}_${INSTANCE_GUID}
# move the old bucket to temporary location
if [ -d $bucket ];
then
mv $bucket $BUCKET_TMPDIR
else
echo >&2 bucket $bucket vanished while processing.. inserting new one and hoping for the best
fi
# replacing old bucket with a new one
echo "Replacing bucket=${bucket_id} index=$index_name"
mv $NEW_BUCKET $bucket_dir/$REAL_BUCKET_NAME
}
# delete temporary export file and the old one.
rm -rf $BUCKET_TMPDIR/$bucket_name # delete old one
rm $EXPORTING_BUCKET # delete exported one
# if bucket folder doesn't end with the INSTANCE_GUID, lets append it
elif [ "$bucket_guid" != "_${INSTANCE_GUID}" ];
then
echo "$(date) Renaming bucket=${bucket_id} from single instance to cluster format index=$index_name"
mv $bucket ${bucket}_${INSTANCE_GUID}
fi
fi
done
fi
done
done
Just for info: we had 30% of buckets in this state. The total size of all indexes is 3 TB. It took us 4 days to run the script. And after running it, there was also a couple of other buckets that were corrupted which had to be exported in same fashion, but it was fast and easy job.