We recently had to move our splunk installation & indexes to a new AWS instance, which was somewhat complicated due to the size of the indexes. Since then most of the indexes are updating correctly, but our most important custom index is not.
Part way thru the move, indexing was restarted, then stopped. When the data move was complete, we had bucket id conflicts. We followed all the instructions we could find to correct the issues, renaming all the conflicting buckets, and all indexes and metadata was rebuilt. (splunk _internal call /data/indexes/*/rebuild-metadata-and-manifests). Other affected indexes are now working correctly, but our most important index has not processed farther than Dec 4. We get these errors in the logs files about every second or so:
12-23-2013 02:47:14.427 +0000 ERROR BTree - 133th child has invalid offset: indexsize=32434216 recordsize=77042296, (Leaf)
12-23-2013 02:47:14.427 +0000 ERROR BTreeCP - addUpdate CheckValidException caught: BTree::Exception: Validation failed in checkpoint
We have tried repairing the buckets and metadata several times. Splunk found errors and repaired them, but the BTREE error continued. We've stopped and restarted Splunk a number of times to retest, and new repairs were made to the buckets each time. One problematic bucket has been moved into /root -- it refused to be repaired.
None of this affected the BTREE error. The data still isn't showing up the Splunk web interface when we run searches.
What other things can we try to repair this index? I have not seen any other reports of a similar error message when I search thru answers.splunk.com.
You're not going to be able to fix that kind of an error by rebuilding metadata. The problem is corruption in the rawdata, you can't fix that with a rebuild. If it's contained in the snapshot as well(based on your response, that seems likely), the only change you have at fixing it is if you've got a backup from prior to the introduction of corruption.
In all likelihood, this is not the reason you're unable to search your custom index. Is that index actually active at this time? Does it have buckets containing data based on the timestamps of the warm buckets from dates after December 4th?
Looks like your BTree is corrupted somehow.
did you have a hard system crash around this time?
Please check your $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db.
If you have a 'snapshot' directory in there, you can try to restore it to the snapshot:
cp $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db/snapshot/btree* $SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db
Thanks for taking the time to respond. We restored the btree files from a snapshot -- but it had no apparent effect. We backed up all of the original files in the root home directory first, and ran Splunk's file check utility afterward, but Splunk seemed to take little notice of the change. - sheilatabuena
You have restarted the Splunk instance after restore?
./splunk fsck --all --repair should fix. If not,
renaming the splunk_private_db folder in the fishbucket directory and restarting the instance will hopefully generate btree files in a new directory