is there some way to see how long does it take to repair index? Lets say for some reason the system crashed and the Splunk index is inconsistent. I can use splunk fsck -all --repair to get index to valid state but how long does it take? Is there some kind of 'Progress bar' to see how much was already done? I'd like to integrate Splunk in the monitoring framework and to start automatic repair in case of system/Splunk crash. It would be nice to inform administrator about problem and actual progress of the index recovery. If it takes too long he should decide how to go on.
Hi Lu,
You know, a primary objective is to hide these failures and to do that we need a solid architecture in which the peer nodes back each other. Due to solid growth, our cluster is not stable, but because of six indexers, Search Factor and Replication Factor at three and the net result is that to the outside world we are very stable.
Cheers
"It depends" makes sense but I would think there would be a formula saying, "on a recommended hardware spec machine, splunk repair will repair 30 MB of data per second" or something like that.
Splunk developers know what algorithms they're running to repair a bucket or index. Even a Big O notation answer that would at least give us a ball-park figure.
How long it takes depends directly on how much data you are giving Splunk to repair. You can manipulate how much data your repair with these instructions.
http://www.splunk.com/wiki/Community:PostCrashFsckRepair
There is No progress bar.
I have no timelines on how long it takes, and this would be dependent on the computing resources you had, such as (CPU/memory/ Disk I/O).
It looks like solution for my concept. Unfortunately the first experiment shows a problem with reinserting of the repaired buckets to the original place. I simulate crash of Splunk with 'kill -9'.
The Splunk check discovered these broken buckets:
bucket=/opt/splunk/var/lib/splunk/audit/db/hot_v1_0 NEEDS REPAIR: count mismatch tsidx=388 source-metadata=248 bucket=/opt/splunk/var/lib/splunk/defaultdb/db/hot_v1_0 NEEDS REPAIR: count mismatch tsidx=315551 source-metadata=237012
I moved them out of the orig directory, then Splunk could start as expectd. I started rebuild for every bucket. Now everithing is ok, but how to reinsert it? There are some new events indexed during rebuild and now there are files with the same name in the original place. Diff on repair and new directory shows some conflicts. Is it safe to replace the new files with old ones? Is there some marge utility? Should I move only .tsidx files?
Example of diff conflict, other file were unique in repair or original place:
diff -r /opt/splunk/var/lib/splunk/audit/db/hot_v1_0/Hosts.data /opt/splunk/var/lib/splunk/audit/db/repair/hot_v1_0/Hosts.data
0 1 6 1319803955 1319803991 1319803991
1 host::myhost 6 1319803955 1319803991 1319803991
.---
0 1 408 1319803860 1319803920 1319803920
1 host::myhost 408 1319803860 1319803920 1319803920
Binary files /opt/splunk/var/lib/splunk/audit/db/hot_v1_0/rawdata/slices.dat and /opt/splunk/var/lib/splunk/audit/db/repair/hot_v1_0/rawdata/slices.dat differ