Monitoring Splunk

Feature request: fsck repair indexes performance

Explorer

Hi,

I think more than me experience repairing corrupted indexes take long time. At least in my environment with a single Indexer server, I could see the 1 out of my many CPUs was 100% utilized while the other ones where idle during the process. Disk util seemed to be not an issue at all...
If I'm right, fsck command for fixing the index files are per bucket and then my question is then:
Why doesn't the feature start several threads handling more than one bucket at a time to utilize the available CPU ressources as an option?
Maybe including the possibility to decide your self how many concurrent cores you want use?
This would have saved us a lot of hours with unavailable Splunk service!!!!!

-Tore

Communicator

+1 for the feature request to parallel fsck.

Since this isn't avail. wrote a quick bash script that generates a list of buckets to fsck and uses the parallel command to kick of many bucket fsck's at a time. I then distributed this to hundred plus indexers and executed via pssh.

#!/bin/bash

# Find buckets between two epoch time stamps and output to file
/data/index/splunk/bin/splunk fsck scan --all-buckets-one-index --index-name=_internal --min-ET=1483250400 --max-LT=1485928800 > /home/splunk/bucket_repair_07152017/buckets_tmp.list 2>&1

# Format bucket list
cat /home/splunk/bucket_repair_07152017/buckets_tmp.list | grep idx | cut -d " " -f 4 | cut -d "=" -f 2 | cut -d "'" -f 2 > /home/splunk/bucket_repair_07152017/buckets.list

# Fix buckets
cat /home/splunk/bucket_repair_07152017/buckets.list | parallel /data/index/splunk/bin/splunk fsck repair --one-bucket --bucket-path={} >> /home/splunk/bucket_repair_07152017/buckets_repair.log 2>&1

Path Finder

I would love to see splunk be able to add this to the startup so multiple cores can be involved in checking a bucket per index.. right now it seems to parse each index and fsck a single bucket in a single index at a time.

This can make recovery take a VERY long time while the box is mostly idle.

The script you have here is nice, but it would be preferable if splunkd incorporated the logic.

0 Karma

Esteemed Legend

I am unfamiliar with parallel but it seems that it is aware of the threads that it makes and will persist it's command state until all of the threads are complete. Is this so? The reason that I ask is that it seems like it might work to add this to the bottom of your screipt, @kbecker:

# All fixed; start splunk now
/data/index/splunk/bin/splunk start

But this would be inadvisable (and likely splunk would not even start) if parallel moves on to the next command (in this case splunk start) before all threads of fsck are completed.

0 Karma

Esteemed Legend

There was a typo in your script, you were missing the last 1 character of the file. I edited and fixed it.

0 Karma

Engager

I misspoke earlier when I thought I could run multiple executions of 'splunk fsck repair --all-buckets-one-index' in parallel. I looked closer and saw that it would repeat processing the buckets in the same order. I tried it the way you posted above and that worked great.

0 Karma

Splunk Employee
Splunk Employee

Hi.

I think you're saying you want index repair to go faster, and that you believe that parallelizing it will make it go faster.
We certainly can parallelize repair by repairing buckets in parallel, but a single bucket would be very difficult to parallelize.

Can you clarify a bit as to whether this occurred during a normal 'splunk start' after an outage, or were there other commands involved?

New Member

Has any attempt been made to make this a faster process? I also am wondering why it's tied to just one CPU only.

0 Karma

Path Finder

Using the GUN Parallel option above mentioned by kbecker did the trick for me... I modified the process a bit and had a loop outside that to check every bucket instead of just the single bucket that the example they have shows. I don't have the commands handy right now though.

0 Karma

Explorer

Hi,
Your assumption is correct:-)

What I'm thinking of is when running a repair of buckets from all indexes with e.x. this command:
/opt/splunk/bin/splunk fsck repair --all-buckets-all-indexes 2>&1 | tee fsck-output.txt

As also support indicated this took around 30min per 10GB data in average...
Since only one CPU was active and in 100% all time, so I assumed no multi threading... Splitting the load onto several cores may have helped speed up the processing?

-Tore

0 Karma