Monitoring Splunk

BucketMover - freeze candidate $BUCKET_NAME is already on inflight list

SplunkTrust
SplunkTrust

I just found a message in our splunkd.log of the form:

09-12-2011 23:26:08.864 -0400 INFO  BucketMover - freeze candidate /opt/splunk/var/lib/splunk/fooindex/colddb/db_1298693239_1298663572_692 is already on inflight list, not adding again. already being frozen, or currently being moved to cold, in which case we will freeze it after that completes.

Running a search of the form...

index=_internal freeze candidate source="*splunkd.log" | rex "freeze candidate /opt/splunk/var/lib/splunk/(?<foo>[^\s]+)" | timechart span=1d count by foo

...gives me one bucket that has been doing this basically 1x/minute going back to 29 August. Splunkd has been restarted more than once since that time. Other relevant log messages include:

08-28-2011 13:41:30.733 -0400 INFO  BucketMover - will attempt to freeze: /opt/splunk/var/lib/splunk/fooindex/colddb/db_1298693239_1298663572_692 because frozenTimePeriodInSecs=7776000 exceeds difference between now=1314553290 and latest=1298693239

08-28-2011 10:12:53.252 -0400 INFO  databasePartitionPolicy - Adding /opt/splunk/var/lib/splunk/fooindex/colddb/db_1298693239_1298663572_692 because of fullRebuild

Looking in the index directory structure, this bucket is (numerically) wildly out of sequence. Other cold buckets are in the range of 1241 to 1667. We do run a coldToFrozenScript, but have not seen any issues with it.

Can we just delete the bucket? How does Splunk know this bucket is on some inflight freezing list -- and how can we clean said list up? (assuming of course it exists).

Additional Info, 13 Sept 15:44CT:

Discussing this with hexx, in #splunk, found some additional hidden files:

drwx--x--x   4 splunk splunk  4096 Feb 25  2011 db_1298693239_1298663572_692
-rw-------   1 splunk splunk     0 May 27 00:07 .db_1298693239_1298663572_692.rbsentinel
-rw-------   1 splunk splunk     0 May 27 00:07 .db_1298693239_1298663572_692.rbsentinel.lock

I strongly suspect these hidden files are what is making Splunk's belief of freeze-in-process persist across restarts.

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

It looks like splunkd's attempt to freeze that bucket using coldToFrozenScript silently fails each time it is attempted. If you want to remediate the problem with this specific problem, you can move the bucket manually while splunkd is stopped to your frozen directory, or even better : manually execute your coldToFrozenScript on it.

If you would like to get this investigated more in-depth, I would suggest you open a support case and attach a diag. Ideally, kick up the log level of the BucketMover channel to DEBUG and see if you can capture more information about the move attempt before gathering the diag.

View solution in original post

Splunk Employee
Splunk Employee

It looks like splunkd's attempt to freeze that bucket using coldToFrozenScript silently fails each time it is attempted. If you want to remediate the problem with this specific problem, you can move the bucket manually while splunkd is stopped to your frozen directory, or even better : manually execute your coldToFrozenScript on it.

If you would like to get this investigated more in-depth, I would suggest you open a support case and attach a diag. Ideally, kick up the log level of the BucketMover channel to DEBUG and see if you can capture more information about the move attempt before gathering the diag.

View solution in original post

SplunkTrust
SplunkTrust

Removing those hidden files with Splunk down made it freeze the bucket rightaway during startup. I tried turning diag on, but there wasn't anything meaningful being logged. I will keep an eye out for this issue in the future with plans to log a support case if it occurs again.

0 Karma