Getting Data In

Why are we seeing Frozen Buckets in Cold?

Communicator

Hi all,

I've noticed some weird behavior on one (and only one) of my indexers. A customer complained about data "suddenly disappearing" in the middle of the day. When my team investigated, they found that all the buckets in cold resembled "frozen" buckets - they contain only a rawdata directory with journal.gz inside. When I search for a representative bucket in the logs, I see only the following two messages, repeated over and over again:

12-30-2015 00:13:53.533 +0900 INFO  DbMaxSizeManager - Will freeze bucket=/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083 LT=1449606229 size=30715904 (29MB)

12-30-2015 00:13:55.068 +0900 ERROR BucketMover - aborting move because recursive copy from src='/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083' to dst='/opt/splunk/var/lib/splunk/frozen/jcare/inflight-db_1449606229_1449572203_26083' failed (reason='No space left on device')

The disk space error referenced is due to a known problem with our backups client that we're working on. However, that still leaves me with two questions:

  1. If Splunk fails to move a bucket to frozen, why does it continue freezing other buckets in cold? Shouldn't it just stop altogether?
  2. If Splunk fails to move a bucket to frozen, why does it leave the bucket in "frozen" status on cold? My understanding was that the bucket was first moved, then frozen.

Curious if anyone else has seen this and has suggestions as to why this is happening. Again, the root cause appears to be disk-related, but we are dependent upon a support ticket with our Backups vendor to resolve this, and I'd love to see any ameliorative suggestions that we can implement in the meantime.

Adding relevant lines from indexes.conf here:

[default]
maxTotalDataSizeMB = 1272700

[volume:tstats]
path = /opt/splunk/var/lib/splunk/tstats

[volume:ctxhot]
path = /opt/splunk/var/lib/splunk/ctx
hot
maxVolumeDataSizeMB = 36000

[volume:ctxcold]
path = /opt/splunk/var/lib/splunk/ctx
cold
maxVolumeDataSizeMB = 74000

[jcare]
homePath = $SPLUNKDB/jcare/db
coldPath = $SPLUNK
DB/jcare/colddb
thawedPath = $SPLUNKDB/jcare/thaweddb
tstatsHomePath = volume:tstats/jcare
coldToFrozenDir = $SPLUNK
DB/frozen/jcare
homePath.maxDataSizeMB = 185000
coldPath.maxDataSizeMB = 237000
maxHotBuckets = 3
maxDataSize = auto

1 Solution

SplunkTrust
SplunkTrust

Ok so here's the order it goes in when rolling to frozen

  1. Identify the bucket that needs to be frozen
  2. Freeze the bucket where it lays on the file system
  3. Move the bucket

You're failing to move the bucket so it's staying in the cold folder as a frozen bucket.

To fix this, make a local frozen directory using coldToFrozenDir, and then use your backup script (not within splunk) to check the connection to the backup location, and if it's up and accessible, copy the frozen buckets.

If you're using a script, you need to put a test case in that checks to see if the backup location is available and if not, move to a temporary location which will be read by the scrip and sent to frozen storage the next time it executes.

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

Ok so here's the order it goes in when rolling to frozen

  1. Identify the bucket that needs to be frozen
  2. Freeze the bucket where it lays on the file system
  3. Move the bucket

You're failing to move the bucket so it's staying in the cold folder as a frozen bucket.

To fix this, make a local frozen directory using coldToFrozenDir, and then use your backup script (not within splunk) to check the connection to the backup location, and if it's up and accessible, copy the frozen buckets.

If you're using a script, you need to put a test case in that checks to see if the backup location is available and if not, move to a temporary location which will be read by the scrip and sent to frozen storage the next time it executes.

View solution in original post

0 Karma

Communicator

Thanks. So if it fails to move a bucket, will the process continue attempting to freeze other buckets? As stated above, we're seeing our entire colddb filled with "frozen" buckets.

0 Karma

SplunkTrust
SplunkTrust

Once the bucket is frozen but not moved, it will stay there. I dont know of any cleanup process that will "catch" the bucket and move it. I'm probably super wrong though as I'm not very fluent in splunk bucket maintenance.

I just know logically it would create the frozen bucket, fail to move it, therefore it would stay right there in cold directory.

I'm thinking you'll have 1 frozen bucket in your cold directory for every 1 of these errors you've had.

0 Karma

SplunkTrust
SplunkTrust

Can you please update your question with a copy of your indexes.conf?

0 Karma

Communicator

Done. See above.

0 Karma

SplunkTrust
SplunkTrust

Looks good to me. I just wanted to be sure you didn't have duplicate paths or something weird. I'm betting the frozen-delete policy doesn't apply when you enable a coldtofrozendir and Splunk can't reach the sir at by let move.

0 Karma

SplunkTrust
SplunkTrust

Surely it's supposed to find these and move them once per day or something. It might be worth the support ticket to file a possible bug report.

0 Karma

Influencer

I cannot say i've seen this, as our freezing policy is currently "Delete", but they are interesting questions, and probably only a Splunk developer could accurately answer them... but it leads me to a bigger question of "Why do we freeze data?"

If the purpose of freezing data is to free up disk space on the searchable volume, by freezing first (and continuing to freeze) we (closer to) guarantee the ability to free up at least some space before moving data to a separate volume. Of the two operations (delete metadata, and move to a separate volume), the moving seems like it'd have a higher possibility of failure, assuming an environment where Splunk is running and currently able to at least operate with the current volume that the cold buckets are on.

If the purpose of freezing data is to ensure that data is only retained in a searchable state for a particular amount of time to meet some regulatory, legal, or other policy requirement, by freezing first and continuing to freeze we ensure that the same functional state is still met at the same/similar time, but still having the data available on disk to be met when the frozen volume is unblocked.

Of course both of these are pure speculation. 🙂

0 Karma