I've noticed some weird behavior on one (and only one) of my indexers. A customer complained about data "suddenly disappearing" in the middle of the day. When my team investigated, they found that all the buckets in cold resembled "frozen" buckets - they contain only a rawdata directory with journal.gz inside. When I search for a representative bucket in the logs, I see only the following two messages, repeated over and over again:
12-30-2015 00:13:53.533 +0900 INFO DbMaxSizeManager - Will freeze bucket=/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083 LT=1449606229 size=30715904 (29MB) 12-30-2015 00:13:55.068 +0900 ERROR BucketMover - aborting move because recursive copy from src='/opt/splunk/var/lib/splunk/jcare/colddb/db_1449606229_1449572203_26083' to dst='/opt/splunk/var/lib/splunk/frozen/jcare/inflight-db_1449606229_1449572203_26083' failed (reason='No space left on device')
The disk space error referenced is due to a known problem with our backups client that we're working on. However, that still leaves me with two questions:
Curious if anyone else has seen this and has suggestions as to why this is happening. Again, the root cause appears to be disk-related, but we are dependent upon a support ticket with our Backups vendor to resolve this, and I'd love to see any ameliorative suggestions that we can implement in the meantime.
Adding relevant lines from indexes.conf here:
maxTotalDataSizeMB = 1272700
path = /opt/splunk/var/lib/splunk/tstats
path = /opt/splunk/var/lib/splunk/ctxhot
maxVolumeDataSizeMB = 36000
path = /opt/splunk/var/lib/splunk/ctxcold
maxVolumeDataSizeMB = 74000
homePath = $SPLUNKDB/jcare/db
coldPath = $SPLUNKDB/jcare/colddb
thawedPath = $SPLUNKDB/jcare/thaweddb
tstatsHomePath = volume:tstats/jcare
coldToFrozenDir = $SPLUNKDB/frozen/jcare
homePath.maxDataSizeMB = 185000
coldPath.maxDataSizeMB = 237000
maxHotBuckets = 3
maxDataSize = auto
I cannot say i've seen this, as our freezing policy is currently "Delete", but they are interesting questions, and probably only a Splunk developer could accurately answer them... but it leads me to a bigger question of "Why do we freeze data?"
If the purpose of freezing data is to free up disk space on the searchable volume, by freezing first (and continuing to freeze) we (closer to) guarantee the ability to free up at least some space before moving data to a separate volume. Of the two operations (delete metadata, and move to a separate volume), the moving seems like it'd have a higher possibility of failure, assuming an environment where Splunk is running and currently able to at least operate with the current volume that the cold buckets are on.
If the purpose of freezing data is to ensure that data is only retained in a searchable state for a particular amount of time to meet some regulatory, legal, or other policy requirement, by freezing first and continuing to freeze we ensure that the same functional state is still met at the same/similar time, but still having the data available on disk to be met when the frozen volume is unblocked.
Of course both of these are pure speculation. 🙂
Ok so here's the order it goes in when rolling to frozen
You're failing to move the bucket so it's staying in the cold folder as a frozen bucket.
To fix this, make a local frozen directory using coldToFrozenDir, and then use your backup script (not within splunk) to check the connection to the backup location, and if it's up and accessible, copy the frozen buckets.
If you're using a script, you need to put a test case in that checks to see if the backup location is available and if not, move to a temporary location which will be read by the scrip and sent to frozen storage the next time it executes.
Thanks. So if it fails to move a bucket, will the process continue attempting to freeze other buckets? As stated above, we're seeing our entire colddb filled with "frozen" buckets.
Looks good to me. I just wanted to be sure you didn't have duplicate paths or something weird. I'm betting the frozen-delete policy doesn't apply when you enable a coldtofrozendir and Splunk can't reach the sir at by let move.
Surely it's supposed to find these and move them once per day or something. It might be worth the support ticket to file a possible bug report.
Once the bucket is frozen but not moved, it will stay there. I dont know of any cleanup process that will "catch" the bucket and move it. I'm probably super wrong though as I'm not very fluent in splunk bucket maintenance.
I just know logically it would create the frozen bucket, fail to move it, therefore it would stay right there in cold directory.
I'm thinking you'll have 1 frozen bucket in your cold directory for every 1 of these errors you've had.