Brilliant thanks!
This helps a lot and confirms my suspicions around what is going on. I don't think I'll encounter this edge case in production but I'll account for it none the less just to be safe.
Thanks for the tip regarding archiving. The scripts that I'm testing are for a managed Splunk archiving system I've built in Python. This scripts do the reconciliation of buckets (removing all the replicas/copies for a source bucket - dedup) and ensures only a single master copy is stored thus saving on space. I'm trying to build something 'enterprise grade' as you could run into data loss issues when using ColdToFrozenDir and standard operating system tools to copy buckets or move buckets mid freeze.
Like the following scenario when using a ColdToFrozenDir for an index: Splunk freezes a bucket, copying the bucket to the path specified in ColdToFrozenDir. If your buckets are large and you have an os script/cronjob that copies or move out the buckets to a storage location on a different mount point (very common use case), you run the risk that data could but truncated at the target location due to the script copying a source file being written to by Splunk as it freezes the bucket. The copy will not 'wait' for the write to complete. If it is on the same filesystem you are covered with a move and will end up with a complete file due to the way inodes are handled in Linux (and I suspect other *nixs).
I think that a sure way to prevent this is to stop the indexer before copying out the buckets from the ColdToFrozenDir to your archive location as there will be no files being written to by Splunk and is would be 'safe' when copying to a different filesystem such as an NFS/HDFS share.
The scripts I'm testing provides a safe way to do all of the above while Splunk is running and freezing buckets. The coldToFrozenScript generates lockfiles for each bucket that are checked for by the consolidation scripts (dedup) and bucket moving scripts.
This allows you to manage archiving on a cluster of any size without needing to shut down nodes to guarantee a safe copy. The dedup, coldtofrozenscript and copy scripts also use a modular plug-in system for verification of buckets (full source / destination hash checking), file size, etc. As well as for encrypting, moving, or uploading to S3 etc for buckets after dedupe.
It also has extensive logging so I plan to develop a Splunk app when I have some time to report on bucket health and bucket status throughout the archiving system, metrics such as disk space saved by consolidation/dedup, etc.
The aim is to automate as much as possible and to be modular/flexible.
Thanks for your help mmodestino!
... View more