This helps a lot and confirms my suspicions around what is going on. I don't think I'll encounter this edge case in production but I'll account for it none the less just to be safe.
Thanks for the tip regarding archiving. The scripts that I'm testing are for a managed Splunk archiving system I've built in Python. This scripts do the reconciliation of buckets (removing all the replicas/copies for a source bucket - dedup) and ensures only a single master copy is stored thus saving on space. I'm trying to build something 'enterprise grade' as you could run into data loss issues when using ColdToFrozenDir and standard operating system tools to copy buckets or move buckets mid freeze.
Like the following scenario when using a ColdToFrozenDir for an index: Splunk freezes a bucket, copying the bucket to the path specified in ColdToFrozenDir. If your buckets are large and you have an os script/cronjob that copies or move out the buckets to a storage location on a different mount point (very common use case), you run the risk that data could but truncated at the target location due to the script copying a source file being written to by Splunk as it freezes the bucket. The copy will not 'wait' for the write to complete. If it is on the same filesystem you are covered with a move and will end up with a complete file due to the way inodes are handled in Linux (and I suspect other *nixs).
I think that a sure way to prevent this is to stop the indexer before copying out the buckets from the ColdToFrozenDir to your archive location as there will be no files being written to by Splunk and is would be 'safe' when copying to a different filesystem such as an NFS/HDFS share.
The scripts I'm testing provides a safe way to do all of the above while Splunk is running and freezing buckets. The coldToFrozenScript generates lockfiles for each bucket that are checked for by the consolidation scripts (dedup) and bucket moving scripts.
This allows you to manage archiving on a cluster of any size without needing to shut down nodes to guarantee a safe copy. The dedup, coldtofrozenscript and copy scripts also use a modular plug-in system for verification of buckets (full source / destination hash checking), file size, etc. As well as for encrypting, moving, or uploading to S3 etc for buckets after dedupe.
It also has extensive logging so I plan to develop a Splunk app when I have some time to report on bucket health and bucket status throughout the archiving system, metrics such as disk space saved by consolidation/dedup, etc.
The aim is to automate as much as possible and to be modular/flexible.
Thanks for your help mmodestino!
... View more
I hope someone can shed some light on the freezing of buckets before a bucket could be replicated due to streaming errors:
I have a coldToFrozen script that copies bucket to a location for long term archiving. My scripts are installed on all the cluster peers and is working like expected. As part of testing I've encountered the following edge case and would like some clarity whether this is expected Splunk behaviour.
I have an event generator that generates thousands of events for a single day, written to an index on the cluster. This index has an aggressive rolling period: frozenTimePeriodInSecs = 60. As part of my testing I also restart cluster members (cronjob to call ./splunk restart on each peer) every 15 minutes or so (causing a streaming error on the host 'sending' the original bucket).
What I have encountered is that when a bucket is busy streaming from source to a replication peer AND that destination peer is shut down, hence causing a replication failure AND the frozenTimePeriodInSecs rule for that bucket is reached on source , the source indexer of the bucket will happily freeze the bucket, thus no longer making it eligible for replication.
You will end up with the number of frozen copies for the bucket across the cluster being less than repFactor.
Bear in mind that I'm also assuming that the number of frozen copies of a bucket across the index cluster will always match the repFactor. If the above happens this will not be the case...
Because freezing timeouts are evaluated and executed on each individual peer I suspect the above is normal behaviour and thus an edge case?
The documentation says:
"In the case of an indexer cluster, when a peer freezes a copy of a bucket, it notifies the master. The master then stops doing fix-ups on that bucket. **It operates under the assumption that the other peers will eventually freeze their copies of that bucket as well."
The issue is that the freeze happens on the originating peer before the bucket can be streamed succesfully to replication peers.
... View more
I have a question regarding the fields returned by Splunk App for Stream. I've configured a number of TCP flow monitors and I see some flows have a "cancelled" attribute.
I couldn't find any documentation about what this field's purpose is, could this be that an RST was sent instead of a FIN|FIN/ACK for a TCP flow? Any other definitions I'm not considering?
... View more