Our splunk system has the potential to grow significantly in the near future, so a veeam backup of the indexer VM will not be practical for long term storage (5 years).
My plan:
2 indexers in a multisite cluster. 100% of my data will exist at each site.
The hot +warm buckets + OS will be on the primary partition, backed up by veeam with VSS.
The cold/frozen will be archived off to long term. Warm/Cold/Frozen will be something like 6 months/1.5 year/5year.
But, I don't want to be completely dependent on synchronization of my indexes between the 2 sites, and data corruption/loss will involve some severe implications for us.
What I would like to do is have a hot bucket roll off to both warm AND frozen at the same time. That way, my long-term storage is using the smaller buckets, I don't have to worry about waiting for things to age off to frozen, and I have a nice safe copy of my frozen data.
I see that buckets can go to different locations, but is it possible to route a bucket to 2 locations/statuses simultaneously?
Why don't you trust the replication of data between sites? Why bother clustering at all then? This is the only reason clustering exists.
you can't roll buckets to two locations per se, but you could look at the data roll feature, which will replicate a single copy of each journal.gz to s3 or hdfs. (the journal.gz is what rolls to frozen and is why it is smaller than warm, see tsidx reduction link below)
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/ArchivingindexestoHadoop
(don't be fooled by the name, you don't need to run hadoop to use the feature, it simply uses hadoop binaries to do the coping of the buckets)
Concern about bucket size is not affected by the items you have discussed. That is solely controlled by indexes.conf configurations.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Admin/Indexesconf
Technically I guess you could write some logic to manually copy all journal.gz from warm buckets to a disk mount somewhere....but I'd guess that might be more prone to data loss than replication as we ship it.
Honestly I would look at tsidx reduction before anything, as all that is described here smells like over complexity to me.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/Reducetsidxdiskusage
Depending on your environment and exact requirements, there might be another approach. If you
then you could
The data on your regular indexer cluster remains available and is searchable. Your search head(s) only use this indexer cluster. The copy sent to the "backup" indexer is for backup purposes only; it is frozen after a short time due to the different settings and you can backup the frozen data and even remove it once it's backed up. The "backup" indexer does not need too much disk space as it only stores the data for a relatively short time. The disk requirements on your regular indexers could possibly be somewhat reduced too if you don't need to freeze buckets there any more and can just delete the data when it times out.
Just an idea, but theoretically it should work.
Why don't you trust the replication of data between sites? Why bother clustering at all then? This is the only reason clustering exists.
you can't roll buckets to two locations per se, but you could look at the data roll feature, which will replicate a single copy of each journal.gz to s3 or hdfs. (the journal.gz is what rolls to frozen and is why it is smaller than warm, see tsidx reduction link below)
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/ArchivingindexestoHadoop
(don't be fooled by the name, you don't need to run hadoop to use the feature, it simply uses hadoop binaries to do the coping of the buckets)
Concern about bucket size is not affected by the items you have discussed. That is solely controlled by indexes.conf configurations.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Admin/Indexesconf
Technically I guess you could write some logic to manually copy all journal.gz from warm buckets to a disk mount somewhere....but I'd guess that might be more prone to data loss than replication as we ship it.
Honestly I would look at tsidx reduction before anything, as all that is described here smells like over complexity to me.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/Reducetsidxdiskusage
I have a great amount of confidence in the clustering -- that is why I am using it. My "don't trust" comment was too harsh.
But, NOWHERE in my system am I comfortable saying that I have no external backups.
Being able to archive off my log data to a completely separate system is a safety net we need.
I can understand that but to be clear we are saying a fully redundant second site is not enough redundancy? AND full VM backups? (not sure vm backups are work for you tbh, maybe for config directories, but for all the data? these VMs will get big)
Still I can appreciate the idea, so to address this, I recommend looking at data roll.
You could archive as soon as the bucket rolls to warm (or any timeframe you wish) and roll it to a hadoop instance or amazon s3 bucket, or Isilon NAS, etc- basically anything that can run s3). And it is identical to rolling to frozen, as only the journal.gz goes. It even dedups replicated data and has fault tolerance.
Otherwise I'd look at rsync of the warm db to another system...I have my doubts this will scale if you do grow large...or striaght up configure your systems to send directly to an archive as a second data landing zone.
Also, remember the journal.gz is a proprietary file type, so if you roll them to archive, you need to bring em back into Splunk to be thawed, or use the bucket reader app in hadoop.
I am looking less for redundancy (as provided by cluster), and more to be able to take my X# of logs completely out of the system. Disaster recovery scenarios, or malicious user activity.
As for the VM -- I expect the majority of the data to not be part of the VEEAM backup. We are still sizing things, but I would rather go too safe and work backwards than have an issue I cannot get out of.
when you say "the system" do you really mean Splunk? or just the physical system itself?, cause it almost sounds like simply sending a second stream from source directly to archive is what would be easiest. is that possible?
If your are ok with the DR plan to be logs in the format of journal.gz, then yeah data roll or some custom rsync
By "system", I mostly mean splunk/the cluster.
We archive logs from several different groups to intermediate forwarders and finally to our splunk.
We have no insight into what others have, other than what they send via their intermediate forwarders.
so you are ok with it being in the journal.gz splunk proprietary format in the archive? if so, rsync or data roll
I believe that just the journal.gz should be fine. It is a small file, has all of the data i would need to make it searchable again, and can be copied off with no impacts if i wait until it is warm. Splitting to warm+frozen simultaneously would be more fun, but I shall make do.
Thank you for your help here.
Well if you can swing an HDFS or S3 interface data roll technically gives u warm + frozen(archive)!
Any time!
Good on ya for dilligently planjing for the worst!
I don't think there is any native way to do this. You might have to write a custom script, at regular interval, which will sync warm buckets (default location: $SPLUNK_HOME/var/lib/splunk/<<IndexName>>/db
and starts with db_
) to your frozen location, may be rsync it.
I was aiming to use frozen to keep my storage size down -- it seems frozen buckets are significantly smaller than warm with the extra files removed.