Hi All,
If i wanted to only backup the rawdata, and exclude the 'index files', is it just as easy as excluding *.tsidx, or do I need to do more?
Assuming that when you restore it, it'll go "oh, I don't have those 'index files', let me rebuild them for you" (if this isn't automatic, and I need to issue a command, that is fine, just tell me what to do! - I figure it would be, as index replication handles the creation of 'index files' by itself...)
Some context:
Our backup guy is telling me my Splunk systems are the largest users of capacity, so I'm seeing what I can do to reduce the backup size. If there is nothing, so be it, but I'd like to know my options.
I have a clustered environment running Splunk 5.0.4 (4 indexers with rep and search factor of 4), so the chance of a restore being required is very low, but we obviously still need backups.
I am happy to accept the delay of service restoration while Splunk rebuilds the 'index files'.
It sounds like it is possible, as hinted at under: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Backupindexeddata
From the above link "Another thing to consider when designing a cluster backup script is whether you want to back up just the bucket's rawdata or both its rawdata and index files. If the latter, the script must also identify a searchable copy of each bucket."
Thanks,
Carson.
The minimum to back up and be able to restore/rebuild your data is to back up the index/db*/rawdata/journal.gz
files, and the contents of the index/db*/rawdata/deletes/
directories. Other data, including the tsidx files can be reconstructed from this, though it will take time and CPU to do so.
You should note that a "rep factor" that is higher than the "search factor" will simply keep only the minimal files as well.
In addition however to the tsidx files, which can be rebuilt by issuing an index rebuild command, you could also
Yeah, aware of that, it is even with 2 in each DC, hence 3 could be okay for me, but for completeness sake, I've chosen 4.
Hopefully you're aware that you can only be guaranteed 2 searchable copies at each of 2 sites if you only have 2 indexer nodes in the cluster at each site, since Splunk replication in the current version is note site-aware. If you have 3 or more nodes at one site, it is possible for 3 or more copies to be at the same site.
Against Splunk advise, I'm doing replication across the WAN (My WAN link is 600Mbps with ~25ms latency, hence going against their advise). I wanted to ensure that I have 2 searchable copies in each DC to ensure everything is okay if there is a link outage + server failure at the same time.
You're right, I could probably drop the search/rep factor to 3, and still be okay, but disk and processing is still comparatively cheap compared to downtime.
The minimum to back up and be able to restore/rebuild your data is to back up the index/db*/rawdata/journal.gz
files, and the contents of the index/db*/rawdata/deletes/
directories. Other data, including the tsidx files can be reconstructed from this, though it will take time and CPU to do so.
You should note that a "rep factor" that is higher than the "search factor" will simply keep only the minimal files as well.
In addition however to the tsidx files, which can be rebuilt by issuing an index rebuild command, you could also
Perfect, thank you.
If you restore back to a cluster that is needs to recreate its search factor then it should get rebuilt automatically. But if you restore to a standalone node, you need to execute a rebuild on each bucket. The extra files should not cause any problems.
To be clear... excluding *.tsidx will result in those files being recreated... Is that automatically, or only when the rebuild command is run? (So I can update my restore documentation)
Also, it would be much more reliable to exclude *.tsidx using the backup agent... leaving the other files won't cause any problems? (Other files being: bloomfilter bucket_info.csv Hosts.data merged_lexicon.lex optimize.result Sources.data SourceTypes.data splunk-autogen-params.dat Strings.data)
Is there any reason in particular you want/need an index replication AND a search factor of 4? That seems a bit on the excessive side, and there may be more efficient ways to give you the redundancy/resiliency you're after (while keeping storage volumes down).
Just thought I'd get some more info before I provide a (possible) answer 🙂