Recently we replace our RedHat 7 peers with new RedHat 9 peers and it seems we lost some data in the process...
Looking at the storage, it almost seems like we lost the cold buckets (and maybe also the warm ones).
We managed to restore a backup of one of the old RHEL7 peers and we connected this to the cluster, but it looks like it's not replicating the cold buckets to the RHEL9 peers..
We are not using smart storage, the cold buckets are in fact just stored in another subdir under the $SPLUNK_DB path.
So.. the question rises... are warm and cold buckets replicated ?
Our replication factor is set to 3 and I added a single restored peer to a 4-peer cluster
If there is no automated way of replicating the cold buckets... can I safely copy them from the RHEL7 node to the RHEL9 nodes ? (e.g. via scp)
ENOTENOUGHINFO
What exactly did you do? Did you just spin up an instance restored from snapshot/backup? Did you add it to the cluster? Does the CM see it? Do you see the buckets at all? Haven't they rolled to frozen yet on other nodes? What does the dbinspect say?
What we did was :
When I check the "Settings / Indexer Clustering" page on the master it does show the recovered node as well.
The "Indexes" tab on this same page shows all indexes are green.
But... when I do a search for the earliestTime, the older data which is on the recovered peer is not seen.
Only when I add the recovered peer to the distSearch.conf it does see the older events.
Also when I remove the recovered peer again from the cluster the older events are also gone again, which indicates those cold buckets were not synced to the production nodes.
The buckets are not rolled to frozen, because the frozenTimePeriodInSecs for the index is set to 157248000 (about 5 years) and the data I try to recover is from 2020.
And I did just run a dbinspect and it seems not to give any errors on the cold buckets on the restored host.
Path is the colddb-path and state is 'cold' as expected
Eventually I would like to remove the recovered peer again from the cluster, since this is still running RHEL7 and it has to be switched off..
So... I am looking for a way to safely get the data on the RHEL9 nodes.
And as a side-track I want to get the understanding of how the warm/cold buckets are handled. Because... when they are indeed not replicated it also explains why they were lost in the first place... the RHEL9 nodes were clean installations which replaced the RHEL7 nodes. The rough procedure followed in this migration was :
So, when cold buckets were not replicated, the were never replicated to the overflow node and eventually were all gone..
OK. The buckets contain several things in their directory name. Most notably - clustered non-hot buckets contain guid of the source indexer. So if you change the guid of the indexer, it will not match any existing indexers and will not be treated as part of the cluster (actually it's not explicitly written anywhere but I suppose it will be treated as an unclustered bucket).
Probably the same goes for your original problem - I suppose you had a stand-alone indexer or just distributed indexers without a cluster and then decided to cluster your indexers. In such case without manual intervention old buckets are treated as unclustered and are _not_ replicated.
Aha... that makes sense ... and explains a lot.
I will see if I can restore the cold buckets by renaming the files / setting the correct GUID in the instance.cfg on the restored node.
Thanks a lot for pointing me in the right direction.
Did a quick check on the files in the colddb directory.
There are 4 different GUID's, which are actually the same as the GUID's for the existing peers. (which makes sense, since I used the original /opt/splunk/etc on the new RHEL9 nodes, which includes the instance.cfg holding the GUID)
$ ls -lrt colddb/ | awk -F_ '{print $5}' | sort -u
10B29386-EAD3-45F6-AFEF-6C5897D7507E
289FAAF8-810C-454E-9CF5-4DEA9C5CA3E7
332E50AC-2BE6-4FFB-96AB-3F7D612A1422
9C46DD6F-782E-4675-8E9B-90CABC42221D
And the current peers :
$ splunk list cluster-peers | grep -v ":" | grep [0-9]
10B29386-EAD3-45F6-AFEF-6C5897D7507E
289FAAF8-810C-454E-9CF5-4DEA9C5CA3E7
332E50AC-2BE6-4FFB-96AB-3F7D612A1422
42C49D52-0A71-4164-91EC-806EAEEEE085
9C46DD6F-782E-4675-8E9B-90CABC42221D
(The 42C49... GUID is from the restored node, holding all the cold buckets)
My issue is solved
I manually copied the files from colddb on the recovered node to the colddb location on the production nodes.
(enable maintenance-mode, stop splunk on receiving node, copy files, make sure ownership is correct, start splunk, disable maintenance-mode)
The recovered node is currently still in the cluster, because removing it would fill up the remaining indexers a bit too much.. which would lead to data loss again 🙂
When we added some additional RHEL9 peer nodes we will remove the recovered node and life will be good again.
Thanks for the tips and clearifying info.