Splunk Enterprise

Are cold buckets replicated when I add an old peer to the cluster ?

pharmapartners
Explorer

Recently we replace our RedHat 7 peers with new RedHat 9 peers and it seems we lost some data in the process...

Looking at the storage, it almost seems like we lost the cold buckets (and maybe also the warm ones).

We managed to restore a backup of one of the old RHEL7 peers and we connected this to the cluster, but it looks like it's not replicating the cold buckets to the RHEL9 peers..

We are not using smart storage, the cold buckets are in fact just stored in another subdir under the $SPLUNK_DB path.

So.. the question rises... are warm and cold buckets replicated ?

Our replication factor is set to 3 and I added a single restored peer to a 4-peer cluster

If there is no automated way of replicating the cold buckets... can I safely copy them from the RHEL7 node to the RHEL9 nodes ? (e.g. via scp)

Labels (2)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

ENOTENOUGHINFO

What exactly did you do? Did you just spin up an instance restored from snapshot/backup? Did you add it to the cluster? Does the CM see it? Do you see the buckets at all? Haven't they rolled to frozen yet on other nodes? What does the dbinspect say?

0 Karma

pharmapartners
Explorer

What we did was :

  • Restored 2 old peer nodes from a backup
  • Cloned the master node to setup a shadow cluster and adapted the replication-factor on this clone to 2.
    This allowed us to make a mini-cluster which is fully balanced (so both restored peer nodes would have all data)
    I did however noticed that on one of the two recoved nodes the colddb-location remained empty.
  • Placed the shadow-cluster in maintenance and removed one of the peer nodes.
  • Reconfigured this peer to connect to the production cluster
    Also changed the name in the server.conf and removed the instance.cfg to prevent duplicate peer names and UUID's

When I check the "Settings / Indexer Clustering" page on the master it does show the recovered node as well.
The "Indexes" tab on this same page shows all indexes are green.

But... when I do a search for the earliestTime, the older data which is on the recovered peer is not seen.
Only when I add the recovered peer to the distSearch.conf it does see the older events.
Also when I remove the recovered peer again from the cluster the older events are also gone again, which indicates those cold buckets were not synced to the production nodes.

The buckets are not rolled to frozen, because the frozenTimePeriodInSecs for the index is set to 157248000 (about 5 years) and the data I try to recover is from 2020.

And I did just run a dbinspect and it seems not to give any errors on the cold buckets on the restored host.
Path is the colddb-path and state is 'cold' as expected

 

Eventually I would like to remove the recovered peer again from the cluster, since this is still running RHEL7 and it has to be switched off..
So... I am looking for a way to safely get the data on the RHEL9 nodes.

And as a side-track I want to get the understanding of how the warm/cold buckets are handled. Because... when they are indeed not replicated it also explains why they were lost in the first place... the RHEL9 nodes were clean installations which replaced the RHEL7 nodes. The rough procedure followed in this migration was :

  • Add an additional "overflow" peer to the cluster and make sure the cluster is synced.
  • Bring down (offline --enforce-counts) one of the RHEL7 nodes and replace it with a clean RHEL9 node.
    Config from /opt/splunk/etc was taken over from the old RHEL7 node
  • When all nodes were replace, the "overflow" node was removed.

So, when cold buckets were not replicated, the were never replicated to the overflow node and eventually were all gone..

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. The buckets contain several things in their directory name. Most notably - clustered non-hot buckets contain guid of the source indexer. So if you change the guid of the indexer, it will not match any existing indexers and will not be treated as part of the cluster (actually it's not explicitly written anywhere but I suppose it will be treated as an unclustered bucket).

Probably the same goes for your original problem - I suppose you had a stand-alone indexer or just distributed indexers without a cluster and then decided to cluster your indexers. In such case without manual intervention old buckets are treated as unclustered and are _not_ replicated.

 

0 Karma

pharmapartners
Explorer

Aha... that makes sense ... and explains a lot.

I will see if I can restore the cold buckets by renaming the files / setting the correct GUID in the instance.cfg on the restored node.

Thanks a lot for pointing me in the right direction.

 

0 Karma

pharmapartners
Explorer

Did a quick check on the files in the colddb directory.

There are 4 different GUID's, which are actually the same as the GUID's for the existing peers. (which makes sense, since I used the original /opt/splunk/etc on the new RHEL9 nodes, which includes the instance.cfg holding the GUID)

 

$ ls -lrt colddb/ | awk -F_ '{print $5}' | sort -u

10B29386-EAD3-45F6-AFEF-6C5897D7507E
289FAAF8-810C-454E-9CF5-4DEA9C5CA3E7
332E50AC-2BE6-4FFB-96AB-3F7D612A1422
9C46DD6F-782E-4675-8E9B-90CABC42221D

 

And the current peers :

 

$ splunk list cluster-peers | grep -v ":" | grep [0-9]
        10B29386-EAD3-45F6-AFEF-6C5897D7507E
        289FAAF8-810C-454E-9CF5-4DEA9C5CA3E7
        332E50AC-2BE6-4FFB-96AB-3F7D612A1422
        42C49D52-0A71-4164-91EC-806EAEEEE085
        9C46DD6F-782E-4675-8E9B-90CABC42221D

 

(The 42C49... GUID is from the restored node, holding all the cold buckets)

0 Karma

pharmapartners
Explorer

My issue is solved
I manually copied the files from colddb on the recovered node to the colddb location on the production nodes.
(enable maintenance-mode, stop splunk on receiving node, copy  files, make sure ownership is correct, start splunk, disable maintenance-mode)

The recovered node is currently still in the cluster, because removing it would fill up the remaining indexers a bit too much.. which would lead to data loss again 🙂
When we added some additional RHEL9 peer nodes we will remove the recovered node and life will be good again.

Thanks for the tips and clearifying info.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...