We have a clustered environment in multi-site running Splunk 6.6 and we would like to upgrade it to 7.2.
I see in the documentation that a site-by-site would be possible only in 3 steps (6.6 to 7.0, then 7.0 to 7.1, then 7.1 to 7.2).
I assume that the "no more than one sequential n.n version" is required for the replication step in the middle of the process when the first site is upgraded and not the second yet.
Then, could it be possible to "skip" some versions and do something like this:
Step 0, initial situation:
- Site1 : 6.6
- Site2 : 6.6
Step 1, ugprade Site1 to its next version:
- Site1 : 7.0
- Site2 : 6.6
Step 2, upgrade Site2 to the next version of Site1:
- Site1 : 7.0
- Site2 : 7.1
Step 3, upgrade Site1 to the next version of Site2:
- Site1 : 7.2
- Site2 : 7.1
Step 4, last upgrade of Site2:
- Site1 : 7.2
- Site2 : 7.2
You can notice that at no time is there more than 1 difference version, so replication should work at my steps 2 and 3.
Is there someone who has already tested this?
I wouldn't suggest that approach, as its not documented and probably not tried by many. Its better to upgrade to 1 version at a time, perform sanity to ensure all your apps works, indexes and buckets are healthy and then perform the next step and so on until your desired version. Have time for fallback, should you need and have backups.
Update: upgrade done on our 2 production environments (x4 and x10 IDX) without issue.
Only 1 question: after the first upgrade step (upgrade first site), it was very long to recover SF and RF on the environment of 10 IDX: it took 2 hours after end of maintence mode to recover SF and RF, after a downtime of 20-25 minutes on site 1.
But after that, after the upgrade of site 2, which took approx. 15 minutes, if took no more that 5 minutes to recover SF and RF !
At each upgrade step, is was faster and faster. And after the last upgrade, if took less than 2 minutes !
What is the explanation for that behaviour ?
I guess it depends on the RF and SF between sites and all but I'm not surprised. @dxu_splunk might have more insight.
It reminds me of a pilot saying "any landing you walk away from is a good landing." But I get your interest in investigating and you can certainly explore fixup activities in the _internal to see what's happened.
RF = SF = 2, between 2 sites
[clustering] available_sites = site1,site2 mode = master multisite = true replication_factor = 1 site_replication_factor = origin:1,site1:1,site2:1,total:2 site_search_factor = origin:1,site1:1,site2:1,total:2 summary_replication = true rebalance_threshold = 1
yes of course, above is our CM configuration file.
To upgrade, we enabled maintenance mode, then upgrade indexers of site1, then disable maintenance mode, and wait that everything come back to green.
And the same for site2, and again for site1, and again for site2, to reach the target version.
Edit: I think we will use the rolling upgrade for the next upgrade, because now we run splunk v7.2 🙂
So this might be specific to the leap-from approach used here. The buckets may be re-replicating across sites if the upgrade included any change to the bucket structure/format that were introduced with the upgrade of that particular release.
Given the number of upgrades your looking at doing, I would encourage you to consider communicating a maintenance window to the users and shutting the whole system down to do the upgrade from version 6.6 directly to 7.2 (or 7.3 now!).
"To upgrade across multiple sequential n.n versions, you must take down all peer nodes across all sites during the upgrade process." - Site-by-site upgrade for multisite indexer clusters
Of course we communicate to the users each time we plan an upgrade.
We considered shutting down the entire cluster, but we don't have buffer on the ingestion process of many inputs, it has been build without (and they want to remove remaining syslog-ng servers ...), so we will lost a lot of data if we shut down the entire cluster.
Note: our acceptance environment has been upgraded with the process I explain in the initial question: we had no issue.
But I aggree with @lakshman239, if we had any issue during a replication step, we should have downgrade a site to have the same version on the 2 sites, before continuing the upgrade.
Oh do you have a lot of UDP coming in? That's a shame how folks are happy to send the data on a lossy protocol but don't accept any loss. lol, right?
Good to know it worked in a lab. Testing in a lab is often an overlooked step by many. Glad you have one and a good communication pattern.
Remember that technically there is no supported downgrade or rollback option with Splunk.
Their argument: "sending with TCP could block the application if the receiver is not available".
Yes, it could be. Badly developped applications.
But most of our TCP inputs go directly on UF or HF, so the buffer is on the source side and we don't control it, it may be very small.
I'm working on a solution based on NXLog CE to write logs to disk, before read them with the local UF or HF. It will answer to that kind of case.
We have some hundreds of configured users, we must have a communication process !
Oh, and our indexing rate is high (15MB/s the day, 6 MB/s the night) so loosing all the indexers during an upgrade will lost a big bunch of data !
Understood on the TCP vs UDP. I hear ya.
Any chance those UDP streams are going to the forwarders and not directly to the indexers?
I ask because your word choice makes me compelled to make sure you're aware of Splunk's persistent queues. Essentially, anything being sent from a forwarder will queue up until it can send to the indexers again.
I'm crossing my fingers that this feature could be benefit to you...
Learn more: Use persistent queues to help prevent data loss
Yes I kown about persistent queues, but several big forwarders does not have enough disk space to store data. But you remind me to think again about the possibility to add disk space to them ...