Solved: Indexer cluster infra migration query

vik_splunk · ‎10-07-2021

Hi All,

We are embarking on moving our Splunk 8.1.3 servers from old version of RHEL to new RHEL servers. The servers names are going to be different and we currently have the config based on IP/hostname as opposed to alias.

We have perused the answers community and as well, referred Splunk documentation and have identified the below approach for indexer cluster migration. There's no single Splunk document that talks about this fairly common occurrence in many environments. We also have raised a Splunk support case for advice but to no avail.

A few details about our existing cluster so you can factor that in while reviewing the below

Multisite Indexer cluster with 2 indexers across 2 sites
Current RF - 2 SF-2 (Origin 1 and total 2 for both)
Daily ingestion of roughly 400-450 GB
Current storage of 4 indexers (Roughly 9TB utilised out of 11 TB total hot_warm on each indexer). Roughly 36 TB across the indexer cluster.
~ 20K buckets with approx. 5K buckets in each indexer
Retention varies between 3 months, 6 months and some on 1 year
It is imperative that we have all data replicated to new servers before we retire the old ones
No forwarder-->Indexer discovery
Un-clustered search heads(4) including one Splunk ES 6.4.x

Key concerns:

Storage on existing indexers being overwhelmed.
Degraded search performance on cluster
Resource util increase on cluster instances
Sustained replication
Lack of testing ability to test the below at scale.

Can you please verify/validate/provide commentary/gotchas on this approach please? Would love suggestions on improving this as well

Proposed steps

Step 0 - Server Build - Build new CMs and Indexers to existing specs (2RF, 2 SF, copy remote bundles on CM etc.)
Step 1 - Cluster master migration
1. Search heads - Add additional cluster master via DS push
2. Place existing CM in maintenance mode/Stop CM
3. Ensure new CM is started and running. Place in maintenance mode
4. Modify server.conf on 1 indexer at a time to point to new CM
5. Disable maintenance mode and allow new cluster master to replicate/bucket fixup
6. Validate and confirm that no errors and no problems are seen.
Step 2 - Indexer cluster migration 1 on each site
1. Push changes to forwarders to include 2 new indexers (1 in each site)
2. Place new CM in maintenance mode.
3. Add 1 indexer in each site with existing SF and RF
4. Remove maintenance mode
5. Issue data rebalance command (Searchable rebalance, end goal of which is to have roughly 3K-3.2K buckets in each of the 6 indexers)
6. Wait for catch up………………………………………………..
7. Place new CM in maintenance mode.
8. Modify CM RF to 4 from 2. Keep SF as same
9. In the same maintenance mode, modify origin as 3 and total 4 for replication factor. SF remains same (On all 6 indexers.4 old + 2 new). This is to ensure each indexer in a site has 1 copy of each bucket
10. Restart indexer cluster master (and possibly indexers) for new changes to reflect
11. Disable maintenance mode
12. Wait for catch up………………………………………………..
Step 3 - Indexer cluster migration 1 on each site
1. Once caught up and validated, place cluster master in maintenance mode
2. Place one old indexer in offline enforce-counts and add 1 new indexer on each site
3. Disable maintenance mode
4. Immediately issue data rebalance command
5. Wait for catch up………………………………………………..
Step 4 – Cluster migration completion
1. Once validated, place cluster master in maintenance mode
2. Place the 2nd set of old indexers on each site in offline enforce-counts
3. Modify RF to 2 and SF remains the same
4. Modify Origin to 1 and Total to 2 RF on all indexers
5. Restart indexers and CM as appropriate
6. Disable maintenance mode
7. Allow excess bucket fix up activities to complete
8. Once cluster has regained status quo, shutdown old indexers
9. Remove reference to old indexers from forwarders and search heads to avoid warnings/errors

richgalloway · ‎10-08-2021

1. Indexers in detention participate in searches, but not in ingestion or (inbound) replication. This is an important step in the migration process. If you can't put all old indexers into migration then you risk moving data multiple times and prolonging the process.

2. Putting the old indexers into detention is what prevents them from receiving data from other old indexers.

3. Do the data balance after all old indexers are removed from the cluster.

4. Taking a peer offline tells the cluster to ensure all data on that indexer exists elsewhere in the cluster and satisfies RF/SF. It may require a physical move to do so, depending on the replication status.

5. I don't know the answer to this question.

6. There's no need to change RF or SF.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway · ‎10-07-2021

I agree this procedure could be much better documented by Splunk.

Splunk Support is for break/fix issues rather than technical advice.

I think you've over-complicated things a little.

The procedure I prefer for replacing index cluster hardware is:
1) Install Splunk on the new hardware and configure it to match the old indexers
2) Add all of the new indexers to the cluster

3) Redirect forwarders to the new indexers
4) Put the old indexers into Detention. This will keep them from receiving new data.
5) Issue a 'splunk offline --enforce-counts' command to ONE old indexer
6) Wait for the buckets to migrate off the old indexer. Depending on the number of buckets, this could take a while. The indexer will shut itself down when migration is complete.
7) Repeat steps 5-6 for the remaining old indexers.
😎 Once all buckets are moved to the new indexers you can remove the old indexers from the cluster.

---
If this reply helps you, Karma would be appreciated.

vik_splunk · ‎10-08-2021

Hi @richgalloway Appreciate your quick response.

We were hoping for a simpler version as well. The current one does seem complex and lacks assurance around cluster performance. As for Splunk support, we believe they could offer some support (as developers of the product) in the lack of comprehensive documentation.

Anyways, a few questions and clarifications around the proposed approach

We have some technical routing challenges that we are trying to overcome as a result of which for day 1 migration, we won't have all rules in place and hence, cannot place old indexers immediately in detention/taken offline. We need them available for data ingestion and searches to start with.
So, without modifying RF and SF, your suggestion is taking one peer at a time offline so cluster master does it's magic, honouring RF and replicates buckets to a new peer? One question that remains is, I don't think CM is aware of storage in peers. How do we guarantee that the downed peer sends data to one of the new peers and not existing ones. As it stands we have just a 2T headroom on old indexers (per indexer i.e)
Also, data rebalance we would've thought is an absolute necessity considering additional peers?
Does offline mode physically migrate off the data or just ensure new copies are created? That is the reason why we wanted to go from RF2 to RF4. In the interim, we have 3 indexers in each site and we wanted one copy of the bucket in each so site specific becomes 3 origin and total 4.
An offline peer participates in searches only until the point where the bucket replication/fix up happens. Considering our volume, it could take several hours. So will it honour searches for the entire period?
Sorry to ask the same question albeit differently worded, your suggestion below is run with same RF? (We would prefer that too but didn't think it would be possible)

richgalloway · ‎10-08-2021

1. Indexers in detention participate in searches, but not in ingestion or (inbound) replication. This is an important step in the migration process. If you can't put all old indexers into migration then you risk moving data multiple times and prolonging the process.

2. Putting the old indexers into detention is what prevents them from receiving data from other old indexers.

3. Do the data balance after all old indexers are removed from the cluster.

4. Taking a peer offline tells the cluster to ensure all data on that indexer exists elsewhere in the cluster and satisfies RF/SF. It may require a physical move to do so, depending on the replication status.

5. I don't know the answer to this question.

6. There's no need to change RF or SF.

---
If this reply helps you, Karma would be appreciated.

vik_splunk · ‎10-08-2021

Thanks @richgalloway This helps.

Indexer cluster infra migration query

indexer

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?