Knowledge Management

[SmartStore]To avoid failure, can we get recommendations for migrating a clustered environment with 50 indexers to Smartstore?

rbal_splunk
Splunk Employee
Splunk Employee

We have read documentation and planning as per documentation, we are looking for feedback for common recommendation to avoid any issues.

Tags (1)
0 Karma
1 Solution

rbal_splunk
Splunk Employee
Splunk Employee

Below I have share information gathered while working with customer migrating large deployment to Smartstore :

1)Connectivity to the remote storage : Prior to enabling S2 migration, suggestion will be to review the following link, one of the most critical requirement will be to test connectivity with the remote store from each of the indexer that is to be enabled for smartstore. Splunk documentation have detailed step to test this connectivity --http://docs.splunk.com/Documentation/Splunk/latest/Indexer/MigratetoSmartStore

In addition, please refer and read the links below.

http://docs.splunk.com/Documentation/Splunk/latest/Indexer/AboutSmartStore
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/DeploySmartStoreonindexerclusters
https://answers.splunk.com/answers/701684/smartstorecan-i-get-more-information-on-migrating.html#ans...

In addition to testing the basic connected my suggestion will also be to check the data throughput , basically check the data transfer rate from indexer to remote store is over 100MB. Based on the available throughput , you can calculate the time it may take to migrate data.

2) Check that Two or more cluster don’t point to the same remote storage:
This is not an recommended configuration as they will each manage the contents on the remote store independently, which may lead to following scenarios
2.1)One cluster may introduce bucket which will be unknow to the second cluster. These buckets may unexpectedly reappear on the cluster later after boostrap. If the shared buckets origin sites don’t meet , then you may get unnecessary fix-up jobs.
2.2)One cluster may freeze buckets from underneath the other one, causing a state mismatch problem on the other one, raising S2 problems( i.e an indexer may think that a bucket is “stable” ( that is, it thinks it’s pn S3) while suddenly its I not.

In case due to migration , if there is need to have two cluster point to same storage , while buckets are being uploaded. In such case, you must disable freezing of buckets, by pushing the following settings onto all indexes of the indexers (in indexes.conf):
[default]
maxGlobalDataSizeMB = 0
frozenTimePeriodInSecs = 0

After the migration is complete and you are left with only one cluster pointing to the remote storage, you should remove the config settings above.
Try to restrict the amount of time during which the two clusters are pointing to the same remote storage location.

3)Multi-site mapping:

For multi-site cluster when If you are migrating buckets from one multi-site cluster onto another multi-site cluster, point to consider will be what is the origin site of bucket and if that site is present in cluster. If the orgin site number is missing in the cluster you may need to apply standard clustering site mapping configuration ( setting "site_mappings" in server.conf)

4) Splunk versions:
Before migration, try to upgrade to the latest version to get the latest bug fixes. You should also upgrade the original cluster to the latest possible version. It is highly recommended that you use at least use the latest 7.2.x version (the first supported on-prem S2 version). If this is not possible, please consult with the S2 team.

5)Use the LRU eviction policy
Make sure that LRU policy is active. Starting with 7.1.x this is the default, so normally this does not need any tweaking. If a 7.0.x version is used for some reason, please override with LRU, by setting this in server.conf:
[cachemanager]
eviction_policy = lru

6) Do NOT turn off S2 once it has been enabled
Please, do NOT turn off S2 once it has been enabled. Even if, for example, there are unforeseen networking problems that need to be worked out, try to resolve the issues without disabling S2. Splunk should be able to operate correctly even if there are issues with connecting to the remote storage.

7) Splunk Crash during migration:
If Splunk crashes or is restarted before the migration has completed, the metrics endpoint ("|rest /services/admin/cacheman/_metrics") will not correctly show the status (it will show that migration has not started). In such case, you could remove the ".buckets_synced_to_remote_storage" file across all indexes and start the indexer back up, which will kick off migration from the beginning, or you could use an alternate method to check progress as described in the docs. That is, removal of the ".buckets_synced_to_remote_storage" file is not necessary but may be desired only due to the convenient metrics which are provided by the cacheman endpoint.

View solution in original post

rbal_splunk
Splunk Employee
Splunk Employee

Below I have share information gathered while working with customer migrating large deployment to Smartstore :

1)Connectivity to the remote storage : Prior to enabling S2 migration, suggestion will be to review the following link, one of the most critical requirement will be to test connectivity with the remote store from each of the indexer that is to be enabled for smartstore. Splunk documentation have detailed step to test this connectivity --http://docs.splunk.com/Documentation/Splunk/latest/Indexer/MigratetoSmartStore

In addition, please refer and read the links below.

http://docs.splunk.com/Documentation/Splunk/latest/Indexer/AboutSmartStore
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/DeploySmartStoreonindexerclusters
https://answers.splunk.com/answers/701684/smartstorecan-i-get-more-information-on-migrating.html#ans...

In addition to testing the basic connected my suggestion will also be to check the data throughput , basically check the data transfer rate from indexer to remote store is over 100MB. Based on the available throughput , you can calculate the time it may take to migrate data.

2) Check that Two or more cluster don’t point to the same remote storage:
This is not an recommended configuration as they will each manage the contents on the remote store independently, which may lead to following scenarios
2.1)One cluster may introduce bucket which will be unknow to the second cluster. These buckets may unexpectedly reappear on the cluster later after boostrap. If the shared buckets origin sites don’t meet , then you may get unnecessary fix-up jobs.
2.2)One cluster may freeze buckets from underneath the other one, causing a state mismatch problem on the other one, raising S2 problems( i.e an indexer may think that a bucket is “stable” ( that is, it thinks it’s pn S3) while suddenly its I not.

In case due to migration , if there is need to have two cluster point to same storage , while buckets are being uploaded. In such case, you must disable freezing of buckets, by pushing the following settings onto all indexes of the indexers (in indexes.conf):
[default]
maxGlobalDataSizeMB = 0
frozenTimePeriodInSecs = 0

After the migration is complete and you are left with only one cluster pointing to the remote storage, you should remove the config settings above.
Try to restrict the amount of time during which the two clusters are pointing to the same remote storage location.

3)Multi-site mapping:

For multi-site cluster when If you are migrating buckets from one multi-site cluster onto another multi-site cluster, point to consider will be what is the origin site of bucket and if that site is present in cluster. If the orgin site number is missing in the cluster you may need to apply standard clustering site mapping configuration ( setting "site_mappings" in server.conf)

4) Splunk versions:
Before migration, try to upgrade to the latest version to get the latest bug fixes. You should also upgrade the original cluster to the latest possible version. It is highly recommended that you use at least use the latest 7.2.x version (the first supported on-prem S2 version). If this is not possible, please consult with the S2 team.

5)Use the LRU eviction policy
Make sure that LRU policy is active. Starting with 7.1.x this is the default, so normally this does not need any tweaking. If a 7.0.x version is used for some reason, please override with LRU, by setting this in server.conf:
[cachemanager]
eviction_policy = lru

6) Do NOT turn off S2 once it has been enabled
Please, do NOT turn off S2 once it has been enabled. Even if, for example, there are unforeseen networking problems that need to be worked out, try to resolve the issues without disabling S2. Splunk should be able to operate correctly even if there are issues with connecting to the remote storage.

7) Splunk Crash during migration:
If Splunk crashes or is restarted before the migration has completed, the metrics endpoint ("|rest /services/admin/cacheman/_metrics") will not correctly show the status (it will show that migration has not started). In such case, you could remove the ".buckets_synced_to_remote_storage" file across all indexes and start the indexer back up, which will kick off migration from the beginning, or you could use an alternate method to check progress as described in the docs. That is, removal of the ".buckets_synced_to_remote_storage" file is not necessary but may be desired only due to the convenient metrics which are provided by the cacheman endpoint.

gjanders
SplunkTrust
SplunkTrust

Excellent post as always, I've up-voted it, may I suggest using the latest in the documentation links ? http://docs.splunk.com/Documentation/Splunk/latest/Indexer/MigratetoSmartStore for example?

That way when the new versions are released your post will link to the newest documentation version!

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Thanks for suggestion on link. Made the change.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...