This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.
The post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading Splunk Enterprise. One of the steps is to benchmark system health before the upgrade.
Benchmarking system health in prep for upgrade has two main goals:
Make sure the system is healthy enough to go forward without breaking or bogging down mid-upgrade
Establish a baseline of performance before the upgrade so you can tell if your system is performing within expected ranges after the upgrade
Before you start benchmarking, though, make sure you are familiar with your Splunk environment. To get a complete layout of your deployment architecture, Use the monitoring console to determine your topology, as described in the Inherit a Splunk Enterprise Deployment manual. Review all the topics in that manual if you are new to the Splunk environment you are about to upgrade.
Here's a high-level snapshot of what to check before upgrading. We dive into details below.
Benchmark and check system health with the monitoring console
Benchmark and check forwarder system health
Benchmark and check indexer system health
Benchmark and check search tier system health
1 Benchmark and check system health with the monitoring console
First, check basic system health indicators in the monitoring console with the Health Check. If you haven't already, Download health check updates and then Use the health check as described in the Monitoring Splunk Enterprise Manual.
Then check these basic indicators in the monitoring console at a minimum:
Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running.
Verify that the monitoring console is configured correctly, and all Splunk Enterprise components are listed and have the correct roles associated with them. (Monitoring Console > Settings > General Setup).
For guidance, see Configure the Monitoring Console in distributed mode in the Monitoring Splunk Enterprise Manual.
Verify that all Splunk Enterprise components are connected and reporting back data. Check search heads, indexers, deployment server, license master, cluster master (if in use), deployer (if in use), and heavy forwarder (if in use). (Monitoring Console > Settings > General Setup: "monitoring" and "state" columns).
Review existing resource utilization (CPU, RAM, disk) for search head and indexer tier. Take screenshots for comparison after upgrade. (Monitoring Console > Resource Usage > Deployment).
Review search scheduling and performance. Correct any skipped and deferred searches before upgrade. (Monitoring Console > Search > Scheduler Activity > Deployment).
For more about the scheduler activity dashboards, see Search: Scheduler activity in the Monitoring Splunk Enterprise Manual.
For tips about discovering skipped searches, see the Answers post Skipped Searches on SHC.
For background about the search scheduler and how it prioritizes (and possibly skips) searches, see Configure the priority of scheduled reports in the Reporting Manual.
Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance).
For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual.
For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual.
Check the search head cluster
If you're using a search head cluster, run the following checks on the monitoring console:
Review replication latency for errors (top of view) and consistency for time taken (bottom of view). Ideally, you would investigate and correct errors before the upgrade, but if you don't, your error rate before and after the upgrade should be consistent. For the time taken, determine whether replication times vary significantly (spikes) or if they have a natural oscillation based on time of day, day of week, or any other pattern. Whatever that pattern is before the upgrade, it should continue after the upgrade. (Monitoring Console > Search > Search Head Clustering > SHC Configuration Replication).
Check each search head cluster member and ensure that the KVStore role is applied to them. (Monitoring Console > Search > Search Head Clustering > Search Head Clustering: Status and Configuration). Edit the assignments as needed.
For guidance, see Configure the Monitoring Console in distributed mode in the Monitoring Splunk Enterprise Manual.
Review the KVstore oplog, specifically “Operations Log Window of KV Store Captain.” Look for a value of at least one hour. Three to four hours is ideal for a busy SHC, the higher the better. Values below 15 minutes are problematic and you should investigate and fix them before upgrade. (Monitoring Console > Search > KVStore > KVStore: Deployment).
In the KVStore: Deployment view, ensure the following settings:
KVstore in the search head cluster has a captain and one or more secondaries
total queued=0 for all nodes
Instances by Average Replication Latency is in the range of 0-10, except for search head clusters running ITSI, which can have a latency range in the 30s or higher
Check the search tier
Run this check on all search tier servers:
If you have implemented report acceleration, review the Summary Status column in the Report Accelerations Summaries for completeness. Review the Access Count column for usage. (Settings > Report Acceleration Summaries). Consider disabling any report accelerations that have never been accessed. If report accelerations aren’t at 100%, the reason is likely related to skipped searches. Correct before upgrading.
For guidance about accelerating reports, see the topic Accelerate Reports in the Reporting Manual.
Check the deployer/search head cluster
Run these checks on the monitoring console for the deployer/search head cluster:
Verify that the status of the cluster is fully healthy. (Monitoring Console > Search > Search Head Clustering: Status and Configuration).
Verify that you can complete a bundle push to all search head cluster nodes successfully.
For instructions, see the topic Update search head cluster members in the Distributed Search Manual.
If you are using a static captain, know which search head cluster node is set to captain. (Monitoring Console > Search > Search Head Clustering: Status and Configuration).
Validate that KV store(s) replicate without issue. (Monitoring Console > Search > KV Store > KV Store Deployment, bottom of view).
For guidance about how to resync the KV store, see Resync the KV store in the Admin Manual.
Check the indexer cluster master
If you are using an indexer cluster master, run the following checks on the monitoring console:
Verify that all data is searchable, and that replication factor and search factor are fully met. (Monitoring Console > Indexing > Indexer Clustering: Status).
Verify that you can successfully complete a bundle push to indexers.
For guidance, see Distribute the configuration bundle in the Managing Indexers and Clusters of Indexers Manual.
For troubleshooting tips, see Configuration bundle issues in that same manual.
Benchmark disk IOPS and load average so you can compare it after upgrade to verify healthy function (at the 'nix command line: iostats -xz 1 or sar -d , or (Monitoring Console > Resource Usage > Deployment).
Verify that unique bucket counts are within reasonable ranges. Although there are no set limits, a good benchmark is 5 million for less or Splunk Enterprise versions 6.6, 7.0, 7.1, or 9 million or less for Splunk Enterprise version 7.2. If unique bucket counts get much higher than these ranges, you could start experiencing performance degradation. (Monitoring Console > Indexing > Indexes and Volumes: Deployment).
Run the following search on the cluster master:
| rest splunk_server=local /services/cluster/master/peers
| stats sum(bucket_count) AS bucket_count_all
| eval bucket_count = round(bucket_count_all / 1000 / 1000,2)."M"
| eval replication_factor = [
| rest splunk_server=local /services/cluster/config
| return $replication_factor ]
| eval unique = round(bucket_count_all / replication_factor / 1000 / 1000,2)."M"
| fields bucket_count unique
| rename bucket_count AS "Total Buckets", unique AS "Unique Buckets”
If the unique bucket counts are significantly higher than 5 or 9 million, investigate the reasons and fix. Consider setting high bucket count configurations on the CM and IDX servers before upgrading.
For guidance, see slide 21 of the presentation from Splunk .conf2017, Indexer clustering internals, scaling, and performance testing.
Identify the pass4SymmKey in plain text in case it needs to be re-keyed into any configurations after upgrade. This password is managed outside of Splunk.
Check the license master
Run these checks on the monitoring console for the license master:
Verify that all indexers are checking into the license master. (Monitoring Console > Instances > Group=License Master).
Verify that _* indexes are successfully forwarding data to the indexing tier (if configured to do so). Run the following search and validate that the license master host is present in the list (you can also check for the cluster master host, the deployment server host, and the deployer host):
"index=_internal earliest=-5min | stats count by host"
For guidance about how to set up data forwarding, see Best practice: Forward search head data to the indexer layer in the Distributed Search Manual.
For tips about how to set this up for forwarders and license master, see Best practice: Forward master node data to the indexer layer in the _Distributed Search Manual.
Archive copies of license(s) off host, or verify that they are included in backups. Make copies of the .lic files in $SPLUNK_HOME/etc/licenses/enterprise/* .
Check the deployment server
Run these checks on the monitoring console for the deployment server:
Validate that config reload is successful. You can push a config from the forwarder management UI or the command line ( splunk reload deploy-server ). If there are issues with individual lines in serverclass.conf , they will appear in splunkd.log as ERROR and will be skipped, and Splunk will continue loading the rest of the file.
Validate that all forwarders that should be phoning home are doing so successfully. (Monitoring Console > Forwarders > Forwarders: Deployment).
2 Benchmark and check forwarder system health
Verify the following on your forwarders before upgrading your Splunk Enterprise version.
Verify that your current forwarders will work with new version of indexers, for example, that the version combinations are supported.
To check forwarder compatibility between versions, see Compatibility between forwarders and indexers in the Splunk Products Version Compatibility Manual.
Verify that the SSL and cipher suite configurations are compatible.
For details, see Configure secure communications between Splunk instances with updated cipher suite and message authentication code in the Securing Splunk Enterprise Manual.
If you are using an app that requires a heavy forwarder or makes external queries, such as DBX or JMX, validate that they work with the new Splunk Enterprise version.
Ensure that any forwarder code management tools you have set up (such as Puppet, Chef, Ansible, or SCCM) can reach all forwarders to be upgraded.
3 Benchmark and check indexer system health
Run these checks on your indexers:
Ensure there is sufficient disk space to take local backups before the upgrade and to deploy the new code during upgrade.
For guidance about managing disk space, see the topic Estimate your storage requirements and related topics in the Capacity Planning Manual.
For items that may affect disk space during upgrade, see the topic About upgrading READ THIS FIRST in the Splunk Enterprise Installation Manual.
Run this search to verify that indexers aren’t running scheduled searches:
index=_internal source="*/scheduler.log" search_group=dmc_group_indexer sourcetype=scheduler
|dedup host savedsearch_name
| stats count(savedsearch_name) by savedsearch_name
Verify that basic searches work and all the indexers replying by running this search:
| tstats count where earliest=-5m by splunk_server
4 Benchmark and check search tier system health
Run these checks on your search tier components:
Validate that the upgrade target version works with all apps (searches, dashboards, add-ons, external inputs).
Check version compatibility via Splunkbase for premium and non-premium apps. For guidance, see Splunk Products Version Compatibility and applications on Splunkbase. Also verify the end-of-support status of Splunkbase apps. For details, see End of Availability: Splunk-BUilt Apps and Add-ons on Splunk Blogs.
Test homegrown apps. For guidance, see the topic Test your apps before upgrade in the Splunk Enterprise Installation Manual.
Have copies of SSL keys, SAML configs, external auth credentials like passwords available in plaintext.
Look for failing searches due to missing users in external auth and correct issues prior to upgrade.
Run the following search to evaluate the size of the search bundle being pushed to indexers to determine if it is close to the maximum setting.
If you have a search head cluster, run the search once on any search head member.
If you don't have a search head cluster, run this search on each search head in your environment.
index=_internal sourcetype=splunkd group=bundles_uploads search_group=dmc_group_search_head
| eval baseline_bundle_size_mb=round((average_baseline_bundle_byte s/1024)/1024,1)
| chart max(baseline_bundle_size_mb) AS Max_bundle_size by host
| eval Max_bundle_size=Max_bundle_size . "M"
For guidance about maximum bundle settings, see the topic Modify the Knowledge Bundle in the Splunk Enterprise Distributed Search Manual.
To tackle the question about what to monitor during a Splunk Enterprise upgrade, see the Answers post How do I monitor system health during a Splunk Enterprise upgrade?
What's your experience? We'd like to hear from you. We'll be updating this topic as we gather more input.
... View more