Solved: What do I validate after I upgrade Splunk Enterpri...

davidpaper · ‎09-10-2019

I need details about what to validate after the upgrade so I know it was successful. How can I tell that everything got upgraded correctly, and that the system is healthy and ready to go?

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on the verification phase and what to check after the upgrade to make sure all the components have upgraded successfully. You can do these checks for each component as you upgrade them, or all at once after the upgrade. During upgrade, make sure you allow adequate time for one component to stabilize before upgrading the next component.

The post-upgrade checks have two main goals:

Make sure the system is healthy after upgrade
Validate that system performance is on par with or better than it was before the upgrade

Here's a high-level snapshot for what to check after upgrading. These checks go in the order of upgrade. We dive into details below.

Check upgrade success on the monitoring console
Check upgrade success on the license master
Check upgrade success on the indexer cluster master
Check upgrade success on the search tier (stand-alone)
Check upgrade success on the search tier (clustered)
Check upgrade success on the deployer
Check upgrade success on the deployment server
Check upgrade success on the indexers

1 Check upgrade success on the monitoring console

The first step to confirm that the monitoring console has upgraded successfully is to log into the monitoring console UI. Once logged in, check the following:

Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running.

Verify that all search heads, indexers, deployment server(s), license server, cluster master, the deployer, and forwarders (heavy and regular) are reporting a healthy status. (Monitoring Console > Overview).
Verify that components have correct roles associated with them (Monitoring Console > Instances)
Review resource utilization (CPU, RAM, disk) for search head and indexer tier, and compare to screenshots taken before the upgrade to verify that performance levels are comparable (Monitoring Console > Resource Usage > Instance > role: search head and role: indexer).
Review search scheduling and performance. Investigate and correct any skipped and deferred searches. (Monitoring Console > Search > Scheduler Activity > Deployment).
For more about the scheduler activity dashboards, see Search: Scheduler activity in the Monitoring Splunk Enterprise Manual.
For tips about discovering skipped searches, see the Answers post Skipped Searches on SHC.
For background about the search scheduler and how it prioritizes (and possibly skips) searches, see Configure the priority of scheduled reports in the Reporting Manual.
Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance).
For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual.
For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual.

2 Check upgrade success on the license master

Verify that all indexers are checking into the license master. (Monitoring Console > Instances > Group=License Master)
Verify that _* indexes are successfully forwarding data to the indexing tier (if configured to do so). Run the following search and validate that the license master host is present in the list (you can also check for the cluster master host, the deployment server host, and the deployer host):

index=_internal earliest=-5min | stats count by host

For guidance about how to set up data forwarding, see Best practice: Forward search head data to the indexer layer in the Distributed Search Manual.
For tips about how to set this up for forwarders and license master, see Best practice: Forward master node data to the indexer layer in the Distributed Search Manual.

3 Check upgrade success on the indexer cluster master

On the cluster master host, check the load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d). Cluster masters are generally not IO intensive, but IO jumps up considerably when indexer rolling restarts occur.
Use your favorite method to monitor the cluster manager's swap space, for example (on 'nix):
"vmstat 1” show pages swapping in/out, “si” and “so” columns
“iostat -zx 1” looking at swap device for activity (can get device name from /etc/fstab)

After upgrade activity is finished and the system has returned to a steady state, review the clustering dashboard to ensure the cluster is searchable.
If RF/SF fixup tasks are queued, verify that the fixups are in progress.
For guidance about the cluster recovery process, see the presentation Indexer Clustering Fixups from .conf2017.
For more about bucket fixing and a link to bucket fixing resources in the Splunk Enterprise documentation, see bucket fixing in the Splexicon.
Look for search peers that are alternating ("flapping") between up and pending states, or restarting outside of a rolling start by running the following search:

index=_internal source=*splunkd.log sourcetype=splunkd host=cluster_master component=CMPeer peer transitioning NOT bid
| eval transition = from." -> ".to
| timechart count by transition

If the search results alternate between "Pending → Up" and "Up → Pending", the indexers may need more time finish all upgrade-related activity and check in properly. If the situation persists, you may need to adjust timeouts in the cluster master configuration to give the components more time to come back online.

Verify that forwarders are communicating with with the cluster master using the following search:

index=_internal sourcetype=splunkd component=CMIndexerDiscovery

Verify that the monitoring console can still see the cluster master as a search peer. (Monitoring Console > Overview).
Verify that you can successfully complete a bundle push to indexers.
For guidance, see Distribute the configuration bundle in the Managing Indexers and Clusters of Indexers Manual.
For troubleshooting tips, see Configuration bundle issues in that same manual.

4 Check upgrade success on the search tier (stand-alone)

Verify that external auth is working (if configured), including certificates if you're using SAML or another SSO outside of Active Directory.
Verify that the new Splunk Enterprise version works with all apps (searches, dashboards, add-ons, external inputs).
Verify that basic searches work from each standalone search head, and that all the indexers reply by running the following search:

| tstats count where earliest=-5m by splunk_server

Look for skipped or deferred searches that were not skipped or deferred before the upgrade.
Run the following search to evaluate the size of the search bundle being pushed to indexers to determine if it is close to the maximum setting.
If you have a search head cluster, run the search once on any search head member.
If you don't have a search head cluster, run this search on each search head in your environment.

index=_internal sourcetype=splunkd group=bundles_uploads search_group=dmc_group_search_head
| eval baseline_bundle_size_mb=round((average_baseline_bundle_byte s/1024)/1024,1)
| chart max(baseline_bundle_size_mb) AS Max_bundle_size by host
| eval Max_bundle_size=Max_bundle_size . "M"

For guidance about maximum bundle settings, see the topic [Modify the Knowledge Bundle][12] in the _Splunk Enterprise Distributed Search Manual_.

Verify that users can log in utilizing remote auth (if configured) on each search head node. Review Report Accelerations summary status. Looking for 100% after catching up. (Settings > Report Acceleration Summaries).

5 Check upgrade success on the search tier (clustered)

Check all upgrade success indicators in the search tier (distributed) list in the previous section.
Verify that all search head cluster members are visible in the monitoring console. (Monitoring console > Indexing > Indexer Clustering: Status).
Verify the search head cluster captain and member details in the monitoring console. (Monitoring console > Search > Search head clustering: Status and Configuration).
Verify that search traffic is distributed evenly in the search head cluster Scheduler Delegation dashboard in the monitoring console by sorting the first panel by instance. (Monitoring Console > Search > Search Head Clustering: Scheduler Delegation).

Measure the time a search head cluster member is taking to spin up by running the following search with a time range before and after the upgrade. Major swings could indicate a problem on the members.

index=_internal uri=*delegatejob*
| timechart median(spent) as median_spent max(spent) as max_spent

Look for any errors and warnings in the logs using this search:

index=_internal sourcetype=mongod earliest=-15m

Verify that the search head cluster can push a bundle successfully to all indexers, especially if a search head cluster connects to multiple indexing clusters. See the topic Update search head cluster members in the Distributed Search Manual for instructions.
Verify that the KVstore comes online on each node and replicates correctly across nodes using this search:
(Monitoring Console > Search > KVStore > KVStore: Deployment)

6 Check upgrade success on the deployer

Verify that a bundle can be pushed from the deployer to all search head nodes.

7 Check upgrade success on the deployment server

Verify that config reload is successful. You can push a config from the forwarder management UI or the command line (splunk reload deploy-server). If there are issues with individual lines in serverclass.conf, they will appear in splunkd.log as ERROR and will be skipped, and Splunk will continue loading the rest of the file.
Verify that all forwarders that should be phoning home are doing so successfully (Monitoring Console > Forwarders > Forwarders: Deployment).

8 Check upgrade success on the indexers

After the upgrade and restart, allow at least 15 minutes for the cluster to finish processing all upgrade-related activity. Check the following indicators to verify that the upgrade is compete and successful.

Verify that all the nodes are present in the UI, either in the cluster master UI, or in the management console (Monitoring Console > Indexing > Indexer clustering: Status)
Verify that all data is searchable, and that replication factor and search factor are fully met (*Monitoring Console > Indexing > Indexer Clustering: Status).
Verify that cleanup/fixup tasks are moving forward while continuing to watch load and IO on the cluster master (at the 'nix command line: iostat -zx 1 or sar -d).
Verify that basic searches work and all the indexers are replying by running this search:

| tstats count where earliest=-5m by splunk_server

Verify that all indexers are ingesting data. Check that ingestion rates are continuous, and If it dropped or spiked, whether it returned to the mean:
Check HEC port (if configured). (Monitoring Console > Indexing > Inputs > HTTP Event Collector: Deployment).
Check S2S port(s). (Monitoring Console > Indexing > Inputs > Splunk TCP Input Performance: Deployment).
Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance).
For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual.
For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual.
Review load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d).
Refer to the existing resource utilization metrics collected in the steps outlined in the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to determine when the cluster master has returned to its normal state of operations. (Monitoring Console > Resource Usage > Deployment).
Scan the internal logs on the cluster master for warnings and errors. You can also check the internal logs of the indexers for warnings and errors, although these logs can contain many entries for unrelated conditions, such as parsing errors, and so on.

index=_internal sourcetype=splunkd source=*splunkd.log log_level!=info

Repeat the checks outlined in the Search tier (clustered) section above to ensure that searches complete in a timely way.

If all else fails...

Here are some resources if you run into any upgrade-related snags.

For specific issues, refer to the Splunk Enterprise Troubleshooting Manual.
If all else fails, contact your Splunk account rep or Splunk Support and Services.

Related upgrade resources

For tips about what to monitor before an upgrade, see the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to make sure your deployment is ready for upgrade, and that you have taken a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance.
For tips about what to monitor and check during an upgrade, see the Answers post How do I monitor system health during a Splunk Enterprise upgrade? to make sure the upgrade goes smoothly for all components.
For high-level post-upgrade guidance, review the post-upgrade guidelines in Phase 3: Verify everything works after the upgrade in the Splunk Enterprise Installation Manual.

View solution in original post

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on the verification phase and what to check after the upgrade to make sure all the components have upgraded successfully. You can do these checks for each component as you upgrade them, or all at once after the upgrade. During upgrade, make sure you allow adequate time for one component to stabilize before upgrading the next component.

The post-upgrade checks have two main goals:

Make sure the system is healthy after upgrade
Validate that system performance is on par with or better than it was before the upgrade

Here's a high-level snapshot for what to check after upgrading. These checks go in the order of upgrade. We dive into details below.

Check upgrade success on the monitoring console
Check upgrade success on the license master
Check upgrade success on the indexer cluster master
Check upgrade success on the search tier (stand-alone)
Check upgrade success on the search tier (clustered)
Check upgrade success on the deployer
Check upgrade success on the deployment server
Check upgrade success on the indexers

1 Check upgrade success on the monitoring console

The first step to confirm that the monitoring console has upgraded successfully is to log into the monitoring console UI. Once logged in, check the following:

Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running.

Verify that all search heads, indexers, deployment server(s), license server, cluster master, the deployer, and forwarders (heavy and regular) are reporting a healthy status. (Monitoring Console > Overview).
Verify that components have correct roles associated with them (Monitoring Console > Instances)
Review resource utilization (CPU, RAM, disk) for search head and indexer tier, and compare to screenshots taken before the upgrade to verify that performance levels are comparable (Monitoring Console > Resource Usage > Instance > role: search head and role: indexer).
Review search scheduling and performance. Investigate and correct any skipped and deferred searches. (Monitoring Console > Search > Scheduler Activity > Deployment).
For more about the scheduler activity dashboards, see Search: Scheduler activity in the Monitoring Splunk Enterprise Manual.
For tips about discovering skipped searches, see the Answers post Skipped Searches on SHC.
For background about the search scheduler and how it prioritizes (and possibly skips) searches, see Configure the priority of scheduled reports in the Reporting Manual.
Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance).
For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual.
For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual.

2 Check upgrade success on the license master

Verify that all indexers are checking into the license master. (Monitoring Console > Instances > Group=License Master)
Verify that _* indexes are successfully forwarding data to the indexing tier (if configured to do so). Run the following search and validate that the license master host is present in the list (you can also check for the cluster master host, the deployment server host, and the deployer host):

index=_internal earliest=-5min | stats count by host

For guidance about how to set up data forwarding, see Best practice: Forward search head data to the indexer layer in the Distributed Search Manual.
For tips about how to set this up for forwarders and license master, see Best practice: Forward master node data to the indexer layer in the Distributed Search Manual.

3 Check upgrade success on the indexer cluster master

On the cluster master host, check the load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d). Cluster masters are generally not IO intensive, but IO jumps up considerably when indexer rolling restarts occur.
Use your favorite method to monitor the cluster manager's swap space, for example (on 'nix):
"vmstat 1” show pages swapping in/out, “si” and “so” columns
“iostat -zx 1” looking at swap device for activity (can get device name from /etc/fstab)

After upgrade activity is finished and the system has returned to a steady state, review the clustering dashboard to ensure the cluster is searchable.
If RF/SF fixup tasks are queued, verify that the fixups are in progress.
For guidance about the cluster recovery process, see the presentation Indexer Clustering Fixups from .conf2017.
For more about bucket fixing and a link to bucket fixing resources in the Splunk Enterprise documentation, see bucket fixing in the Splexicon.
Look for search peers that are alternating ("flapping") between up and pending states, or restarting outside of a rolling start by running the following search:

index=_internal source=*splunkd.log sourcetype=splunkd host=cluster_master component=CMPeer peer transitioning NOT bid
| eval transition = from." -> ".to
| timechart count by transition

If the search results alternate between "Pending → Up" and "Up → Pending", the indexers may need more time finish all upgrade-related activity and check in properly. If the situation persists, you may need to adjust timeouts in the cluster master configuration to give the components more time to come back online.

Verify that forwarders are communicating with with the cluster master using the following search:

index=_internal sourcetype=splunkd component=CMIndexerDiscovery

Verify that the monitoring console can still see the cluster master as a search peer. (Monitoring Console > Overview).
Verify that you can successfully complete a bundle push to indexers.
For guidance, see Distribute the configuration bundle in the Managing Indexers and Clusters of Indexers Manual.
For troubleshooting tips, see Configuration bundle issues in that same manual.

4 Check upgrade success on the search tier (stand-alone)

Verify that external auth is working (if configured), including certificates if you're using SAML or another SSO outside of Active Directory.
Verify that the new Splunk Enterprise version works with all apps (searches, dashboards, add-ons, external inputs).
Verify that basic searches work from each standalone search head, and that all the indexers reply by running the following search:

| tstats count where earliest=-5m by splunk_server

Look for skipped or deferred searches that were not skipped or deferred before the upgrade.
Run the following search to evaluate the size of the search bundle being pushed to indexers to determine if it is close to the maximum setting.
If you have a search head cluster, run the search once on any search head member.
If you don't have a search head cluster, run this search on each search head in your environment.

index=_internal sourcetype=splunkd group=bundles_uploads search_group=dmc_group_search_head
| eval baseline_bundle_size_mb=round((average_baseline_bundle_byte s/1024)/1024,1)
| chart max(baseline_bundle_size_mb) AS Max_bundle_size by host
| eval Max_bundle_size=Max_bundle_size . "M"

For guidance about maximum bundle settings, see the topic [Modify the Knowledge Bundle][12] in the _Splunk Enterprise Distributed Search Manual_.

Verify that users can log in utilizing remote auth (if configured) on each search head node. Review Report Accelerations summary status. Looking for 100% after catching up. (Settings > Report Acceleration Summaries).

5 Check upgrade success on the search tier (clustered)

Check all upgrade success indicators in the search tier (distributed) list in the previous section.
Verify that all search head cluster members are visible in the monitoring console. (Monitoring console > Indexing > Indexer Clustering: Status).
Verify the search head cluster captain and member details in the monitoring console. (Monitoring console > Search > Search head clustering: Status and Configuration).
Verify that search traffic is distributed evenly in the search head cluster Scheduler Delegation dashboard in the monitoring console by sorting the first panel by instance. (Monitoring Console > Search > Search Head Clustering: Scheduler Delegation).

Measure the time a search head cluster member is taking to spin up by running the following search with a time range before and after the upgrade. Major swings could indicate a problem on the members.

index=_internal uri=*delegatejob*
| timechart median(spent) as median_spent max(spent) as max_spent

Look for any errors and warnings in the logs using this search:

index=_internal sourcetype=mongod earliest=-15m

Verify that the search head cluster can push a bundle successfully to all indexers, especially if a search head cluster connects to multiple indexing clusters. See the topic Update search head cluster members in the Distributed Search Manual for instructions.
Verify that the KVstore comes online on each node and replicates correctly across nodes using this search:
(Monitoring Console > Search > KVStore > KVStore: Deployment)

6 Check upgrade success on the deployer

Verify that a bundle can be pushed from the deployer to all search head nodes.

7 Check upgrade success on the deployment server

Verify that config reload is successful. You can push a config from the forwarder management UI or the command line (splunk reload deploy-server). If there are issues with individual lines in serverclass.conf, they will appear in splunkd.log as ERROR and will be skipped, and Splunk will continue loading the rest of the file.
Verify that all forwarders that should be phoning home are doing so successfully (Monitoring Console > Forwarders > Forwarders: Deployment).

8 Check upgrade success on the indexers

After the upgrade and restart, allow at least 15 minutes for the cluster to finish processing all upgrade-related activity. Check the following indicators to verify that the upgrade is compete and successful.

Verify that all the nodes are present in the UI, either in the cluster master UI, or in the management console (Monitoring Console > Indexing > Indexer clustering: Status)
Verify that all data is searchable, and that replication factor and search factor are fully met (*Monitoring Console > Indexing > Indexer Clustering: Status).
Verify that cleanup/fixup tasks are moving forward while continuing to watch load and IO on the cluster master (at the 'nix command line: iostat -zx 1 or sar -d).
Verify that basic searches work and all the indexers are replying by running this search:

| tstats count where earliest=-5m by splunk_server

Verify that all indexers are ingesting data. Check that ingestion rates are continuous, and If it dropped or spiked, whether it returned to the mean:
Check HEC port (if configured). (Monitoring Console > Indexing > Inputs > HTTP Event Collector: Deployment).
Check S2S port(s). (Monitoring Console > Indexing > Inputs > Splunk TCP Input Performance: Deployment).
Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance).
For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual.
For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual.
Review load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d).
Refer to the existing resource utilization metrics collected in the steps outlined in the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to determine when the cluster master has returned to its normal state of operations. (Monitoring Console > Resource Usage > Deployment).
Scan the internal logs on the cluster master for warnings and errors. You can also check the internal logs of the indexers for warnings and errors, although these logs can contain many entries for unrelated conditions, such as parsing errors, and so on.

index=_internal sourcetype=splunkd source=*splunkd.log log_level!=info

Repeat the checks outlined in the Search tier (clustered) section above to ensure that searches complete in a timely way.

If all else fails...

Here are some resources if you run into any upgrade-related snags.

For specific issues, refer to the Splunk Enterprise Troubleshooting Manual.
If all else fails, contact your Splunk account rep or Splunk Support and Services.

Related upgrade resources

For tips about what to monitor before an upgrade, see the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to make sure your deployment is ready for upgrade, and that you have taken a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance.
For tips about what to monitor and check during an upgrade, see the Answers post How do I monitor system health during a Splunk Enterprise upgrade? to make sure the upgrade goes smoothly for all components.
For high-level post-upgrade guidance, review the post-upgrade guidelines in Phase 3: Verify everything works after the upgrade in the Splunk Enterprise Installation Manual.

cstump_splunk · ‎03-14-2020

One minor thing I want to point out about the tstats command:

| tstats count where earliest=-5m by splunk_server

By default, this tstats command will only search default indexes. If there are any data imbalances across the cluster and one of the indexers does not have any data from a default index, it may not appear in the results. You can look for all internal indexes by adding "AND index=_internal"

| tstats count where earliest=-5m AND index=_internal by splunk_server

What do I validate after I upgrade Splunk Enterprise to confirm the upgrade was successful?

upgrade

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

1 Check upgrade success on the monitoring console

2 Check upgrade success on the license master

3 Check upgrade success on the indexer cluster master

4 Check upgrade success on the search tier (stand-alone)

5 Check upgrade success on the search tier (clustered)

6 Check upgrade success on the deployer

7 Check upgrade success on the deployment server

8 Check upgrade success on the indexers

If all else fails...

Related upgrade resources

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

1 Check upgrade success on the monitoring console

2 Check upgrade success on the license master

3 Check upgrade success on the indexer cluster master

4 Check upgrade success on the search tier (stand-alone)

5 Check upgrade success on the search tier (clustered)

6 Check upgrade success on the deployer

7 Check upgrade success on the deployment server

8 Check upgrade success on the indexers

If all else fails...

Related upgrade resources

Explore the Latest Educational Offerings from Splunk (November Releases)

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

Alerting Best Practices: How to Create Good Detectors