Solved: How do I monitor system health during a Splunk Ent...

davidpaper · ‎09-10-2019

I need details about what to monitor during my upgrade so I know it is proceeding as expected. What should I monitor during an upgrade?

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The Answers post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on what to monitor during the upgrade phase to make sure the upgrade goes smoothly for all components.

Keep two things in mind during upgrade:

Upgrade components in the right order. This post gives a suggestion about upgrade order, but the order you should follow depends on your topology. For a high-level overview, see the Answers post What's the order of operations for upgrading Splunk Enterprise? For specific guidance, see the topic How to upgrade Splunk in the Splunk Enterprise Installation Manual.
Ensure that each component has stabilized before proceeding with the next upgrade step. This post provides some guidance on how to make sure you're ready to move to the next step.

Before you start upgrading, thoroughly read the topic About upgrading Splunk Enterprise: READ THIS FIRST in the Splunk Enterprise Installation Manual. The guidelines in this post supplement the detailed instructions in the upgrade documentation.

Also review the Answers post How do I benchmark system health before a Splunk Enterprise upgrade to make sure your deployment is ready for upgrade. That post walks you through how to take a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance. We strongly recommend you put together a detailed upgrade plan that matches your topology before you upgrade.

Here's a high-level snapshot for what to check during an upgrade. We dive into details below.

Upgrade and check progress of single-step components
Upgrade and check progress of the indexer cluster master
Check forwarder function during upgrade
Upgrade and check progress of indexers (distributed)
Upgrade and check progress of indexers (clustered)

1 Upgrade and check progress of single-step components

Several components are single-step upgrades (no bundle pushes, waiting for fix-ups, or sync waits for other components). These components are present in both distributed and clustered deployments. If all these components are on separate machines, you can upgrade them in the following order. If they are collocated on a single machine, you only need to run the upgrade once.

License master
Deployment server
Search head cluster deployer
Monitoring console

After you upgrade the code for each component, verify that you can log in successfully to each component using the UI. From the Monitoring Console > Overview, verify that the license master, deployment server, and deployer components are visible and running in expected ranges compared to the benchmarks you took before upgrade.

Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running.

2 Upgrade and check progress of the indexer cluster master

If you have an indexer cluster, there are several indicators you can check to ensure that the cluster master has upgraded fully and is ready for the next step:

Exit maintenance mode at key points to let the cluster master fully recover. Maintenance mode halts most bucket fixup activity and prevents frequent rolling of hot buckets. To help facilitate cluster recovery after running the upgrade, take the cluster master out of maintenance mode so it can process fixups and manage buckets. This is a best practice after any cluster master restart, and between upgrades of each site in a multi-site indexing cluster.
Check progress indicators at each step. The act of monitoring the cluster master can affect performance because the cluster UI makes REST calls that can compete for resources as the cluster stitches itself together. There are several indicators you can view at the OS layer without adding load to the cluster master. The cluster master is generally ready for the next step when the following indictors are present:
Load average has dropped. On the operating system, check "w", "uptime", "top" to see system load average.
Disk IO has returned to pre-upgrade levels. At the 'nix command line, run iostats -xz 1 or sar -d
Threads are no longer pegging a single CPU at 99%+. At the 'nix command line, run "top -H", or turn on the thread view once 'top' initializes normally (“H”)
The log splunkd.log returns to normal. Run tail -f $SPLUNK_HOME/var/log/splunk/splunkd.log. The rate that data gets written to this log slows significantly when the cluster master has caught up. The type of messages written to this log also changes to info-only.
Compare current resource usage with pre-upgrade levels. When the upgraded cluster master and cluster come up, verify that the resource usage after upgrade compares with screen shots taken before the upgrade in the Monitoring Console > Resource Usage > Machine.

3 Check forwarder function during upgrade

This check is to ensure that forwarders are still checking in during upgrade (for example, with the deployment server), and forwarding data (for example, to the indexer).

Using the monitoring console, ensure that data ingestion continues to flow at the expected rate for the time of day and/or day of the week (monitoring console > Forwarders: Deployment).

4 Upgrade and check progress of indexers (stand-alone)

As indexers are upgraded and brought back online, ensure they are ingesting and participating in search. Run the following search:

index=_internal component=Metrics per_index_thruput earliest=-30m
| eval mb=(kb/1024)
| timechart span=5m sum(mb) by host
| tstats count where earliest=-5m by splunk_server

5 Upgrade and check progress of indexers (clustered)

Verify that indexers rejoin the cluster as they come back online and are marked Status=up and Fully Searchable=yes in Monitoring Console > Indexing > Indexer Clustering > Indexer Clustering: Status.

What's next?

To address the question about what to monitor to verify a successful Splunk Enterprise upgrade, see the Answers post What do I validate after I upgrade Splunk Enterprise to confirm the upgrade was successful?

What's your experience? We'd like to hear from you. We'll be updating this topic as we gather more input.

View solution in original post

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The Answers post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on what to monitor during the upgrade phase to make sure the upgrade goes smoothly for all components.

Keep two things in mind during upgrade:

Upgrade components in the right order. This post gives a suggestion about upgrade order, but the order you should follow depends on your topology. For a high-level overview, see the Answers post What's the order of operations for upgrading Splunk Enterprise? For specific guidance, see the topic How to upgrade Splunk in the Splunk Enterprise Installation Manual.
Ensure that each component has stabilized before proceeding with the next upgrade step. This post provides some guidance on how to make sure you're ready to move to the next step.

Before you start upgrading, thoroughly read the topic About upgrading Splunk Enterprise: READ THIS FIRST in the Splunk Enterprise Installation Manual. The guidelines in this post supplement the detailed instructions in the upgrade documentation.

Also review the Answers post How do I benchmark system health before a Splunk Enterprise upgrade to make sure your deployment is ready for upgrade. That post walks you through how to take a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance. We strongly recommend you put together a detailed upgrade plan that matches your topology before you upgrade.

Here's a high-level snapshot for what to check during an upgrade. We dive into details below.

Upgrade and check progress of single-step components
Upgrade and check progress of the indexer cluster master
Check forwarder function during upgrade
Upgrade and check progress of indexers (distributed)
Upgrade and check progress of indexers (clustered)

1 Upgrade and check progress of single-step components

Several components are single-step upgrades (no bundle pushes, waiting for fix-ups, or sync waits for other components). These components are present in both distributed and clustered deployments. If all these components are on separate machines, you can upgrade them in the following order. If they are collocated on a single machine, you only need to run the upgrade once.

License master
Deployment server
Search head cluster deployer
Monitoring console

After you upgrade the code for each component, verify that you can log in successfully to each component using the UI. From the Monitoring Console > Overview, verify that the license master, deployment server, and deployer components are visible and running in expected ranges compared to the benchmarks you took before upgrade.

Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running.

2 Upgrade and check progress of the indexer cluster master

If you have an indexer cluster, there are several indicators you can check to ensure that the cluster master has upgraded fully and is ready for the next step:

Exit maintenance mode at key points to let the cluster master fully recover. Maintenance mode halts most bucket fixup activity and prevents frequent rolling of hot buckets. To help facilitate cluster recovery after running the upgrade, take the cluster master out of maintenance mode so it can process fixups and manage buckets. This is a best practice after any cluster master restart, and between upgrades of each site in a multi-site indexing cluster.
Check progress indicators at each step. The act of monitoring the cluster master can affect performance because the cluster UI makes REST calls that can compete for resources as the cluster stitches itself together. There are several indicators you can view at the OS layer without adding load to the cluster master. The cluster master is generally ready for the next step when the following indictors are present:
Load average has dropped. On the operating system, check "w", "uptime", "top" to see system load average.
Disk IO has returned to pre-upgrade levels. At the 'nix command line, run iostats -xz 1 or sar -d
Threads are no longer pegging a single CPU at 99%+. At the 'nix command line, run "top -H", or turn on the thread view once 'top' initializes normally (“H”)
The log splunkd.log returns to normal. Run tail -f $SPLUNK_HOME/var/log/splunk/splunkd.log. The rate that data gets written to this log slows significantly when the cluster master has caught up. The type of messages written to this log also changes to info-only.
Compare current resource usage with pre-upgrade levels. When the upgraded cluster master and cluster come up, verify that the resource usage after upgrade compares with screen shots taken before the upgrade in the Monitoring Console > Resource Usage > Machine.

3 Check forwarder function during upgrade

This check is to ensure that forwarders are still checking in during upgrade (for example, with the deployment server), and forwarding data (for example, to the indexer).

Using the monitoring console, ensure that data ingestion continues to flow at the expected rate for the time of day and/or day of the week (monitoring console > Forwarders: Deployment).

4 Upgrade and check progress of indexers (stand-alone)

As indexers are upgraded and brought back online, ensure they are ingesting and participating in search. Run the following search:

index=_internal component=Metrics per_index_thruput earliest=-30m
| eval mb=(kb/1024)
| timechart span=5m sum(mb) by host
| tstats count where earliest=-5m by splunk_server

5 Upgrade and check progress of indexers (clustered)

Verify that indexers rejoin the cluster as they come back online and are marked Status=up and Fully Searchable=yes in Monitoring Console > Indexing > Indexer Clustering > Indexer Clustering: Status.

What's next?

To address the question about what to monitor to verify a successful Splunk Enterprise upgrade, see the Answers post What do I validate after I upgrade Splunk Enterprise to confirm the upgrade was successful?

What's your experience? We'd like to hear from you. We'll be updating this topic as we gather more input.

How do I monitor system health during a Splunk Enterprise upgrade?

upgrade

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

1 Upgrade and check progress of single-step components

2 Upgrade and check progress of the indexer cluster master

3 Check forwarder function during upgrade

4 Upgrade and check progress of indexers (stand-alone)

5 Upgrade and check progress of indexers (clustered)

What's next?

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

1 Upgrade and check progress of single-step components

2 Upgrade and check progress of the indexer cluster master

3 Check forwarder function during upgrade

4 Upgrade and check progress of indexers (stand-alone)

5 Upgrade and check progress of indexers (clustered)

What's next?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!