Deployment Architecture

[SHC] Troubleshooting Configurations under Search Head Clustering

rbal_splunk
Splunk Employee
Splunk Employee

It will be nice to have high level documentation on Troubleshooting Configurations under Search Head Clustering

1 Solution

rbal_splunk
Splunk Employee
Splunk Employee

Must Read :
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/AboutSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/SHCarchitecture
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/HowconfigurationworksinSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/HowconfrepoworksinSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/PropagateSHCconfigurationchanges

Configuration Changes

Under search head clustering, there are two broad categories of configuration changes:

changes to search and UI configurations driven by user activity in the UI/CLI/REST
save report
add panel to dashboard
create field extraction
changes to system configurations made by administrators
configure forwarding
deploy centralized authentication (e.g. LDAP)
install entirely new app or hand-edited configuration file

Changes in the first category are replicated across a search head cluster automatically.

Changes in the second category must be validated outside of the search head cluster entirely and then pushed from a central instance – the deployer

Watch Out For

Users should set conf_deploy_fetch_url on each search head cluster member to mitigate the potential impact of "missed" deployments caused by intermittent outages. Note that technically this parameter is optional, as customers do not necessarily have to use the deployer; some users are willing to take on the burden of distributing baseline configurations via some homegrown mechanism.

Users must validate system configurations before deploying them, as a bad set of configurations can totally hose a search head cluster.

If a user "disobeys" our best practices and makes a system configuration change directly on a search head cluster member, that change will NOT be replicated to other members. Instead, that member's configuration will permanently diverge from the rest of the search head cluster.

Changes to /etc/passwd are not replicated across a search head cluster. This means user creation and password changes must be manually synchronized across search head cluster members. We generally recommend that users use centralized auth (e.g. LDAP) under search head clustering.

Search peer configuration is not replicated, meaning changes to distsearch.conf are not replicated. If a customer is using index clustering, they should use the cluster master to keep search peers in sync across a search head cluster. Otherwise, they'll need to maintain the same set of search peers across their search head cluster members via some other means.

One reason it's important for search peers to be consistent across all search head cluster members – the captain alone handles bundle replication for the entire search head cluster. You'll see weird behavior if the captain only knows about a subset of search peers.

Notable Behaviors

By default, configuration changes are replicated approximately every 5s. Under good conditions, we expect a change made on one member to be reflected shcluster-wide in 5-10s.

The UI under search head clustering automatically disables access to system configurations. In other words, the "Settings" menu hides configurations that are meant to be deployed instead of edited "live" on a search head cluster member – this includes indexes, inputs, authentication, users, roles, etc.

If an administrator wants to make "one-off" changes to individual search head cluster members, it is possible to re-enable the full Settings menu via the UI itself. If a user does this, we log a WARN to splunkd.log to make this easy to spot.

Changes to log.cfg, log-local.cfg, etc. are not replicated.

Notable REST APIs

/replication/configuration
/apps/local
/apps/deploy

If you think there's a problem with configurations:

Before anything else, sanity-check the search head cluster:

search head clustering is configured on every node of interest
bootstrap succeeded
there's a stable captain in the shcluster – splunk show shcluster-status
the captain has a reasonable configuration – splunk list shcluster-captain-info
the member of interest has a reasonable configuration – splunk list shcluster-member-info 
**the member of interest has a stable connection to the captain**

check splunkd.log. In particular, look on the following channels:

ConfReplication
ConfReplicationThread
ConfReplicationHandler
loader (during startup)

If perf issues are a potential concern, check metrics.log and look for group=conf. These metrics messages include summaries of how often various conf-related actions were performed during the last 30s interval, how much cumulative time was spent on each action "type", and the worst/longest execution time for a single invocation of any action. A rough, high-level description of what each "action" means:

  • accept_push: on the captain, accept replicated changes from a member (related: push_to)
  • acquire_mutex: acquire a mutex that "protects" the configuration system (related: release_and_reacquire_mutex)
  • add_commit: on a member, record a change
    • base_initialize: initialize a configuration "root", e.g. $SPLUNK_HOME/etc (related: repo_initialize)
    • check_range: compare two ranges of configuration changes (related: compute_common)
    • compute_common: find the latest common change between a member and the captain (related: check_range)
    • pull_from: on a member, pull changes from the captain (related: reply_pull)
    • purge_eligible: on a member, purge sufficiently old changes from the repo
    • push_to: on a member, push changes to the captain (related: accept_push)
    • release_and_reacquire_mutex: release, then re-acquire a mutex that "protects" the configuration system (related: acquire_mutex)
    • reply_pull: on the captain, reply to a member's pull_from request (related: pull_from)
    • repo_initialize: initialize a configuration repo from disk (related: base_initialize)

If network activity between search head cluster members is of particular concern, check splunkd_access.log on the various members. Look for GETs and POSTs against the /replication/configuration REST API.

If network activity between search head cluster members and the deployer is of concern, look for GETs and POSTs against the /apps/deploy and /apps/local REST APIs.

View solution in original post

rbal_splunk
Splunk Employee
Splunk Employee

Must Read :
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/AboutSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/SHCarchitecture
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/HowconfigurationworksinSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/HowconfrepoworksinSHC
http://docs.splunk.com/Documentation/Splunk/6.2.3/DistSearch/PropagateSHCconfigurationchanges

Configuration Changes

Under search head clustering, there are two broad categories of configuration changes:

changes to search and UI configurations driven by user activity in the UI/CLI/REST
save report
add panel to dashboard
create field extraction
changes to system configurations made by administrators
configure forwarding
deploy centralized authentication (e.g. LDAP)
install entirely new app or hand-edited configuration file

Changes in the first category are replicated across a search head cluster automatically.

Changes in the second category must be validated outside of the search head cluster entirely and then pushed from a central instance – the deployer

Watch Out For

Users should set conf_deploy_fetch_url on each search head cluster member to mitigate the potential impact of "missed" deployments caused by intermittent outages. Note that technically this parameter is optional, as customers do not necessarily have to use the deployer; some users are willing to take on the burden of distributing baseline configurations via some homegrown mechanism.

Users must validate system configurations before deploying them, as a bad set of configurations can totally hose a search head cluster.

If a user "disobeys" our best practices and makes a system configuration change directly on a search head cluster member, that change will NOT be replicated to other members. Instead, that member's configuration will permanently diverge from the rest of the search head cluster.

Changes to /etc/passwd are not replicated across a search head cluster. This means user creation and password changes must be manually synchronized across search head cluster members. We generally recommend that users use centralized auth (e.g. LDAP) under search head clustering.

Search peer configuration is not replicated, meaning changes to distsearch.conf are not replicated. If a customer is using index clustering, they should use the cluster master to keep search peers in sync across a search head cluster. Otherwise, they'll need to maintain the same set of search peers across their search head cluster members via some other means.

One reason it's important for search peers to be consistent across all search head cluster members – the captain alone handles bundle replication for the entire search head cluster. You'll see weird behavior if the captain only knows about a subset of search peers.

Notable Behaviors

By default, configuration changes are replicated approximately every 5s. Under good conditions, we expect a change made on one member to be reflected shcluster-wide in 5-10s.

The UI under search head clustering automatically disables access to system configurations. In other words, the "Settings" menu hides configurations that are meant to be deployed instead of edited "live" on a search head cluster member – this includes indexes, inputs, authentication, users, roles, etc.

If an administrator wants to make "one-off" changes to individual search head cluster members, it is possible to re-enable the full Settings menu via the UI itself. If a user does this, we log a WARN to splunkd.log to make this easy to spot.

Changes to log.cfg, log-local.cfg, etc. are not replicated.

Notable REST APIs

/replication/configuration
/apps/local
/apps/deploy

If you think there's a problem with configurations:

Before anything else, sanity-check the search head cluster:

search head clustering is configured on every node of interest
bootstrap succeeded
there's a stable captain in the shcluster – splunk show shcluster-status
the captain has a reasonable configuration – splunk list shcluster-captain-info
the member of interest has a reasonable configuration – splunk list shcluster-member-info 
**the member of interest has a stable connection to the captain**

check splunkd.log. In particular, look on the following channels:

ConfReplication
ConfReplicationThread
ConfReplicationHandler
loader (during startup)

If perf issues are a potential concern, check metrics.log and look for group=conf. These metrics messages include summaries of how often various conf-related actions were performed during the last 30s interval, how much cumulative time was spent on each action "type", and the worst/longest execution time for a single invocation of any action. A rough, high-level description of what each "action" means:

  • accept_push: on the captain, accept replicated changes from a member (related: push_to)
  • acquire_mutex: acquire a mutex that "protects" the configuration system (related: release_and_reacquire_mutex)
  • add_commit: on a member, record a change
    • base_initialize: initialize a configuration "root", e.g. $SPLUNK_HOME/etc (related: repo_initialize)
    • check_range: compare two ranges of configuration changes (related: compute_common)
    • compute_common: find the latest common change between a member and the captain (related: check_range)
    • pull_from: on a member, pull changes from the captain (related: reply_pull)
    • purge_eligible: on a member, purge sufficiently old changes from the repo
    • push_to: on a member, push changes to the captain (related: accept_push)
    • release_and_reacquire_mutex: release, then re-acquire a mutex that "protects" the configuration system (related: acquire_mutex)
    • reply_pull: on the captain, reply to a member's pull_from request (related: pull_from)
    • repo_initialize: initialize a configuration repo from disk (related: base_initialize)

If network activity between search head cluster members is of particular concern, check splunkd_access.log on the various members. Look for GETs and POSTs against the /replication/configuration REST API.

If network activity between search head cluster members and the deployer is of concern, look for GETs and POSTs against the /apps/deploy and /apps/local REST APIs.

splunkIT
Splunk Employee
Splunk Employee

In addition to what @rbal has mentioned above, try adjusting some of these server.conf parameters to limit the loads on the captain:

  • Increase the max_peer_rep_load to 10 or more.
  • Setting "async_replicate_on_proxy = false" to reduce excess replications.
  • Setting "scheduling_heuristic = round_robin". In many cases, by just setting this parameter to round_robin alone has helped tremendously.

  • Setting "captain_is_adhoc_searchhead = true"

  • Setting heartbeat_period value to 10 or higher. Default is 5 sec.

Please refer to the server.conf spec for details on this parameter:
http://docs.splunk.com/Documentation/Splunk/6.3.0/Admin/Serverconf

0 Karma

RMartinezDTV
Path Finder

A few of these have changed with latest v6.4. The user replication (/etc/passwd) and search peer replication are now handled when adding via CLI (and I assume REST API).

0 Karma
Get Updates on the Splunk Community!

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Deprecation of Splunk Observability Kubernetes “Classic Navigator” UI starting ...

Access to Splunk Observability Kubernetes “Classic Navigator” UI will no longer be available starting January ...

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...