Must Read :
Under search head clustering, there are two broad categories of configuration changes:
changes to search and UI configurations driven by user activity in the UI/CLI/REST
add panel to dashboard
create field extraction
changes to system configurations made by administrators
deploy centralized authentication (e.g. LDAP)
install entirely new app or hand-edited configuration file
Changes in the first category are replicated across a search head cluster automatically.
Changes in the second category must be validated outside of the search head cluster entirely and then pushed from a central instance – the deployer
Watch Out For
Users should set confdeployfetch_url on each search head cluster member to mitigate the potential impact of "missed" deployments caused by intermittent outages. Note that technically this parameter is optional, as customers do not necessarily have to use the deployer; some users are willing to take on the burden of distributing baseline configurations via some homegrown mechanism.
Users must validate system configurations before deploying them, as a bad set of configurations can totally hose a search head cluster.
If a user "disobeys" our best practices and makes a system configuration change directly on a search head cluster member, that change will NOT be replicated to other members. Instead, that member's configuration will permanently diverge from the rest of the search head cluster.
Changes to /etc/passwd are not replicated across a search head cluster. This means user creation and password changes must be manually synchronized across search head cluster members. We generally recommend that users use centralized auth (e.g. LDAP) under search head clustering.
Search peer configuration is not replicated, meaning changes to distsearch.conf are not replicated. If a customer is using index clustering, they should use the cluster master to keep search peers in sync across a search head cluster. Otherwise, they'll need to maintain the same set of search peers across their search head cluster members via some other means.
One reason it's important for search peers to be consistent across all search head cluster members – the captain alone handles bundle replication for the entire search head cluster. You'll see weird behavior if the captain only knows about a subset of search peers.
By default, configuration changes are replicated approximately every 5s. Under good conditions, we expect a change made on one member to be reflected shcluster-wide in 5-10s.
The UI under search head clustering automatically disables access to system configurations. In other words, the "Settings" menu hides configurations that are meant to be deployed instead of edited "live" on a search head cluster member – this includes indexes, inputs, authentication, users, roles, etc.
If an administrator wants to make "one-off" changes to individual search head cluster members, it is possible to re-enable the full Settings menu via the UI itself. If a user does this, we log a WARN to splunkd.log to make this easy to spot.
Changes to log.cfg, log-local.cfg, etc. are not replicated.
Notable REST APIs
If you think there's a problem with configurations:
Before anything else, sanity-check the search head cluster:
search head clustering is configured on every node of interest bootstrap succeeded there's a stable captain in the shcluster – splunk show shcluster-status the captain has a reasonable configuration – splunk list shcluster-captain-info the member of interest has a reasonable configuration – splunk list shcluster-member-info **the member of interest has a stable connection to the captain**
check splunkd.log. In particular, look on the following channels:
ConfReplication ConfReplicationThread ConfReplicationHandler loader (during startup)
If perf issues are a potential concern, check metrics.log and look for group=conf. These metrics messages include summaries of how often various conf-related actions were performed during the last 30s interval, how much cumulative time was spent on each action "type", and the worst/longest execution time for a single invocation of any action. A rough, high-level description of what each "action" means:
If network activity between search head cluster members is of particular concern, check splunkd_access.log on the various members. Look for GETs and POSTs against the /replication/configuration REST API.
If network activity between search head cluster members and the deployer is of concern, look for GETs and POSTs against the /apps/deploy and /apps/local REST APIs.
In addition to what @rbal has mentioned above, try adjusting some of these server.conf parameters to limit the loads on the captain:
Setting "schedulingheuristic = roundrobin". In many cases, by just setting this parameter to round_robin alone has helped tremendously.
Setting "captainisadhoc_searchhead = true"
Setting heartbeat_period value to 10 or higher. Default is 5 sec.
Please refer to the server.conf spec for details on this parameter:
A few of these have changed with latest v6.4. The user replication (/etc/passwd) and search peer replication are now handled when adding via CLI (and I assume REST API).