Alerting

Why do scheduled searches randomly stop running in a Splunk 6.3.0 Search Head Cluster?

emiller42
Motivator

We're running a Search Head Cluster on Splunk 6.3.0. We have noticed that saved searches/alerts for some users stop dispatching seemingly at random. Issuing a rolling-restart on the cluster gets them working again, but eventually they stop.

1 Solution

emiller42
Motivator

EDIT: New details as of 12/11. Scroll down!


Answering this one myself to get the info out there:

This has been registered as a bug with Splunk support. (SPL-109514) Details of the issue, how to detect it, and how to work around it follow:


Description

There appears to be a problem with quota calculation in the search scheduler that is specific to a clustered deployment. Splunk will not dispatch a saved search if the user has reached their concurrent search quota. (As defined in authorize.conf) However, it appears the current usage is not calculated correctly, causing the user to show as over-quota. We see this 'usage' value slowly grow over time.

Splunk emits WARNs when this happens:

11-18-2015 11:10:34.638 -0600 WARN  SHPMaster - Search not executed: Your maximum number of concurrent searches has been reached. usage=41 quota=30 user=someuser. for search: nobody;search;A Saved Search

While some of these may be legit, users affected by this bug generate a far higher volume of them. (We see WARNs every 7 seconds for each scheduled search owned by the affected user)

A confusing side-effect of this is that because the searches are never dispatched, they don't report as skipped. So if you're looking at scheduler metrics in the DMC, it looks like everything is successfully running.

Detection

Because of the sheer volume of WARNs generated, you can use that to detect the issue: We run the following with a 5-minute window as an alert:

index=_internal sourcetype=splunkd  component=SHPMaster "Search not executed: Your maximum number of concurrent searches has been reached" 
| rex "user\=(?<user>.+)\.\s+for search:\s(?<search_user>[^;]+);(?<search_context>[^;]+);(?<search_name>.+)" 
| fields _time usage quota user search_*  
| stats  count by user search_name 
| where count>40 
| stats values(search_name) as affected_searches by user

Alert if any records are returned.

Impact

This can prevent alert searches from running. Depending on the importance of those alerts, the impact can be severe.

The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every day or so, while in a lower volume testing environment it could take over a week to surface.

Remediation

If affected by this issue, a rolling-restart of the search head cluster will get things moving again. However, the issue will recur. So this becomes an active maintenance thing.

NEW 12/11 - Workaround

This issue is related to new functionality in Splunk 6.3. Pre-6.3, Splunk calculated quotas independently on each search head. In 6.3, this changed to calculating cluster-wide quotas. This new behavior makes sense, but doesn't seem to work correctly in practice. You can restore splunk to it's pre-6.3 behavior by adding the following in limits.conf:

[scheduler]
shc_role_quota_enforcement = false
shc_local_quota_check = true

Alternative Workaround

A way to prevent the bug from occurring is to remove all role-based concurrent search quotas. Note that this leaves your users free to run concurrent searches up to the sever-based restrictions in limits.conf.

Since we weren't certain of the interaction of imported roles here, we explicitly set all roles to zero, including built in roles ('default', 'user', 'power', and 'admin')

Example authorize.conf stanza:

[role_admin]
srchJobsQuota = 0
rtSrchJobsQuota = 0
cumulativeSrchJobsQuota = 0
cumulativeRTSrchJobsQuota = 0

View solution in original post

bohanlon_splunk
Splunk Employee
Splunk Employee

SPL-109514 - Number of concurrent searches increases in an idle SHC member

This JIRA/defect has been addressed in 6.3.3 Maintenance Update. I would recommend upgrading to the latest Maintenance update for 6.3.

0 Karma

brigancc
Explorer

I don't see SPL-109514 referenced in any of the known issues or release notes. In fact, I can't find any reference to this issue at all. Could we please confirm which release addressed this issue?

0 Karma

bohanlon_splunk
Splunk Employee
Splunk Employee

I confirm the fix SPL-109514 shipped in 6.3.3.
SPL-122983 is a docs enhancement request to update the release notes to properly reflect this.

brigancc
Explorer

Awesome. Thanks for a rapid response!

0 Karma

bohanlon_splunk
Splunk Employee
Splunk Employee
0 Karma

Ankitha_d
Path Finder

We had the exact same issue,which was not fixed even after upgrade to 6.3.5.We finally had to go for the work around suggested here.
Thank you for the information

0 Karma

emiller42
Motivator

EDIT: New details as of 12/11. Scroll down!


Answering this one myself to get the info out there:

This has been registered as a bug with Splunk support. (SPL-109514) Details of the issue, how to detect it, and how to work around it follow:


Description

There appears to be a problem with quota calculation in the search scheduler that is specific to a clustered deployment. Splunk will not dispatch a saved search if the user has reached their concurrent search quota. (As defined in authorize.conf) However, it appears the current usage is not calculated correctly, causing the user to show as over-quota. We see this 'usage' value slowly grow over time.

Splunk emits WARNs when this happens:

11-18-2015 11:10:34.638 -0600 WARN  SHPMaster - Search not executed: Your maximum number of concurrent searches has been reached. usage=41 quota=30 user=someuser. for search: nobody;search;A Saved Search

While some of these may be legit, users affected by this bug generate a far higher volume of them. (We see WARNs every 7 seconds for each scheduled search owned by the affected user)

A confusing side-effect of this is that because the searches are never dispatched, they don't report as skipped. So if you're looking at scheduler metrics in the DMC, it looks like everything is successfully running.

Detection

Because of the sheer volume of WARNs generated, you can use that to detect the issue: We run the following with a 5-minute window as an alert:

index=_internal sourcetype=splunkd  component=SHPMaster "Search not executed: Your maximum number of concurrent searches has been reached" 
| rex "user\=(?<user>.+)\.\s+for search:\s(?<search_user>[^;]+);(?<search_context>[^;]+);(?<search_name>.+)" 
| fields _time usage quota user search_*  
| stats  count by user search_name 
| where count>40 
| stats values(search_name) as affected_searches by user

Alert if any records are returned.

Impact

This can prevent alert searches from running. Depending on the importance of those alerts, the impact can be severe.

The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every day or so, while in a lower volume testing environment it could take over a week to surface.

Remediation

If affected by this issue, a rolling-restart of the search head cluster will get things moving again. However, the issue will recur. So this becomes an active maintenance thing.

NEW 12/11 - Workaround

This issue is related to new functionality in Splunk 6.3. Pre-6.3, Splunk calculated quotas independently on each search head. In 6.3, this changed to calculating cluster-wide quotas. This new behavior makes sense, but doesn't seem to work correctly in practice. You can restore splunk to it's pre-6.3 behavior by adding the following in limits.conf:

[scheduler]
shc_role_quota_enforcement = false
shc_local_quota_check = true

Alternative Workaround

A way to prevent the bug from occurring is to remove all role-based concurrent search quotas. Note that this leaves your users free to run concurrent searches up to the sever-based restrictions in limits.conf.

Since we weren't certain of the interaction of imported roles here, we explicitly set all roles to zero, including built in roles ('default', 'user', 'power', and 'admin')

Example authorize.conf stanza:

[role_admin]
srchJobsQuota = 0
rtSrchJobsQuota = 0
cumulativeSrchJobsQuota = 0
cumulativeRTSrchJobsQuota = 0

View solution in original post

splunkIT
Splunk Employee
Splunk Employee
0 Karma

martin_hempstoc
Explorer

This is so helpful. We recently installed ITSI which caused us to go over new quote limits

|rest /servicesNS/-/-/saved/searches
| search  disabled=0 is_scheduled=1
| table title cron_schedule next_scheduled_time
|convert mktime(next_scheduled_time) AS next_schedhuled_time_epoc timeformat="%F %T %Z"
| where now()>next_schedhuled_time_epoc

We used this to detect any searches that have gone stale

0 Karma

brooklynotss
Path Finder

thank you so much for this. been giving me some late nights troubleshooting this.
2 things:

  1. I found I also needed to include the same Quota value changes to the [role_splunk-system-role] to get those Warnings to stop.
  2. Regarding your Detection recommendation above... thanks for the search but if scheduled searches have stopped (which is what that detects right?) how is that additional scheduled search going to alert you?
0 Karma

brooklynotss
Path Finder

According to Splunk Support this is fixed in 6.3.1 or 2 or 3? Can anyone confirm?

0 Karma

alekksi
Communicator

This is still a bug in 6.3.3. I've had to apply the workaround to fix this issue after not restarting the search head cluster for a week. No reference to this in the known issues however, nor any response from Splunk support.

http://docs.splunk.com/Documentation/Splunk/6.3.3/ReleaseNotes/KnownIssues

0 Karma

emiller42
Motivator

Regarding #2: I run the detection search on the Cluster Deployer, which is not a part of the cluster, and thus unaffected by the issue.

brooklynotss
Path Finder

Thank you for that update and additional workaround - at first glance, seems to do the trick. Apparently I was a little too eager to get to SH Clustering for Windows. Probably should have waited for a few . releases.

0 Karma