EDIT: New details as of 12/11. Scroll down!
Answering this one myself to get the info out there:
This has been registered as a bug with Splunk support. (SPL-109514) Details of the issue, how to detect it, and how to work around it follow:
Description
There appears to be a problem with quota calculation in the search scheduler that is specific to a clustered deployment. Splunk will not dispatch a saved search if the user has reached their concurrent search quota. (As defined in authorize.conf) However, it appears the current usage is not calculated correctly, causing the user to show as over-quota. We see this 'usage' value slowly grow over time.
Splunk emits WARNs when this happens:
11-18-2015 11:10:34.638 -0600 WARN SHPMaster - Search not executed: Your maximum number of concurrent searches has been reached. usage=41 quota=30 user=someuser. for search: nobody;search;A Saved Search
While some of these may be legit, users affected by this bug generate a far higher volume of them. (We see WARNs every 7 seconds for each scheduled search owned by the affected user)
A confusing side-effect of this is that because the searches are never dispatched, they don't report as skipped. So if you're looking at scheduler metrics in the DMC, it looks like everything is successfully running.
Detection
Because of the sheer volume of WARNs generated, you can use that to detect the issue: We run the following with a 5-minute window as an alert:
index=_internal sourcetype=splunkd component=SHPMaster "Search not executed: Your maximum number of concurrent searches has been reached"
| rex "user\=(?<user>.+)\.\s+for search:\s(?<search_user>[^;]+);(?<search_context>[^;]+);(?<search_name>.+)"
| fields _time usage quota user search_*
| stats count by user search_name
| where count>40
| stats values(search_name) as affected_searches by user
Alert if any records are returned.
Impact
This can prevent alert searches from running. Depending on the importance of those alerts, the impact can be severe.
The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every day or so, while in a lower volume testing environment it could take over a week to surface.
Remediation
If affected by this issue, a rolling-restart of the search head cluster will get things moving again. However, the issue will recur. So this becomes an active maintenance thing.
NEW 12/11 - Workaround
This issue is related to new functionality in Splunk 6.3. Pre-6.3, Splunk calculated quotas independently on each search head. In 6.3, this changed to calculating cluster-wide quotas. This new behavior makes sense, but doesn't seem to work correctly in practice. You can restore splunk to it's pre-6.3 behavior by adding the following in limits.conf:
[scheduler]
shc_role_quota_enforcement = false
shc_local_quota_check = true
Alternative Workaround
A way to prevent the bug from occurring is to remove all role-based concurrent search quotas. Note that this leaves your users free to run concurrent searches up to the sever-based restrictions in limits.conf.
Since we weren't certain of the interaction of imported roles here, we explicitly set all roles to zero, including built in roles ('default', 'user', 'power', and 'admin')
Example authorize.conf stanza:
[role_admin]
srchJobsQuota = 0
rtSrchJobsQuota = 0
cumulativeSrchJobsQuota = 0
cumulativeRTSrchJobsQuota = 0
... View more