We're running a Search Head Cluster on Splunk 6.3.0. We have noticed that saved searches/alerts for some users stop dispatching seemingly at random. Issuing a rolling-restart on the cluster gets them working again, but eventually they stop.
Answering this one myself to get the info out there:
This has been registered as a bug with Splunk support. (SPL-109514) Details of the issue, how to detect it, and how to work around it follow:
There appears to be a problem with quota calculation in the search scheduler that is specific to a clustered deployment. Splunk will not dispatch a saved search if the user has reached their concurrent search quota. (As defined in authorize.conf) However, it appears the current usage is not calculated correctly, causing the user to show as over-quota. We see this 'usage' value slowly grow over time.
Splunk emits WARNs when this happens:
11-18-2015 11:10:34.638 -0600 WARN SHPMaster - Search not executed: Your maximum number of concurrent searches has been reached. usage=41 quota=30 user=someuser. for search: nobody;search;A Saved Search
While some of these may be legit, users affected by this bug generate a far higher volume of them. (We see WARNs every 7 seconds for each scheduled search owned by the affected user)
A confusing side-effect of this is that because the searches are never dispatched, they don't report as skipped. So if you're looking at scheduler metrics in the DMC, it looks like everything is successfully running.
Because of the sheer volume of WARNs generated, you can use that to detect the issue: We run the following with a 5-minute window as an alert:
index=_internal sourcetype=splunkd component=SHPMaster "Search not executed: Your maximum number of concurrent searches has been reached"
| rex "user\=(?<user>.+)\.\s+for search:\s(?<search_user>[^;]+);(?<search_context>[^;]+);(?<search_name>.+)"
| fields _time usage quota user search_*
| stats count by user search_name
| where count>40
| stats values(search_name) as affected_searches by user
Alert if any records are returned.
This can prevent alert searches from running. Depending on the importance of those alerts, the impact can be severe.
The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every day or so, while in a lower volume testing environment it could take over a week to surface.
If affected by this issue, a rolling-restart of the search head cluster will get things moving again. However, the issue will recur. So this becomes an active maintenance thing.
This issue is related to new functionality in Splunk 6.3. Pre-6.3, Splunk calculated quotas independently on each search head. In 6.3, this changed to calculating cluster-wide quotas. This new behavior makes sense, but doesn't seem to work correctly in practice. You can restore splunk to it's pre-6.3 behavior by adding the following in limits.conf:
[scheduler]
shc_role_quota_enforcement = false
shc_local_quota_check = true
A way to prevent the bug from occurring is to remove all role-based concurrent search quotas. Note that this leaves your users free to run concurrent searches up to the sever-based restrictions in limits.conf.
Since we weren't certain of the interaction of imported roles here, we explicitly set all roles to zero, including built in roles ('default', 'user', 'power', and 'admin')
Example authorize.conf stanza:
[role_admin]
srchJobsQuota = 0
rtSrchJobsQuota = 0
cumulativeSrchJobsQuota = 0
cumulativeRTSrchJobsQuota = 0
SPL-109514 - Number of concurrent searches increases in an idle SHC member
This JIRA/defect has been addressed in 6.3.3 Maintenance Update. I would recommend upgrading to the latest Maintenance update for 6.3.
I don't see SPL-109514 referenced in any of the known issues or release notes. In fact, I can't find any reference to this issue at all. Could we please confirm which release addressed this issue?
I confirm the fix SPL-109514 shipped in 6.3.3.
SPL-122983 is a docs enhancement request to update the release notes to properly reflect this.
Awesome. Thanks for a rapid response!
Gone change went live in 6.3.3 - 6.3.5 docs:
http://docs.splunk.com/Documentation/Splunk/6.3.3/ReleaseNotes/KnownIssues
We had the exact same issue,which was not fixed even after upgrade to 6.3.5.We finally had to go for the work around suggested here.
Thank you for the information
Answering this one myself to get the info out there:
This has been registered as a bug with Splunk support. (SPL-109514) Details of the issue, how to detect it, and how to work around it follow:
There appears to be a problem with quota calculation in the search scheduler that is specific to a clustered deployment. Splunk will not dispatch a saved search if the user has reached their concurrent search quota. (As defined in authorize.conf) However, it appears the current usage is not calculated correctly, causing the user to show as over-quota. We see this 'usage' value slowly grow over time.
Splunk emits WARNs when this happens:
11-18-2015 11:10:34.638 -0600 WARN SHPMaster - Search not executed: Your maximum number of concurrent searches has been reached. usage=41 quota=30 user=someuser. for search: nobody;search;A Saved Search
While some of these may be legit, users affected by this bug generate a far higher volume of them. (We see WARNs every 7 seconds for each scheduled search owned by the affected user)
A confusing side-effect of this is that because the searches are never dispatched, they don't report as skipped. So if you're looking at scheduler metrics in the DMC, it looks like everything is successfully running.
Because of the sheer volume of WARNs generated, you can use that to detect the issue: We run the following with a 5-minute window as an alert:
index=_internal sourcetype=splunkd component=SHPMaster "Search not executed: Your maximum number of concurrent searches has been reached"
| rex "user\=(?<user>.+)\.\s+for search:\s(?<search_user>[^;]+);(?<search_context>[^;]+);(?<search_name>.+)"
| fields _time usage quota user search_*
| stats count by user search_name
| where count>40
| stats values(search_name) as affected_searches by user
Alert if any records are returned.
This can prevent alert searches from running. Depending on the importance of those alerts, the impact can be severe.
The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every day or so, while in a lower volume testing environment it could take over a week to surface.
If affected by this issue, a rolling-restart of the search head cluster will get things moving again. However, the issue will recur. So this becomes an active maintenance thing.
This issue is related to new functionality in Splunk 6.3. Pre-6.3, Splunk calculated quotas independently on each search head. In 6.3, this changed to calculating cluster-wide quotas. This new behavior makes sense, but doesn't seem to work correctly in practice. You can restore splunk to it's pre-6.3 behavior by adding the following in limits.conf:
[scheduler]
shc_role_quota_enforcement = false
shc_local_quota_check = true
A way to prevent the bug from occurring is to remove all role-based concurrent search quotas. Note that this leaves your users free to run concurrent searches up to the sever-based restrictions in limits.conf.
Since we weren't certain of the interaction of imported roles here, we explicitly set all roles to zero, including built in roles ('default', 'user', 'power', and 'admin')
Example authorize.conf stanza:
[role_admin]
srchJobsQuota = 0
rtSrchJobsQuota = 0
cumulativeSrchJobsQuota = 0
cumulativeRTSrchJobsQuota = 0
There appears to be some bugs introduced in 6.3, in this area:
https://answers.splunk.com/answers/337598/search-head-cluster-pre-63-we-could-run-more-numbe-2.html
This is so helpful. We recently installed ITSI which caused us to go over new quote limits
|rest /servicesNS/-/-/saved/searches
| search disabled=0 is_scheduled=1
| table title cron_schedule next_scheduled_time
|convert mktime(next_scheduled_time) AS next_schedhuled_time_epoc timeformat="%F %T %Z"
| where now()>next_schedhuled_time_epoc
We used this to detect any searches that have gone stale
thank you so much for this. been giving me some late nights troubleshooting this.
2 things:
According to Splunk Support this is fixed in 6.3.1 or 2 or 3? Can anyone confirm?
This is still a bug in 6.3.3. I've had to apply the workaround to fix this issue after not restarting the search head cluster for a week. No reference to this in the known issues however, nor any response from Splunk support.
http://docs.splunk.com/Documentation/Splunk/6.3.3/ReleaseNotes/KnownIssues
Regarding #2: I run the detection search on the Cluster Deployer, which is not a part of the cluster, and thus unaffected by the issue.
Thank you for that update and additional workaround - at first glance, seems to do the trick. Apparently I was a little too eager to get to SH Clustering for Windows. Probably should have waited for a few . releases.