Prior to 6.3, quotas were enforced instead on a member-by-member basis for SHC we have the System Wide Quote Working. Here is an example for this
Let’s assume that we have 2 members in the cluster- set to run scheduled searches as the other 10 members as adhoc only. The Captain takes into consideration number of members in the cluster who can run scheduled searches?
((24*3)+6 ).85 2 = 132
In this case following settings were used : 24 core machines :
machines total and they are both considered job_servers and run scheduled searches
1 of them is the captain
max_search_per_cpu = 3
base_max_searches = 6
max_searches_perc = 85
auto_summary_perc = 85
When the system-wide Quota is reached splunkd.log files shows following messages.
12-11-2015 11:20:40.790 -0500 WARN SavedSplunker - The maximum number of concurrent scheduled searches has been reached (limits: historical=132, realtime=132). historical=233, realtime=0 ready-to-run scheduled searches are pending.
host = iapp106.howard.ms.com-9000 source = /var/hostlinks/farm/splunk/prod/federated-shc-1/9000/home/var/log/splunk/scheduler.log sourcetype = scheduler
Since upgrading to Splunk Version 6.3.1 we are seeing less number of Concurrent Scheduled searches.
Did something changed in 6.3?
1) Bug SPL-111603:[Regression] No warn message for those skipped saved searches if hit the system-wide quota in SHC
This Bug is marked as fixed in 6.4. In case you will like to try and reproduce you can take following steps
1.1. Create a SHC with 3 members
1.2. Set search quota.
$splunk_home/etc/system/local/limits.conf
[search]
base_max_searches=1
max_searches_per_cpu=0
1.3. Create 5 scheduled real time search every minutes.
curl -k -u admin:changed https://localhost:8089/servicesNS/admin/search/saved/searches -d name=testRT_1 -d search="index=_internal | head 30" -d is_scheduled=1 -d dispatch.earliest_time="rt-20m" -d dispatch.latest_time="rtnow" -d cron_schedule="* * * * " ;
curl -k -u admin:changed https://localhost:8089/servicesNS/admin/search/saved/searches -d name=testRT_2 -d search="index=_internal | head 30" -d is_scheduled=1 -d dispatch.earliest_time="rt-20m" -d dispatch.latest_time="rtnow" -d cron_schedule=" * * * " ;
curl -k -u admin:changed https://localhost:8089/servicesNS/admin/search/saved/searches -d name=testRT_3 -d search="index=_internal | head 30" -d is_scheduled=1 -d dispatch.earliest_time="rt-20m" -d dispatch.latest_time="rtnow" -d cron_schedule=" * * * " ;
curl -k -u admin:changed https://localhost:8089/servicesNS/admin/search/saved/searches -d name=testRT_4 -d search="index=_internal | head 30" -d is_scheduled=1 -d dispatch.earliest_time="rt-20m" -d dispatch.latest_time="rtnow" -d cron_schedule=" * * * " ;
curl -k -u admin:changed https://localhost:8089/servicesNS/admin/search/saved/searches -d name=testRT_5 -d search="index=_internal | head 30" -d is_scheduled=1 -d dispatch.earliest_time="rt-20m" -d dispatch.latest_time="rtnow" -d cron_schedule=" * * * *" ;
1.4)wait 2 minutes.
1.5)Check from each search heads:
grep "system-wide" ~/splunk/var/log/splunk/splunkd.log
-21-2015 10:09:05.896 +0800 WARN SHPMaster - Search not executed: The maximum number of real-time concurrent system-wide searches has been reached. current=1 maximum=1 for search: admin;search;testRT_1
12-21-2015 10:09:10.952 +0800 WARN SHPMaster - Search not executed: The maximum number of real-time concurrent system-wide searches has been reached. current=1 maximum=1 for search: admin;search;testRT_2
2> SPL-111347:SHC: Search and scheduler performance degrade with increasing number of saved searcher
This performance degradation was seen only when the saved searches were in thousands.
Starting with 6.3, default behavior for handling user-based and role-based concurrent search quotas has changed. In 6.3, the search head cluster enforces these quotas across the set of cluster members. Prior to 6.3, quotas were enforced instead on a member-by-member basis.
Both Pre and Post 6.3, doesn’t take into account the search user when it assigns a search to a member. Combined with the pre-6.3 behavior of member-enforced quotas, this could result in unwanted and unexpected behavior. For example, if the captain happened to assign most of a particular user's searches to one cluster member, that member could quickly reach the quota for that user, even though other members had not yet reached their limit for the user.
If you need to maintain the pre-6.3 behavior, make these attribute changes in
limits.conf:
shc_role_quota_enforcement = false
shc_local_quota_check = true
The new role-based quota on a cluster is calculated by multiplying the individual user's role quota by the number of cluster members.
In release 6.3 we have also introduced a New Bug >SPL-101954/SPL-110906:srchJobsQuota and rtSrchJobsQuota are not centralized across search head clusters.
Seems like with the changes for new role based quota for SHC in 6.3 this Bug got introduced. Due to this Bug in 6.3 and above the system wide System quota will not be reached in 6.3.1 instead only the per node system quota is enforced while scheduling a job.
Also, Note there are two kinds of quotas messages that look almost similar. (see below)
2-08-2015 07:00:24.537 -0600 WARN SHPMaster - Search not executed: The maximum number of historical concurrent system-wide searches has been reached. current=204 maximum=78 for search: …
In addition here are few other Bug that are somewhat relevant in this situation.
SPL-111603:[Regression] No warn message for those skipped saved searches if hit the system-wide quota in SHC
SPL-111347:SHC: Search and scheduler performance degrade with increasing number of saved searcher
How can we find more information on the two bugs noted above?
SPL-111603:[Regression] No warn message for those skipped saved searches if hit the system-wide quota in SHC
SPL-111347:SHC: Search and scheduler performance degrade with increasing number of saved searcher