I'm familiar with some of the system-wide limits and per-user quotas that prevent a Splunk instance from getting oversubscribed. Namely:
base_max_searches = <int>
* a constant to add the maximum number of searches computed as a multiplier of the CPUs
* Defaults to 4
max_searches_per_cpu = <int>
* the maximum number of concurrent historical searches per CPU. The system-wide limit of historical searches
* is computed as: max_hist_searches = max_searches_per_cpu x number_of_cpus + base_max_searches
* Note: the maximum number of real-time searches is computed as: max_rt_searches = max_rt_search_multiplier x max_hist_searches
* Defaults to 4
and from authorize.conf.spec:
srchDiskQuota = <number>
* Maximum amount of disk space (MB) that can be taken by search jobs of a user
that belongs to this role
srchJobsQuota = <number>
* Maximum number of concurrently running historical searches a member of this role can have (excludes real-time searches, see rtSrchJobsQuota)
rtSrchJobsQuota = <number>
* Maximum number of concurrently running real-time searches a member of this role can have
My question is, how does this work in a distributed environment?
Imagine I have two search heads, and users load-balanced between them. Both search heads distribute to the same farm of 10 indexers, mostly of the same ilk of 8 core boxes but a couple have only 4 cores.
What do I have to consider so that my indexers don't get over-subscribed?
Do local system-wide limits override limits set on the search-heads?
Perhaps more importantly, how can I be proactive in monitoring performance and detecting when a user has hit his quota, or when the system has too many concurrent jobs?
What are the log messages to look for?
Can I prioritize a saved search to run before any of the other searches queued in the system?