When I do a health check I get a warning that the skip ratio for scheduled searches is 96.4%. Upon further digging, and checking Search Activity: Instance it shows a skip ratio of 96.3 %. I have run the search
index=_internal source=*scheduler.log | stats count by user, app, savedsearch_name, status and the results returned was a high number of scheduled searches from the app Cisco Security Suite that were skipped. I changed the
[scheduler]max_searches_per cpu in the limits.conf file to 35. Is there anything else that I can do?
CPU: 16 Physical 32 Virtual
Memory: 262 GB
I believe the bottom line is that the resources need to be there on the indexer so that there are minimal to no skips on searches. The ideal Splunk configuration is that you have an indexer and a search head. The search head does what it does and if there are enough resources there would be no skipping searches. This way the load is kept separate. So if you have an indexer/search head on one server, there needs to be a lot of resources as it is searches per physical core, not virtual and the more you have with more memory then the probability of skipped searches go down. If that is not the case, and there is no way around the physical cpu and memory issues, then what I have found is that you can modify the limits.conf file. It is not something that Splunk support recommends, but it helps. Here are the places that I have changed:
max_searches_perc = 50 (default)
max_searches_per_cpu = 1 (default)
I changed the max_searches_perc to 60 and max_searches_per_cpu to 10 to see if the skipping searches would go back to 0.00%. When it did I slowly lowered it down until I found a good point where there may be a small percentage skipped or none at all. I also changed the max_searches_perc back down to 50 at that point and watched it. With all the apps that I have and what I need it to do,Right now the max searches per cpu is at 5. I am good with that. I may have to go to getting another server for a search head.
I hope this helps.
ITSI is a different beast. It's the red headed stepchild of the Splunk enterprise. It doesn't have great documentation like enterprise security. This is likely a bug in my case because real time search doesn't barely thread the needle on my notable event search on my search head. But the indexer was going nuts. Skipped searches everywhere. And that worst part was we didn't need the indexer doing that function as the search head was doing it fine and not under any stress. I just feel that particular item is a bug. And notable events is something that a person should consciously opt into given the resource strain it causes on your environment. I don't use notable events at all Currently.
I'm getting 100% Skip on something that i don't even know that i need "Itsi event grouping" this is on the indexer. I could care less as the indexer job isn't to do noteable events.
It's very frustrating trying to find good documentation around these skip ratios and why they are set this way out of the box with splunk apps.
After running the search you provided, here is what is returned.
reason: The maximum number of concurrent auto-summarization searches on this instance has bee reached
reason: The maximum number of concurrent historical scheduled searches on this instance has been reached
The first one for the last 24hours has a count of 173834 and the second is 27.
I updated the [scheduler]max_searches_per cpu to 25 and max_searches_perc from 50 to 60.
i will be very careful changing these settings in limits.conf and would talk to Splunk PS before doing so.
in any case, something seems very odd, i installed that app many times on weaker indexers without any skipped searches issues. Is it a single splunk instance? do you run many realtime searches?
do you have Enterprise Security app installed as well?
You're right, my last instance of Splunk was on a server that had less resources and I never noticed this. Since I use it to monitor the security of my network, there may be a lot of real time searches going on, which I don't see. Of course, this is the first time I am using health check because it is 6.6 and I didn't notice it on 6.5.3. I'll check with Splunk PS to see if I messed something up by making those changes.