There is following errors with my Splunk healtch check.
"The number of extremely lagged searches (1) over the last hour exceeded the red threshold (1) on this Splunk instance"
Do you have any idea what I should do ?
Okay, so what it is telling you is that you are having very many slow searches.
You need to figure out WHAT those searches are, and WHY they are slow.
You can start by trying to figure out which jobs are taking up lots of time
|rest /services/search/jobs | sort 0 - performance_command_addinfo_duration_secs
Then you can start looking at the biggest time wasters, and seeing what might be making them slow. There are dozens of things we could look at, from the very simple to the very complex.
First, get rid of all realtime searches. They are almost never really needed. Use near-real-time searches that run every minute or two instead, or use data models, or any of a number of other strategies that save CPU cycles.
Second, make sure all saved searches and scheduled searches are using smart mode.
Third, make sure that dashboards aren't spamming your instance. they shouldn't be recalculating very often, and if many people are using the same dash, then it should be based on loading a periodic saved search, rather than running the redundant search themselves.
Fourth, check individual searches that take a long time and see if they can be corrected not to waste resources. Anything with
transaction or more than one
join is probably a good candidate for a refactor. Take each kind of search that is really slow to run, and research here on answers if there is a better way. After you've researched, if you can't figure it out, write a single question for one problem search, and see what we can help you with.
The search above is slightly wrong. Try this:
| rest /services/search/jobs splunk_server=local | stats count avg(performance.command.addinfo.duration_secs) AS avg max(performance.command.addinfo.duration_secs) AS max BY search | sort 0 - max - avg