Hello all, thanks for taking the time to read this post. I am writing today about an issue we seem to be having with one of our Splunk dashboards. It's really just 1 particular query within the dashboard...and it seems like it's due to the way in which the query is written. The query is taking on average 2 1/2 - 3 minutes to load, and utilizing between 150-200MB of memory on the search head instance.
index=* sourcetype="WinEventLog:Security" EventCode IN (4625, 4626, 4627, 4628, 4630, 4631, 4632, 4633) AND AccountName IN (admin account prefixes on network) | fillnull value=NULL | eval AccountName=mvindex(AcountName,1) | eval SecurityID=mvindex(SecurityID,1) | eval LoginType=case(LogonType=2, "Regular Logon", LogonType=3, "RPC (not RDP)", LogonType=4, "Batch", LogonType=5, "Service", LogonType=7, "Screen Unlock/SessionResume", LogonType=10, "Remote Desktop", LogonType=11, "Cached", LogonType=9, "New Credentials") | rename AccountName as "User" | stats count(SecurityID) as "Login Events" by AccountName, LoginType, host, _time | sort - _time
Here is another query we are using for standard accounts in the same dashboard...and it loads in less than 15 seconds, utilizing much less resources.
index=* sourcetype="WinEventLog:Security" NOT AccountName IN (admin account prefixes on network) NOT CallerProcessName="*process we want suppressed" EventCode IN (4625, 4626, 4627, 4628, 4630, 4631, 4632, 4633) | fillnull value=NULL | eval AccountName=mvindex(AcountName,1) | eval SecurityID=mvindex(SecurityID,1) | eval LoginType=case(LogonType=2, "Regular Logon", LogonType=3, "RPC (not RDP)", LogonType=4, "Batch", LogonType=5, "Service", LogonType=7, "Screen Unlock/SessionResume", LogonType=10, "Remote Desktop", LogonType=11, "Cached", LogonType=9, "New Credentials") | rename AccountName as "User" | stats count(SecurityID) as "Login Events" by AccountName, LoginType, host, _time | sort - _time
Again, the top query takes minutes to load and uses excessive resources, the bottom query takes seconds to load and doesn't use nearly as much resources. I guess I'm just curious if this is due to the nature of "NOT" statements in a query vice "AND"...or if my query isn't optimized. Maybe both? The queries are searching for the past 24 hours, and are set to 30min refresh intervals.
Have you checked the job inspector on the slow query? It will provided you a detailed summary of execution costs. Pay special attention to high numbers in the Duration and Invocation columns. Also compare your query to the "optimizedSearch" provided by Splunk in the job inspector, and check the logs for warnings or errors.
Also, at the risk of stating the obvious, using index=* is just never a good idea. If you have multiple indexes with similar names, limit it down as best you can, but you don't want to use a single wildcard.
Something else that I notice in both your queries that is likely less obvious, but important, is your use of the sort command.
This is not probably not well known to many, but using "sort - _time" as you have it, limits your search results to only 10,000. So, you are likely getting incomplete results.
This syntax will give you all results:
sort 0 - _time
Thank you so much for your answer. I was able to identify a few optimization issues using the job inspector, but they didn't seem to change much. I have watched the video Splunk has on using the job inspector and EPS to determine how well your search is performing...and it seems that mine is under-performing. For my search, in this example...it returned 56 results by scanning 1,467,895 events in 206.326 seconds. This puts my EPS at 7,114...which is much lower than the 10,000 minimum Splunk says it should be running at.
Also, you've stated that I shouldn't use index=*, but the reason I do that is because we only use 1 index for non Splunk system related items. I put the actual index in there explicitly and it did not change my job inspector data either for time to complete, or resources utilized. As for the sorting, I rarely have more than 75 events per search over a 24 hour period on these particular searches that are populating my dashboard, so I'm not too worried about that. But I'll put it in there just in case. Thanks!
Any other ideas?
Returning 56 results by scanning 1,467,895 events confirms what you already know, the query is not optimal.
There are many ways you can tackle this.
My personal approach would be to simply the query down to the bare minimum, then slowly add pieces back in and gauge the performance impact.
A quicker option would be to create a datamodel using the fields you're searching, accelerate it for the time range you typically search, and run your queries against that. The performance increase will be staggering.
Thanks! I actually started from scratch yesterday and then was adding things back and it didn't seem to help much. The only real way to get it quicker is to stop searching the last 24 hours. And if we were using this as a real-time dashboard, that would be easy...because we could just do 1 hour or something like that. But it's not a real-time dashboard...or at least it's not really necessary for it to be.
So I started from scratch. Removed all the "extra" fluff and just let it spit the data out however it wanted for that event ID, and got the time down to about 130 seconds for a 24 hour search. I think that's about the best I'm going to get for that time-frame...and honestly at this point, we're probably going to end up porting this to a daily report instead of an active dashboard so it just runs once early in the morning and gives us the report for the previous 24 hours.
Unfortunately we don't get to decide the resources we are allocated or features we are allowed to use, so we have to make due with what we're provided...and there are a lot of other people who also need their dashboards and searches to work, so I'll just take the back burner on this one.
I appreciate all the help! I'll mark your response as the answer.