We have the following code that ran for one and a half hours last week and consumed lots of cpu. How can we optimize this query?
index=<index name> Error*
| eval ERROR_TYPE = if(searchmatch("SQLException"), "SQL Exceptions",
(if(searchmatch("JwtTokenUtil"), "JWTToken Error",
(if(searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
(if(searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
(if(searchmatch("sendEmail"), "Send Mail error",
(if(searchmatch("message was not delivered"), "IMAP error",
(if(searchmatch("sever not reponding"), "Server error",
(if(searchmatch("Timed Out"),"Timed Out Error", "OTHER")))))))))))))))
| timechart count by ERROR_TYPE
Hi,
after one year I think it's time to give some feedback 🙂
Kidding, I was searching for something else and found this article by accident.
This information
event_count=12146884, result_count=7, available_count=0, scan_count=2942980670
in the audit.log output will tell you that this search scanned almost 3 billion events on disk and just returned/used 12 million. So, 99.6% of the data read from disk was thrown away... sounds a little bit like boiling the ocean.
Looking at all the "searchmatch" patterns, you might put the TERMS() like "JwtTokenUtil" or "sendEmail"before the first pipe using "() OR ()" because this will leverage the TSIDX Index (NOT the compressed raw events) and is 1000x faster than the full event scan you have done.
Another thought: You seem to mix a couple of sourcetypes in one index which is fine. Using the "sourcetype=" in the OR-Clause might also help to speed up searching because Splunk Enterprise will eliminate all events which are not matching the sourcetypes you want to see.
A shortened example:
index=<index name> Error* ( (sourcetype=<st1> JwtTokenUtil) OR (sourcetype=<st2> SQLException) ... )
| eval ERROR_TYPE = case(searchmatch("SQLException"), "SQL Exceptions",
searchmatch("JwtTokenUtil"), "JWTToken Error",
searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
searchmatch("sendEmail"), "Send Mail error",
searchmatch("message was not delivered"), "IMAP error",
searchmatch("sever not reponding"), "Server error",
searchmatch("Timed Out"),"Timed Out Error",
true(),"OTHER")
| timechart count by ERROR_TYPE
Will only return events with "Error*" and the two tokens I defined (with the right sourcetype).
This might save you from reading 99.6% of the events from disk, decompress them and do the field extraction for every single event...
Depending of the structure of your data, you might even use the "append" command and split the big search into multiple shorter ones but might avoid the "searchmatch" for all your 12 million results.
index=<index name> Error* (sourcetype=<st1> JwtTokenUtil) | eval ERROR_TYPE="JWTToken Error" | fields _time ERROR_TYPE |
append [ index=<index name> Error* sourcetype=<st2> SQLException | eval ERORR_TYPE="SQL Exceptions" | fields _time ERROR_TYPE ] |
append [] ... | timechart count by ERROR_TYPE
Try it.
You need to fix typos and maybe some syntax errors because I have written this by heart without testing in the "Search&Reporting" App.
Hope this all make sense.
Feel free to ask questions.
Greetings,
Holger
About the time range. I see the following in _audit -
Audit:[timestamp=07-10-2020 11:35:16.459, user= <user>, action=search, info=completed, search_id=' <user>__ <user>_YmJoX3ByaXZhdGViYW5raW5nX3Vp__search18_1594389297.3145638', total_run_time=5997.91, event_count=12146884, result_count=7, available_count=0, scan_count=2942980670, drop_count=0, exec_time=1594389297, api_et=N/A, api_lt=N/A, search_et=N/A, search_lt=N/A, is_realtime=0, savedsearch_name="search18", search_startup_time="1292", searched_buckets=630, eliminated_buckets=0, considered_events=2942980670, total_slices=3512873, decompressed_slices=3615466, duration.command.search.index=1146121, invocations.command.search.index.bucketcache.hit=0, duration.command.search.index.bucketcache.hit=0, invocations.command.search.index.bucketcache.miss=0, duration.command.search.index.bucketcache.miss=0, invocations.command.search.index.bucketcache.error=0, duration.command.search.rawdata=8063766, invocations.command.search.rawdata.bucketcache.hit=0, duration.command.search.rawdata.bucketcache.hit=0, invocations.command.search.rawdata.bucketcache.miss=0, duration.command.search.rawdata.bucketcache.miss=0, invocations.command.search.rawdata.bucketcache.error=0, sourcetype_count__<index name>=12146884, roles=<roles>', search='search index=<index name> Error* | eval ERROR_TYPE = if(searchmatch("SQLException"), "SQL Exceptions",
(if(searchmatch("JwtTokenUtil"), "JWTToken Error",
(if(searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
(if(searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
(if(searchmatch("sendEmail"), "Send Mail error",
(if(searchmatch(" message was not delivered"), "IMAP error",
(if(searchmatch("sever not reponding"), "Server error",
(if(searchmatch("Timed Out"),"Timed Out Error", "OTHER"))))))))))))))) | stats count by ERROR_TYPE'][n/a]
Which field specifies the duration?
The two queries that ended up consuming lots of cpu used the chart and timechart commands. Are these two prone to use lots of cpu in certain cases?
I believe the time window doesn't appear in _audit unless it is included in the body of the search, but the fields are search_et and search_lt.
CPU use is secondary to the time taken by the search. You should optimize for that and you should see CPU use go down.
The timechart and chart commands can use a bit of CPU, but not as much as other commands. None of the commands in the query are very CPU-intensive so it really comes down to how much data is being processed.
Another consideration is how well balanced the data is among your indexers. An uneven distribution means one indexer is working harder than the others and is slowing down the search.
If the query really needs to process a lot of data then there are other ways to speed things up, including data model acceleration and summary indexing, but we should save those for later.
About -
Another consideration is how well balanced the data is among your indexers. An uneven distribution means one indexer is working harder than the others and is slowing down the search
So, it seems that for this particular index, the primary buckets are available on only four out of of eight indexers and Support suggested to re balance the data.
To be clear - the queries go only against the primary buckets?
total_run_time=5997.91
It's taking a lot of time. And
available_count=0, scan_count=2942980670
No result....
There's something wrong with the time picker setting.
If the query hasn't expired yet (this probably is) you could look it by Job inspector.
r. Ismo
index=<index name> Error* sourcetype=<your sourcetype>
| eval ERROR_TYPE = case(searchmatch("SQLException"), "SQL Exceptions",
searchmatch("JwtTokenUtil"), "JWTToken Error",
searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
searchmatch("sendEmail"), "Send Mail error",
searchmatch("message was not delivered"), "IMAP error",
searchmatch("sever not reponding"), "Server error",
searchmatch("Timed Out"),"Timed Out Error",
true(),"OTHER")
| timechart count by ERROR_TYPE
try case() , select appropriate time range(time picker) and sourcetype