Re: How can we optimize a query that consumes lots...

danielbb · ‎07-15-2020

We have the following code that ran for one and a half hours last week and consumed lots of cpu. How can we optimize this query?

index=<index name> Error* 
| eval ERROR_TYPE = if(searchmatch("SQLException"), "SQL Exceptions",
    (if(searchmatch("JwtTokenUtil"), "JWTToken Error",
    (if(searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
    (if(searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
    (if(searchmatch("sendEmail"), "Send Mail error",
    (if(searchmatch("message was not delivered"), "IMAP error",
    (if(searchmatch("sever not reponding"), "Server error",
    (if(searchmatch("Timed Out"),"Timed Out Error", "OTHER"))))))))))))))) 
| timechart count by ERROR_TYPE

hsesterhenn_spl · ‎07-26-2021

Hi,

after one year I think it's time to give some feedback 🙂

Kidding, I was searching for something else and found this article by accident.

This information

event_count=12146884, result_count=7, available_count=0, scan_count=2942980670

in the audit.log output will tell you that this search scanned almost 3 billion events on disk and just returned/used 12 million. So, 99.6% of the data read from disk was thrown away... sounds a little bit like boiling the ocean.

Looking at all the "searchmatch" patterns, you might put the TERMS() like "JwtTokenUtil" or "sendEmail"before the first pipe using "() OR ()" because this will leverage the TSIDX Index (NOT the compressed raw events) and is 1000x faster than the full event scan you have done.

Another thought: You seem to mix a couple of sourcetypes in one index which is fine. Using the "sourcetype=" in the OR-Clause might also help to speed up searching because Splunk Enterprise will eliminate all events which are not matching the sourcetypes you want to see.

A shortened example:

index=<index name> Error* ( (sourcetype=<st1> JwtTokenUtil) OR (sourcetype=<st2> SQLException) ... ) 
| eval ERROR_TYPE = case(searchmatch("SQLException"), "SQL Exceptions",
    searchmatch("JwtTokenUtil"), "JWTToken Error",
    searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
    searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
    searchmatch("sendEmail"), "Send Mail error",
    searchmatch("message was not delivered"), "IMAP error",
    searchmatch("sever not reponding"), "Server error",
    searchmatch("Timed Out"),"Timed Out Error", 
    true(),"OTHER")
| timechart count by ERROR_TYPE

Will only return events with "Error*" and the two tokens I defined (with the right sourcetype).

This might save you from reading 99.6% of the events from disk, decompress them and do the field extraction for every single event...

Depending of the structure of your data, you might even use the "append" command and split the big search into multiple shorter ones but might avoid the "searchmatch" for all your 12 million results.

index=<index name> Error* (sourcetype=<st1> JwtTokenUtil) | eval ERROR_TYPE="JWTToken Error" | fields _time ERROR_TYPE |
append [ index=<index name> Error* sourcetype=<st2> SQLException | eval ERORR_TYPE="SQL Exceptions" | fields _time ERROR_TYPE ] |
append [] ... | timechart count by ERROR_TYPE

Try it.

You need to fix typos and maybe some syntax errors because I have written this by heart without testing in the "Search&Reporting" App.

Hope this all make sense.

Feel free to ask questions.

Greetings,

Holger

richgalloway · ‎07-15-2020

What is the time range being searched?
How large is the index?

---
If this reply helps you, Karma would be appreciated.

danielbb · ‎07-16-2020

About the time range. I see the following in _audit -

Audit:[timestamp=07-10-2020 11:35:16.459, user= <user>, action=search, info=completed, search_id=' <user>__ <user>_YmJoX3ByaXZhdGViYW5raW5nX3Vp__search18_1594389297.3145638', total_run_time=5997.91, event_count=12146884, result_count=7, available_count=0, scan_count=2942980670, drop_count=0, exec_time=1594389297, api_et=N/A, api_lt=N/A, search_et=N/A, search_lt=N/A, is_realtime=0, savedsearch_name="search18", search_startup_time="1292", searched_buckets=630, eliminated_buckets=0, considered_events=2942980670, total_slices=3512873, decompressed_slices=3615466, duration.command.search.index=1146121, invocations.command.search.index.bucketcache.hit=0, duration.command.search.index.bucketcache.hit=0, invocations.command.search.index.bucketcache.miss=0, duration.command.search.index.bucketcache.miss=0, invocations.command.search.index.bucketcache.error=0, duration.command.search.rawdata=8063766, invocations.command.search.rawdata.bucketcache.hit=0, duration.command.search.rawdata.bucketcache.hit=0, invocations.command.search.rawdata.bucketcache.miss=0, duration.command.search.rawdata.bucketcache.miss=0, invocations.command.search.rawdata.bucketcache.error=0, sourcetype_count__<index name>=12146884, roles=<roles>', search='search index=<index name> Error* | eval ERROR_TYPE = if(searchmatch("SQLException"), "SQL Exceptions",
 (if(searchmatch("JwtTokenUtil"), "JWTToken Error",
 (if(searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
 (if(searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
 (if(searchmatch("sendEmail"), "Send Mail error",
 (if(searchmatch(" message was not delivered"), "IMAP error",
 (if(searchmatch("sever not reponding"), "Server error",
 (if(searchmatch("Timed Out"),"Timed Out Error", "OTHER"))))))))))))))) | stats count by ERROR_TYPE'][n/a]

Which field specifies the duration?

The two queries that ended up consuming lots of cpu used the chart and timechart commands. Are these two prone to use lots of cpu in certain cases?

richgalloway · ‎07-16-2020

I believe the time window doesn't appear in _audit unless it is included in the body of the search, but the fields are search_et and search_lt.

CPU use is secondary to the time taken by the search. You should optimize for that and you should see CPU use go down.

The timechart and chart commands can use a bit of CPU, but not as much as other commands. None of the commands in the query are very CPU-intensive so it really comes down to how much data is being processed.

Another consideration is how well balanced the data is among your indexers. An uneven distribution means one indexer is working harder than the others and is slowing down the search.

If the query really needs to process a lot of data then there are other ways to speed things up, including data model acceleration and summary indexing, but we should save those for later.

---
If this reply helps you, Karma would be appreciated.

danielbb · ‎07-20-2020

About -

Another consideration is how well balanced the data is among your indexers. An uneven distribution means one indexer is working harder than the others and is slowing down the search

So, it seems that for this particular index, the primary buckets are available on only four out of of eight indexers and Support suggested to re balance the data.

To be clear - the queries go only against the primary buckets?

richgalloway · ‎07-20-2020

Yes, only primary buckets are searched.
If only 4 out of 8 indexers are participating in a search then, in theory, that search is taking twice as long as it could.
Make sure your forwarders are using all 8 indexers (Indexer Discovery is best for this).
Rebalancing should help, too.

---
If this reply helps you, Karma would be appreciated.

to4kawa · ‎07-16-2020

total_run_time=5997.91

It's taking a lot of time. And

available_count=0, scan_count=2942980670

No result....

There's something wrong with the time picker setting.

isoutamo · ‎07-16-2020

If the query hasn't expired yet (this probably is) you could look it by Job inspector.

r. Ismo

to4kawa · ‎07-15-2020

index=<index name> Error* sourcetype=<your sourcetype>
| eval ERROR_TYPE = case(searchmatch("SQLException"), "SQL Exceptions",
    searchmatch("JwtTokenUtil"), "JWTToken Error",
    searchmatch("JwtAuthenticationTokenFilter"), "Authentication error",
    searchmatch("java.lang.OutOfMemoryError: Java heap space"), "Out of memory error",
    searchmatch("sendEmail"), "Send Mail error",
    searchmatch("message was not delivered"), "IMAP error",
    searchmatch("sever not reponding"), "Server error",
    searchmatch("Timed Out"),"Timed Out Error", 
    true(),"OTHER")
| timechart count by ERROR_TYPE

try case() , select appropriate time range(time picker) and sourcetype

How can we optimize a query that consumes lots of cpu?

timechart

New in Observability Cloud - Explicit Bucket Histograms

Updated Team Landing Page in Splunk Observability

New! Splunk Observability Search Enhancements for Splunk APM Services/Traces and ...