Solved: Optimize query that hits disk usage limit when com...

cmontanari · ‎01-26-2022

Hi All,

What I'm trying to do is to have a chart with time on x-axis and percentages by ResponseStatus on y-axis.

To do that I come up with the below Splunk search query:

match some http requests
| fields _time,ResponseStatus,RequestName
| eval Date=strftime(_time, "%m/%d/%Y")
| eval ResponseStatus=if(isnull(ResponseStatus), 504, ResponseStatus)
| eventstats count as "totalCount" by Date
| eventstats count as "codeCount" by Date,ResponseStatus
| eval percent=round((codecount/totalCount)*100)
| chart values(percent) by Date,ResponseStatus

But it is hitting the disk usage limit (500MB - which I can't increase) for a 10 days interval. And I'd like to be able to have this on a 3/4 months interval.

What I have noticed is that If I only run the match part of the query, I get all the events without hitting any disk limit, which makes me think the problem is with the counting and group by part of the query.

My guess is that Splunk is making the computation by keeping in-memory (or, trying to do so and eventually swapping to disk) the full event message even if I specified the useful fields via the fields command.

Is there any way to either effectively have Splunk ignore all the remaining part of the message or obtain the same result via a different path?

Thanks a lot!

yuanliu · ‎01-26-2022

Reduce use of eventstats is always good. Secondly, you can use table to reduce event (row) size; fields doesn't do quite that.

In your example, you can eliminate one of eventstats like this

match some http requests
| table _time,ResponseStatus,RequestName ``` fields does not reduce row size ```
| eval Date=strftime(_time, "%m/%d/%Y")
| eval ResponseStatus=if(isnull(ResponseStatus), 504, ResponseStatus)
| stats count as "codeCount" by Date,ResponseStatus
| eventstats sum(count) as "totalCount" by Date
| eval percent=round((codecount/totalCount)*100)
| chart values(percent) by Date,ResponseStatus

View solution in original post

PickleRick · ‎01-27-2022

You might explicitly remove _raw early in the process so that you operate only on the set of fields you need without dragging more data around.

| fields - _raw

Furthermore, why do you render the _time to a string? You might use bin/bucket to align the data to full days or whatever time unit you need without creating additional fields.

You might also rethink your eventstats. If you already calculate by date and responsestatus, why calculating by date again? You might use the one you already have and sum it over date.

cmontanari · ‎01-27-2022

Thank you! Your message made me realize I could write the query in another way. This is where I landed after the changes suggested by @yuanliu + your message:

match some stuff
| fields _time,ResponseStatus,RequestName
| fields - _raw
| bucket _time span=1d
| eval ResponseStatus=if(isnull(ResponseStatus), 504, ResponseStatus)
| eventstats count as "total" by _time 
| stats count first(total) as "total" by _time, ResponseStatus 
| eval percent=(count/total)*100 
| timechart span=1d first(percent) by ResponseStatus

Most likely further improvements can be made here as well, but as long as it is not hitting the disk threshold I'm happy with it 🙂

johnhuang · ‎01-27-2022

Could be even simplified further:

<base_search>
| table _time ResponseStatus | fields - _raw 
| bucket _time span=1d
| fillnull value=504 ResponseStatus
| top 100 ResponseStatus by _time showcount=f
| timechart limit=30 span=1d first(percent) by ResponseStatus

PickleRick · ‎01-27-2022

No. You don't want to do that. Firstly, the removing _raw is not needed. But more importantly, table command is a transforming command and moves further processing to searchhead which kills parallelization.

johnhuang · ‎01-27-2022

You know, adding table and removing _raw was actually slightly slower in my testing and took up slightly more disk but the difference was too small for me to be sure. You were the one who first recommended removing _raw and someone else here recommended table.

| table _time ResponseStatus | fields - _raw

yuanliu · ‎01-26-2022

Reduce use of eventstats is always good. Secondly, you can use table to reduce event (row) size; fields doesn't do quite that.

In your example, you can eliminate one of eventstats like this

match some http requests
| table _time,ResponseStatus,RequestName ``` fields does not reduce row size ```
| eval Date=strftime(_time, "%m/%d/%Y")
| eval ResponseStatus=if(isnull(ResponseStatus), 504, ResponseStatus)
| stats count as "codeCount" by Date,ResponseStatus
| eventstats sum(count) as "totalCount" by Date
| eval percent=round((codecount/totalCount)*100)
| chart values(percent) by Date,ResponseStatus

cmontanari · ‎01-27-2022

Thank you! Converting that eventstats into stats made it work. The use of table didn't seem to have an impact on performance overall

richgalloway · ‎01-27-2022

Where did you find out fields does not reduce row size? This contradicts what we've been told over many years.

---
If this reply helps you, Karma would be appreciated.

yuanliu · ‎01-28-2022

@richgalloway You and @isoutamo are correct. I didn't realize that events and fields are two separate spaces. fields allows events (_raw) to carry on, but search buffers are not burdened by them. As isoutamo points out, table may actually carry higher performance penalty.

isoutamo · ‎01-26-2022

Please remember that when you are replacing fields with table you move processing from indexers to search head! Anyhow stats should remove those additional fields away.
r. Ismo

Optimize query that hits disk usage limit when computing stats

chart

eval

stats

timechart

Preparing your Splunk Environment for OpenSSL3

Incident Response: Reduce Incident Recurrence with Automated Ticket Creation

Splunk Classroom Chronicles: Training Tales and Testimonials (Episode 2)