Solved: SRE ERROR counts by percentage

luckyman80 · ‎10-12-2021

Hi All,

As part of one of my SRE objectives I was trying to find out the following in splunk.

The High(Max) count of ERRORs within a given time period (1hr / 24hr/ 144hr) compared to monthly 99 Percentile

I was starting off with baby steps assuming that the count is obviously zooming in on anything 'ERROR'

But that doesn't seem to even work. Help would be really appropriated .

Thanks in advance team !

richgalloway · ‎10-14-2021

The query is looking for the maximum and average values of the count field, but does that field exist? I suspect it does not. Look at my earlier replies to see how to get the count field.

Also, the query mixes apples and oranges. The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes. That makes it unlikely any 5-minute count will come close to the average.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway · ‎10-12-2021

"Doesn't seem to even work" is not a good problem description. Please describe the expected and actual results.

To debug a query, run it incrementally, starting with the base search and adding one pipe at a time until it fails. Then you'll know which command to focus on.

Are you sure the source file is /test.log? It's unusual to see files in the root directory in Splunk.

The timechart command is grouping results by status, but status is not mentioned in the stated objective.

The timechart command returns only the fields specified. That means only count and status are available to later commands so there is no Errors or Total field for eval to use.

As for "baby steps", try this simple query to count errors.

index=myindex ERROR source="test.log"
| stats count

And this one to find the max count per hour.

index=myindex ERROR source="test.log"
| bin span=1h _time
| stats count by _time
| timechart span=1h max(count)

---
If this reply helps you, Karma would be appreciated.

luckyman80 · ‎10-13-2021

Hi Richard, I ended up doing this which worked

index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance

Now I have the counts per hour the last thing I wanted to do is do an ERROR count for the whole month and then work out the average and add that as a tolerance level so I can see any hour count totals that sit above it . Is that possible ?

richgalloway · ‎10-13-2021

This method works for me, but it may not be the best way to do it.

index=_internal ERROR 
| bin span=1h _time 
| stats count by component ,_time
| stats max(count) as max by component,_time
| append 
    [ search index=_internal ERROR earliest=-30d 
    | bin span=1h _time 
    | stats count by component ,_time
    | stats avg(count) as avg by component 
    | eval avg=round(avg,0) ]
| fillnull value=0 max
| stats values(_time) as _time,values(*) as * by component
| eval timeMax=mvzip(_time,max)
| mvexpand timeMax
| eval timeMax=split(timeMax,",")
| eval _time=mvindex(timeMax,0), max=mvindex(timeMax,1)
| where max > avg
| table _time component avg max

---
If this reply helps you, Karma would be appreciated.

luckyman80 · ‎10-13-2021

Thanks Richard. I added the sourcetype /log into the mix but I see it just showing the 2 actual events from today and nothing in stats

richgalloway · ‎10-13-2021

Nothing in stats may be normal if there are no instances where the error rate exceeds the average.

Please show or describe how you added sourcetype to the mix.

You may need to debug the query to figure out why there are no results.

---
If this reply helps you, Karma would be appreciated.

luckyman80 · ‎10-13-2021

Hey,

maybe I wasn't clear with my previous reply. So when I do the following that works on a column chart fine

I.E

index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance

What i was trying to do was leave the above but generate an overlay on the same chart to show the average tolerance level so I can see if any of the hour levels touch it or go above it visually

richgalloway · ‎10-13-2021

Yeah, that ask was not clear.

See if this answer helps: https://community.splunk.com/t5/Splunk-Search/How-to-overlay-a-straight-line-showing-the-average-tim...

---
If this reply helps you, Karma would be appreciated.

luckyman80 · ‎10-14-2021

Hi Richard, I tried to do this which looks right

earliest=-1h index=myindex ERROR sourcetype=mysourcetype source="mysource"
| timechart span=5m max(count) by sourcetype
| appendcols [ search earliest=-30d index=myindex ERROR sourcetype=mysourcetype source="mysource"| stats avg(count) AS 30d_average]

What i was hoping for is getting the max count of ERRORS in 1 hour and adding a chart overlay for 30 day average . (Added as per the other post)

The hourly count of 'ERRORS' was 5 for the last hour and over a month around around 20 per day so strange the Timechart isn't showing it .

richgalloway · ‎10-14-2021

The query is looking for the maximum and average values of the count field, but does that field exist? I suspect it does not. Look at my earlier replies to see how to get the count field.

Also, the query mixes apples and oranges. The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes. That makes it unlikely any 5-minute count will come close to the average.

---
If this reply helps you, Karma would be appreciated.

luckyman80 · ‎10-12-2021

Thanks Richard apologies for being cryptic and that works fine . So it’s not possible to look at all error counts over a month as an average and then overlay that as a tolerance level on the charts to show how high the 1 hour values sit ?

richgalloway · ‎10-13-2021

I'm not saying it's not possible. I just wanted to get your "baby steps" pointed in the right direction.

---
If this reply helps you, Karma would be appreciated.

SRE ERROR counts by percentage

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

SRE ERROR counts by percentage

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?