Alerting

SRE ERROR counts by percentage

luckyman80
Path Finder

Hi All,

           As part of one of my SRE objectives I was trying to find out the following in splunk. 

The High(Max) count of ERRORs within a given time period (1hr / 24hr/ 144hr) compared to monthly 99 Percentile

I was starting off with baby steps assuming that the count is obviously zooming in on anything 'ERROR'

index=myIndex ERROR
source="/test.log"
| timechart count by status
| addtotals
| addtotals fieldname=ERROR
| eval ErrorRate=round(Errors/Total*100,2)
| fields _time 5* ErrorRate

But that doesn't seem to even work.  Help would be really appropriated .

 

Thanks in advance team ! 

 

Labels (1)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

The query is looking for the maximum and average values of the count field, but does that field exist?  I suspect it does not.  Look at my earlier replies to see how to get the count field.

Also, the query mixes apples and oranges.  The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes.  That makes it unlikely any 5-minute count will come close to the average.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

"Doesn't seem to even work" is not a good problem description.  Please describe the expected and actual results.

To debug a query, run it incrementally, starting with the base search and adding one pipe at a time until it fails.  Then you'll know which command to focus on.

Are you sure the source file is /test.log?  It's unusual to see files in the root directory in Splunk.

The timechart command is grouping results by status, but status is not mentioned in the stated objective.

The timechart command returns only the fields specified.  That means only count and status are available to later commands so there is no Errors or Total field for eval to use.

As for "baby steps", try this simple query to count errors.

index=myindex ERROR source="test.log"
| stats count

And this one to find the max count per hour.

index=myindex ERROR source="test.log"
| bin span=1h _time
| stats count by _time
| timechart span=1h max(count)
---
If this reply helps you, Karma would be appreciated.
0 Karma

luckyman80
Path Finder

Hi Richard, I ended up doing this which worked 

 

index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance

 

Now I have the counts per hour the last thing I wanted to do is do an ERROR count for the whole month and then work out the average and add that as a tolerance level so I can see any hour count totals that sit above it . Is that possible ? 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

This method works for me, but it may not be the best way to do it.

index=_internal ERROR 
| bin span=1h _time 
| stats count by component ,_time
| stats max(count) as max by component,_time
| append 
    [ search index=_internal ERROR earliest=-30d 
    | bin span=1h _time 
    | stats count by component ,_time
    | stats avg(count) as avg by component 
    | eval avg=round(avg,0) ]
| fillnull value=0 max
| stats values(_time) as _time,values(*) as * by component
| eval timeMax=mvzip(_time,max)
| mvexpand timeMax
| eval timeMax=split(timeMax,",")
| eval _time=mvindex(timeMax,0), max=mvindex(timeMax,1)
| where max > avg
| table _time component avg max
---
If this reply helps you, Karma would be appreciated.
0 Karma

luckyman80
Path Finder

Thanks Richard. I added the sourcetype /log into the mix but I see it just showing the 2 actual events from today and nothing in stats 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Nothing in stats may be normal if there are no instances where the error rate exceeds the average.

Please show or describe how you added sourcetype to the mix.

You may need to debug the query to figure out why there are no results.

---
If this reply helps you, Karma would be appreciated.
0 Karma

luckyman80
Path Finder

Hey,

        maybe I wasn't clear with my previous reply. So when I do the following that works on a column chart fine 

I.E 

index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance

 

What i was trying to do was leave the above but generate an overlay on the same chart to show the average tolerance level so I can see if any of the hour levels touch it or go above it visually 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Yeah, that ask was not clear.

See if this answer helps: https://community.splunk.com/t5/Splunk-Search/How-to-overlay-a-straight-line-showing-the-average-tim...

---
If this reply helps you, Karma would be appreciated.
0 Karma

luckyman80
Path Finder

Hi Richard, I tried to do this which looks right 

earliest=-1h index=myindex ERROR sourcetype=mysourcetype source="mysource"
| timechart span=5m max(count) by sourcetype
| appendcols [ search earliest=-30d index=myindex ERROR sourcetype=mysourcetype source="mysource"| stats avg(count) AS 30d_average]

What i was hoping for is getting the max count of ERRORS in 1 hour and adding a chart overlay for 30 day average . (Added as per the other post) 

The hourly count of 'ERRORS' was 5 for the last hour and over a month around around 20 per day so strange the Timechart isn't showing it .

0 Karma

richgalloway
SplunkTrust
SplunkTrust

The query is looking for the maximum and average values of the count field, but does that field exist?  I suspect it does not.  Look at my earlier replies to see how to get the count field.

Also, the query mixes apples and oranges.  The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes.  That makes it unlikely any 5-minute count will come close to the average.

---
If this reply helps you, Karma would be appreciated.

luckyman80
Path Finder

Thanks Richard apologies for being cryptic and that works fine . So it’s not possible to look at all error counts over a month as an average and then overlay that as a tolerance level on the charts to show how high the 1 hour values sit ? 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

I'm not saying it's not possible.  I just wanted to get your "baby steps" pointed in the right direction.

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...