Hi All,
As part of one of my SRE objectives I was trying to find out the following in splunk.
The High(Max) count of ERRORs within a given time period (1hr / 24hr/ 144hr) compared to monthly 99 Percentile
I was starting off with baby steps assuming that the count is obviously zooming in on anything 'ERROR'
index=myIndex ERROR
source="/test.log"
| timechart count by status
| addtotals
| addtotals fieldname=ERROR
| eval ErrorRate=round(Errors/Total*100,2)
| fields _time 5* ErrorRate
But that doesn't seem to even work. Help would be really appropriated .
Thanks in advance team !
The query is looking for the maximum and average values of the count field, but does that field exist? I suspect it does not. Look at my earlier replies to see how to get the count field.
Also, the query mixes apples and oranges. The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes. That makes it unlikely any 5-minute count will come close to the average.
"Doesn't seem to even work" is not a good problem description. Please describe the expected and actual results.
To debug a query, run it incrementally, starting with the base search and adding one pipe at a time until it fails. Then you'll know which command to focus on.
Are you sure the source file is /test.log? It's unusual to see files in the root directory in Splunk.
The timechart command is grouping results by status, but status is not mentioned in the stated objective.
The timechart command returns only the fields specified. That means only count and status are available to later commands so there is no Errors or Total field for eval to use.
As for "baby steps", try this simple query to count errors.
index=myindex ERROR source="test.log"
| stats count
And this one to find the max count per hour.
index=myindex ERROR source="test.log"
| bin span=1h _time
| stats count by _time
| timechart span=1h max(count)
Hi Richard, I ended up doing this which worked
index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance
Now I have the counts per hour the last thing I wanted to do is do an ERROR count for the whole month and then work out the average and add that as a tolerance level so I can see any hour count totals that sit above it . Is that possible ?
This method works for me, but it may not be the best way to do it.
index=_internal ERROR
| bin span=1h _time
| stats count by component ,_time
| stats max(count) as max by component,_time
| append
[ search index=_internal ERROR earliest=-30d
| bin span=1h _time
| stats count by component ,_time
| stats avg(count) as avg by component
| eval avg=round(avg,0) ]
| fillnull value=0 max
| stats values(_time) as _time,values(*) as * by component
| eval timeMax=mvzip(_time,max)
| mvexpand timeMax
| eval timeMax=split(timeMax,",")
| eval _time=mvindex(timeMax,0), max=mvindex(timeMax,1)
| where max > avg
| table _time component avg max
Thanks Richard. I added the sourcetype /log into the mix but I see it just showing the 2 actual events from today and nothing in stats
Nothing in stats may be normal if there are no instances where the error rate exceeds the average.
Please show or describe how you added sourcetype to the mix.
You may need to debug the query to figure out why there are no results.
Hey,
maybe I wasn't clear with my previous reply. So when I do the following that works on a column chart fine
I.E
index=myindex ERROR sourcetype=mysourcetype source="/tmp/test*.log"
| rex field=source "seema(?<instance>.*?)\/"
| bin span=1h _time
| stats count by _time instance
| timechart span=1h max(count) by instance
What i was trying to do was leave the above but generate an overlay on the same chart to show the average tolerance level so I can see if any of the hour levels touch it or go above it visually
Yeah, that ask was not clear.
See if this answer helps: https://community.splunk.com/t5/Splunk-Search/How-to-overlay-a-straight-line-showing-the-average-tim...
Hi Richard, I tried to do this which looks right
earliest=-1h index=myindex ERROR sourcetype=mysourcetype source="mysource"
| timechart span=5m max(count) by sourcetype
| appendcols [ search earliest=-30d index=myindex ERROR sourcetype=mysourcetype source="mysource"| stats avg(count) AS 30d_average]
What i was hoping for is getting the max count of ERRORS in 1 hour and adding a chart overlay for 30 day average . (Added as per the other post)
The hourly count of 'ERRORS' was 5 for the last hour and over a month around around 20 per day so strange the Timechart isn't showing it .
The query is looking for the maximum and average values of the count field, but does that field exist? I suspect it does not. Look at my earlier replies to see how to get the count field.
Also, the query mixes apples and oranges. The first part gets counts by sourcetype, but the appendcols command gets a 30-day average of all sourcetypes. That makes it unlikely any 5-minute count will come close to the average.
Thanks Richard apologies for being cryptic and that works fine . So it’s not possible to look at all error counts over a month as an average and then overlay that as a tolerance level on the charts to show how high the 1 hour values sit ?
I'm not saying it's not possible. I just wanted to get your "baby steps" pointed in the right direction.