Solved: Percent Change by Host and Error Type

Michael_Condon · ‎10-11-2021

We're ingesting Tomcat logs, and looking for items tagged [SEVERE]. I'd like to be able to pull a report of error rate, and look for errors which are occurring at a significantly higher than average rate *for their error type*. In addition, we're getting data streams from multiple hosts, each of which are really their own instance and have their own "native" error rate.

I need the average rate of occurrence of errors over the last week, and over the last day, grouped by host and error type. Then I need to flag any error who's rate has risen by... say, 500%.

So far the best I've come up with is this:

So - I'm pulling SEVERE errors, extracting just the error text, CONCATing that to the host to get my group-by string, bucketing in 1-hour increments to get an average, then building a chart with the 7-day average and the 1-day average for each host/error pair. Wondering if anyone else has a better way to do it?

Thanks!

bowesmana · ‎10-11-2021

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
work out the average hourly rate for each window/errhost (stats 2)
collapse the separated averages into a single row for each errhost (stats 3)
calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

to do the filtering.

Hope this helps

View solution in original post

bowesmana · ‎10-11-2021

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
work out the average hourly rate for each window/errhost (stats 2)
collapse the separated averages into a single row for each errhost (stats 3)
calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

to do the filtering.

Hope this helps

Michael_Condon · ‎10-12-2021

That's much better!

Also, didn't know that RE append columns. I'm coming to this from a SQL context, and it's been a bit of a learning curve. Appreciate the full explanation and reply.

-Mike

Percent Change by Host and Error Type

field extraction

other

rex

stats

timechart

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!