Solved: Percent Change by Host and Error Type

Michael_Condon · ‎10-11-2021

We're ingesting Tomcat logs, and looking for items tagged [SEVERE]. I'd like to be able to pull a report of error rate, and look for errors which are occurring at a significantly higher than average rate *for their error type*. In addition, we're getting data streams from multiple hosts, each of which are really their own instance and have their own "native" error rate.

I need the average rate of occurrence of errors over the last week, and over the last day, grouped by host and error type. Then I need to flag any error who's rate has risen by... say, 500%.

So far the best I've come up with is this:

So - I'm pulling SEVERE errors, extracting just the error text, CONCATing that to the host to get my group-by string, bucketing in 1-hour increments to get an average, then building a chart with the 7-day average and the 1-day average for each host/error pair. Wondering if anyone else has a better way to do it?

Thanks!

bowesmana · ‎10-11-2021

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
work out the average hourly rate for each window/errhost (stats 2)
collapse the separated averages into a single row for each errhost (stats 3)
calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

to do the filtering.

Hope this helps

View solution in original post

bowesmana · ‎10-11-2021

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
work out the average hourly rate for each window/errhost (stats 2)
collapse the separated averages into a single row for each errhost (stats 3)
calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

to do the filtering.

Hope this helps

Michael_Condon · ‎10-12-2021

That's much better!

Also, didn't know that RE append columns. I'm coming to this from a SQL context, and it's been a bit of a learning curve. Appreciate the full explanation and reply.

-Mike

Percent Change by Host and Error Type

field extraction

rex

stats

timechart

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

SOK it to Me: Top 3 Benefits of Using Splunk Operator on Kubernetes that’ll Make ...

Upgrade Prep for 10.4, Network Observability Deep Dives, and More from Splunk Lantern

Splunk Developer Day announcements: AI agents, MCP tools, Forecasting, and Custom ...

Join the Conversation