Splunk Search

Percent Change by Host and Error Type

Michael_Condon
Engager

We're ingesting Tomcat logs, and looking for items tagged [SEVERE]. I'd like to be able to pull a report of error rate, and look for errors which are occurring at a significantly higher than average rate *for their error type*. In addition, we're getting data streams from multiple hosts, each of which are really their own instance and have their own "native" error rate. 

I need the average rate of occurrence of errors over the last week, and over the last day, grouped by host and error type. Then I need to flag any error who's rate has risen by... say, 500%. 

So far the best I've come up with is this: 


index="tomcat" sourcetype="catalina.out_logs" SEVERE earliest=-7d latest=-1d | rex field=_raw "\[SEVERE[\s\t]+\][\s\t]+(?<err>.+)[\n\r]" | eval errhost = host + "::" + err | bucket _time span=1h | stats count as tcount by errhost | stats avg(tcount) as wk_avg by errhost | appendcols [search index="tomcat" sourcetype="catalina.out_logs" SEVERE earliest=-1d latest=now() | rex field=_raw "\[SEVERE[\s\t]+\][\s\t]+(?<err>.+)[\n\r]" | eval errhost = host + "::" + err | bucket _time span=1h | stats count as tcount by errhost | stats avg(tcount) as day_avg by errhost] | eval perc= round(((day_avg - wk_avg) * 100 / wk_avg),2) | fields + perc errhost | search perc > 500.0 | search NOT "Unable to serve form" | sort perc desc

 

So - I'm pulling SEVERE errors, extracting just the error text, CONCATing that to the host to get my group-by string, bucketing in 1-hour increments to get an average, then building a chart with the 7-day average and the 1-day average for each host/error pair.  Wondering if anyone else has a better way to do it?

Thanks!

Labels (4)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

 

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

 

 So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

  • bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
  • work out the average hourly rate for each window/errhost (stats 2)
  • collapse the separated averages into a single row for each errhost (stats 3)
  • calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

 

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

 

to do the filtering.

Hope this helps

 

View solution in original post

0 Karma

bowesmana
SplunkTrust
SplunkTrust

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

 

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

 

 So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

  • bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
  • work out the average hourly rate for each window/errhost (stats 2)
  • collapse the separated averages into a single row for each errhost (stats 3)
  • calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

 

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

 

to do the filtering.

Hope this helps

 

0 Karma

Michael_Condon
Engager

That's much better!

Also, didn't know that RE append columns. I'm coming to this from a SQL context, and it's been a bit of a learning curve. Appreciate the full explanation and reply. 

-Mike

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

SOK it to Me: Top 3 Benefits of Using Splunk Operator on Kubernetes that’ll Make ...

    Thursday, July 9, 2026  |  11:00AM–12:00PM PDT Duration: 1 hour (includes Q&A) Managing can feel like a ...

Upgrade Prep for 10.4, Network Observability Deep Dives, and More from Splunk Lantern

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...

Splunk Developer Day announcements: AI agents, MCP tools, Forecasting, and Custom ...

Splunk Developer Day was packed with product and platform updates for developers building in the AI ...