Splunk Search

Percent Change by Host and Error Type

Michael_Condon
Engager

We're ingesting Tomcat logs, and looking for items tagged [SEVERE]. I'd like to be able to pull a report of error rate, and look for errors which are occurring at a significantly higher than average rate *for their error type*. In addition, we're getting data streams from multiple hosts, each of which are really their own instance and have their own "native" error rate. 

I need the average rate of occurrence of errors over the last week, and over the last day, grouped by host and error type. Then I need to flag any error who's rate has risen by... say, 500%. 

So far the best I've come up with is this: 


index="tomcat" sourcetype="catalina.out_logs" SEVERE earliest=-7d latest=-1d | rex field=_raw "\[SEVERE[\s\t]+\][\s\t]+(?<err>.+)[\n\r]" | eval errhost = host + "::" + err | bucket _time span=1h | stats count as tcount by errhost | stats avg(tcount) as wk_avg by errhost | appendcols [search index="tomcat" sourcetype="catalina.out_logs" SEVERE earliest=-1d latest=now() | rex field=_raw "\[SEVERE[\s\t]+\][\s\t]+(?<err>.+)[\n\r]" | eval errhost = host + "::" + err | bucket _time span=1h | stats count as tcount by errhost | stats avg(tcount) as day_avg by errhost] | eval perc= round(((day_avg - wk_avg) * 100 / wk_avg),2) | fields + perc errhost | search perc > 500.0 | search NOT "Unable to serve form" | sort perc desc

 

So - I'm pulling SEVERE errors, extracting just the error text, CONCATing that to the host to get my group-by string, bucketing in 1-hour increments to get an average, then building a chart with the 7-day average and the 1-day average for each host/error pair.  Wondering if anyone else has a better way to do it?

Thanks!

Labels (5)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

 

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

 

 So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

  • bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
  • work out the average hourly rate for each window/errhost (stats 2)
  • collapse the separated averages into a single row for each errhost (stats 3)
  • calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

 

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

 

to do the filtering.

Hope this helps

 

View solution in original post

0 Karma

bowesmana
SplunkTrust
SplunkTrust

Your existing search is somewhat flawed, in that you are bucketing by 1hour but not using _time in the group by, so your second stats statement is redundant as there is only one tcount value for each errhost. not one per hour.

Also, if you use appendcols, the total number of rows generated by each search MUST match and the rows must match otherwise the appended columns may not match their appended row, i.e. if you have 10 errhosts in the last 7 days and 6 in the last day, then only the first 6 rows will have data and you will not know which errhost the appended columns relate to.

I am guessing you might want to be doing this - i.e. done in a single search and then an evaluation based on which time window (last 7 or last 1) it sits in

Have a look at this search - it generates random data, but will show you how to do that averaging.

 

| makeresults count=10000 
| eval _time=_time-random() % (86400 * 8 )
| eval err="err_".((random() % 2) + 100)
| eval host="host_".(random() % 2) 
| eval errhost = host + "::" + err 
| eval type=if(_time<relative_time(now(), "-d"), 7, 1)
| bucket _time span=1h 
| stats count as tcount by _time type errhost
| stats avg(tcount) as avg by type errhost
| stats avg(eval(if(type==1,avg,null()))) as day_avg avg(eval(if(type==7,avg,null()))) as wk_avg by errhost
| eval perc= round(((day_avg - wk_avg) / wk_avg * 100),2) 
| fields + perc errhost 
| where perc > 1.0

 

 So, from the 'eval type=...' line, which is setting type to either 7 if it's a last 7 day event or 1 if it's today's event is what you need.

This will

  • bucket on 1 hour and count for each hour in each window for each errhost (stats 1)
  • work out the average hourly rate for each window/errhost (stats 2)
  • collapse the separated averages into a single row for each errhost (stats 3)
  • calculate the percentage change and evaluate against your threshold

I wasn't clear what your second search was for without knowing your data, but if it's looking for a particular error string in errhost, then I would suggest

 

| where perc > 1.0 AND !match(errhost, "(?i)Unable to serve form")

 

to do the filtering.

Hope this helps

 

0 Karma

Michael_Condon
Engager

That's much better!

Also, didn't know that RE append columns. I'm coming to this from a SQL context, and it's been a bit of a learning curve. Appreciate the full explanation and reply. 

-Mike

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...