Solved: How to measure an increase in number of error

mataharry · ‎03-01-2011

I am looking for the best method to highlight host with errors, by comparing them to the previous days.

by example I run this search every day :

index="lsfgbc" OR index="lsficeng" process="lsf-sbatchd-audit" numautofsdefunct 
| `autofsversion` 
| table host, index, numstucksbatchd, numautofsdefunct, autofs_version 
| sort index, host

I tried this but using, earliest=-48h latest=-24h returned an empty result.

| set diff [ search index="lsfgbc" OR index="lsficeng" process="lsf-sbatchd-audit" numautofdefunct 
               earliest=-48h latest=-24h 
             | fields + host | fields - _time _raw  ]
           [ search index="lsfgbc" OR index="lsficeng" process="lsf-sbatchd-audit" numautofsdefunct 
               earliest=-24h latest=now 
             | fields + host | fields - _time _raw ]

How to compare to the previous day, or the last month ?

yannK · ‎03-01-2011

An easy approach is to compare count per day. by example to see the number of error per hosts over a week, and generate a nice graph

error earliest=-7d@d | timechart span=1d count by host useother=0

You also can use summary indexing to save your results every day instead of recalculate them every time : http://www.splunk.com/base/Documentation/4.1.7/Knowledge/Usesummaryindexing with a scheduled search, running every day at midnight (+15 minutes to make sure that all your data is available) example my saved search "summary_error_daily" with the "si" version of the timechart and a more precise detail (per hour)

error earliest=-1d@d latest=@d | sitimechart span=1h count by host useother=0

then call the results with

index=summary name=summary_error_daily | timechart span=2d count by host

Another method is to use the alerting, run a stats count by host search, the use the condition with the "if number of host rise by 1" let it run one day (to store the first values), then it will fire email alerts see http://www.splunk.com/base/Documentation/latest/Admin/HowdoesalertingworkinSplunk

View solution in original post

dwaynelee · ‎03-03-2011

I tried the raw log search about but got this error: Error in 'timechart' command: When you specify a split-by field, only single functions applied to a non-wildcarded data field are allowed.

I would like to get the list of hosts each day, how would I do that?

yannK · ‎03-09-2011

can you provide the search you used ?

yannK · ‎03-01-2011

An easy approach is to compare count per day. by example to see the number of error per hosts over a week, and generate a nice graph

error earliest=-7d@d | timechart span=1d count by host useother=0

You also can use summary indexing to save your results every day instead of recalculate them every time : http://www.splunk.com/base/Documentation/4.1.7/Knowledge/Usesummaryindexing with a scheduled search, running every day at midnight (+15 minutes to make sure that all your data is available) example my saved search "summary_error_daily" with the "si" version of the timechart and a more precise detail (per hour)

error earliest=-1d@d latest=@d | sitimechart span=1h count by host useother=0

then call the results with

index=summary name=summary_error_daily | timechart span=2d count by host

Another method is to use the alerting, run a stats count by host search, the use the condition with the "if number of host rise by 1" let it run one day (to store the first values), then it will fire email alerts see http://www.splunk.com/base/Documentation/latest/Admin/HowdoesalertingworkinSplunk

David · ‎03-01-2011

The way I would recommend doing this is by setting up a summary index to look at the number of events over the last day (-1d@d) and then comparing the last 24 hours to the recent days. That will likely better than searching the raw logs, and solve the problem itself.

However, doing it based on the raw logs you can do:

index="lsfgbc" OR index="lsficeng" process="lsf-sbatchd-audit" numautofsdefunct 
        | autofsversion 
        | table host, index, numstucksbatchd, numautofsdefunct, autofs_version
        | timechart span=1d sum(numstucksbatchd) as sumnumstucksbatchd, sum(numautofsdefunct) as numautofsdefunct by host
        | delta sumstucksbatchd as diffsumstucksbatchd
        | delta sumnumautofsdefunct as diffsumnumautofsdefunct

Timechart should summarize the events to a day (you might need to play with whether you want sum, avg or first, depending on the contents of the logs) and then delta will show you the change in values since the previous day. I pulled off a couple of the fields for the timechart, just because it can get overwhelming and it sounds like what you want, but you can toss them back in as well.

Let me know if that all makes sense.

How to measure an increase in number of error

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms