Archive
Highlighted

Statistical Analysis using Splunk

Path Finder

Hi,
I'm trying to create an ALert mechanism that will be triggered if certain condition is "statistically significant".

Let's say
n = Number of Errors/5 min
p = Probability of n<=5
a = The critical alpha level = 5%
Null Hypothesis:
H0: p <= 5%
H1: p > 5%

Let's say I have a set of data with the # of Errors per5 min, for the last 7 days. e.g.
3:36 -- 3
3:41 -- 0
3:46 -- 0
.
.
.
.
4:00 -- 1
4:05 -- 4

Is there a way for me to configure an alert in Splunk, which will check every 5 mins, if the number of Errors occurred in the last 5 mins is statistically significant, in order to trigger the alert?

Thanks

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Thanks for the response. That sounds like a workaround that we can incorporate. It isn't really a statistical analysis such as a T-Test or a Chi-Test. But for now this is better than simply running an alert with a threshold count. SO, thanks for that.

Note:
I am yet to set it up & collect the data first for the 7 days & then prepare the alert.

However I have a few followup questions:

  1. When you say, don't use "'sistats' count" in the summary search, what do you mean? SHold all Summary Indexing be using "sistats". I understand that when I'm using the Summary Indexed search in another search, then I'll use "stats", instead of "sisstats". Am I correct?
  2. If I'm not mistaken then the Error Summary should look like "your error condition | sistats count as numErrors"
  3. The actual alert search should look like this: your error condition | stats count as currentErrors | appendcols [search index=summary search=ErrorSummary earliest=-7d@d latest=@d | stats avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors count as numPeriods | eval maxVal = avgErrors + (stdevErrors * 2) | eval significant = currentErrors > maxVal | fields + currentErrors avgErrors stdevErrors numPeriods significant

Please let me know if I missed anything ...

I'll report back, once I try the whole thing.

THanks again for yoru help

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Legend

Bah - my original answer was poor, and I have realized that there is a simpler solution. This calculates a mean number of errors over the last 7 days, then creates an upper bound that is two standard deviations from the mean. If the number of errors is greater than the upper bound, the number is statistically significant.

Try this, no summary index is needed:

your-error-condition | stats count as currentErrors | 
appendcols [search your-error-condition earliest=-7d@d lastest=@d |
timechart span=5m count as errorCount | 
stats avg(errorCount) as avgErrors std(errorCount) as stdevErrors count as numPeriods |
eval maxVal = avgErrors + (stdevErrors * 2) |
eval significant = currentErrors > maxVal |
fields + currentErrors avgErrors stdevErrors numPeriods significant

Note that I am only testing for a statistically significant increase in errors, although using the standard deviation makes this a two-tailed statistical test. There are also a few other statistical functions that might be of interest to you here -

Functions for stats, chart and timechart

Functions for eval and where

These may help you calculate the actual T-test or Chi-Square.

Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Thanks. I was thinking in the same line. But I'd still prefer to use Summary Indexing instead of a live search, since to gather data for last 7 days will take a very long time. Hence I'd collect the data for every 5 min interval in the Summary Index, and use a separate search to calculate the statistics using the Summary search....

In order to calculate the exact p-value, we need either
1) Integral function in Splunk, or
2) The ability to lookup the Z-Table in SPlunk. Currently, I'm aware of looking up a 2-dimensional table in Splunk, but in order to lookup a Z-Table, we need to match the row & the column values: Is there any way to do that?

e.g. for each X, Z = (X-avg)/stddev
Lookup the p-value from the Z-score table(I'll save that table as a lookup table in Splunk lookup), and then determine if the p-value < alpha(=.05). If it is, for the 1-tailed test, then we can determine, if the result is significant.

Here's what I came up with:

error_summary:

index=someindex sourcetype=somesource environment!="someenv*" Type=someType (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)| sistats count as numErrors by ErrorCode

Alert Search:

index=someindex sourcetype=somesource environment!="someenv*" Type=sometype (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| appendcols [search index=somesummary search=summaryerror_alert earliest=-7d@d latest=@d
| stats avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors count as numPeriods by ErrorCode]
| eval zVal = (currentErrors - avgErrors)/stdevErrors/(sqrt(numPeriods)) // e,g, z=2.54
| eval zValFirstDec = substr(zVal, 1, 3) //2.5
| eval zValSecDec = "0.0"+substr(zVal, -1)//0.04
| lookup zTable zValFirstDec, zValSecDec OUTPUT pValue
| eval significant = if(pValue < .05, "true", "false")
| fields + currentErrors avgErrors stdevErrors numPeriods zVal significant

Now the only thing that I'm not sure of is the Splunk lookup for row & column, in the Z-Table, where you look for the first two values:
"lookup zTable zValFirstDec, zValSecDec OUTPUT pValue"

I don't know if and/or how to do that in Splunk =>

A new question for Splunk: Can we do lookup in a Matrix table with two values(Row value & column value)?

Does that look like something we can do?

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Legend

[Okay, I deleted my original dumb answer, and left the second one. This one is the third.]

So your searches are now very close, but I don't think that your summary index is going to give you what you want. A summary indexing search that ends with sistats count as numErrors by ErrorCode
should be retrieved by the following command

index=some_summary search=summary_error_alert earliest=-7d@d latest=@d 
| stats count as numErrors by ErrorCode

And that isn't going to give you the stats that you need. But you can try it. (Run the error_summary search for an hour or two and then try searching the summary index.)

Here is what I think you need:

error_summary:

  index=someindex sourcetype=somesource environment!="someenv*" Type=someType (ErrorCode!="null" AND 
  ErrorCode!=200 AND ErrorCode!=0 ) | eval counter=1 | 
  sitimechart span=5m sum(counter) by ErrorCode

The timechart command will summarize based on the 5-minute intervals and return a series. The stats command will summarize overall and return a single value. I think you need the series.

Alert Search:

index=someindex sourcetype=somesource environment!="someenv*" Type=sometype (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| appendcols [search index=some_summary search=summary_error_alert earliest=-7d@d latest=@d 
| timechart span=5m sum(counter) as numErrors by ErrorCode
| stats count as numPeriods avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors by Errorcode ]
| eval zVal = (currentErrors - avgErrors)/stdevErrors/(sqrt(numPeriods)) // e,g, z=2.54
| eval zValFirstDec = substr(zVal, 1, 3) //2.5
| eval zValSecDec = "0.0"+substr(zVal, -1)//0.04
| lookup zTable zValFirstDec, zValSecDec OUTPUT pValue
| eval significant = pValue < .05
| fields + currentErrors avgErrors stdevErrors numPeriods zVal significant pValue

The lookup question:

You can totally do a multi-key lookup in Splunk, just as you have written it. But the CSV file needs to be in this format:

zValFirstDec,zValSecDec,pValue

Not in a matrix format.

Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Thanks Guinn,

We are getting close.... but not there yet. I'm having a hard time collecting data from the summary index within a subsearch.

On inspecting my actual alert search, it appeared that I'm getting 0 results for the subsearch:
search index=somesummary search=summaryerror_alert earliest=-7d@d latest=@d
| timechart span=5m sum(counter) as numErrors by ErrorCode
| stats count as numPeriods avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors by Errorcode

However, I validated that my summary search(which runs every 5 mins(earliest=-10m@m & latest=-5m@m) does indeed return data once I set the time range for the last 7 days.

What I noticed is that if I change the search to the following(source=summaryerroralert), instead of search=summaryerroralert), then it returns some data. With that in mind, when I run the following section of the search, it only returns two fields, "ErrorCode" & currentErrrs:

index=someindex sourcetype=somesource environment!="someenv*" Type=sometype (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| appendcols [search index=somesummary search=summaryerror_alert earliest=-7d@d latest=@d
| timechart span=5m sum(counter) as numErrors by ErrorCode
| stats count as numPeriods avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors by Errorcode ]

Results:

ErrorCode currentErrors
1 2010 1
2 2011 5
3 2013 6
4 2100 1
5 2102 8
6 2201 7
....
....

I can't put my finger on it, but it's either the timechart/counter values, or the subsearch/summary search, that's screwing things up.

I'm still researching.. if I find the issue, I'll post it. any help is appreciated.

Basically the final result should look something like this:

ErrorCode-currentErrors-avgErrors-stdevErrors-numPeriods-zVal-pValue-significant
2010-1-7.6-12.4-1567-.12-.03-true
2011-5-7.6-12.4-1567-.23-.07-false

I think we need to use eventstats, instead of stats, since the avg, std & numPeriods should be the same value in all the events/rows, and available for all the events/rows, in order to calculate the zVals for each row.

Any help is appreciated..

Kuntal

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Timechart messed things up, since it's result had time as the column header. The following search works as expected. Now I just need to put the subsearch in a summary indexed search for better performance, when gathering large quantities of data:

index=someindex sourcetype=somelogs environment!="PREPROD" Type=sometype (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| appendcols [search index=someindex sourcetype=somelogs environment!="PREPROD
" Type=sometype (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| bucket time span=5m
| stats count as numErrors, sum(counter) as numErrors by ErrorCode, _time
| stats avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors count as numPeriods by ErrorCode]
| eval zVal = (currentErrors - avgErrors)/stdevErrors/(sqrt(numPeriods))
| eval Z1 = substr(zVal, 1, 3)
| eval Z2 = substr(zVal, 4, 1)
| lookup z
val_lookup Z1,Z2 OUTPUT P as pValue
| eval significant = if(pValue < .05, "true", "false")
| fields + ErrorCode currentErrors avgErrors stdevErrors numPeriods zVal pValue significant

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Here's the final answer. I figured out why my summary search wasn't working, it's because instead of searchname="summaryerroralert", we were doing search="summaryerror_alert". duh, it took me sometime to compare my other summary searches to figure out what was wrong :).

Nevertheless, here's the final solution. It's not a perfect one, since there are a lot of assumptions:
1) I can't determine the p-value if Z-Score > 4.09, because the table doesn't have data beyond that, & we don't need it either, since beyond that Z-Score, the probability is so low.
2) The Error distribution is Normal
3) I'm avoiding weekends, when gathering summary data, so that my statistics is not skewed, by very low number of errors over the weekends.
4) However, my Alert search runs every day
5) Critical value = 0.05

Summary Search:

Run conditions:

I run it every 5 mins, and it collects the data between "earliest=-10m@m latest=-5m@m".

Query:

index=someindex sourcetype=somelogs environment!="PREPROD*" Type=someType (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
(datewday!="sunday" AND datewday!="saturday")
| bucket _time span=5m
| sistats count as numErrors by ErrorCode, _time

Alert Search:

Alert condition:

Runs every 5 mins
Gathers data from the last 5 mins => earliest=-5m@m latest=now
Alerts on a custom condition, when "significant=true" in the search results

Query:

index=someindex sourcetype=somelogs environment!="PREPROD*" Type=someType (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| appendcols [search index=mobilesummary searchname=summaryerroralert
earliest=-7d@d latest=@d (datewday!="sunday" AND datewday!="saturday")
| stats count as numErrors by ErrorCode, time
| stats avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors count as numPeriods by ErrorCode]
| eval zVal = round(((currentErrors - avgErrors)/stdevErrors/(sqrt(numPeriods))), 2)
| lookup z
val_lookup zVal OUTPUT pValue1Tail
| eval significant = if(zVal > 4.09 OR pValue1Tail < .05, "true", "false")
| fields + ErrorCode currentErrors avgErrors stdevErrors numPeriods zVal pValue pValue1Tail significant

Please let me know if there is any scope for improvement or any issues with the above solution. I'd appreciate any feedback.

Cheers

0 Karma
Highlighted

Re: Statistical Analysis using Splunk

Path Finder

Correction:

Instead of appendcols, we need to perform "join ErrorCode", otherwise, the merge is not correct. The appendcols will replace the subsearch results with the main result, without doing any "join", and hence the results will be misleading. Hence you need to join the two results with the common field, "ErrorCode"

Note:

  1. I've reduced the 2 input Z-Score table with one input Lookup table, hence no need to Z1 & Z2
  2. Since the Z-Score table doesn't have results beyond Z=4.09, I've also added another check for Z > 4.09.

Alert Search:

index=someindex sourcetype=sometype environment!="PREPROD*" Type=someType (ErrorCode!="null" AND ErrorCode!=200 AND ErrorCode!=0)
| stats count as currentErrors by ErrorCode
| join ErrorCode [search index=mobilesummary searchname=summaryerroralert
earliest=-7d@d latest=@d (datewday!="sunday" AND datewday!="saturday")
| stats count as numErrors by ErrorCode, time
| stats avg(numErrors) as avgErrors stdev(numErrors) as stdevErrors count as numPeriods by ErrorCode]
| eval zVal = round(((currentErrors - avgErrors)/stdevErrors), 2)
| lookup z
val_lookup zVal OUTPUT pValue1Tail
| eval significant = if(zVal > 4.09 OR pValue1Tail < .05, "true", "false")
| fields + ErrorCode currentErrors avgErrors stdevErrors numPeriods zVal pValue pValue1Tail significant

Highlighted

Re: Statistical Analysis using Splunk

Communicator

This is very valuable! Thanks for posting and documenting this so well. Do you have a copy of the zvallookup file you used? I'd hate to recreate the wheel.

0 Karma