I spent about 5 minutes trying to figure out how to even title this question.
Its much easier explained by this example, please feel free to edit the title.
We have an access log of format:
ClientIP Hostname URI StatusCode
Now, I try to identify a set of ClientIPs that have unusually large number of requests per Hostname over a specified timespan (for example per minute). For example 2 standard deviations higher request count per Hostname then average (over that timespan).
The reason for trying to do this vs a specific count/threshold is because:
For those interested, here is how the current "preset threshold" is implemented:
index=accesslog | stats count as CPM by ClientIP Hostname | search (Hostname="*.domain.com" CPM>800) OR (Hostname="this.domain.com" CPM>350) OR (Hostname="that.domain.com" CPM>300) OR (...) OR ...
The easiest solution is to leverage the Prelert Anomaly Detective App. You can easily determine which ClientIP are making an abnormally different number of requests than other ClientIPs:
index=accesslog | prelertautodetect count over ClientIP
If you want to segment it by host as well:
index=accesslog | prelertautodetect count by Hostname over ClientIP
Here's a slightly different example of finding a ClientIP requesting a abnormally different number of pages than other ClientIPs, but in this case, segmented by status code:
This is also be easily run on-going as a regularly scheduled search, so that you can continuously run it every X minutes.
After a bit of trial and error I figured it out.
index=accesslog earliest=-7d@m-5m latest==-7d@m | append [ search index=accesslog earliest=-14d@m-5m latest==-14d@m ] | append [ search index=accesslog earliest=-21d@m-5m latest==-21d@m ] | bucket _time span=1m | stats count AS LastCPM by ClientIP Hostname date_mday | stats avg(LastCPM) as LastAvg, stdev(LastCPM) as LastStdev by Hostname | join type=outer Hostname [ search index=accesslog earliest=-5m@m latest=now@m | bucket _time span=1m | stats count AS NowCPM by ClientIP Hostname date_mday | stats avg(NowCPM) as NowAvg by Hostname ] | where NowAvg > LastAvg+LastStdev*2
The output will be something like this:
Hostname LastAvg LastStdev NowAvg host3.domain.com 25.370370 32.720253 26.600000 host55.domain.com 10.610169 14.518736 13.900000
The logic is, look at average and stdev of events(connections) per client per host per minute. Compare that with the current average.
You can add more
appends to cover additional time ranges. Every company/site/webservice have unique access profile. Some will have similar stats per specific times of the day of week (like in our case), others will have similar stats every day regardless of weekday/weekend, hence you can change the
append time frames to yesterday, day before, etc, instead of going week earlier and earlier.
Now, I have 2 things I'm not sure about. Do I need to add
_time to the
stats count lines? I think that would be needed only if you want to compare non-equal timespans (e.g. >5 mins in the top lines and exactly last 5 minutes under the
The other thing is, if I wanted to "extract" individual events from the resulting stats tables (after the
where pipe), but I could not find the way to do that, the underlying logs that made the stats are lost?
Also, a thing of note. My solution above only works if the number of requests by a number of ClientIPs or by one ClientIP pushes the
NowAvg high enough to be "caught". The good thing is that it will catch both possibilities but the bad is that it will still catch some real situations like testers doing more connections during tests or added external monitoring solution.
The solution is efficient and can be used in "almost" real-time reports/alerts if you don't specify large timespans. For 5min timespan as in the example above, parsing takes longer than actual search and stats.
For a normal distribution, we can use the
p97 to approximate two standard deviations.
Here is how I would write the search:
| bucket _time span=1m
| stats count as CPM by ClientIP Hostname _time
| eventstats p97(CPM) as threshhold by Hostname _time
| where CPM > threshhold
Although I am not quite sure what you are trying to compare... but start with a short time period and leave off the last line, and I think you will see how it works.
(More info on std dev - look at the graph in this Wikipedia article and you can see that 2 standard deviations would include all but approximately the top 2.2% of values.)
It runs fine and is quite efficient (from performance point of view) for short timespans. However, it returns too many rows. I've tried changing to p98() and still quite a few. I believe its because of large variety of number of requests, for different times of the day. I think comparing current-timespan (e.g. last 5 mins) to last 4 weeks for the same hour:min timespan would produce better results. e.g. avg(today:15:00-15:05 per IP per host per minute) > avg(last4weeks:15:00-15:05 per IP per host per minute) Is it possible with Splunk?
One way would be to look at it purely from time perspective, then it would be X per hostname standing out and then looking at events manually (they could be from same or multiple IPs). Another way (I think preferred) is to look at it from client IP point of view (Y requests per minute per client IP per hostname).
I will reiterate. The idea is to catch anomalies. But only higher than usual number of requests.
The idea is: each hostname gets X requests on average per minute, and Y requests per minute by unique(!) client IP. Somehow we want to be able to see client IPs requesting considerable more than others (anomalies right?). This is a good way to identify attacks, too frequent health/monitoring checks (possibly misconfiguration), infinite loops querying sites/webservices/etc internally or externally (possibly poor code), the list goes on.
But if you're trying to calculate "higher than two stdev" over the time frame, you need to have some other sample time frame against which to figure out the mean / stdev.
Would you say that your question could be phrased as "Search for sudden uptick in activity from a <client host>?"
Yes, I was actually trying to use last weeks data to calculate average and stdev and compare to the last 5 minutes. Although a better way would be to do equivalent timespan (e.g. 5 minutes) from the same time range over the last 4+ weeks - it is actually computationally faster and more indicative, for web traffic stats. I couldn't even get the first scenario going using some example solutions in the links specified above - they are either wrong or have some hidden syntax errors.
@felipetesta . Actually the timeframe is specified in the search (right side) or in alert/report settings, its not explicit in the search itself, so the values for CPM are by timeframe specified (in our case its 5min). Bucket command is nice but I think what I'm after is more advanced, like looking at average of previous week's stats per Hostname and comparing with now
@alucas_1stop . I am not sure where the problem is. If you need to split by time, which is not shown in your example, how about "| bucket _time span=1m | stats count as CPM by ClientIP Hostname _time | search ..."
Then I tried playing with find anomalies commands but did not come to a meaningful result.
I tried to follow this: http://answers.splunk.com/answers/58750/how-do-you-monitoralert-for-spikes-of-negative-events but doesn't really work as the example uses 1 field and here we are looking at a table (ClientIP vs Hostname)