I have a situation where a server is crashing as the result of a specific user accessing some specific web site. Don't know the user and don't know the web site. But the pattern is this:
When some user (call them userID) accesses some site (call it siteID) it causes the server to crash. All connected users drop, the server is automatically restarted, and within two minutes users are reconnecting. This results in a load pattern (number of connected users) to look something like this (each number is the event count sample interval)
55 57 47 65 52 58 1 2 6 11 23 34 45 56 58 62... ^ crash happens here
Where the event counts would drop by a quantifiable percentage, in this case, going from 58/interval to something close to 1/interval. Also, imagine that this pattern repeats over 24 hours where crashes occur at random times but at least once or twice per hour.
So, the query I want would be something like:
Find all the events in minute X where, if you compared their count to minute X-1, their count would be at least 10 times less than the count at time X. For the one minute time period of X-1 to X, give me the list of all distinctly unique userIDs and the siteID they accessed. Now scan 24 hours of data looking for this pattern and give me the union of all the data sets from each minute prior to the crash.
Finally, and here's what I'm looking for, the offending user. It is highly probably that just one user (or a very small set of users) is accessing a specific web site for which our server is getting tripped up and crashing when doing its thing for this site.
So, I want to find the userID that appears in (is common to) many of these X-1 intervals. Sort of like taking a bunch of X-1 data sets and performing and intersection of all uerIDs across each data set.
I tried doing something like this to build one X-1 data set:
host="webproxy01" earliest=5/3/2010:18:48:0 latest=5/3/2010:18:49:0 | fields userID, siteID| dedup userID
Did it this way because I know the time of each crash (to the nearest minute).
But then I have to create a bunch of saved queries (one for each X-1 interval) and run another query like this to get all the data:
| savedsearch A | append[savedsearch B] | append[savedsearch C] | stats count by userID | sort - count
I just started using Splunk (new user to Splunk) yesterday to analyze this problem. So, my approach is probably reflective of a newbie and I'm sure there must be a better, more generic query that eliminates the need to explicitly specify the earliest/latest intervals and let Splunk figure out those intervals by recognizing the pattern of the drastic drop in events between X and X-1.
As a final point, there are more than one server that I would want to look at, as well because, due to load balancing, the offending user would nearly certainly, randomly, connect to a different server in the pool when the crash occurs. And, of course, cause that server to crash eventually. 😞
Thanks in advance for any suggestions,
I'm not sure whether I understood you correctly, but it might be worth looking at the transaction & delta command.
You could try
| transaction siteID maxspan=60s | delta eventcount as dlt_cnt | where dlt_cnt/eventcount>0.8 | top userID
I am assuming that siteID and userID are fields that are already extracted
A big delta (dlt_cnt) indicates a sharp fall in events, if the delta is close to the eventcount dlt_cnt/eventcount will be close to 1
The documentation about the delta command says that it will compute the difference between the current and the previous value (http://www.splunk.com/base/Documentation/latest/SearchReference/Delta) this is/was confusing to me since I expected previous to mean "take event x go back one event in time, calculate the difference and add the result as a field to event x" it seems to "take event x go back one event in time (event y), substract the field value of x from y and add the delta field to y". I hopeing that this doesn't add to the confusion... 🙂
Thanks for the lead. However, the problem I am having is the "eventcount" you refer to in your suggestion. I cannot figure out how to get that number to use in the math. Are you assuming I have that? Because I don't. There is no built in variable that I can see that refers to the number of events in a given timeslice. I'm looking to do something like "take the diff between number of events in the current time slice and the preceding one". Where to you get the "number of events" part? Thanks, Mark
'eventcount' is a field that the transaction command outputs. It's not really mentioned in the documentation though.
Yep, sorry I didn't mention that the eventcount is automatically added if you work with transactions. The delta command should give you that difference you are looking for between two transactions (which corresponds to the events that occur in the timeslice you need)