I am looking for a way to do a statistical sampling of events in order to provide rapid insight into millions of events.
An example: We have login events from a specific sourcetype. Millions of logins per day, want to look at how often users are logging in over a 7 day period (% of 1 login per user, % of 2 logins per user, % of 3 logins per user, etc). Right now this is a very long query to run even as a scheduled search.
< search string > | STATS count as logins by userID | STATS count as users by logins
This gives me a graph of how many users are in each category of login attempts (1,2,3,4,5,etc). But it takes a very long time to run. Any suggestions on how to take a statistical sampling would be appreciated.
<< EDIT >>
The suggestion has been to use summary indexing for this. How would you generate summary indexes to achieve this type of information? If, for example, we have 2 million unique users per day who login about 5 million times, shrinking 5 million daily events down to 2 million daily events (stats count by userID) is the only way I can think of to check for repeat visits on subsequent days.
This still involves looking at 14+ million events in the summary index (which, it would seem, defeats the purpose of a summary index) each time the query needs to run.
Any tips on how to optimize this kind of search?
You really can't do this easily in Splunk today.
The hardest part of this would be to select the sample to give you a statistically representative subset of the data. The easiest and natural way for most Splunkable data would be to take events in some representative time slices. (This by itself introduces assumptions about the data, though, that may or may not be true.) For example, it would be very easy to generate selecting 2 seconds worth of data every 12 hours for the previous 9 days
sourcetype=mylargesourcetype
[ gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| fields + earliest latest
| format "(" "(" "" ")" "OR" ")" ]
But this would probably give you a poor sample, since there might be something odd about the 2 seconds right at noon and midnight. You could play little games with the above expression, doing some randomization with the eval
random()
function to offset the earliest time, but it would be up to you to know how well this works.
HOWEVER, I don't believe the above resulting search in Splunk today will actually yield you performance benefits, as I think the engine doesn't take advantage of the time-based index when it looks at the whole list of generated sample intervals. Possibly this is a worthwhile enhancement request.
Instead, as long as your samples are a small enough fraction of your total data, you could get a performance improvement with something like:
| gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| sort - earliest
| map maxsearches=99999
search="search earliest=$earliest$ latest=$latest$ sourcetype=mylargesourcetype"
Again, this "works" but the actual sampling is subject to bias. I would strongly recommend some randomization of earliest
after it's been generated (e.g. | eval starttime=starttime+(random()%43200)-21600
, but I'm just making that up), but the degree and type would depend quite a bit on what you suspect about your data. In fact, you may not want to use gentimes
at all, and generate your intervals using some other method.
Update:
Summarization will only help you so much, and sampling out of time probably won't give you statistically good results, i.e., whatever time sampling method you choose will bias your results. Probably you want to select a sample of users, then count their logins and take them as representative. Can you get a list of userids/usernames to choose from? Let's say you do, and you create a lookup table, uids
that looks like:
n,uid
1,username1
2,auser
3,login143
4,another_user
....
(If you just have the list, you can generate the number for n
and output with outputlookup
using the accum
or streamstats
commands.) Two or three millions rows in this table is okay.
Now set up an automatic lookup from your sourcetype on the user id field to this lookup table:
[mylargesourcetype]
LOOKUP-uid_n = uids uid AS user_id_field OUTPUT n
Now you select, say a thousand randoms from the table, and there are 2,123,456 ids in the table:
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n ]
(I'm just using gentimes
as a convenient way to get 1000 rows that I then throw away. I could also use inputlookup uids | head 1000
, assuming uids is always more than 1000. Which I guess it would be, or this is pointless. Also, random()
generates numbers from 0 thru 2^31-1, so without doing a little more arithmetic there will be a tiny bias toward lower-numbered user ids, but since the bias with 2 million users is about 0.1% and it doesn't seem systematic, I'm going to ignore it.)
That should give you a quicker-loading subset. Okay, the truth is, if you sample more than 100 names, you'll have to bump up the subsearch and format limits in limits.conf too, and possibly specify a higher limit in a format command (that last may not be necessary in more recent releases):
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n
| format maxresults=999999 ]
You really can't do this easily in Splunk today.
The hardest part of this would be to select the sample to give you a statistically representative subset of the data. The easiest and natural way for most Splunkable data would be to take events in some representative time slices. (This by itself introduces assumptions about the data, though, that may or may not be true.) For example, it would be very easy to generate selecting 2 seconds worth of data every 12 hours for the previous 9 days
sourcetype=mylargesourcetype
[ gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| fields + earliest latest
| format "(" "(" "" ")" "OR" ")" ]
But this would probably give you a poor sample, since there might be something odd about the 2 seconds right at noon and midnight. You could play little games with the above expression, doing some randomization with the eval
random()
function to offset the earliest time, but it would be up to you to know how well this works.
HOWEVER, I don't believe the above resulting search in Splunk today will actually yield you performance benefits, as I think the engine doesn't take advantage of the time-based index when it looks at the whole list of generated sample intervals. Possibly this is a worthwhile enhancement request.
Instead, as long as your samples are a small enough fraction of your total data, you could get a performance improvement with something like:
| gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| sort - earliest
| map maxsearches=99999
search="search earliest=$earliest$ latest=$latest$ sourcetype=mylargesourcetype"
Again, this "works" but the actual sampling is subject to bias. I would strongly recommend some randomization of earliest
after it's been generated (e.g. | eval starttime=starttime+(random()%43200)-21600
, but I'm just making that up), but the degree and type would depend quite a bit on what you suspect about your data. In fact, you may not want to use gentimes
at all, and generate your intervals using some other method.
Update:
Summarization will only help you so much, and sampling out of time probably won't give you statistically good results, i.e., whatever time sampling method you choose will bias your results. Probably you want to select a sample of users, then count their logins and take them as representative. Can you get a list of userids/usernames to choose from? Let's say you do, and you create a lookup table, uids
that looks like:
n,uid
1,username1
2,auser
3,login143
4,another_user
....
(If you just have the list, you can generate the number for n
and output with outputlookup
using the accum
or streamstats
commands.) Two or three millions rows in this table is okay.
Now set up an automatic lookup from your sourcetype on the user id field to this lookup table:
[mylargesourcetype]
LOOKUP-uid_n = uids uid AS user_id_field OUTPUT n
Now you select, say a thousand randoms from the table, and there are 2,123,456 ids in the table:
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n ]
(I'm just using gentimes
as a convenient way to get 1000 rows that I then throw away. I could also use inputlookup uids | head 1000
, assuming uids is always more than 1000. Which I guess it would be, or this is pointless. Also, random()
generates numbers from 0 thru 2^31-1, so without doing a little more arithmetic there will be a tiny bias toward lower-numbered user ids, but since the bias with 2 million users is about 0.1% and it doesn't seem systematic, I'm going to ignore it.)
That should give you a quicker-loading subset. Okay, the truth is, if you sample more than 100 names, you'll have to bump up the subsearch and format limits in limits.conf too, and possibly specify a higher limit in a format command (that last may not be necessary in more recent releases):
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n
| format maxresults=999999 ]
edited answer above
Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).
To answer your question, you probably really need summary indexing, particularly if you are looking for users who've logged in once or twice out of many millions of events and presumably many tens of thousands of users.
And actually, now that I look at your specific question, where you ask about number of logins per user over 7 days, the above type of time interval sampling is completely useless.
Summary indexing is probably what you are looking for: http://www.splunk.com/base/Documentation/latest/Knowledge/Usesummaryindexing
Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).