Solved: Generate Statistical Sampling

mattcg · ‎08-16-2010

I am looking for a way to do a statistical sampling of events in order to provide rapid insight into millions of events.

An example: We have login events from a specific sourcetype. Millions of logins per day, want to look at how often users are logging in over a 7 day period (% of 1 login per user, % of 2 logins per user, % of 3 logins per user, etc). Right now this is a very long query to run even as a scheduled search.

< search string > | STATS count as logins by userID | STATS count as users by logins

This gives me a graph of how many users are in each category of login attempts (1,2,3,4,5,etc). But it takes a very long time to run. Any suggestions on how to take a statistical sampling would be appreciated.

<< EDIT >>

The suggestion has been to use summary indexing for this. How would you generate summary indexes to achieve this type of information? If, for example, we have 2 million unique users per day who login about 5 million times, shrinking 5 million daily events down to 2 million daily events (stats count by userID) is the only way I can think of to check for repeat visits on subsequent days.

This still involves looking at 14+ million events in the summary index (which, it would seem, defeats the purpose of a summary index) each time the query needs to run.

Any tips on how to optimize this kind of search?

gkanapathy · ‎08-17-2010

You really can't do this easily in Splunk today.

The hardest part of this would be to select the sample to give you a statistically representative subset of the data. The easiest and natural way for most Splunkable data would be to take events in some representative time slices. (This by itself introduces assumptions about the data, though, that may or may not be true.) For example, it would be very easy to generate selecting 2 seconds worth of data every 12 hours for the previous 9 days

sourcetype=mylargesourcetype 
[ gentimes start=-10 end=-1 increment=12h 
  | rename starttime as earliest 
  | eval latest=earliest+2  
  | fields + earliest latest 
  | format "(" "(" "" ")" "OR" ")" ]

But this would probably give you a poor sample, since there might be something odd about the 2 seconds right at noon and midnight. You could play little games with the above expression, doing some randomization with the eval random() function to offset the earliest time, but it would be up to you to know how well this works.

HOWEVER, I don't believe the above resulting search in Splunk today will actually yield you performance benefits, as I think the engine doesn't take advantage of the time-based index when it looks at the whole list of generated sample intervals. Possibly this is a worthwhile enhancement request.

Instead, as long as your samples are a small enough fraction of your total data, you could get a performance improvement with something like:

| gentimes start=-10 end=-1 increment=12h 
| rename starttime as earliest 
| eval latest=earliest+2 
| sort - earliest
| map maxsearches=99999
      search="search earliest=$earliest$ latest=$latest$ sourcetype=mylargesourcetype"

Again, this "works" but the actual sampling is subject to bias. I would strongly recommend some randomization of earliest after it's been generated (e.g. | eval starttime=starttime+(random()%43200)-21600, but I'm just making that up), but the degree and type would depend quite a bit on what you suspect about your data. In fact, you may not want to use gentimes at all, and generate your intervals using some other method.

Update:

Summarization will only help you so much, and sampling out of time probably won't give you statistically good results, i.e., whatever time sampling method you choose will bias your results. Probably you want to select a sample of users, then count their logins and take them as representative. Can you get a list of userids/usernames to choose from? Let's say you do, and you create a lookup table, uids that looks like:

n,uid
1,username1
2,auser
3,login143
4,another_user
....

(If you just have the list, you can generate the number for n and output with outputlookup using the accum or streamstats commands.) Two or three millions rows in this table is okay.

Now set up an automatic lookup from your sourcetype on the user id field to this lookup table:

[mylargesourcetype]
LOOKUP-uid_n = uids uid AS user_id_field OUTPUT n

Now you select, say a thousand randoms from the table, and there are 2,123,456 ids in the table:

sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
  | eval n=(random() % 2123456)
  | fields n ]

(I'm just using gentimes as a convenient way to get 1000 rows that I then throw away. I could also use inputlookup uids | head 1000, assuming uids is always more than 1000. Which I guess it would be, or this is pointless. Also, random() generates numbers from 0 thru 2^31-1, so without doing a little more arithmetic there will be a tiny bias toward lower-numbered user ids, but since the bias with 2 million users is about 0.1% and it doesn't seem systematic, I'm going to ignore it.)

That should give you a quicker-loading subset. Okay, the truth is, if you sample more than 100 names, you'll have to bump up the subsearch and format limits in limits.conf too, and possibly specify a higher limit in a format command (that last may not be necessary in more recent releases):

sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
  | eval n=(random() % 2123456)
  | fields n 
  | format maxresults=999999 ]

View solution in original post

gkanapathy · ‎08-17-2010

You really can't do this easily in Splunk today.

The hardest part of this would be to select the sample to give you a statistically representative subset of the data. The easiest and natural way for most Splunkable data would be to take events in some representative time slices. (This by itself introduces assumptions about the data, though, that may or may not be true.) For example, it would be very easy to generate selecting 2 seconds worth of data every 12 hours for the previous 9 days

sourcetype=mylargesourcetype 
[ gentimes start=-10 end=-1 increment=12h 
  | rename starttime as earliest 
  | eval latest=earliest+2  
  | fields + earliest latest 
  | format "(" "(" "" ")" "OR" ")" ]

But this would probably give you a poor sample, since there might be something odd about the 2 seconds right at noon and midnight. You could play little games with the above expression, doing some randomization with the eval random() function to offset the earliest time, but it would be up to you to know how well this works.

HOWEVER, I don't believe the above resulting search in Splunk today will actually yield you performance benefits, as I think the engine doesn't take advantage of the time-based index when it looks at the whole list of generated sample intervals. Possibly this is a worthwhile enhancement request.

Instead, as long as your samples are a small enough fraction of your total data, you could get a performance improvement with something like:

| gentimes start=-10 end=-1 increment=12h 
| rename starttime as earliest 
| eval latest=earliest+2 
| sort - earliest
| map maxsearches=99999
      search="search earliest=$earliest$ latest=$latest$ sourcetype=mylargesourcetype"

Again, this "works" but the actual sampling is subject to bias. I would strongly recommend some randomization of earliest after it's been generated (e.g. | eval starttime=starttime+(random()%43200)-21600, but I'm just making that up), but the degree and type would depend quite a bit on what you suspect about your data. In fact, you may not want to use gentimes at all, and generate your intervals using some other method.

Update:

Summarization will only help you so much, and sampling out of time probably won't give you statistically good results, i.e., whatever time sampling method you choose will bias your results. Probably you want to select a sample of users, then count their logins and take them as representative. Can you get a list of userids/usernames to choose from? Let's say you do, and you create a lookup table, uids that looks like:

n,uid
1,username1
2,auser
3,login143
4,another_user
....

(If you just have the list, you can generate the number for n and output with outputlookup using the accum or streamstats commands.) Two or three millions rows in this table is okay.

Now set up an automatic lookup from your sourcetype on the user id field to this lookup table:

[mylargesourcetype]
LOOKUP-uid_n = uids uid AS user_id_field OUTPUT n

Now you select, say a thousand randoms from the table, and there are 2,123,456 ids in the table:

sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
  | eval n=(random() % 2123456)
  | fields n ]

(I'm just using gentimes as a convenient way to get 1000 rows that I then throw away. I could also use inputlookup uids | head 1000, assuming uids is always more than 1000. Which I guess it would be, or this is pointless. Also, random() generates numbers from 0 thru 2^31-1, so without doing a little more arithmetic there will be a tiny bias toward lower-numbered user ids, but since the bias with 2 million users is about 0.1% and it doesn't seem systematic, I'm going to ignore it.)

That should give you a quicker-loading subset. Okay, the truth is, if you sample more than 100 names, you'll have to bump up the subsearch and format limits in limits.conf too, and possibly specify a higher limit in a format command (that last may not be necessary in more recent releases):

sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
  | eval n=(random() % 2123456)
  | fields n 
  | format maxresults=999999 ]

gkanapathy · ‎08-17-2010

edited answer above

mattcg · ‎08-17-2010

Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).

gkanapathy · ‎08-17-2010

To answer your question, you probably really need summary indexing, particularly if you are looking for users who've logged in once or twice out of many millions of events and presumably many tens of thousands of users.

gkanapathy · ‎08-17-2010

And actually, now that I look at your specific question, where you ask about number of logins per user over 7 days, the above type of time interval sampling is completely useless.

ziegfried · ‎08-17-2010

Summary indexing is probably what you are looking for: http://www.splunk.com/base/Documentation/latest/Knowledge/Usesummaryindexing

mattcg · ‎08-17-2010

Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).

Generate Statistical Sampling

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?