Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Splunk Search

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community
- :
- Splunk Answers
- :
- Using Splunk
- :
- Splunk Search
- :
- Re: Generate Statistical Sampling

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

mattcg

Explorer

08-16-2010
11:30 PM

I am looking for a way to do a statistical sampling of events in order to provide rapid insight into millions of events.

An example: We have login events from a specific sourcetype. Millions of logins per day, want to look at how often users are logging in over a 7 day period (% of 1 login per user, % of 2 logins per user, % of 3 logins per user, etc). Right now this is a very long query to run even as a scheduled search.

< search string > | STATS count as logins by userID | STATS count as users by logins

This gives me a graph of how many users are in each category of login attempts (1,2,3,4,5,etc). But it takes a very long time to run. Any suggestions on how to take a statistical sampling would be appreciated.

**<< EDIT >>**

The suggestion has been to use summary indexing for this. How would you generate summary indexes to achieve this type of information? If, for example, we have 2 million unique users per day who login about 5 million times, shrinking 5 million daily events down to 2 million daily events (stats count by userID) is the only way I can think of to check for repeat visits on subsequent days.

This still involves looking at 14+ million events in the summary index (which, it would seem, defeats the purpose of a summary index) each time the query needs to run.

Any tips on how to optimize this kind of search?

1 Solution

Highlighted
##

Summary indexing is probably what you are looking for: http://www.splunk.com/base/Documentation/latest/Knowledge/Usesummaryindexing

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

ziegfried

Influencer

08-17-2010
12:01 AM

Highlighted
##

Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

mattcg

Explorer

08-17-2010
05:43 PM

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

gkanapathy

Splunk Employee

08-17-2010
03:11 AM

You really can't do this easily in Splunk today.

The hardest part of this would be to select the sample to give you a statistically representative subset of the data. The easiest and natural way for most Splunkable data would be to take events in some representative time slices. (This by itself introduces assumptions about the data, though, that may or may not be true.) For example, it would be very easy to generate selecting 2 seconds worth of data every 12 hours for the previous 9 days

```
sourcetype=mylargesourcetype
[ gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| fields + earliest latest
| format "(" "(" "" ")" "OR" ")" ]
```

But this would probably give you a poor sample, since there might be something odd about the 2 seconds right at noon and midnight. You could play little games with the above expression, doing some randomization with the `eval`

`random()`

function to offset the earliest time, but it would be up to you to know how well this works.

HOWEVER, I don't believe the above resulting search in Splunk today will actually yield you performance benefits, as I think the engine doesn't take advantage of the time-based index when it looks at the whole list of generated sample intervals. Possibly this is a worthwhile enhancement request.

Instead, as long as your samples are a small enough fraction of your total data, you could get a performance improvement with something like:

```
| gentimes start=-10 end=-1 increment=12h
| rename starttime as earliest
| eval latest=earliest+2
| sort - earliest
| map maxsearches=99999
search="search earliest=$earliest$ latest=$latest$ sourcetype=mylargesourcetype"
```

Again, this "works" but the actual sampling is subject to bias. I would strongly recommend some randomization of `earliest`

after it's been generated (e.g. `| eval starttime=starttime+(random()%43200)-21600`

, but I'm just making that up), but the degree and type would depend quite a bit on what you suspect about your data. In fact, you may not want to use `gentimes`

at all, and generate your intervals using some other method.

Update:

Summarization will only help you so much, and sampling out of time probably won't give you statistically good results, i.e., whatever time sampling method you choose will bias your results. Probably you want to select a sample of *users*, then count their logins and take them as representative. Can you get a list of userids/usernames to choose from? Let's say you do, and you create a lookup table, `uids`

that looks like:

```
n,uid
1,username1
2,auser
3,login143
4,another_user
....
```

(If you just have the list, you can generate the number for `n`

and output with `outputlookup`

using the `accum`

or `streamstats`

commands.) Two or three millions rows in this table is okay.

Now set up an automatic lookup from your sourcetype on the user id field to this lookup table:

```
[mylargesourcetype]
LOOKUP-uid_n = uids uid AS user_id_field OUTPUT n
```

Now you select, say a thousand randoms from the table, and there are 2,123,456 ids in the table:

```
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n ]
```

(I'm just using `gentimes`

as a convenient way to get 1000 rows that I then throw away. I could also use `inputlookup uids | head 1000`

, assuming uids is always more than 1000. Which I guess it would be, or this is pointless. Also, `random()`

generates numbers from 0 thru 2^31-1, so without doing a little more arithmetic there will be a tiny bias toward lower-numbered user ids, but since the bias with 2 million users is about 0.1% and it doesn't seem systematic, I'm going to ignore it.)

That should give you a quicker-loading subset. Okay, the truth is, if you sample more than 100 names, you'll have to bump up the subsearch and format limits in limits.conf too, and possibly specify a higher limit in a format command (that last may not be necessary in more recent releases):

```
sourcetype=mylargesourcetype
[ gentimes start=-1000 increment=1d
| eval n=(random() % 2123456)
| fields n
| format maxresults=999999 ]
```

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

gkanapathy

Splunk Employee

08-17-2010
03:12 AM

*completely* useless.

Highlighted
##

To answer your question, you probably really need summary indexing, particularly if you are looking for users who've logged in once or twice out of many millions of events and presumably many tens of thousands of users.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

gkanapathy

Splunk Employee

08-17-2010
03:15 AM

Highlighted
##

Please see edited question - how would you go about using summary indexing to achieve this kind of measurement on a field which is unique with millions of possible values and needs to be compared with the following days worth of unique events? This is why a statistical sampling seemed like the simple solution (though it isn't actually a possibility).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

mattcg

Explorer

08-17-2010
05:43 PM

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Re: Generate Statistical Sampling

gkanapathy

Splunk Employee

08-17-2010
07:23 PM

edited answer above