Solved: Is there a best practice for counting distinct val...

sc0tt · ‎04-05-2015

I would like to count unique users by day, week, and month. I'm not really sure what's the preferred Splunk method to do this. I've been experimenting with different searches and summary indexes/accelerated reports. I'm struggling with determining the most efficient solution.

I believe populating a summary index with a daily search such as

search | sistats dc(user_id) BY field1, field2, field3

will work. However, my concern is that since there are many unique users the index can become quite large. Changing the search to something like

search | stats count BY user_id, field1, field2, field3

is another option but this would create many events which I don't think would be any more efficient than just searching the source events.

Are there any recommended solutions for counting distinct users over periods of time?

sideview · ‎04-07-2015

The broader question here "what's the best way to count distinct count of X for each value of foo and bar", has the simple answer | stats dc(X) by foo bar

But the question here is more about how to do this with summary indexing, which is complicated for distinct counts.

Populating a daily summary index search with the results of something like

search | sistats dc(user_id) BY field1, field2, field3

will work great if you only want to report on distinct counts at the day granularity. But for week and month granularities it wont work. The reason is that the sistats command isn't going to preserve the actual values of the user_id's, just what the distinct counts were for each combination of fields on that day. As such it wont have any idea how many of the 150 users it saw on one day are the same users it saw on any other day.

On the other hand you could do

search | stats count BY user_id, field1, field2, field3

as your summary search. This would certainly work but it is indeed quite a lot of rows if there are a lot of unique combinations of user_id, field1, field2, and field3. If the number of such unique combinations is equivalent to the number of events in your raw data, that's a cardinal sin and making this summary index is probably a bad idea that wont be much faster than running the reports raw.

Sometimes a summary indexing use case gets a little overburdened. A sign of this is if the summary index cases uses a lot of fields, but no single report that runs against that summary uses more than a couple. In this case finding the fault lines in the cases and breaking it up into 2 or more summary indexed searches can alleviate the kind of "runaway cross product" and get you back into good performance.

And last but not least there's the sort of caveman approach, of just running three summary index searches of

search | stats dc(user_id) BY field1, field2, field3

on three schedules, once a day, once a week and once a month.

View solution in original post

sideview · ‎04-07-2015

The broader question here "what's the best way to count distinct count of X for each value of foo and bar", has the simple answer | stats dc(X) by foo bar

But the question here is more about how to do this with summary indexing, which is complicated for distinct counts.

Populating a daily summary index search with the results of something like

search | sistats dc(user_id) BY field1, field2, field3

will work great if you only want to report on distinct counts at the day granularity. But for week and month granularities it wont work. The reason is that the sistats command isn't going to preserve the actual values of the user_id's, just what the distinct counts were for each combination of fields on that day. As such it wont have any idea how many of the 150 users it saw on one day are the same users it saw on any other day.

On the other hand you could do

search | stats count BY user_id, field1, field2, field3

as your summary search. This would certainly work but it is indeed quite a lot of rows if there are a lot of unique combinations of user_id, field1, field2, and field3. If the number of such unique combinations is equivalent to the number of events in your raw data, that's a cardinal sin and making this summary index is probably a bad idea that wont be much faster than running the reports raw.

Sometimes a summary indexing use case gets a little overburdened. A sign of this is if the summary index cases uses a lot of fields, but no single report that runs against that summary uses more than a couple. In this case finding the fault lines in the cases and breaking it up into 2 or more summary indexed searches can alleviate the kind of "runaway cross product" and get you back into good performance.

And last but not least there's the sort of caveman approach, of just running three summary index searches of

search | stats dc(user_id) BY field1, field2, field3

on three schedules, once a day, once a week and once a month.

sc0tt · ‎04-12-2015

Thanks for the detailed response. In the end, I've gone with the second search for now. This creates about 10% of the raw events. It's still more data than I would like to have, but I think it's sufficient for now. Maybe I'll set a short retention policy on this index to prevent it from increasing in size too much.

markthompson · ‎04-07-2015

I use DistinctCount if I want to achieve a similar outcome.

Is there a best practice for counting distinct values over periods of time?

Splunk Custom Visualizations App End of Life

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk