Splunk Search

Is there a best practice for counting distinct values over periods of time?

sc0tt
Builder

I would like to count unique users by day, week, and month. I'm not really sure what's the preferred Splunk method to do this. I've been experimenting with different searches and summary indexes/accelerated reports. I'm struggling with determining the most efficient solution.

I believe populating a summary index with a daily search such as

search | sistats dc(user_id) BY field1, field2, field3 

will work. However, my concern is that since there are many unique users the index can become quite large. Changing the search to something like

search | stats count BY user_id, field1, field2, field3

is another option but this would create many events which I don't think would be any more efficient than just searching the source events.

Are there any recommended solutions for counting distinct users over periods of time?

1 Solution

sideview
SplunkTrust
SplunkTrust

The broader question here "what's the best way to count distinct count of X for each value of foo and bar", has the simple answer | stats dc(X) by foo bar

But the question here is more about how to do this with summary indexing, which is complicated for distinct counts.

Populating a daily summary index search with the results of something like

search | sistats dc(user_id) BY field1, field2, field3 

will work great if you only want to report on distinct counts at the day granularity. But for week and month granularities it wont work. The reason is that the sistats command isn't going to preserve the actual values of the user_id's, just what the distinct counts were for each combination of fields on that day. As such it wont have any idea how many of the 150 users it saw on one day are the same users it saw on any other day.

On the other hand you could do

search | stats count BY user_id, field1, field2, field3

as your summary search. This would certainly work but it is indeed quite a lot of rows if there are a lot of unique combinations of user_id, field1, field2, and field3. If the number of such unique combinations is equivalent to the number of events in your raw data, that's a cardinal sin and making this summary index is probably a bad idea that wont be much faster than running the reports raw.

Sometimes a summary indexing use case gets a little overburdened. A sign of this is if the summary index cases uses a lot of fields, but no single report that runs against that summary uses more than a couple. In this case finding the fault lines in the cases and breaking it up into 2 or more summary indexed searches can alleviate the kind of "runaway cross product" and get you back into good performance.

And last but not least there's the sort of caveman approach, of just running three summary index searches of

search | stats dc(user_id) BY field1, field2, field3

on three schedules, once a day, once a week and once a month.

View solution in original post

sideview
SplunkTrust
SplunkTrust

The broader question here "what's the best way to count distinct count of X for each value of foo and bar", has the simple answer | stats dc(X) by foo bar

But the question here is more about how to do this with summary indexing, which is complicated for distinct counts.

Populating a daily summary index search with the results of something like

search | sistats dc(user_id) BY field1, field2, field3 

will work great if you only want to report on distinct counts at the day granularity. But for week and month granularities it wont work. The reason is that the sistats command isn't going to preserve the actual values of the user_id's, just what the distinct counts were for each combination of fields on that day. As such it wont have any idea how many of the 150 users it saw on one day are the same users it saw on any other day.

On the other hand you could do

search | stats count BY user_id, field1, field2, field3

as your summary search. This would certainly work but it is indeed quite a lot of rows if there are a lot of unique combinations of user_id, field1, field2, and field3. If the number of such unique combinations is equivalent to the number of events in your raw data, that's a cardinal sin and making this summary index is probably a bad idea that wont be much faster than running the reports raw.

Sometimes a summary indexing use case gets a little overburdened. A sign of this is if the summary index cases uses a lot of fields, but no single report that runs against that summary uses more than a couple. In this case finding the fault lines in the cases and breaking it up into 2 or more summary indexed searches can alleviate the kind of "runaway cross product" and get you back into good performance.

And last but not least there's the sort of caveman approach, of just running three summary index searches of

search | stats dc(user_id) BY field1, field2, field3

on three schedules, once a day, once a week and once a month.

sc0tt
Builder

Thanks for the detailed response. In the end, I've gone with the second search for now. This creates about 10% of the raw events. It's still more data than I would like to have, but I think it's sufficient for now. Maybe I'll set a short retention policy on this index to prevent it from increasing in size too much.

0 Karma

markthompson
Builder

I use DistinctCount if I want to achieve a similar outcome.

0 Karma
Get Updates on the Splunk Community!

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...