topic Re: What's the best way to get unique counts from long term data? in Splunk Search

What's the best way to get unique counts from long term data?

twinspop — Fri, 02 Dec 2016 20:10:15 GMT

We have a few busy indexes that can only retain about 20 days worth of logs. The corner-office-types want unique user counts for the last month. The first thing that popped into my head was doing a daily stats count by user and collecting that into a summary index. But, I'm seeing upwards of 2m unique users each day. So that collect will be pretty massive.

Is there a better way? Maybe dumping that stats command into a kvstore?

At what point will collect fall over from too many lines of input?

Other considerations?

Re: What's the best way to get unique counts from long term data?

aljohnson_splun — Fri, 02 Dec 2016 20:57:19 GMT

Do you need the count for each user - or the distinct count of users?

If the latter, have you considered using estdc? If the unique user count can be approximated (1-2% error), give it a try!

Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!

Re: What's the best way to get unique counts from long term data?

twinspop — Sat, 03 Dec 2016 15:55:20 GMT

I can't use estdc() because the data is not there. I need a uniques for the whole month, but only have about 19 days of retention. I'll have to run some tests to see how collect and/or the kvstore handles millions of records.

Re: What's the best way to get unique counts from long term data?

twinspop — Tue, 06 Dec 2016 23:58:48 GMT

My solution:

Created a collection (KV store) with one defined field, timestamp
Created the lookup for that collection
Created a saved search that runs every 4 hours...
... and sends the results to outputlookup

I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:

sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true

Now with a simple inputlookup command I can get uniques for the month of November:

| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques

I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.

EDIT - Dawg, do u even optimize?!

It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!

sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true

Any with existing records are excluded with the where isnull() clause. The speedup was at least 2 orders of magnitude. Woot woot.