Splunk Search

What's the best way to get unique counts from long term data?

twinspop
Influencer

We have a few busy indexes that can only retain about 20 days worth of logs. The corner-office-types want unique user counts for the last month. The first thing that popped into my head was doing a daily stats count by user and collecting that into a summary index. But, I'm seeing upwards of 2m unique users each day. So that collect will be pretty massive.

Is there a better way? Maybe dumping that stats command into a kvstore?

At what point will collect fall over from too many lines of input?

Other considerations?

0 Karma
1 Solution

twinspop
Influencer

My solution:

  • Created a collection (KV store) with one defined field, timestamp
  • Created the lookup for that collection
  • Created a saved search that runs every 4 hours...
  • ... and sends the results to outputlookup

I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:

sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true

Now with a simple inputlookup command I can get uniques for the month of November:

| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques

I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.

EDIT - Dawg, do u even optimize?!

It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!

sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true

Any with existing records are excluded with the where isnull() clause. The speedup was at least 2 orders of magnitude. Woot woot.

View solution in original post

0 Karma

twinspop
Influencer

My solution:

  • Created a collection (KV store) with one defined field, timestamp
  • Created the lookup for that collection
  • Created a saved search that runs every 4 hours...
  • ... and sends the results to outputlookup

I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:

sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true

Now with a simple inputlookup command I can get uniques for the month of November:

| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques

I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.

EDIT - Dawg, do u even optimize?!

It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!

sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true

Any with existing records are excluded with the where isnull() clause. The speedup was at least 2 orders of magnitude. Woot woot.

0 Karma

aljohnson_splun
Splunk Employee
Splunk Employee

Do you need the count for each user - or the distinct count of users?

If the latter, have you considered using estdc? If the unique user count can be approximated (1-2% error), give it a try!

Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!

0 Karma

twinspop
Influencer

I can't use estdc() because the data is not there. I need a uniques for the whole month, but only have about 19 days of retention. I'll have to run some tests to see how collect and/or the kvstore handles millions of records.

0 Karma
Get Updates on the Splunk Community!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...