Solved: Re: Whats the difference between dc (distinct coun...

khourihan_splun · ‎08-07-2013

I have a search that returns unique visitors query over 30 days' worth of logs :

Using dc() it was a lot slower. Here is the comparison:

estdc: 3300 seconds, 15351270
dc: 17700 seconds, 15134261

ESTDC looks good enough, especially given that it's fairly accurate (1.5% difference) and MUCH faster. Any information will be appreciated.

khourihan_splun · ‎08-07-2013

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

View solution in original post

khourihan_splun · ‎08-07-2013

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

VatsalJagani · ‎02-13-2019

@khourihan_splunk - Could you please elaborate on how does it use constant time and resource regardless of the number of values? As per my understanding if I search for estdc(bytes) it needs to calculate the hash for each value of bytes and then it must go through all the hashes and count number of the collision.

kulick · ‎04-15-2023

Just reading this thread. Cool. I didn't know about estdc(). I will definitely look for chances to use it now.

Regarding your question @VatsalJagani , I suspect this works by just hashing the values (hash function is O(1)ish if chosen properly (maybe actually O(n) in the size of the value, but that is small (<1024 or something))) and then incrementing a "distinct" counter if that hash has not been seen before and finally recording that the hash has been seen in a table (at index based on the hash). That should all be basically constant time for each row/event considered. Of course, if two distinct values hash to the same "bucket", then you will miscount, but if you hash is large enough that you minimize those collisions, then you will get approximately the right answer. Lots of edge cases need to be handled in that description, but they are basically the same edge cases that arise in any hashtable implementation and (sane) hashtable implementations are well understood to be O(1) for insert and lookup.

Thanks for the Answer, @khourihan_splun !

Whats the difference between dc (distinct count) and estdc (estimated distinct count)?

count

Data Management Digest – November 2025

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Are you a member of the Splunk Community?

Whats the difference between dc (distinct count) and estdc (estimated distinct count)?

count

Data Management Digest – November 2025

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...