Splunk Search

Whats the difference between dc (distinct count) and estdc (estimated distinct count)?

khourihan_splun
Splunk Employee
Splunk Employee

I have a search that returns unique visitors query over 30 days' worth of logs :

Using dc() it was a lot slower. Here is the comparison:

estdc: 3300 seconds, 15351270
dc: 17700 seconds, 15134261

ESTDC looks good enough, especially given that it's fairly accurate (1.5% difference) and MUCH faster. Any information will be appreciated.

Labels (1)
Tags (2)
1 Solution

khourihan_splun
Splunk Employee
Splunk Employee

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

View solution in original post

khourihan_splun
Splunk Employee
Splunk Employee

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

VatsalJagani
SplunkTrust
SplunkTrust

@khourihan_splunk - Could you please elaborate on how does it use constant time and resource regardless of the number of values? As per my understanding if I search for estdc(bytes) it needs to calculate the hash for each value of bytes and then it must go through all the hashes and count number of the collision.

0 Karma

kulick
Path Finder

Just reading this thread.  Cool.  I didn't know about estdc().  I will definitely look for chances to use it now.

Regarding your question @VatsalJagani , I suspect this works by just hashing the values (hash function is O(1)ish if chosen properly (maybe actually O(n) in the size of the value, but that is small (<1024 or something))) and then incrementing a "distinct" counter if that hash has not been seen before and finally recording that the hash has been seen in a table (at index based on the hash).  That should all be basically constant time for each row/event considered.  Of course, if two distinct values hash to the same "bucket", then you will miscount, but if you hash is large enough that you minimize those collisions, then you will get approximately the right answer.  Lots of edge cases need to be handled in that description, but they are basically the same edge cases that arise in any hashtable implementation and (sane) hashtable implementations are well understood to be O(1) for insert and lookup.

Thanks for the Answer, @khourihan_splun !

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...