Archive

Whats the difference between dc (distinct count) and estdc (estimated distinct count)

Splunk Employee
Splunk Employee

I have a search that returns unique visitors query over 30 days' worth of logs :

Using dc() it was a lot slower. Here is the comparison:

estdc: 3300 seconds, 15351270
dc: 17700 seconds, 15134261

ESTDC looks good enough, especially given that it's fairly accurate (1.5% difference) and MUCH faster. Any information will be appreciated.

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

View solution in original post

Splunk Employee
Splunk Employee

Basically, the technique is based on hashing and hash collisions. You can estimate how many distinct items you have tried to hash based on the number of hash collisions and the size of the hash bucket.

More or less it will use constant time and resources regardless of the number of unique values. The technique is accurate to about 1-2%, although it may be over or undercounting.

View solution in original post

Motivator

@khourihan_splunk - Could you please elaborate on how does it use constant time and resource regardless of the number of values? As per my understanding if I search for estdc(bytes) it needs to calculate the hash for each value of bytes and then it must go through all the hashes and count number of the collision.

0 Karma