Some days ago, one of my colleagues told me that "if you want to delete duplicates on your search, using a
stats count by yourfield is more efficient than using
dedup yourfield because it has better performance since
stats doesn't have to compare ALL the elements of the search while
dedup does", but he didn't give me to me any demonstration about it.
Is that true?
I've been digging for days on the internet, but I can't find an official answer, just some good argumented approaches:
Somebody even says here that
stats dc(yourfield) it's even faster than a simple
For me it makes completely sense, because it's easier to count (or distinct count) just elements by one unique field than check if that same element exists within ALL the data sets.
So, what do you guys think? Is there any REAL performance improvement in using
stats over using
dedup? Is there any official answer about this question?
I'm just looking for improve my queries the best as I can.
Thank you all!!
I've heard this discussion before, and just had a user run a search that is a prime candidate for this so I did some comparing. This 24-hour search covered about 10-15Tb of raw data and returned 62,023 pairs
The base search was something like this:
index IN (index1,index2,index3) event=specific_type_auth_event username IN (user1,user2,username*)
This was piped into 3 different options and based on the overall runtime, I'll keep using stats for my deduping.
Stats took 67 seconds to run:
| stats count by clientip,username | table clientip,username
dedup took 113 seconds
| dedup client_ip, username | table client_ip, username
Dedup without the raw field took 97 seconds
| fields + username,client_ip | fields - _raw | dedup client_ip, username | table client_ip, username
I also used other variations like fields - _* to pull out all internal fields, but it didn't have noticeable effect for stats or dedup.
Assuming you want a list of all values of a field in an index, both these searches would give you that:
index=a | stats count by field | fields - count index=a | dedup field | table field
Fundamentally, both searches have to do the same work:
Assuming both commands are built well, there will not be a huge difference in performance. You can verify this by looking at the big numbers to the right of dispatch.stream.remote.indexernamehere in the job inspector, both should show similar and small amounts of data returned to the search head. When looking at run time, make sure you do several executions to get a good average and iron out other activities on the system.
There can be subtle differences.
- dedup should not allow batch mode searches, but instead requires event ordering and may therefore not allow parallel search pipelines, didn't verify this
- less smart use of dedup may cause more data to be carried around, e.g. the _raw event
- large stats results will cause an on-disk mergesort, slowing the search head phase of the search down significantly
@martin_mueller and I have argued about this several times. He seems to have it straight in his mind but for some reason when he has tried to convince me, I just don't see it. Testing with searches has been very inconclusive when judging strictly by run-time (no clear winner). Probably testing for your use-case and events is the best option because it doesn't take very long to try all 3 and check the
According to this page, that is simply not true.
Here's an explanation from that page:
Other commands require all of the events from all of the indexers before
the command can finish. These are referred to as non-streaming commands. Examples of non-streaming commands are stats, sort, dedup, top, and append.
Non-streaming commands can run only when all of the data is available. To process non-streaming commands, all of the search results from the indexers
are sent to the search head. When this happens, all further processing must be performed by the search head, rather than in parallel on the indexers.