Hi Splunkers!
Some days ago, one of my colleagues told me that "if you want to delete duplicates on your search, using a stats count by yourfield
is more efficient than using dedup yourfield
because it has better performance since stats
doesn't have to compare ALL the elements of the search while dedup
does", but he didn't give me to me any demonstration about it.
Is that true?
I've been digging for days on the internet, but I can't find an official answer, just some good argumented approaches:
https://antipaucity.com/2018/03/08/more-thoughts-on-stats-vs-dedup-in-splunk/#.XfJoU-hKiUk
https://www.reddit.com/r/Splunk/comments/91nqsc/more_thoughts_on_stats_vs_dedup_in_splunk/
Somebody even says here that stats dc(yourfield)
it's even faster than a simple stats
:
https://answers.splunk.com/answers/483100/is-there-a-performance-impact-by-using-dedup-comma.html
For me it makes completely sense, because it's easier to count (or distinct count) just elements by one unique field than check if that same element exists within ALL the data sets.
So, what do you guys think? Is there any REAL performance improvement in using stats
over using dedup
? Is there any official answer about this question?
I'm just looking for improve my queries the best as I can.
Thank you all!!
I just found this to absolutely be the case, and was able to use this method to tune a bunch of my queries in one of my dashboards. My use-case is that I'm looking for a unique list of hosts reporting to a given index within a timeframe. Here's a small example of the efficiency gain I'm seeing:
Using "dedup host" : scanned 5.4 million events in 171.24 seconds
Using "stats max(_time) by host" : scanned 5.4 million events in 22.672 seconds
I was so impressed by the improvement that I searched for a deeper rationale and found this post instead. I'm sure there's a sophisticated internal answer for this significantly improved execution path, but for now I'll just be happy that it works as well as it does.
I've heard this discussion before, and just had a user run a search that is a prime candidate for this so I did some comparing. This 24-hour search covered about 10-15Tb of raw data and returned 62,023 pairs
The base search was something like this:
index IN (index1,index2,index3)
event=specific_type_auth_event
username IN (user1,user2,username*)
This was piped into 3 different options and based on the overall runtime, I'll keep using stats for my deduping.
Stats took 67 seconds to run:
| stats count by clientip,username
| table clientip,username
dedup took 113 seconds
| dedup client_ip, username
| table client_ip, username
Dedup without the raw field took 97 seconds
| fields + username,client_ip
| fields - _raw
| dedup client_ip, username
| table client_ip, username
I also used other variations like fields - _* to pull out all internal fields, but it didn't have noticeable effect for stats or dedup.
Assuming you want a list of all values of a field in an index, both these searches would give you that:
index=a | stats count by field | fields - count
index=a | dedup field | table field
Fundamentally, both searches have to do the same work:
Assuming both commands are built well, there will not be a huge difference in performance. You can verify this by looking at the big numbers to the right of dispatch.stream.remote.indexernamehere in the job inspector, both should show similar and small amounts of data returned to the search head. When looking at run time, make sure you do several executions to get a good average and iron out other activities on the system.
There can be subtle differences.
- dedup should not allow batch mode searches, but instead requires event ordering and may therefore not allow parallel search pipelines, didn't verify this
- less smart use of dedup may cause more data to be carried around, e.g. the _raw event
- large stats results will cause an on-disk mergesort, slowing the search head phase of the search down significantly
@martin_mueller and I have argued about this several times. He seems to have it straight in his mind but for some reason when he has tried to convince me, I just don't see it. Testing with searches has been very inconclusive when judging strictly by run-time (no clear winner). Probably testing for your use-case and events is the best option because it doesn't take very long to try all 3 and check the Job Inspector
.
HI faguilar-
According to this page, that is simply not true.
Here's an explanation from that page:
Other commands require all of the events from all of the indexers before
the command can finish. These are referred to as non-streaming commands. Examples of non-streaming commands are stats, sort, dedup, top, and append.Non-streaming commands can run only when all of the data is available. To process non-streaming commands, all of the search results from the indexers
are sent to the search head. When this happens, all further processing must be performed by the search head, rather than in parallel on the indexers.
-Mike