Re: Dedup vs. Stats performance

faguilar · ‎12-12-2019

Hi Splunkers!

Some days ago, one of my colleagues told me that "if you want to delete duplicates on your search, using a stats count by yourfield is more efficient than using dedup yourfield because it has better performance since stats doesn't have to compare ALL the elements of the search while dedup does", but he didn't give me to me any demonstration about it.

Is that true?

I've been digging for days on the internet, but I can't find an official answer, just some good argumented approaches:

https://antipaucity.com/2018/03/08/more-thoughts-on-stats-vs-dedup-in-splunk/#.XfJoU-hKiUk

https://www.reddit.com/r/Splunk/comments/91nqsc/more_thoughts_on_stats_vs_dedup_in_splunk/

Somebody even says here that stats dc(yourfield) it's even faster than a simple stats:
https://answers.splunk.com/answers/483100/is-there-a-performance-impact-by-using-dedup-comma.html

For me it makes completely sense, because it's easier to count (or distinct count) just elements by one unique field than check if that same element exists within ALL the data sets.

So, what do you guys think? Is there any REAL performance improvement in using stats over using dedup? Is there any official answer about this question?

I'm just looking for improve my queries the best as I can.

Thank you all!!

sutton115 · ‎11-10-2022

I just found this to absolutely be the case, and was able to use this method to tune a bunch of my queries in one of my dashboards. My use-case is that I'm looking for a unique list of hosts reporting to a given index within a timeframe. Here's a small example of the efficiency gain I'm seeing:

Using "dedup host" : scanned 5.4 million events in 171.24 seconds

Using "stats max(_time) by host" : scanned 5.4 million events in 22.672 seconds

I was so impressed by the improvement that I searched for a deeper rationale and found this post instead. I'm sure there's a sophisticated internal answer for this significantly improved execution path, but for now I'll just be happy that it works as well as it does.

jwrjrobertson05 · ‎11-02-2020

I've heard this discussion before, and just had a user run a search that is a prime candidate for this so I did some comparing. This 24-hour search covered about 10-15Tb of raw data and returned 62,023 pairs

The base search was something like this:

index IN (index1,index2,index3)
event=specific_type_auth_event
username IN (user1,user2,username*)

This was piped into 3 different options and based on the overall runtime, I'll keep using stats for my deduping.

Stats took 67 seconds to run:

| stats count by clientip,username
| table clientip,username

dedup took 113 seconds

| dedup client_ip, username 
| table client_ip, username

Dedup without the raw field took 97 seconds

| fields + username,client_ip
| fields - _raw
| dedup client_ip, username
| table client_ip, username

I also used other variations like fields - _* to pull out all internal fields, but it didn't have noticeable effect for stats or dedup.

martin_mueller · ‎12-12-2019

Assuming you want a list of all values of a field in an index, both these searches would give you that:

index=a | stats count by field | fields - count

index=a | dedup field | table field

Fundamentally, both searches have to do the same work:

load all events matching the search
extract, alias, calculate, lookup, whatever to produce the field
produce a deduplicated list on each indexer (prestats / prededup in remoteSearch in the job inspector) to return to the search head
merge those lists into one on the search head

Assuming both commands are built well, there will not be a huge difference in performance. You can verify this by looking at the big numbers to the right of dispatch.stream.remote.indexernamehere in the job inspector, both should show similar and small amounts of data returned to the search head. When looking at run time, make sure you do several executions to get a good average and iron out other activities on the system.

There can be subtle differences.
- dedup should not allow batch mode searches, but instead requires event ordering and may therefore not allow parallel search pipelines, didn't verify this
- less smart use of dedup may cause more data to be carried around, e.g. the _raw event
- large stats results will cause an on-disk mergesort, slowing the search head phase of the search down significantly

woodcock · ‎12-12-2019

@martin_mueller and I have argued about this several times. He seems to have it straight in his mind but for some reason when he has tried to convince me, I just don't see it. Testing with searches has been very inconclusive when judging strictly by run-time (no clear winner). Probably testing for your use-case and events is the best option because it doesn't take very long to try all 3 and check the Job Inspector.

BainM · ‎12-12-2019

HI faguilar-

According to this page, that is simply not true.

Here's an explanation from that page:

Other commands require all of the events from all of the indexers before
the command can finish. These are referred to as non-streaming commands. Examples of non-streaming commands are stats, sort, dedup, top, and append.

Non-streaming commands can run only when all of the data is available. To process non-streaming commands, all of the search results from the indexers
are sent to the search head. When this happens, all further processing must be performed by the search head, rather than in parallel on the indexers.

-Mike

Dedup vs. Stats performance

Mastering Data Pipelines: Unlocking Value with Splunk

The Latest Cisco Integrations With Splunk Platform!

AI Adoption Hub Launch | Curated Resources to Get Started with AI in Splunk