Splunk Search

Dedup vs. Stats performance

Path Finder

Hi Splunkers!

Some days ago, one of my colleagues told me that "if you want to delete duplicates on your search, using a stats count by yourfield is more efficient than using dedup yourfield because it has better performance since stats doesn't have to compare ALL the elements of the search while dedup does", but he didn't give me to me any demonstration about it.

Is that true?

I've been digging for days on the internet, but I can't find an official answer, just some good argumented approaches:

https://antipaucity.com/2018/03/08/more-thoughts-on-stats-vs-dedup-in-splunk/#.XfJoU-hKiUk

https://www.reddit.com/r/Splunk/comments/91nqsc/more_thoughts_on_stats_vs_dedup_in_splunk/

Somebody even says here that stats dc(yourfield) it's even faster than a simple stats:
https://answers.splunk.com/answers/483100/is-there-a-performance-impact-by-using-dedup-comma.html

For me it makes completely sense, because it's easier to count (or distinct count) just elements by one unique field than check if that same element exists within ALL the data sets.

So, what do you guys think? Is there any REAL performance improvement in using stats over using dedup? Is there any official answer about this question?

I'm just looking for improve my queries the best as I can.

Thank you all!!

SplunkTrust
SplunkTrust

Assuming you want a list of all values of a field in an index, both these searches would give you that:

index=a | stats count by field | fields - count

index=a | dedup field | table field

Fundamentally, both searches have to do the same work:

  • load all events matching the search
  • extract, alias, calculate, lookup, whatever to produce the field
  • produce a deduplicated list on each indexer (prestats / prededup in remoteSearch in the job inspector) to return to the search head
  • merge those lists into one on the search head

Assuming both commands are built well, there will not be a huge difference in performance. You can verify this by looking at the big numbers to the right of dispatch.stream.remote.indexernamehere in the job inspector, both should show similar and small amounts of data returned to the search head. When looking at run time, make sure you do several executions to get a good average and iron out other activities on the system.

There can be subtle differences.
- dedup should not allow batch mode searches, but instead requires event ordering and may therefore not allow parallel search pipelines, didn't verify this
- less smart use of dedup may cause more data to be carried around, e.g. the _raw event
- large stats results will cause an on-disk mergesort, slowing the search head phase of the search down significantly

Esteemed Legend

@martin_mueller and I have argued about this several times. He seems to have it straight in his mind but for some reason when he has tried to convince me, I just don't see it. Testing with searches has been very inconclusive when judging strictly by run-time (no clear winner). Probably testing for your use-case and events is the best option because it doesn't take very long to try all 3 and check the Job Inspector.

Communicator

HI faguilar-

According to this page, that is simply not true.

Here's an explanation from that page:

Other commands require all of the events from all of the indexers before
the command can finish. These are referred to as non-streaming commands. Examples of non-streaming commands are stats, sort, dedup, top, and append.

Non-streaming commands can run only when all of the data is available. To process non-streaming commands, all of the search results from the indexers
are sent to the search head. When this happens, all further processing must be performed by the search head, rather than in parallel on the indexers.

-Mike