<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Dedup vs. Stats performance in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/527587#M148942</link>
    <description>&lt;P&gt;I've heard this discussion before, and just had a user run a search that is a prime candidate for this so I did some comparing. This 24-hour search covered about 10-15Tb of raw data and returned 62,023 pairs&lt;/P&gt;&lt;P&gt;The base search was something like this:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index IN (index1,index2,index3)
event=specific_type_auth_event
username IN (user1,user2,username*)&lt;/LI-CODE&gt;&lt;P&gt;This was piped into 3 different options and based on the overall runtime, I'll keep using stats for my deduping.&lt;/P&gt;&lt;P&gt;Stats took 67 seconds to run:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| stats count by clientip,username
| table clientip,username&lt;/LI-CODE&gt;&lt;P&gt;dedup took 113 seconds&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| dedup client_ip, username 
| table client_ip, username&lt;/LI-CODE&gt;&lt;P&gt;Dedup without the raw field took 97 seconds&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| fields + username,client_ip
| fields - _raw
| dedup client_ip, username
| table client_ip, username&lt;/LI-CODE&gt;&lt;P&gt;I also used other variations like fields - _* to pull out all internal fields, but it didn't have noticeable effect for stats or dedup.&lt;/P&gt;</description>
    <pubDate>Mon, 02 Nov 2020 19:24:53 GMT</pubDate>
    <dc:creator>jwrjrobertson05</dc:creator>
    <dc:date>2020-11-02T19:24:53Z</dc:date>
    <item>
      <title>Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465388#M131105</link>
      <description>&lt;P&gt;Hi Splunkers!&lt;/P&gt;

&lt;P&gt;Some days ago, one of my colleagues told me that &lt;STRONG&gt;&lt;EM&gt;"if you want to delete duplicates on your search, using a &lt;CODE&gt;stats count by yourfield&lt;/CODE&gt; is more efficient than using &lt;CODE&gt;dedup yourfield&lt;/CODE&gt; because it has better performance since  &lt;CODE&gt;stats&lt;/CODE&gt; doesn't have to compare ALL the elements of the search while  &lt;CODE&gt;dedup&lt;/CODE&gt; does"&lt;/EM&gt;&lt;/STRONG&gt;, but he didn't give me to me any demonstration about it.&lt;/P&gt;

&lt;P&gt;Is that true?&lt;/P&gt;

&lt;P&gt;I've been digging for days on the internet, but I can't find an official answer, just some good argumented approaches:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://antipaucity.com/2018/03/08/more-thoughts-on-stats-vs-dedup-in-splunk/#.XfJoU-hKiUk"&gt;https://antipaucity.com/2018/03/08/more-thoughts-on-stats-vs-dedup-in-splunk/#.XfJoU-hKiUk&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://www.reddit.com/r/Splunk/comments/91nqsc/more_thoughts_on_stats_vs_dedup_in_splunk/"&gt;https://www.reddit.com/r/Splunk/comments/91nqsc/more_thoughts_on_stats_vs_dedup_in_splunk/&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Somebody even says here that  &lt;CODE&gt;stats dc(yourfield)&lt;/CODE&gt; it's even faster than a simple  &lt;CODE&gt;stats&lt;/CODE&gt;:&lt;BR /&gt;
&lt;A href="https://answers.splunk.com/answers/483100/is-there-a-performance-impact-by-using-dedup-comma.html"&gt;https://answers.splunk.com/answers/483100/is-there-a-performance-impact-by-using-dedup-comma.html&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;For me it makes completely sense, because it's easier to count (or distinct count) just elements by one unique field than check if that same element exists within ALL the data sets.&lt;/P&gt;

&lt;P&gt;So, what do you guys think? Is there any REAL performance improvement in using  &lt;CODE&gt;stats&lt;/CODE&gt; over using &lt;CODE&gt;dedup&lt;/CODE&gt;? Is there any official answer about this question?&lt;/P&gt;

&lt;P&gt;I'm just looking for improve my queries the best as I can.&lt;/P&gt;

&lt;P&gt;Thank you all!!&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2019 16:41:10 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465388#M131105</guid>
      <dc:creator>faguilar</dc:creator>
      <dc:date>2019-12-12T16:41:10Z</dc:date>
    </item>
    <item>
      <title>Re: Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465389#M131106</link>
      <description>&lt;P&gt;HI faguilar-&lt;/P&gt;

&lt;P&gt;&lt;A href="https://docs.splunk.com/Documentation/Splunk/7.2.7/Search/Writebettersearches"&gt;According to this page&lt;/A&gt;, that is simply not true. &lt;/P&gt;

&lt;P&gt;Here's an explanation from that page: &lt;/P&gt;

&lt;BLOCKQUOTE&gt;
&lt;P&gt;Other commands require all of the events from all of the indexers before&lt;BR /&gt;
the command can finish. These are referred to as non-streaming commands. Examples of &lt;STRONG&gt;non-streaming commands&lt;/STRONG&gt; are &lt;STRONG&gt;stats&lt;/STRONG&gt;, sort, &lt;STRONG&gt;dedup&lt;/STRONG&gt;, top, and append.&lt;/P&gt;

&lt;P&gt;Non-streaming commands can run only when all of the data is available. To process non-streaming commands, all of the search results from the indexers&lt;BR /&gt;
are sent to the search head. When this happens, all further processing must be performed by the search head, rather than in parallel on the indexers.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;-Mike &lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2019 19:02:21 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465389#M131106</guid>
      <dc:creator>BainM</dc:creator>
      <dc:date>2019-12-12T19:02:21Z</dc:date>
    </item>
    <item>
      <title>Re: Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465390#M131107</link>
      <description>&lt;P&gt;@martin_mueller and I have argued about this several times.  He seems to have it straight in his mind but for some reason when he has tried to convince me, I just don't see it.  Testing with searches has been very inconclusive when judging strictly by run-time (no clear winner).  Probably testing for your use-case and events is the best option because it doesn't take very long to try all 3 and check the &lt;CODE&gt;Job Inspector&lt;/CODE&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2019 22:31:24 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465390#M131107</guid>
      <dc:creator>woodcock</dc:creator>
      <dc:date>2019-12-12T22:31:24Z</dc:date>
    </item>
    <item>
      <title>Re: Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465391#M131108</link>
      <description>&lt;P&gt;Assuming you want a list of all values of a field in an index, both these searches would give you that:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=a | stats count by field | fields - count

index=a | dedup field | table field
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Fundamentally, both searches have to do the same work:&lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;load all events matching the search &lt;/LI&gt;
&lt;LI&gt;extract, alias, calculate, lookup, whatever to produce the field&lt;/LI&gt;
&lt;LI&gt;produce a deduplicated list on each indexer (prestats / prededup in remoteSearch in the job inspector) to return to the search head &lt;/LI&gt;
&lt;LI&gt;merge those lists into one on the search head &lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Assuming both commands are built well, there will not be a huge difference in performance. You can verify this by looking at the big numbers to the right of dispatch.stream.remote.indexernamehere in the job inspector, both should show similar and small amounts of data returned to the search head. When looking at run time, make sure you do several executions to get a good average and iron out other activities on the system. &lt;/P&gt;

&lt;P&gt;There can be subtle differences.&lt;BR /&gt;
- dedup should not allow batch mode searches, but instead requires event ordering and may therefore not allow parallel search pipelines, didn't verify this &lt;BR /&gt;
- less smart use of dedup may cause more data to be carried around, e.g. the _raw event&lt;BR /&gt;
- large stats results will cause an on-disk mergesort, slowing the search head phase of the search down significantly&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2019 22:45:04 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/465391#M131108</guid>
      <dc:creator>martin_mueller</dc:creator>
      <dc:date>2019-12-12T22:45:04Z</dc:date>
    </item>
    <item>
      <title>Re: Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/527587#M148942</link>
      <description>&lt;P&gt;I've heard this discussion before, and just had a user run a search that is a prime candidate for this so I did some comparing. This 24-hour search covered about 10-15Tb of raw data and returned 62,023 pairs&lt;/P&gt;&lt;P&gt;The base search was something like this:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index IN (index1,index2,index3)
event=specific_type_auth_event
username IN (user1,user2,username*)&lt;/LI-CODE&gt;&lt;P&gt;This was piped into 3 different options and based on the overall runtime, I'll keep using stats for my deduping.&lt;/P&gt;&lt;P&gt;Stats took 67 seconds to run:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| stats count by clientip,username
| table clientip,username&lt;/LI-CODE&gt;&lt;P&gt;dedup took 113 seconds&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| dedup client_ip, username 
| table client_ip, username&lt;/LI-CODE&gt;&lt;P&gt;Dedup without the raw field took 97 seconds&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| fields + username,client_ip
| fields - _raw
| dedup client_ip, username
| table client_ip, username&lt;/LI-CODE&gt;&lt;P&gt;I also used other variations like fields - _* to pull out all internal fields, but it didn't have noticeable effect for stats or dedup.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Nov 2020 19:24:53 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/527587#M148942</guid>
      <dc:creator>jwrjrobertson05</dc:creator>
      <dc:date>2020-11-02T19:24:53Z</dc:date>
    </item>
    <item>
      <title>Re: Dedup vs. Stats performance</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/620432#M215686</link>
      <description>&lt;P&gt;I just found this to absolutely be the case, and was able to use this method to tune a bunch of my queries in one of my dashboards.&amp;nbsp; My use-case is that I'm looking for a unique list of hosts reporting to a given index within a timeframe.&amp;nbsp; Here's a small example of the efficiency gain I'm seeing:&lt;/P&gt;&lt;P&gt;Using "dedup host" : scanned 5.4 million events in 171.24 seconds&lt;/P&gt;&lt;P&gt;Using "stats max(_time) by host" :&amp;nbsp;scanned 5.4 million events in 22.672 seconds&lt;/P&gt;&lt;P&gt;I was so impressed by the improvement that I searched for a deeper rationale and found this post instead.&amp;nbsp; I'm sure there's a sophisticated internal answer for this significantly improved execution path, but for now I'll just be happy that it works as well as it does.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Nov 2022 14:04:55 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Dedup-vs-Stats-performance/m-p/620432#M215686</guid>
      <dc:creator>sutton115</dc:creator>
      <dc:date>2022-11-10T14:04:55Z</dc:date>
    </item>
  </channel>
</rss>

