<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/712731#M240392</link>
    <description>&lt;P&gt;I believe this answer is not quite correct.&amp;nbsp; The optimized query is:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" 
|&amp;nbsp;table JSESSIONID, action, status&lt;BR /&gt;| stats count by JSESSIONID, action, status&lt;BR /&gt;| rename JSESSIONID as UserSessions&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In a clustered Splunk environment, lines 1-2 execute in parallel on your indexers, the minimized data is then passed to the searchhead, and the searchhead executes line 3, and then line 4 only operates on 1 row of data.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;I try to always do a TABLE early in the qeury especially before doing an expensive DEDUP, STATS, or BIN.&amp;nbsp; That reduces the dataset on all your indexers, discarding unneeded fields, before it's merged on your searchead.&amp;nbsp; Instead of TABLE you could alternately do two FIELDS commands, one to include the necessary fields and another to remove _raw.&amp;nbsp; Computationally I don't know whether Splunk is more efficient handling event data from FIELDS or handling transformed data from TABLE, but TABLE makes the query simpler.&lt;/P&gt;</description>
    <pubDate>Thu, 27 Feb 2025 19:27:51 GMT</pubDate>
    <dc:creator>satyenshah</dc:creator>
    <dc:date>2025-02-27T19:27:51Z</dc:date>
    <item>
      <title>Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391861#M171046</link>
      <description>&lt;P&gt;In the fundamentals 1 course lab 8 tells us to:&lt;BR /&gt;
"As a best practice and for best performance, place dedup as early in the search as possible." (page 4)&lt;/P&gt;

&lt;P&gt;But the quick refence guide tells us that:&lt;BR /&gt;
"Postpone commands that process over the entire result set (non-streaming commands) as late as possible in your search. Some of these commands are: dedup, sort, and stats" (page2)&lt;/P&gt;

&lt;P&gt;the example command they give in lab 8 places dedup in front of the distributable streaming command 'rename':&lt;BR /&gt;
index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" &lt;BR /&gt;
| dedup JSESSIONID&lt;BR /&gt;
| table JSESSIONID, action, status&lt;BR /&gt;
| rename JSESSIONID as UserSessions&lt;/P&gt;

&lt;P&gt;Would it not make sense to place dedup after rename? I guess 'as early as possible' is ambiguous anyways, but any input on where to place dedup would be greatly appreciated,&lt;/P&gt;

&lt;P&gt;Cheers,&lt;BR /&gt;
Roelof&lt;/P&gt;</description>
      <pubDate>Wed, 30 Sep 2020 00:43:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391861#M171046</guid>
      <dc:creator>rvsroe</dc:creator>
      <dc:date>2020-09-30T00:43:05Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391862#M171047</link>
      <description>&lt;P&gt;The best way to tackle the above query is&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" 
| stats count by JSESSIONID, action, status
| rename JSESSIONID as UserSessions
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;CODE&gt;stats&lt;/CODE&gt; or &lt;CODE&gt;dedup&lt;/CODE&gt; is much efficient and reduce the data as much as possible before you do field level manipulations&lt;BR /&gt;
you do a statistical reduction as early as possible in your search&lt;/P&gt;</description>
      <pubDate>Mon, 27 May 2019 09:59:35 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391862#M171047</guid>
      <dc:creator>koshyk</dc:creator>
      <dc:date>2019-05-27T09:59:35Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391863#M171048</link>
      <description>&lt;P&gt;Hi Koshyk, &lt;BR /&gt;
Thank you for the quick reply, just a follow up: this means that if I rename before stats or dedup it would take more time? And this would be the case since it is renaming over a larger dataset than if it was excuted after stats/dedup?&lt;/P&gt;</description>
      <pubDate>Mon, 27 May 2019 10:07:41 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/391863#M171048</guid>
      <dc:creator>rvsroe</dc:creator>
      <dc:date>2019-05-27T10:07:41Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/712731#M240392</link>
      <description>&lt;P&gt;I believe this answer is not quite correct.&amp;nbsp; The optimized query is:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" 
|&amp;nbsp;table JSESSIONID, action, status&lt;BR /&gt;| stats count by JSESSIONID, action, status&lt;BR /&gt;| rename JSESSIONID as UserSessions&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In a clustered Splunk environment, lines 1-2 execute in parallel on your indexers, the minimized data is then passed to the searchhead, and the searchhead executes line 3, and then line 4 only operates on 1 row of data.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;I try to always do a TABLE early in the qeury especially before doing an expensive DEDUP, STATS, or BIN.&amp;nbsp; That reduces the dataset on all your indexers, discarding unneeded fields, before it's merged on your searchead.&amp;nbsp; Instead of TABLE you could alternately do two FIELDS commands, one to include the necessary fields and another to remove _raw.&amp;nbsp; Computationally I don't know whether Splunk is more efficient handling event data from FIELDS or handling transformed data from TABLE, but TABLE makes the query simpler.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Feb 2025 19:27:51 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/712731#M240392</guid>
      <dc:creator>satyenshah</dc:creator>
      <dc:date>2025-02-27T19:27:51Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/712841#M240429</link>
      <description>&lt;P&gt;You have messed with table and fields. You should never use table before stats etc. As table always move processing into SH side. So you should do&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" 
| fields JSESSIONID, action, status
| stats count by JSESSIONID, action, status
| rename JSESSIONID as UserSessions&lt;/LI-CODE&gt;&lt;P&gt;This can use several indexers to do preliminary phase of stats, then send smaller result sets to SH which finally combine those to give true result of stats.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Feb 2025 16:01:46 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/712841#M240429</guid>
      <dc:creator>isoutamo</dc:creator>
      <dc:date>2025-02-28T16:01:46Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/742721#M240927</link>
      <description>&lt;P&gt;I didn't realize that table forces localop.&amp;nbsp; So, optimizing further would be:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index=main sourcetype="access_combined_wcookie" action=purchase status=200 file="success.do" 
| fields - _*
| fields JSESSIONID, action, status
| stats count by JSESSIONID, action, status
| rename JSESSIONID as UserSessions&lt;/LI-CODE&gt;&lt;P&gt;to discard _raw and other internal fields.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 14:16:56 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/742721#M240927</guid>
      <dc:creator>satyenshah</dc:creator>
      <dc:date>2025-03-26T14:16:56Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice dedup: should I use it as early as possible, or postpone it since it is non-streaming?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/742731#M240928</link>
      <description>&lt;P&gt;It makes no sense to fiddle with fields since you're gonna do stats next.&lt;/P&gt;&lt;P&gt;For example, performance of&lt;/P&gt;&lt;PRE&gt;search index=_internal &lt;BR /&gt;| stats count by sourcetype&lt;/PRE&gt;&lt;P&gt;and&lt;/P&gt;&lt;PRE&gt;search index=_internal &lt;BR /&gt;| fields - _* &lt;BR /&gt;| fields sourcetype&lt;BR /&gt;| stats count by sourcetype&lt;/PRE&gt;&lt;P&gt;is practically identical.&lt;/P&gt;&lt;P&gt;The only difference in those commands is that one has this in map phase:&lt;/P&gt;&lt;PRE&gt;litsearch index=_internal&lt;BR /&gt;| addinfo type=count label=prereport_events track_fieldmeta_events=true&lt;BR /&gt;| fields keepcolorder=t "prestats_reserved_*" "psrsvd_*" "sourcetype"&lt;BR /&gt;| prestats count by sourcetype&lt;/PRE&gt;&lt;P&gt;While the other has this:&lt;/P&gt;&lt;PRE&gt;litsearch index=_internal&lt;BR /&gt;| fields - "_*"&lt;BR /&gt;| fields + sourcetype&lt;BR /&gt;| addinfo type=count label=prereport_events track_fieldmeta_events=true&lt;BR /&gt;| fields keepcolorder=t "prestats_reserved_*" "psrsvd_*" "sourcetype"&lt;BR /&gt;| prestats count by sourcetype&lt;/PRE&gt;&lt;P&gt;The two fields commands in the "better" version are actually pointless since right before prestats Splunk does its own fields command which limits the data to the fields taking part in aggregation anyway.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 15:57:29 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Best-practice-dedup-should-I-use-it-as-early-as-possible-or/m-p/742731#M240928</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2025-03-26T15:57:29Z</dc:date>
    </item>
  </channel>
</rss>

