<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to chart cumulative distinct count over time? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223873#M65934</link>
    <description>&lt;P&gt;Try something like this&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.&lt;/P&gt;</description>
    <pubDate>Tue, 12 Jan 2016 23:02:45 GMT</pubDate>
    <dc:creator>somesoni2</dc:creator>
    <dc:date>2016-01-12T23:02:45Z</dc:date>
    <item>
      <title>How to chart cumulative distinct count over time?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223872#M65933</link>
      <description>&lt;P&gt;With &lt;CODE&gt;dc(mykey) as DC1&lt;/CODE&gt;, I can plot how many distinct values of &lt;CODE&gt;mykey&lt;/CODE&gt; is incurred for the fixed time span.  If values of &lt;CODE&gt;mykey&lt;/CODE&gt; never repeat over time, &lt;CODE&gt;accum DC1 as DC_accum&lt;/CODE&gt; will give me cumulative count of distinct values of &lt;CODE&gt;mykey&lt;/CODE&gt; over time.  But that would be too trivial.  In most practical cases, &lt;CODE&gt;mykey&lt;/CODE&gt; values are partially repeating.  How can I plot cumulative count of distinct values of &lt;CODE&gt;mykey&lt;/CODE&gt; over time?&lt;/P&gt;

&lt;P&gt;Following suggestions from &lt;A href="https://answers.splunk.com/answers/50628/how-do-you-chart-a-cumulative-sum.html"&gt;How do you chart a cumulative sum&lt;/A&gt;, I tried&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| reverse
| streamstats dc(mykey) as DC_cumulative
| timechart max(DC_cumulative)
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;This gives me an "improved" result, meaning it gives a plateau that equals the total DC.  Does this really do what I want?  Not fully fluent in streamstats, this whole "stats" without _time makes me nervous.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 20:07:14 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223872#M65933</guid>
      <dc:creator>yuanliu</dc:creator>
      <dc:date>2016-01-12T20:07:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to chart cumulative distinct count over time?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223873#M65934</link>
      <description>&lt;P&gt;Try something like this&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 23:02:45 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223873#M65934</guid>
      <dc:creator>somesoni2</dc:creator>
      <dc:date>2016-01-12T23:02:45Z</dc:date>
    </item>
    <item>
      <title>Re: How to chart cumulative distinct count over time?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223874#M65935</link>
      <description>&lt;P&gt;I initially thought that adding dedup would increase cost, but timechart before streamstats would reduce cost of streamstats.  So I played these scenarios out with my original recipe over 5.5 million records - 155K cumulative distinct values, 2K to 3K distinct values in each of 49 surveyed intervals.  Amazingly, as dedup reduces load on subsequent searches with such duplication rate, adding dedup accelerates my original recipe, too.&lt;/P&gt;

&lt;OL&gt;
&lt;LI&gt;(@somesoni2) 75s: &lt;CODE&gt;| reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;(Modified) 86s: &lt;CODE&gt;| dedup mykey | streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;(Original) 179s: &lt;CODE&gt;| streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)&lt;/CODE&gt;&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;The side effect of placing timechart before streamstats is an added series &lt;CODE&gt;dc&lt;/CODE&gt;.  Not only can this be easily filtered out, but also &lt;CODE&gt;dc&lt;/CODE&gt; is a useful metric that I had to go out of my way to add back, adding even more cost.&lt;/P&gt;

&lt;P&gt;Of course, actual savings/cost will depend on data characteristics.  The one in this comparison has an extreme duplication ratio, very different from my target data.  (Chosen for low cost of raw search.)   But I believe that whenever duplication is significant, the savings is positive.  Great job!&lt;/P&gt;</description>
      <pubDate>Wed, 20 Jan 2016 23:46:28 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223874#M65935</guid>
      <dc:creator>yuanliu</dc:creator>
      <dc:date>2016-01-20T23:46:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to chart cumulative distinct count over time?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223875#M65936</link>
      <description>&lt;P&gt;As commented under @somesoni2's answer, running timechart before streamstats is more efficient than the original recipe.  However, one non-obvious benefit of running streamstats before timechart (the original method) is that it allows a groupby clause whereas the former doesn't.  So, here is an alternative, nearly as efficient answer if you need groupby:&lt;BR /&gt;
&lt;CODE&gt;| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart max(DC_cumulative) by group_key&lt;/CODE&gt;&lt;BR /&gt;
Further more, if you need to retain the side effect of obtaining an interval distinct count (that the other method has), you can do&lt;BR /&gt;
&lt;CODE&gt;| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart dc(mykey) max(DC_cumulative) by group_key&lt;/CODE&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Jan 2016 00:07:08 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-chart-cumulative-distinct-count-over-time/m-p/223875#M65936</guid>
      <dc:creator>yuanliu</dc:creator>
      <dc:date>2016-01-21T00:07:08Z</dc:date>
    </item>
  </channel>
</rss>

