<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What's the best way to get unique counts from long term data? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254510#M76230</link>
    <description>&lt;P&gt;Do you need the count for each user - or the distinct count of users?&lt;/P&gt;

&lt;P&gt;If the latter, have you considered using &lt;CODE&gt;estdc&lt;/CODE&gt;? If the unique user count can be approximated &lt;A href="https://answers.splunk.com/answers/98220/whats-the-difference-between-dc-distinct-count-and-estdc-estimated-distinct-count.html"&gt;(1-2% error)&lt;/A&gt;, give it a try!&lt;/P&gt;

&lt;P&gt;Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!&lt;/P&gt;</description>
    <pubDate>Fri, 02 Dec 2016 20:57:19 GMT</pubDate>
    <dc:creator>aljohnson_splun</dc:creator>
    <dc:date>2016-12-02T20:57:19Z</dc:date>
    <item>
      <title>What's the best way to get unique counts from long term data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254509#M76229</link>
      <description>&lt;P&gt;We have a few busy indexes that can only retain about 20 days worth of logs. The corner-office-types want unique user counts for the last month. The first thing that popped into my head was doing a daily &lt;CODE&gt;stats count by user&lt;/CODE&gt; and collecting that into a summary index. But, I'm seeing upwards of 2m unique users each day. So that &lt;CODE&gt;collect&lt;/CODE&gt; will be pretty massive.&lt;/P&gt;

&lt;P&gt;Is there a better way? Maybe dumping that &lt;CODE&gt;stats&lt;/CODE&gt; command into a kvstore?&lt;/P&gt;

&lt;P&gt;At what point will &lt;CODE&gt;collect&lt;/CODE&gt; fall over from too many lines of input?&lt;/P&gt;

&lt;P&gt;Other considerations?&lt;/P&gt;</description>
      <pubDate>Fri, 02 Dec 2016 20:10:15 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254509#M76229</guid>
      <dc:creator>twinspop</dc:creator>
      <dc:date>2016-12-02T20:10:15Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to get unique counts from long term data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254510#M76230</link>
      <description>&lt;P&gt;Do you need the count for each user - or the distinct count of users?&lt;/P&gt;

&lt;P&gt;If the latter, have you considered using &lt;CODE&gt;estdc&lt;/CODE&gt;? If the unique user count can be approximated &lt;A href="https://answers.splunk.com/answers/98220/whats-the-difference-between-dc-distinct-count-and-estdc-estimated-distinct-count.html"&gt;(1-2% error)&lt;/A&gt;, give it a try!&lt;/P&gt;

&lt;P&gt;Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!&lt;/P&gt;</description>
      <pubDate>Fri, 02 Dec 2016 20:57:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254510#M76230</guid>
      <dc:creator>aljohnson_splun</dc:creator>
      <dc:date>2016-12-02T20:57:19Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to get unique counts from long term data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254511#M76231</link>
      <description>&lt;P&gt;I can't use estdc() because the data is not there. I need a uniques for the whole month, but only have about 19 days of retention. I'll have to run some tests to see how collect and/or the kvstore handles millions of records.&lt;/P&gt;</description>
      <pubDate>Sat, 03 Dec 2016 15:55:20 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254511#M76231</guid>
      <dc:creator>twinspop</dc:creator>
      <dc:date>2016-12-03T15:55:20Z</dc:date>
    </item>
    <item>
      <title>Re: What's the best way to get unique counts from long term data?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254512#M76232</link>
      <description>&lt;P&gt;My solution:&lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;Created a collection (KV store) with one defined field, timestamp&lt;/LI&gt;
&lt;LI&gt;Created the lookup for that collection&lt;/LI&gt;
&lt;LI&gt;Created a saved search that runs every 4 hours... &lt;/LI&gt;
&lt;LI&gt;... and sends the results to &lt;CODE&gt;outputlookup&lt;/CODE&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Now with a simple &lt;CODE&gt;inputlookup&lt;/CODE&gt; command I can get uniques for the month of November:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.&lt;/P&gt;

&lt;P&gt;EDIT - Dawg, do u even optimize?!&lt;/P&gt;

&lt;P&gt;It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Any with existing records are excluded with the &lt;CODE&gt;where isnull()&lt;/CODE&gt; clause. The speedup was at least 2 orders of magnitude. Woot woot.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Dec 2016 23:58:48 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/What-s-the-best-way-to-get-unique-counts-from-long-term-data/m-p/254512#M76232</guid>
      <dc:creator>twinspop</dc:creator>
      <dc:date>2016-12-06T23:58:48Z</dc:date>
    </item>
  </channel>
</rss>

