<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can we improve the performance of an Hunk's query? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346360#M102591</link>
    <description>&lt;P&gt;We have the following Hunk query - &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=&amp;lt;claims_table&amp;gt; claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
| eval prov_group=if(prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; OR prov_tin=&amp;lt;tin5&amp;gt; OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR  prov_tin=&amp;lt;tin8&amp;gt; OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;,"montefiore"," all others") 
| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group 
| eval avg=round(avg,0)
| eventstats sum(count) as total by drg_code 
| sort 20 -total 
| xyseries drg_code prov_group avg count 
| addtotals fieldname=grandtotals 
| sort -grandtotals 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The &lt;CODE&gt;claims_table&lt;/CODE&gt; is of two billion claims and this query is running at a pace of a million claims per minute.&lt;/P&gt;

&lt;P&gt;Any ideas how to improve it?&lt;/P&gt;</description>
    <pubDate>Tue, 25 Apr 2017 20:08:08 GMT</pubDate>
    <dc:creator>ddrillic</dc:creator>
    <dc:date>2017-04-25T20:08:08Z</dc:date>
    <item>
      <title>How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346360#M102591</link>
      <description>&lt;P&gt;We have the following Hunk query - &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=&amp;lt;claims_table&amp;gt; claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
| eval prov_group=if(prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; OR prov_tin=&amp;lt;tin5&amp;gt; OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR  prov_tin=&amp;lt;tin8&amp;gt; OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;,"montefiore"," all others") 
| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group 
| eval avg=round(avg,0)
| eventstats sum(count) as total by drg_code 
| sort 20 -total 
| xyseries drg_code prov_group avg count 
| addtotals fieldname=grandtotals 
| sort -grandtotals 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The &lt;CODE&gt;claims_table&lt;/CODE&gt; is of two billion claims and this query is running at a pace of a million claims per minute.&lt;/P&gt;

&lt;P&gt;Any ideas how to improve it?&lt;/P&gt;</description>
      <pubDate>Tue, 25 Apr 2017 20:08:08 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346360#M102591</guid>
      <dc:creator>ddrillic</dc:creator>
      <dc:date>2017-04-25T20:08:08Z</dc:date>
    </item>
    <item>
      <title>Re: How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346361#M102592</link>
      <description>&lt;P&gt;You are assigning field results to reserved words.&lt;/P&gt;

&lt;P&gt;As in &lt;/P&gt;

&lt;P&gt;&lt;CODE&gt;| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group&lt;/CODE&gt;&lt;BR /&gt;
below I have changed it to this&lt;/P&gt;

&lt;P&gt;&lt;CODE&gt;| stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group&lt;/CODE&gt;&lt;/P&gt;

&lt;P&gt;Try this instead.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=&amp;lt;claims_table&amp;gt; claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A"       drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; OR prov_tin=&amp;lt;tin5&amp;gt; OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR  prov_tin=&amp;lt;tin8&amp;gt; OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group 
 | eval myavg=round(myavg,0)
 | eventstats sum(mycount) as mytotal by drg_code 
 | sort 20 -mytotal 
 | xyseries drg_code prov_group myavg mycount 
 | addtotals fieldname=grandtotals 
 | sort -grandtotals
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;After that start from the initial search and test each step to see if you get results.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;And then&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others")
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;And then&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt&amp;gt;=2015* fst_srvc_dt&amp;lt;=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;until you have the entire search running.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Apr 2017 20:46:13 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346361#M102592</guid>
      <dc:creator>Claw</dc:creator>
      <dc:date>2017-04-25T20:46:13Z</dc:date>
    </item>
    <item>
      <title>Re: How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346362#M102593</link>
      <description>&lt;P&gt;Try a time trial on this, against the most recent 4 days of data - &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;   index=&amp;lt;claims_table&amp;gt; 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt&amp;gt;=2016-11-26 AND fst_srvc_dt&amp;lt;=2016-11-29) 
   AND (prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; OR prov_tin=&amp;lt;tin5&amp;gt; 
   OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR  prov_tin=&amp;lt;tin8&amp;gt; OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; 
   OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;)
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | stats count as mycount , mean(net_pd_amt) as myavg by drg_code
   | sort 30 -mycount 
   | eval prov_group="montefiore"
   | append 
       [index=&amp;lt;claims_table&amp;gt; 
       (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
       AND (fst_srvc_dt&amp;gt;=2016-11-26 AND fst_srvc_dt&amp;lt;=2016-11-29) 
       AND NOT (prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; OR prov_tin=&amp;lt;tin5&amp;gt; 
           OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR  prov_tin=&amp;lt;tin8&amp;gt; OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; 
           OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;)
       AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
       | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code
       | sort 30 -mycount 
       | eval prov_group=" all others"
       ]
    | eval myavg=round(myavg,0)
    | eventstats sum(mycount) as mytotal by drg_code 
    | sort 20 -mytotal 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;... against this...&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;  index=&amp;lt;claims_table&amp;gt; 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt&amp;gt;=2016-11-26 AND fst_srvc_dt&amp;lt;=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | eval prov_group=if(prov_tin=&amp;lt;tin1&amp;gt; OR prov_tin=&amp;lt;tin2&amp;gt; OR prov_tin=&amp;lt;tin3&amp;gt; OR prov_tin=&amp;lt;tin4&amp;gt; 
                     OR prov_tin=&amp;lt;tin5&amp;gt; OR prov_tin=&amp;lt;tin6&amp;gt; OR prov_tin=&amp;lt;tin7&amp;gt; OR prov_tin=&amp;lt;tin8&amp;gt;  
                     OR prov_tin=&amp;lt;tin9&amp;gt; OR prov_tin=&amp;lt;tin10&amp;gt; OR prov_tin=&amp;lt;tin11&amp;gt; OR prov_tin=&amp;lt;tin12&amp;gt;,
                    "montefiore"," all others") 
   | stats count as mycount, mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;.. and against this...&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;  index=&amp;lt;claims_table&amp;gt; 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt&amp;gt;=2016-11-26 AND fst_srvc_dt&amp;lt;=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | join type=left  prov_tin 
        [| makeresults | eval prov_tin="&amp;lt;tin1&amp;gt; &amp;lt;tin2&amp;gt; &amp;lt;tin3&amp;gt; &amp;lt;tin4&amp;gt; &amp;lt;tin5&amp;gt; &amp;lt;tin6&amp;gt; &amp;lt;tin7&amp;gt; &amp;lt;tin8&amp;gt; &amp;lt;tin9&amp;gt; &amp;lt;tin10&amp;gt; &amp;lt;tin11&amp;gt; &amp;lt;tin12&amp;gt;" | makemv prov_tin | mvexpand prov_tin 
        | eval prov_group="montefiore" | table prov_tin prov_group
        ]
   | eval prov_group=coalesce(prov_group," all others")
   | stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;HR /&gt;

&lt;P&gt;Discussion -&lt;/P&gt;

&lt;P&gt;Performance of each one of these is going to be data- and architecture-dependent. &lt;/P&gt;

&lt;P&gt;The top one assumes that the top 20 overall have to be in the top 30 of each prov_group, and assumes that tins are an indexed field. It is hoping to save calculation overall by doing the stat work at the prov_group level.  Efficiency gains will be dependent on the sparsity of the tins under inspection.&lt;/P&gt;

&lt;P&gt;The second one is basically yours with the changes suggested by Claw.  &lt;/P&gt;

&lt;P&gt;The third one tests a lookup as a way of assigning the prov_group.  Efficiency might be significantly better or worse as a join, depending on the cost of sorting compared to the cost of twelve individual compares for each record.  &lt;/P&gt;

&lt;HR /&gt;

&lt;P&gt;The real BEST solution is likely to consist of some of the following  -&lt;/P&gt;

&lt;P&gt;1) turn (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 ) into a single indexed field so you can use it for information at the summary level&lt;BR /&gt;
2) create a summary index of the data you are likely to want to reuse.&lt;BR /&gt;
3) since you only want the top 20 drg_code by overall count, collect only that information FIRST, then scan for only those.  DANG, I should have coded that one.  hang on...&lt;/P&gt;

&lt;HR /&gt;

&lt;P&gt;If this works, the following should return the top 30 drug codes for a 3 month period extremely fast... &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| tstats count 
where (index=&amp;lt;claims_table&amp;gt;)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time &amp;gt;= strptime("2016-09-01","%Y-%m-%d")) AND (_time &amp;lt;= strptime("2016-11-30","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;...If so, then try this....&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| tstats count 
where (index=&amp;lt;claims_table&amp;gt;)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time &amp;gt;= strptime("2015-01-01","%Y-%m-%d")) AND (_time &amp;lt;= strptime("2016-11-29","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;...and then this... &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt; index=&amp;lt;claims_table&amp;gt; 
(claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
AND (fst_srvc_dt&amp;gt;=2016-11-26 AND fst_srvc_dt&amp;lt;=2016-11-29) 
AND [| tstats count 
    where (index=&amp;lt;claims_table&amp;gt;)   
    AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
    AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
    AND (_time &amp;gt;= strptime("2015-01-01","%Y-%m-%d")) AND (_time &amp;lt;= strptime("2016-11-29","%Y-%m-%d"))
      by drg_code 
    | rename count as mycount
    | sort 30 -count
    | table drg_code
    ]
| join type=left  prov_tin 
     [| makeresults | eval prov_tin="&amp;lt;tin1&amp;gt; &amp;lt;tin2&amp;gt; &amp;lt;tin3&amp;gt; &amp;lt;tin4&amp;gt; &amp;lt;tin5&amp;gt; &amp;lt;tin6&amp;gt; &amp;lt;tin7&amp;gt; &amp;lt;tin8&amp;gt; &amp;lt;tin9&amp;gt; &amp;lt;tin10&amp;gt; &amp;lt;tin11&amp;gt; &amp;lt;tin12&amp;gt;" | makemv prov_tin | mvexpand prov_tin 
     | eval prov_group="montefiore" | table prov_tin prov_group
     ]
| eval prov_group=coalesce(prov_group," all others")
| stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
| eval myavg=round(myavg,0)
| eventstats sum(mycount) as mytotal by drg_code 
| sort 20 -mytotal 
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 29 Sep 2020 13:50:07 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346362#M102593</guid>
      <dc:creator>DalJeanis</dc:creator>
      <dc:date>2020-09-29T13:50:07Z</dc:date>
    </item>
    <item>
      <title>Re: How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346363#M102594</link>
      <description>&lt;P&gt;PS. This data is against Hadoop Data via Hunk and there are no date time indexes. So the beautiful append is you wrote in the first example is going to take in this environment probably 3 or 4 hours to run.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Apr 2017 23:32:07 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346363#M102594</guid>
      <dc:creator>Claw</dc:creator>
      <dc:date>2017-04-25T23:32:07Z</dc:date>
    </item>
    <item>
      <title>Re: How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346364#M102595</link>
      <description>&lt;P&gt;Thank you Claw!!! &lt;/P&gt;</description>
      <pubDate>Wed, 26 Apr 2017 19:04:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346364#M102595</guid>
      <dc:creator>ddrillic</dc:creator>
      <dc:date>2017-04-26T19:04:19Z</dc:date>
    </item>
    <item>
      <title>Re: How can we improve the performance of an Hunk's query?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346365#M102596</link>
      <description>&lt;P&gt;ROFL.  Well, then... I'd have to go back to solution 2 - create a summary index of whatever part of this stuff is likely to be reusable.  I'd probably tend to aggregate it to the tin/drug-cd/day level at each unit (whatever term hadoop refers to an indexer or a chunk of distributed database)&lt;/P&gt;</description>
      <pubDate>Wed, 26 Apr 2017 19:34:11 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-we-improve-the-performance-of-an-Hunk-s-query/m-p/346365#M102596</guid>
      <dc:creator>DalJeanis</dc:creator>
      <dc:date>2017-04-26T19:34:11Z</dc:date>
    </item>
  </channel>
</rss>

