topic How can we improve the performance of an Hunk's query? in Splunk Search

How can we improve the performance of an Hunk's query?

ddrillic — Tue, 25 Apr 2017 20:08:08 GMT

We have the following Hunk query -

index=<claims_table> claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
| eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,"montefiore"," all others") 
| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group 
| eval avg=round(avg,0)
| eventstats sum(count) as total by drg_code 
| sort 20 -total 
| xyseries drg_code prov_group avg count 
| addtotals fieldname=grandtotals 
| sort -grandtotals

The claims_table is of two billion claims and this query is running at a pace of a million claims per minute.

Any ideas how to improve it?

Re: How can we improve the performance of an Hunk's query?

Claw — Tue, 25 Apr 2017 20:46:13 GMT

You are assigning field results to reserved words.

As in

| stats count as count ,mean(net_pd_amt) as avg by drg_code prov_group
below I have changed it to this

| stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

Try this instead.

index=<claims_table> claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A"       drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group 
 | eval myavg=round(myavg,0)
 | eventstats sum(mycount) as mytotal by drg_code 
 | sort 20 -mytotal 
 | xyseries drg_code prov_group myavg mycount 
 | addtotals fieldname=grandtotals 
 | sort -grandtotals

After that start from the initial search and test each step to see if you get results.

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others")

And then

index= claim_classification=INPATIENT 
OR claim_classification="INPATIENT (PART A)*" fst_srvc_dt>=2015* fst_srvc_dt<=2016-11-29 drg_code!=NO_APR drg_code!="N/A" drg_code!=" " drg_code!=00000 
 | eval prov_group=if(prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR  prov_tin= OR prov_tin= OR prov_tin= OR prov_tin= OR prov_tin=,"montefiore"," all others") 
 | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code prov_group

until you have the entire search running.

Re: How can we improve the performance of an Hunk's query?

DalJeanis — Tue, 29 Sep 2020 13:50:07 GMT

Try a time trial on this, against the most recent 4 days of data -

   index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND (prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> 
   OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> 
   OR prov_tin=<tin11> OR prov_tin=<tin12>)
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | stats count as mycount , mean(net_pd_amt) as myavg by drg_code
   | sort 30 -mycount 
   | eval prov_group="montefiore"
   | append 
       [index=<claims_table> 
       (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
       AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
       AND NOT (prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> OR prov_tin=<tin5> 
           OR prov_tin=<tin6> OR prov_tin=<tin7> OR  prov_tin=<tin8> OR prov_tin=<tin9> OR prov_tin=<tin10> 
           OR prov_tin=<tin11> OR prov_tin=<tin12>)
       AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
       | stats count as mycount ,mean(net_pd_amt) as myavg by drg_code
       | sort 30 -mycount 
       | eval prov_group=" all others"
       ]
    | eval myavg=round(myavg,0)
    | eventstats sum(mycount) as mytotal by drg_code 
    | sort 20 -mytotal

... against this...

  index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | eval prov_group=if(prov_tin=<tin1> OR prov_tin=<tin2> OR prov_tin=<tin3> OR prov_tin=<tin4> 
                     OR prov_tin=<tin5> OR prov_tin=<tin6> OR prov_tin=<tin7> OR prov_tin=<tin8>  
                     OR prov_tin=<tin9> OR prov_tin=<tin10> OR prov_tin=<tin11> OR prov_tin=<tin12>,
                    "montefiore"," all others") 
   | stats count as mycount, mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal

.. and against this...

  index=<claims_table> 
   (claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
   AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
   AND NOT (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 )
   | join type=left  prov_tin 
        [| makeresults | eval prov_tin="<tin1> <tin2> <tin3> <tin4> <tin5> <tin6> <tin7> <tin8> <tin9> <tin10> <tin11> <tin12>" | makemv prov_tin | mvexpand prov_tin 
        | eval prov_group="montefiore" | table prov_tin prov_group
        ]
   | eval prov_group=coalesce(prov_group," all others")
   | stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
   | eval myavg=round(myavg,0)
   | eventstats sum(mycount) as mytotal by drg_code 
   | sort 20 -mytotal

Discussion -

Performance of each one of these is going to be data- and architecture-dependent.

The top one assumes that the top 20 overall have to be in the top 30 of each prov_group, and assumes that tins are an indexed field. It is hoping to save calculation overall by doing the stat work at the prov_group level. Efficiency gains will be dependent on the sparsity of the tins under inspection.

The second one is basically yours with the changes suggested by Claw.

The third one tests a lookup as a way of assigning the prov_group. Efficiency might be significantly better or worse as a join, depending on the cost of sorting compared to the cost of twelve individual compares for each record.

The real BEST solution is likely to consist of some of the following -

1) turn (drg_code=NO_APR OR drg_code="N/A" OR drg_code=" " OR drg_code=00000 ) into a single indexed field so you can use it for information at the summary level
2) create a summary index of the data you are likely to want to reuse.
3) since you only want the top 20 drg_code by overall count, collect only that information FIRST, then scan for only those. DANG, I should have coded that one. hang on...

If this works, the following should return the top 30 drug codes for a 3 month period extremely fast...

| tstats count 
where (index=<claims_table>)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time >= strptime("2016-09-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-30","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count

...If so, then try this....

| tstats count 
where (index=<claims_table>)   
AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
AND (_time >= strptime("2015-01-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-29","%Y-%m-%d"))
by drg_code 
| rename count as mycount
| sort 30 -count

...and then this...

 index=<claims_table> 
(claim_classification=INPATIENT OR claim_classification="INPATIENT (PART A)*") 
AND (fst_srvc_dt>=2016-11-26 AND fst_srvc_dt<=2016-11-29) 
AND [| tstats count 
    where (index=<claims_table>)   
    AND (claim_classification="INPATIENT" OR claim_classification="INPATIENT (PART A)*") 
    AND NOT (drg_code=="NO_APR" OR drg_code=="N/A" OR drg_code==" " OR drg_code=="00000" )
    AND (_time >= strptime("2015-01-01","%Y-%m-%d")) AND (_time <= strptime("2016-11-29","%Y-%m-%d"))
      by drg_code 
    | rename count as mycount
    | sort 30 -count
    | table drg_code
    ]
| join type=left  prov_tin 
     [| makeresults | eval prov_tin="<tin1> <tin2> <tin3> <tin4> <tin5> <tin6> <tin7> <tin8> <tin9> <tin10> <tin11> <tin12>" | makemv prov_tin | mvexpand prov_tin 
     | eval prov_group="montefiore" | table prov_tin prov_group
     ]
| eval prov_group=coalesce(prov_group," all others")
| stats count as mycount,mean(net_pd_amt) as myavg by drg_code prov_group
| eval myavg=round(myavg,0)
| eventstats sum(mycount) as mytotal by drg_code 
| sort 20 -mytotal

Re: How can we improve the performance of an Hunk's query?

Claw — Tue, 25 Apr 2017 23:32:07 GMT

PS. This data is against Hadoop Data via Hunk and there are no date time indexes. So the beautiful append is you wrote in the first example is going to take in this environment probably 3 or 4 hours to run.

Re: How can we improve the performance of an Hunk's query?

ddrillic — Wed, 26 Apr 2017 19:04:19 GMT

Thank you Claw!!!

Re: How can we improve the performance of an Hunk's query?

DalJeanis — Wed, 26 Apr 2017 19:34:11 GMT

ROFL. Well, then... I'd have to go back to solution 2 - create a summary index of whatever part of this stuff is likely to be reusable. I'd probably tend to aggregate it to the tin/drug-cd/day level at each unit (whatever term hadoop refers to an indexer or a chunk of distributed database)