Join operation query optimization

tareddy · ‎07-20-2017

I am dealing with a massive dataset, i need to perform a join operation on a commonid. I have tried multiple variations of the query but i am facing issues with each of them. It would be great if anyone can provide a solution(Modifying the configuration files is not an option).

Query Version 1: This query is taking up too much of memory and the query is auto-finalizing after reaching 1GB data limit, consequently the results are inaccurate.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| search XYZ=DEF
| timechart span=1d count by PQR

Query Version 2: This query appears to be more efficient, but I am getting "[subsearch]: Search auto-finalized after time limit (60 seconds) reached" and consequently i am not able to obtain the accurate results.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
[search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 ) 
   | search XYZ=DEF | fields COMMONID
   | dedup COMMONID]  
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| timechart span=1d count by PQR

Query Version 3: This query will probably give accurate results, but it is very inefficient will take about 20 hours to finish execution.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
| search (PQR=* OR NOT PQR=*)
| join COMMONID
   [search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
   | search XYZ=DEF ]
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL" 
| timechart span=1d count by PQR

woodcock · ‎07-22-2017

I don't get | search (PQR=* OR NOT PQR=*); it does absolutely nothing so why is it there?

tareddy · ‎07-24-2017

I want all the values of PQR to be considered (including NULL values), you're probably right though, i guess it should work even without "| search (PQR=* OR NOT PQR=*)"

woodcock · ‎07-24-2017

If that part disappears then the search simplifies.

DalJeanis · ‎07-21-2017

1) In query 2 replace | fields COMMONID | dedup COMMONID with |stats by COMMONID for a slight boost.

2) You could break query 2 down into two pieces and have piece 1 output a lookup to be used by piece 2. That way, the combined searches have a much higher total limit.

First half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
| search XYZ=DEF 
| stats by COMMONID 
| outputlookup myCommonIds.csv

Second half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
    [| inputlookup append=t myCommonIds.csv] 
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL"
| timechart span=30m count by PQR

3) Since you're not doing anything to the records before the stats command, you could break the search down even further with multisearch...

| multisearch 
    [ search index=INDEXA earliest=-7d@d latest=-5d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
    ]
    [ search index=INDEXA earliest=-5d@d latest=-3d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
     ]
     [  search index=INDEXA earliest=-3d@d latest=@d  sourcetype=GHI "service=randomservice" 
            (api_name=API1 OR api_name=API2 )  
            [| inputlookup append=t myCommonIds.csv] 
     ]
 | stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
 | fillnull PQR value="NULL"
 | timechart span=30m count by PQR

cpetterborg · ‎07-21-2017

Do you have access to doing summary searches? If so, it appears that you could use that method to break up your search into smaller, already summarized chunks of data. That way your end result would take FAR less time to produce at search time.

cmerriman · ‎07-21-2017

just a quick idea, i'm not sure if they're applicable or not for your data.

Can you do: index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" XYZ=DEF (api_name=API1 OR api_name=API2 )| stats first(_time) as _time, values(PQR) AS PQR by COMMONID | fillnull PQR value="NULL" | timechart span=1d count by PQR

Join operation query optimization

Index This | Forward, I’m heavy; backward, I’m not. What am I?

A Guide To Cloud Migration Success

Join Us for Splunk University and Get Your Bootcamp Game On!