Splunk Dev

Join operation query optimization

tareddy
Explorer

I am dealing with a massive dataset, i need to perform a join operation on a commonid. I have tried multiple variations of the query but i am facing issues with each of them. It would be great if anyone can provide a solution(Modifying the configuration files is not an option).

Query Version 1: This query is taking up too much of memory and the query is auto-finalizing after reaching 1GB data limit, consequently the results are inaccurate.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| search XYZ=DEF
| timechart span=1d count by PQR

Query Version 2: This query appears to be more efficient, but I am getting "[subsearch]: Search auto-finalized after time limit (60 seconds) reached" and consequently i am not able to obtain the accurate results.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
[search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 ) 
   | search XYZ=DEF | fields COMMONID
   | dedup COMMONID]  
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| timechart span=1d count by PQR

Query Version 3: This query will probably give accurate results, but it is very inefficient will take about 20 hours to finish execution.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
| search (PQR=* OR NOT PQR=*)
| join COMMONID
   [search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
   | search XYZ=DEF ]
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL" 
| timechart span=1d count by PQR
Tags (1)
0 Karma

woodcock
Esteemed Legend

I don't get | search (PQR=* OR NOT PQR=*); it does absolutely nothing so why is it there?

0 Karma

tareddy
Explorer

I want all the values of PQR to be considered (including NULL values), you're probably right though, i guess it should work even without "| search (PQR=* OR NOT PQR=*)"

0 Karma

woodcock
Esteemed Legend

If that part disappears then the search simplifies.

0 Karma

DalJeanis
Legend

1) In query 2 replace | fields COMMONID | dedup COMMONID with |stats by COMMONID for a slight boost.

2) You could break query 2 down into two pieces and have piece 1 output a lookup to be used by piece 2. That way, the combined searches have a much higher total limit.

First half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
| search XYZ=DEF 
| stats by COMMONID 
| outputlookup myCommonIds.csv

Second half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
    [| inputlookup append=t myCommonIds.csv] 
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL"
| timechart span=30m count by PQR

3) Since you're not doing anything to the records before the stats command, you could break the search down even further with multisearch...

| multisearch 
    [ search index=INDEXA earliest=-7d@d latest=-5d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
    ]
    [ search index=INDEXA earliest=-5d@d latest=-3d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
     ]
     [  search index=INDEXA earliest=-3d@d latest=@d  sourcetype=GHI "service=randomservice" 
            (api_name=API1 OR api_name=API2 )  
            [| inputlookup append=t myCommonIds.csv] 
     ]
 | stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
 | fillnull PQR value="NULL"
 | timechart span=30m count by PQR
0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Do you have access to doing summary searches? If so, it appears that you could use that method to break up your search into smaller, already summarized chunks of data. That way your end result would take FAR less time to produce at search time.

0 Karma

cmerriman
Super Champion

just a quick idea, i'm not sure if they're applicable or not for your data.

Can you do: index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" XYZ=DEF (api_name=API1 OR api_name=API2 )| stats first(_time) as _time, values(PQR) AS PQR by COMMONID | fillnull PQR value="NULL" | timechart span=1d count by PQR

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Boston may be buzzing this September with Splunk University and .conf25, but you don’t have to pack a bag to ...