Splunk Dev

Join operation query optimization

tareddy
Explorer

I am dealing with a massive dataset, i need to perform a join operation on a commonid. I have tried multiple variations of the query but i am facing issues with each of them. It would be great if anyone can provide a solution(Modifying the configuration files is not an option).

Query Version 1: This query is taking up too much of memory and the query is auto-finalizing after reaching 1GB data limit, consequently the results are inaccurate.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| search XYZ=DEF
| timechart span=1d count by PQR

Query Version 2: This query appears to be more efficient, but I am getting "[subsearch]: Search auto-finalized after time limit (60 seconds) reached" and consequently i am not able to obtain the accurate results.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" (api_name=API1 OR api_name=API2 )
[search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 ) 
   | search XYZ=DEF | fields COMMONID
   | dedup COMMONID]  
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID
| fillnull PQR value="NULL"
| timechart span=1d count by PQR

Query Version 3: This query will probably give accurate results, but it is very inefficient will take about 20 hours to finish execution.

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
| search (PQR=* OR NOT PQR=*)
| join COMMONID
   [search index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice"  (api_name=API1  OR api_name=API2 )
   | search XYZ=DEF ]
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL" 
| timechart span=1d count by PQR
Tags (1)
0 Karma

woodcock
Esteemed Legend

I don't get | search (PQR=* OR NOT PQR=*); it does absolutely nothing so why is it there?

0 Karma

tareddy
Explorer

I want all the values of PQR to be considered (including NULL values), you're probably right though, i guess it should work even without "| search (PQR=* OR NOT PQR=*)"

0 Karma

woodcock
Esteemed Legend

If that part disappears then the search simplifies.

0 Karma

DalJeanis
Legend

1) In query 2 replace | fields COMMONID | dedup COMMONID with |stats by COMMONID for a slight boost.

2) You could break query 2 down into two pieces and have piece 1 output a lookup to be used by piece 2. That way, the combined searches have a much higher total limit.

First half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
| search XYZ=DEF 
| stats by COMMONID 
| outputlookup myCommonIds.csv

Second half...

index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" 
    (api_name=API1 OR api_name=API2 ) 
    [| inputlookup append=t myCommonIds.csv] 
| stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
| fillnull PQR value="NULL"
| timechart span=30m count by PQR

3) Since you're not doing anything to the records before the stats command, you could break the search down even further with multisearch...

| multisearch 
    [ search index=INDEXA earliest=-7d@d latest=-5d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
    ]
    [ search index=INDEXA earliest=-5d@d latest=-3d@d sourcetype=GHI "service=randomservice" 
          (api_name=API1 OR api_name=API2 )   
          [| inputlookup append=t myCommonIds.csv] 
     ]
     [  search index=INDEXA earliest=-3d@d latest=@d  sourcetype=GHI "service=randomservice" 
            (api_name=API1 OR api_name=API2 )  
            [| inputlookup append=t myCommonIds.csv] 
     ]
 | stats first(_time) as _time, values(XYZ) AS XYZ, values(PQR) AS PQR by COMMONID 
 | fillnull PQR value="NULL"
 | timechart span=30m count by PQR
0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Do you have access to doing summary searches? If so, it appears that you could use that method to break up your search into smaller, already summarized chunks of data. That way your end result would take FAR less time to produce at search time.

0 Karma

cmerriman
Super Champion

just a quick idea, i'm not sure if they're applicable or not for your data.

Can you do: index=INDEXA earliest=-7d@d latest=@d sourcetype=GHI "service=randomservice" XYZ=DEF (api_name=API1 OR api_name=API2 )| stats first(_time) as _time, values(PQR) AS PQR by COMMONID | fillnull PQR value="NULL" | timechart span=1d count by PQR

0 Karma
Get Updates on the Splunk Community!

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...