Splunk Search

Question on left join

Deepz2612
Explorer

I'm not sure why is my left join not working.
I'm sure that my results will be than 50000 records.

kindly assist me!!

Tags (2)
0 Karma
1 Solution

DavidHourani
Super Champion

Hi @Deepz2612,

Not sure if your regexps are formed properly but you can replace your join with a stats as follows :

index=pcf_msl_azure_prod_us
OR (index=pcf_csdn_us-east_prod Api_Call="GET /api/vehicleimage/v1/vehicle/full")
OR (index=pcf_msl_us-east_prod source_type=RTR service="vehicleimage-service" response_code="404" Api_Call="GET /api/vehicleimage/v1/vehicle/full") 
OR(index= pcf_msl_us-east_prod source_type="APP/PROC/WEB" AND cf_app_name="vehicleimage-service" AND ("*uri=https://www.fordpass.com/content*" OR "*uri=https://buildfoc.ford.com/dig*" OR "*uri=https://www.fordeumedia-a.ford.com/nas/gforcenaslive*" OR "*uri=https://build.ford.com/dig*" OR ("*ERROR*" AND "*strix*")) )    
| rex field=msg "vin=(?\w+)" 
| rex \"(?\w+\s\S+)\? 
| rex field=msg "model=(?\w+)&" 
| rex field=msg "model=(?\w+)%2" 
| rex field=msg "model=(?\w+)%" 
| rex field=msg "model=(?([A-Z0-9][-])\w+)" 
| rex field=msg "model=(?\w+) HTTP" 
| rex field=msg "model=(?\w+)&" 
| eval Model = coalesce(Model,Model1,Model2,Model3,Model4,Model5) 
| rex field=msg "make=(?\w+)" 
| rex field=msg "year=(?\w+)" 
| rex field=msg "countryCode=(?\w+)" 
| rex field=msg "uri=(?.*)\/direct" 
| rex field=msg "uri=(?.*)\/dam" 
| rex field=msg "uri=(?.*)\/nas" 
| rex field=msg "strix-(?[^-]+)]" 
| eval URL = coalesce(URL1,URL2,URL3) 
| eval Backend= case (URL=="https://www.fordpass.com/content","FPCMS",URL=="https://www.fordeumedia-a.ford.com","GFORCE",URL=="https://build.ford.com/dig","DIG",URL=="https://buildfoc.ford.com/dig","DIG",true(),URL4) 
| rex field=msg "trace=(?.*)\,span" 
| rename cf_app_id as "APP_ID" 
| stats values(Users) as Users, values(Model) as Model, values(Make) as Make, values(Year) as Year, values(Country_code) as Country_code, values(APP_ID) as APP_ID, count(URL) as CountURL,list(Backend) as Backend by x_b3_traceid
| eval Retry= case (CountURL>"1",CountURL,true(),"No") 

Since I couldn't test the rex you made here's the idea so you can fix it up and get it running:
First call all your data, from all indexes, then extract your fields with rex and final use a stats to get the values of the required field BY the joining field, which in this case is x_b3_traceid.

Let me know if this helps.

Cheers,
David

View solution in original post

0 Karma

Richfez
SplunkTrust
SplunkTrust

If I take your search and remove all the rex and other commands that aren't essential to understanding your data, I get the following search:

index=pcf_msl_azure_prod_us OR index=pcf_csdn_us-east_prod OR index=pcf_msl_us-east_prod source_type=RTR service="vehicleimage-service" response_code="404" Api_Call="GET /api/vehicleimage/v1/vehicle/full" 
| join x_b3_traceid type=left 
    [ search index= pcf_msl_us-east_prod source_type="APP/PROC/WEB" AND cf_app_name="vehicleimage-service" AND
        ("*uri=https://www.fordpass.com/content*" OR "*uri=https://buildfoc.ford.com/dig*" OR "*uri=https://www.fordeumedia-a.ford.com/nas/gforcenaslive*" OR "*uri=https://build.ford.com/dig*" OR ("*ERROR*" AND "*strix*")) 
    | eval URL = coalesce(URL1,URL2,URL3) 
    | eval Backend= case (...) 
    | stats count(URL) as Count,list(Backend) as Backend by x_b3_traceid 
    | eval Retry= case (Count>"1",Count,true(),"No") ] 

I'm probably removing too much to make it actually work, but I think that's the gist of it.

First, could you provide a small sample of the two sets of data? That would be very helpful.

Second, are all those rex commands - did they get munged up in the copy/paste into here? There's no extractions in them, which just means I suspect a <Field17> thing got lost in translation.

Third, to the real issue. I am not convinced join is needed. And if you can drop the join, especially with the suspected data sizes of "more than 50k events" the performance of this should skyrocket.

Why is join not needed? Because it appears like x_b3_traceid is on both sides of the join already.

I'd start over with the data, and there's a couple of options on how to do it.

One way might be to append in the second set of data. Append has limits like join does on size - but in the case of append it's not a horrendously terribly performing thing, so changing your config to let it append more isn't nearly the problem as it would be with join.

It's also possible that a simple "pull all the data into one big pile, then stats it together in the right way later" could also work. This would be best, performance-wise. Well, probably anyway. 🙂

I/we can give examples and help in both cases, but I think a glance at the data may be very helpful.

Generic advice is that unless there's no other way to do a certain thing, avoid join. I mean, it's possible that very specific situations (tiny data, that you can guarantee will remain tiny, and with a lot of other caveats) it can be fine. But get out of the habit of using join, it's a SQL thing, not a Splunk thing. It's really a shame they named it the same because SQL people gravitate towards it since it looks familiar, when really they should try to not do that. But no worries, that's what we're here for; to help overcome the bad command naming done 10 years ago. 🙂

I acknowledge that there will be a change in thinking needed to do this, but the rewards at the end will be huge and your Splunk admins will love you for not crushing their search heads with poorly written, slow memory hogging searches.

So do paste in some sample data! Just a row or two (matching on x_b3_traceid or whatever would be best!) would be enough probably, assuming it's all pretty much like that. Feel free to obfuscate things, though remember, IP addresses really aren't that secret you know, and too much obfuscation can sometimes make the data so unlike the original that it's useless.

Happy Splunking! Hope to hear back from you soon on this!
-Rich

0 Karma

DavidHourani
Super Champion

Hi @Deepz2612,

Not sure if your regexps are formed properly but you can replace your join with a stats as follows :

index=pcf_msl_azure_prod_us
OR (index=pcf_csdn_us-east_prod Api_Call="GET /api/vehicleimage/v1/vehicle/full")
OR (index=pcf_msl_us-east_prod source_type=RTR service="vehicleimage-service" response_code="404" Api_Call="GET /api/vehicleimage/v1/vehicle/full") 
OR(index= pcf_msl_us-east_prod source_type="APP/PROC/WEB" AND cf_app_name="vehicleimage-service" AND ("*uri=https://www.fordpass.com/content*" OR "*uri=https://buildfoc.ford.com/dig*" OR "*uri=https://www.fordeumedia-a.ford.com/nas/gforcenaslive*" OR "*uri=https://build.ford.com/dig*" OR ("*ERROR*" AND "*strix*")) )    
| rex field=msg "vin=(?\w+)" 
| rex \"(?\w+\s\S+)\? 
| rex field=msg "model=(?\w+)&" 
| rex field=msg "model=(?\w+)%2" 
| rex field=msg "model=(?\w+)%" 
| rex field=msg "model=(?([A-Z0-9][-])\w+)" 
| rex field=msg "model=(?\w+) HTTP" 
| rex field=msg "model=(?\w+)&" 
| eval Model = coalesce(Model,Model1,Model2,Model3,Model4,Model5) 
| rex field=msg "make=(?\w+)" 
| rex field=msg "year=(?\w+)" 
| rex field=msg "countryCode=(?\w+)" 
| rex field=msg "uri=(?.*)\/direct" 
| rex field=msg "uri=(?.*)\/dam" 
| rex field=msg "uri=(?.*)\/nas" 
| rex field=msg "strix-(?[^-]+)]" 
| eval URL = coalesce(URL1,URL2,URL3) 
| eval Backend= case (URL=="https://www.fordpass.com/content","FPCMS",URL=="https://www.fordeumedia-a.ford.com","GFORCE",URL=="https://build.ford.com/dig","DIG",URL=="https://buildfoc.ford.com/dig","DIG",true(),URL4) 
| rex field=msg "trace=(?.*)\,span" 
| rename cf_app_id as "APP_ID" 
| stats values(Users) as Users, values(Model) as Model, values(Make) as Make, values(Year) as Year, values(Country_code) as Country_code, values(APP_ID) as APP_ID, count(URL) as CountURL,list(Backend) as Backend by x_b3_traceid
| eval Retry= case (CountURL>"1",CountURL,true(),"No") 

Since I couldn't test the rex you made here's the idea so you can fix it up and get it running:
First call all your data, from all indexes, then extract your fields with rex and final use a stats to get the values of the required field BY the joining field, which in this case is x_b3_traceid.

Let me know if this helps.

Cheers,
David

0 Karma

Richfez
SplunkTrust
SplunkTrust

@davidhourani

Only one note - you left the | search Api_Call="GET /api/vehicleimage/v1/vehicle/full" in there on line 7. You'll have to merge that part of the search into the base search (which should be easy) or else it'll filter out a lot of things that need to be kept.

DavidHourani
Super Champion

You rock, thanks man.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...