Splunk Search

Join two sources via common field to grab another field w/ multiple matches

yumeina
Loves-to-Learn Everything

Hi,

Been trying to connect/join two log sources which have fields that share the same values.

To break it down:
source_1

  • field_A, field_D, and field_E

source_2

  • field_B, and field_C


field_a and field_b can share same value.
field_c can correspond to multiple values of field_A/field_B.

The query should essentially add field_c from source_2 to every filtered event in source_1 (like a left join, with source_2 almost functioning as a lookup table).

I've gotten pretty close with my Join query, but it's a bit slow and not populating all the field_c's. Inspecting the job reveals I'm hitting 50000 result limit.

I've also tried a stew query using stats, which is much faster, but it's not actually connecting the events / data together.

Here are the queries I've been using so far:

join

 

index=index_1
sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*
| rename field_A as field_B

| join type=left max=0 field_B [ search source="source_2" earliest=-30d@d latest=@m]
| table field_D field_E field_B field_C

 


stats w/ coalesce()

 

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB

 

 
expected output

field_Dfield_Efield_A/field_Bfield_C
fun_textUp/Down_textshared_valuecorresponding_value

 

Labels (5)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

To add to @yuanliu comment - the starting point to diagnose why something is NOT giving you what you expect is to isolate the simple example of a source from each where you do not get the results expected.

If you are unable to understand why it's not connecting the events as you suggest, post a sanitised example here, so we can help with different sets of eyes.

stats is certainly the way to go.

Does this example model what your data looks like

| makeresults count=3
| eval field_A=split("ABC",""), field_D="text"
| mvexpand field_A
| eval field_E=mvindex(split("DOWN,UP",","),random() % 2)
| append [ 
  | makeresults count=5
  | eval field_B=split("ABC",""), field_C="F_C".(random() % 10)
  | mvexpand field_B
]
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB
0 Karma

yuanliu
SplunkTrust
SplunkTrust

You are correct that join is slow and easily hits limit.  But how is stats with coalesce not actually connecting the events / data together?  What exactly do you get?  Given that your mock code uses mock field names, are you sure you typed field names correctly in coalesce and group by?

What is the output from the following test?

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| table _time source field_AB

If field_A exists in every event from source_1 and likewise field_B in source_2, the test should list every event from both sources.  Do you get that?

If all spellings are correct, you can rename your actual field names to field_A, field_B, etc., then post sample data from the two sources (anonymize as needed) so volunteers will have a basis to help in concrete ways.

This is a long way to say that stats with coalesce is the correct code.

0 Karma

yumeina
Loves-to-Learn Everything

Thanks for the response!

With the test query, I'm seeing both sources.

For further research I counted all the field_AB by sourcetype and found that there are significantly more source_2 than source_1, not sure if that affects anything though.

sourcetype (last 5 min)
source_1 -> count 147
source_2 -> count 66359

In my initial query, I do get both event from both sources. It's not populating the fields in each event the way I want it though.

Inital Query Results

field_Dfield_ABfield_Cfield_E
DeviceTypeUniqueID Up
DeviceTypeUniqueID Down
 UniqueIDData_2 
 UniqueIDData_1 

 

Expected Query Results

field_Dfield_ABfield_Cfield_E
DeviceTypeUniqueIDData_1Up
DeviceTypeUniqueIDData_2Down
DeviceTypeUniqueIDData_2Down
DeviceTypeUniqueIDData_1Down

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Mere count of field_AB is not particularly insightful. (Also, what are the counts?  Are they unique value counts or counts of events that has field_AB?  There are many other metrics for either of these to be useful.)

Let me clarify something.  In your latest illustration of Inital Query Results, you are having problem populating field_D, field_C, and field_E corresponding to select values of field_AB.  Is this correct?  Your illustration cannot be a precise representation because it is impossible for groupby field_AB to have the same value "UniqueID" in every row.  This means that your observed missing values of field_D, field_C, and field_E has something to do with your actual data.

In other words, coalesce, stats, and groupby are all functioning as designed.  All those "gaps" in field_D, field_C, and field_E only means that source_1 and source_b contain different sets of unique values of field_AB (field_A in source_1, field_B in source_2).  This is a simple test:

 

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| eval source = case(source == "source_2", "source_2", sourcetype == "source_1", "source_1", true(), "mingled")
| stats dc(field_AB) by source

 

Are the count the same for both sources? Are there any "mingled" counts? (Note one of your searches uses sourcetype as filter, the other uses source.  This is generally a bad strategy because they will confuse you three months from now.)

If you need concrete help, you must post anonymized raw data, and show us which data set cause the problem.  Alternatively, you need to inspect data closely; specifically, how are field_D and field_E populated with field_A in source_1, and how is field_C populated with field_B in source_2.

0 Karma

yumeina
Loves-to-Learn Everything

The counts were calculated via

 

index=index_1 (sourcetype=source_1 field_D="Text" field_E=*Down* OR field_E=*Up*) OR (sourcetype=source_2)
| dedup field_B keepempty=true
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| stats dc(field_AB) as count by sourcetype

 

To your point about source vs. sourcetype, I realized that after looking at those counts and made both of my filters use sourcetype.

As for my Initial Query Results, let me clarify that for you: UniqueID is not the same value in every row. It can occur more than once, yes, as this log is reporting the status of the device as it goes up and down.

As for field_C, to clarify from my initial post: field_C is an identifier which can be mapped to multiple field_AB

field_C is the only field from source_2 I really want to add to source_1. Essentially I'm using source_2 as a reference to pull field_C from, matching all instances of UniqueID_X to whichever corresponding field_C.

Here is what it looks like with that in mind:

Inital Query Results

field_Dfield_ABfield_Cfield_E
DeviceTypeUniqueID_1 Up
DeviceTypeUniqueID_2 Down
 UniqueID_3Data_2 
 UniqueID_4Data_1 


As well as expected results:

field_Dfield_ABfield_Cfield_E
DeviceTypeUniqueID_1Data_1Up
DeviceTypeUniqueID_2Data_2Down
DeviceTypeUniqueID_3Data_2Down
DeviceTypeUniqueID_4Data_1Down


Additionally, here is raw data from source_1:

 

2024/03/07 09:06:12.75 ET,ND="\A001   ",DIP="$ABC0.",Sys=B002,Serv=C,Ent="AAA-BBBBBB  ",AppID=0000,LogID=1111,field_A="AA000111    ",Alias="            ",Tkn1="222222222   ",Tkn2="333333333                ",field_D="DeviceType",field_E="BB AA000111     (00) Down, error 0000"

 

And here is raw data from source_2 (field_B and field_C are towards the bottom):

 

BSYS=  ,EXL=     ,HI=        ,HSTT= ,MBR=     ,NAME=        ,ORRR=  ,RDDD=  ,REV= ,LINK=0000,POSTDATE=240307,RESP=     ,RESP_ERR=                    ,R=   ,D=          ,DE=          ,RECOM= ,SCORE=    ,STAND= ,ROUTE=00000000,NUM=0000000000,NUM1=0000000000,NUM2=0000000000,NUM3=000000000,CODE=1,POS=P,RNDTTT=          ,AUTH=000000,ASYS=0-,AN=A000  ,RCODE= ,TIMEIN=24030709061224,BURNRES= ,BURNRT=00000,BURNS=00000,ACCEPTCODE=0000,ACCEPTNAME=00000000000000  ,ANOTHERID=AAA    ,INPUT_CAP=0,INPUTMODE=0,ST=00,STATE=AA,LEVEL= ,COMREQ=0,PDATE=240308,CCODE=000,RTD=     ,CCC= ,CCC_RES= ,CCC_RES2= ,SCODE=  ,DACTUIB= ,DDD=   ,DDDR=               ,DDDE=  ,DDDEB= ,DDDN= ,DDDT= ,DDDF= ,DDDM=               ,DDDM2=               ,DDMN=       ,DDOI= ,DDRN=            ,DDRC=  ,DDAF=               ,DDTA=               ,DDTAF=               ,DDTT=   ,EI= ,EARD=          ,EERD=          ,EEFT= ,ESSS=     ,EAS=            ,EFRT=          ,EH=AAA0AAA ,EII=AAAAAAA0,EO=AAAAA00B,FMS= ,FPPP= ,FAT=00,GRC=00,HS=00,HII=00000000,ITN=AAA03,ISS=0000,JT=240307090612,LAA=0000000000,LAA2=0000000000,MSI= ,MIDD=AAAAA AAAAAAAAAAA,MT=0000,MMMM= ,MMOD=            ,MMOS=   ,MMU=   ,MD= ,MDDD= ,MRRC=   ,MRRRR=                                ,MSGGG=0,MSGGG=0000,MCS=  ,MCQ=  ,NCF=    ,NPD=240308,NII=AAA,NPPC=      ,NNNPPP= ,NNNPPPR= ,NTNT= ,NTNTR= ,OSS=00,OOM= ,PADD= ,POOOS=000,PDCCC=000000000A,PSC12=000000      ,PSCSC=11111111,PTTTID=      ,QTIME=11111,RAII=111111111,REIII=77775750  ,RTDDD= ,RTRRR=    ,RTRRRS=    ,RIII=BBBBBBB        ,RSSST=AA ,RSAAAA=AAAA B CCCCCCCCC DD EE,RRRC=00,RTTTTT=00000,RESSSCC=A,RSSS=0,RSCC=A,EFFN=00000000,RCCCC=00,RMRRM= ,RRTD=     ,PREPRE=000000      ,SSTS=00,SNNN=000000,Ssrc=,STMF=00000000,field_B=AA000111,SCCA=00000000,SCCA=00000000,STQQ=000000,SYSTEMTRACE=00000000000,TIST=00000,field_C=AA00    ,TOID=00000000,TST=OK,TT=00,TIMEZONE=*,ACCT=00,TDA=                   ,TDARID=           ,TCC=00,TNID=000000,TTTDM=240307080559,TTTIII=0000000000000000,VRES= 

 

Hope that clears it up.

0 Karma

yumeina
Loves-to-Learn Everything

An update:

I've gotten pretty close by doing this:

 

index=index_1
(sourcetype=source_2 earliest=1 latest=now()) OR (sourcetype=source_1 field_D="DeviceType" field_E=*Down* OR field_E=*Up*)
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(field_E) values(field_C) by field_AB

 

 and then using the UI to sort ascending by field_E.


However, there are still many cases where field_E is not populating with a matching field_AB, despite that same field_E matching a different field_AB. To illustrate, see below:

field_AB

field_E

field_C
11122233

Up

Down

00002
11111179Down 
11111178Up00001


11111178 and 11111179 share 00001 as field_c, but it's not populating.

I checked to make sure it wasn't missing in source_2 and found multiple events where field_AB=11111178 and field_C=00001.

With that in mind, I'm not sure how to ensure I'm getting all the events from the log besides using the earliest and latest filters, unless I'm misunderstanding how those filters work.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Again, this is a data problem.  As you described in OP, field_E and field_A only exist in source_1, field_C and field_B only exist in source_2.  When you get the results reminiscent to your latest illustration, do you have evidence that field_B in source_2 has a value of 11111179 AND a non-null value of field_C?  In other words, does this give you any result in the exact search period you used to obtain that table?

index=index_1
  sourcetype=source_2 earliest=1 latest=now()
  field_C=* field_B = 11111179

(By the way, I don't think you posted the correct mock code because earliest=1 latest=now() will always give you no results, therefore none of rows should have field_C populated.  Unless your Splunk has a time machine in it.)

0 Karma

yumeina
Loves-to-Learn Everything

I described in OP that field_A and field_B share the value. So there's nothing I can do to join the rest of the corresponding fields together?

Yes, I did find evidence of field_B in source_2 having a value of 11111179 and a non-null value of field_C, which is what I intended to refer to in my previous post. 

In regards to the mock code,  the search worked, so I'm guessing the flags were ignored.

Either way, if it is indeed a data problem, I'm not seeing it in my logs. Thank you for the insight though.

Will update this post if that does end up being the problem.

0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...