Join two sources via common field to grab another ...

yumeina · ‎03-05-2024

Hi,

Been trying to connect/join two log sources which have fields that share the same values.

To break it down:
source_1

field_A, field_D, and field_E

source_2

field_B, and field_C

field_a and field_b can share same value.
field_c can correspond to multiple values of field_A/field_B.

The query should essentially add field_c from source_2 to every filtered event in source_1 (like a left join, with source_2 almost functioning as a lookup table).

I've gotten pretty close with my Join query, but it's a bit slow and not populating all the field_c's. Inspecting the job reveals I'm hitting 50000 result limit.

I've also tried a stew query using stats, which is much faster, but it's not actually connecting the events / data together.

Here are the queries I've been using so far:

join

index=index_1
sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*
| rename field_A as field_B

| join type=left max=0 field_B [ search source="source_2" earliest=-30d@d latest=@m]
| table field_D field_E field_B field_C

stats w/ coalesce()

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB

expected output

field_D	field_E	field_A/field_B	field_C
fun_text	Up/Down_text	shared_value	corresponding_value

bowesmana · ‎03-05-2024

To add to @yuanliu comment - the starting point to diagnose why something is NOT giving you what you expect is to isolate the simple example of a source from each where you do not get the results expected.

If you are unable to understand why it's not connecting the events as you suggest, post a sanitised example here, so we can help with different sets of eyes.

stats is certainly the way to go.

Does this example model what your data looks like

| makeresults count=3
| eval field_A=split("ABC",""), field_D="text"
| mvexpand field_A
| eval field_E=mvindex(split("DOWN,UP",","),random() % 2)
| append [ 
  | makeresults count=5
  | eval field_B=split("ABC",""), field_C="F_C".(random() % 10)
  | mvexpand field_B
]
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB

yuanliu · ‎03-05-2024

You are correct that join is slow and easily hits limit. But how is stats with coalesce not actually connecting the events / data together? What exactly do you get? Given that your mock code uses mock field names, are you sure you typed field names correctly in coalesce and group by?

What is the output from the following test?

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| table _time source field_AB

If field_A exists in every event from source_1 and likewise field_B in source_2, the test should list every event from both sources. Do you get that?

If all spellings are correct, you can rename your actual field names to field_A, field_B, etc., then post sample data from the two sources (anonymize as needed) so volunteers will have a basis to help in concrete ways.

This is a long way to say that stats with coalesce is the correct code.

yumeina · ‎03-06-2024

Thanks for the response!

With the test query, I'm seeing both sources.

For further research I counted all the field_AB by sourcetype and found that there are significantly more source_2 than source_1, not sure if that affects anything though.

sourcetype (last 5 min)
source_1 -> count 147
source_2 -> count 66359

In my initial query, I do get both event from both sources. It's not populating the fields in each event the way I want it though.

Inital Query Results

field_D	field_AB	field_C	field_E
DeviceType	UniqueID		Up
DeviceType	UniqueID		Down
	UniqueID	Data_2
	UniqueID	Data_1

Expected Query Results

field_D	field_AB	field_C	field_E
DeviceType	UniqueID	Data_1	Up
DeviceType	UniqueID	Data_2	Down
DeviceType	UniqueID	Data_2	Down
DeviceType	UniqueID	Data_1	Down

yuanliu · ‎03-06-2024

Mere count of field_AB is not particularly insightful. (Also, what are the counts? Are they unique value counts or counts of events that has field_AB? There are many other metrics for either of these to be useful.)

Let me clarify something. In your latest illustration of Inital Query Results, you are having problem populating field_D, field_C, and field_E corresponding to select values of field_AB. Is this correct? Your illustration cannot be a precise representation because it is impossible for groupby field_AB to have the same value "UniqueID" in every row. This means that your observed missing values of field_D, field_C, and field_E has something to do with your actual data.

In other words, coalesce, stats, and groupby are all functioning as designed. All those "gaps" in field_D, field_C, and field_E only means that source_1 and source_b contain different sets of unique values of field_AB (field_A in source_1, field_B in source_2). This is a simple test:

index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| eval source = case(source == "source_2", "source_2", sourcetype == "source_1", "source_1", true(), "mingled")
| stats dc(field_AB) by source

Are the count the same for both sources? Are there any "mingled" counts? (Note one of your searches uses sourcetype as filter, the other uses source. This is generally a bad strategy because they will confuse you three months from now.)

If you need concrete help, you must post anonymized raw data, and show us which data set cause the problem. Alternatively, you need to inspect data closely; specifically, how are field_D and field_E populated with field_A in source_1, and how is field_C populated with field_B in source_2.

yumeina · ‎03-07-2024

The counts were calculated via

index=index_1 (sourcetype=source_1 field_D="Text" field_E=*Down* OR field_E=*Up*) OR (sourcetype=source_2)
| dedup field_B keepempty=true
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| stats dc(field_AB) as count by sourcetype

To your point about source vs. sourcetype, I realized that after looking at those counts and made both of my filters use sourcetype.

As for my Initial Query Results, let me clarify that for you: UniqueID is not the same value in every row. It can occur more than once, yes, as this log is reporting the status of the device as it goes up and down.

As for field_C, to clarify from my initial post: field_C is an identifier which can be mapped to multiple field_AB.

field_C is the only field from source_2 I really want to add to source_1. Essentially I'm using source_2 as a reference to pull field_C from, matching all instances of UniqueID_X to whichever corresponding field_C.

Here is what it looks like with that in mind:

Inital Query Results

field_D	field_AB	field_C	field_E
DeviceType	UniqueID_1		Up
DeviceType	UniqueID_2		Down
	UniqueID_3	Data_2
	UniqueID_4	Data_1

As well as expected results:

field_D	field_AB	field_C	field_E
DeviceType	UniqueID_1	Data_1	Up
DeviceType	UniqueID_2	Data_2	Down
DeviceType	UniqueID_3	Data_2	Down
DeviceType	UniqueID_4	Data_1	Down

Additionally, here is raw data from source_1:

2024/03/07 09:06:12.75 ET,ND="\A001   ",DIP="$ABC0.",Sys=B002,Serv=C,Ent="AAA-BBBBBB  ",AppID=0000,LogID=1111,field_A="AA000111    ",Alias="            ",Tkn1="222222222   ",Tkn2="333333333                ",field_D="DeviceType",field_E="BB AA000111     (00) Down, error 0000"

And here is raw data from source_2 (field_B and field_C are towards the bottom):

BSYS=  ,EXL=     ,HI=        ,HSTT= ,MBR=     ,NAME=        ,ORRR=  ,RDDD=  ,REV= ,LINK=0000,POSTDATE=240307,RESP=     ,RESP_ERR=                    ,R=   ,D=          ,DE=          ,RECOM= ,SCORE=    ,STAND= ,ROUTE=00000000,NUM=0000000000,NUM1=0000000000,NUM2=0000000000,NUM3=000000000,CODE=1,POS=P,RNDTTT=          ,AUTH=000000,ASYS=0-,AN=A000  ,RCODE= ,TIMEIN=24030709061224,BURNRES= ,BURNRT=00000,BURNS=00000,ACCEPTCODE=0000,ACCEPTNAME=00000000000000  ,ANOTHERID=AAA    ,INPUT_CAP=0,INPUTMODE=0,ST=00,STATE=AA,LEVEL= ,COMREQ=0,PDATE=240308,CCODE=000,RTD=     ,CCC= ,CCC_RES= ,CCC_RES2= ,SCODE=  ,DACTUIB= ,DDD=   ,DDDR=               ,DDDE=  ,DDDEB= ,DDDN= ,DDDT= ,DDDF= ,DDDM=               ,DDDM2=               ,DDMN=       ,DDOI= ,DDRN=            ,DDRC=  ,DDAF=               ,DDTA=               ,DDTAF=               ,DDTT=   ,EI= ,EARD=          ,EERD=          ,EEFT= ,ESSS=     ,EAS=            ,EFRT=          ,EH=AAA0AAA ,EII=AAAAAAA0,EO=AAAAA00B,FMS= ,FPPP= ,FAT=00,GRC=00,HS=00,HII=00000000,ITN=AAA03,ISS=0000,JT=240307090612,LAA=0000000000,LAA2=0000000000,MSI= ,MIDD=AAAAA AAAAAAAAAAA,MT=0000,MMMM= ,MMOD=            ,MMOS=   ,MMU=   ,MD= ,MDDD= ,MRRC=   ,MRRRR=                                ,MSGGG=0,MSGGG=0000,MCS=  ,MCQ=  ,NCF=    ,NPD=240308,NII=AAA,NPPC=      ,NNNPPP= ,NNNPPPR= ,NTNT= ,NTNTR= ,OSS=00,OOM= ,PADD= ,POOOS=000,PDCCC=000000000A,PSC12=000000      ,PSCSC=11111111,PTTTID=      ,QTIME=11111,RAII=111111111,REIII=77775750  ,RTDDD= ,RTRRR=    ,RTRRRS=    ,RIII=BBBBBBB        ,RSSST=AA ,RSAAAA=AAAA B CCCCCCCCC DD EE,RRRC=00,RTTTTT=00000,RESSSCC=A,RSSS=0,RSCC=A,EFFN=00000000,RCCCC=00,RMRRM= ,RRTD=     ,PREPRE=000000      ,SSTS=00,SNNN=000000,Ssrc=,STMF=00000000,field_B=AA000111,SCCA=00000000,SCCA=00000000,STQQ=000000,SYSTEMTRACE=00000000000,TIST=00000,field_C=AA00    ,TOID=00000000,TST=OK,TT=00,TIMEZONE=*,ACCT=00,TDA=                   ,TDARID=           ,TCC=00,TNID=000000,TTTDM=240307080559,TTTIII=0000000000000000,VRES=

Hope that clears it up.

yumeina · ‎03-07-2024

An update:

I've gotten pretty close by doing this:

index=index_1
(sourcetype=source_2 earliest=1 latest=now()) OR (sourcetype=source_1 field_D="DeviceType" field_E=*Down* OR field_E=*Up*)
| eval field_AB=coalesce(field_A, field_B) 
| fields field_D field_E field_AB field_C
| stats values(field_E) values(field_C) by field_AB

and then using the UI to sort ascending by field_E.

However, there are still many cases where field_E is not populating with a matching field_AB, despite that same field_E matching a different field_AB. To illustrate, see below:

field_AB	field_E	field_C
11122233	Up Down	00002
11111179	Down
11111178	Up	00001

11111178 and 11111179 share 00001 as field_c, but it's not populating.

I checked to make sure it wasn't missing in source_2 and found multiple events where field_AB=11111178 and field_C=00001.

With that in mind, I'm not sure how to ensure I'm getting all the events from the log besides using the earliest and latest filters, unless I'm misunderstanding how those filters work.

yuanliu · ‎03-07-2024

Again, this is a data problem. As you described in OP, field_E and field_A only exist in source_1, field_C and field_B only exist in source_2. When you get the results reminiscent to your latest illustration, do you have evidence that field_B in source_2 has a value of 11111179 AND a non-null value of field_C? In other words, does this give you any result in the exact search period you used to obtain that table?

index=index_1
  sourcetype=source_2 earliest=1 latest=now()
  field_C=* field_B = 11111179

(By the way, I don't think you posted the correct mock code because earliest=1 latest=now() will always give you no results, therefore none of rows should have field_C populated. Unless your Splunk has a time machine in it.)

yumeina · ‎03-08-2024

I described in OP that field_A and field_B share the value. So there's nothing I can do to join the rest of the corresponding fields together?

Yes, I did find evidence of field_B in source_2 having a value of 11111179 and a non-null value of field_C, which is what I intended to refer to in my previous post.

In regards to the mock code, the search worked, so I'm guessing the flags were ignored.

Either way, if it is indeed a data problem, I'm not seeing it in my logs. Thank you for the insight though.

Will update this post if that does end up being the problem.

Join two sources via common field to grab another field w/ multiple matches

eval

fields

join

stats

table

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio