Hi,
Been trying to connect/join two log sources which have fields that share the same values.
To break it down:
source_1
source_2
field_a and field_b can share same value.
field_c can correspond to multiple values of field_A/field_B.
The query should essentially add field_c from source_2 to every filtered event in source_1 (like a left join, with source_2 almost functioning as a lookup table).
I've gotten pretty close with my Join query, but it's a bit slow and not populating all the field_c's. Inspecting the job reveals I'm hitting 50000 result limit.
I've also tried a stew query using stats, which is much faster, but it's not actually connecting the events / data together.
Here are the queries I've been using so far:
join
index=index_1
sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*
| rename field_A as field_B
| join type=left max=0 field_B [ search source="source_2" earliest=-30d@d latest=@m]
| table field_D field_E field_B field_C
stats w/ coalesce()
index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB
expected output
field_D | field_E | field_A/field_B | field_C |
fun_text | Up/Down_text | shared_value | corresponding_value |
To add to @yuanliu comment - the starting point to diagnose why something is NOT giving you what you expect is to isolate the simple example of a source from each where you do not get the results expected.
If you are unable to understand why it's not connecting the events as you suggest, post a sanitised example here, so we can help with different sets of eyes.
stats is certainly the way to go.
Does this example model what your data looks like
| makeresults count=3
| eval field_A=split("ABC",""), field_D="text"
| mvexpand field_A
| eval field_E=mvindex(split("DOWN,UP",","),random() % 2)
| append [
| makeresults count=5
| eval field_B=split("ABC",""), field_C="F_C".(random() % 10)
| mvexpand field_B
]
| eval field_AB=coalesce(field_A, field_B)
| fields field_D field_E field_AB field_C
| stats values(*) as * by field_AB
You are correct that join is slow and easily hits limit. But how is stats with coalesce not actually connecting the events / data together? What exactly do you get? Given that your mock code uses mock field names, are you sure you typed field names correctly in coalesce and group by?
What is the output from the following test?
index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| table _time source field_AB
If field_A exists in every event from source_1 and likewise field_B in source_2, the test should list every event from both sources. Do you get that?
If all spellings are correct, you can rename your actual field names to field_A, field_B, etc., then post sample data from the two sources (anonymize as needed) so volunteers will have a basis to help in concrete ways.
This is a long way to say that stats with coalesce is the correct code.
Thanks for the response!
With the test query, I'm seeing both sources.
For further research I counted all the field_AB by sourcetype and found that there are significantly more source_2 than source_1, not sure if that affects anything though.
sourcetype (last 5 min)
source_1 -> count 147
source_2 -> count 66359
In my initial query, I do get both event from both sources. It's not populating the fields in each event the way I want it though.
Inital Query Results
field_D | field_AB | field_C | field_E |
DeviceType | UniqueID | Up | |
DeviceType | UniqueID | Down | |
UniqueID | Data_2 | ||
UniqueID | Data_1 |
Expected Query Results
field_D | field_AB | field_C | field_E |
DeviceType | UniqueID | Data_1 | Up |
DeviceType | UniqueID | Data_2 | Down |
DeviceType | UniqueID | Data_2 | Down |
DeviceType | UniqueID | Data_1 | Down |
Mere count of field_AB is not particularly insightful. (Also, what are the counts? Are they unique value counts or counts of events that has field_AB? There are many other metrics for either of these to be useful.)
Let me clarify something. In your latest illustration of Inital Query Results, you are having problem populating field_D, field_C, and field_E corresponding to select values of field_AB. Is this correct? Your illustration cannot be a precise representation because it is impossible for groupby field_AB to have the same value "UniqueID" in every row. This means that your observed missing values of field_D, field_C, and field_E has something to do with your actual data.
In other words, coalesce, stats, and groupby are all functioning as designed. All those "gaps" in field_D, field_C, and field_E only means that source_1 and source_b contain different sets of unique values of field_AB (field_A in source_1, field_B in source_2). This is a simple test:
index=index_1
(sourcetype=source_1 field_D="Device" field_E=*Down* OR field_E=*Up*) OR (source="source_2" earliest=-30d@d latest=@m)
| eval field_AB=coalesce(field_A, field_B)
| eval source = case(source == "source_2", "source_2", sourcetype == "source_1", "source_1", true(), "mingled")
| stats dc(field_AB) by source
Are the count the same for both sources? Are there any "mingled" counts? (Note one of your searches uses sourcetype as filter, the other uses source. This is generally a bad strategy because they will confuse you three months from now.)
If you need concrete help, you must post anonymized raw data, and show us which data set cause the problem. Alternatively, you need to inspect data closely; specifically, how are field_D and field_E populated with field_A in source_1, and how is field_C populated with field_B in source_2.
The counts were calculated via
index=index_1 (sourcetype=source_1 field_D="Text" field_E=*Down* OR field_E=*Up*) OR (sourcetype=source_2)
| dedup field_B keepempty=true
| eval field_AB=coalesce(field_A, field_B)
| where isnotnull(field_AB)
| stats dc(field_AB) as count by sourcetype
To your point about source vs. sourcetype, I realized that after looking at those counts and made both of my filters use sourcetype.
As for my Initial Query Results, let me clarify that for you: UniqueID is not the same value in every row. It can occur more than once, yes, as this log is reporting the status of the device as it goes up and down.
As for field_C, to clarify from my initial post: field_C is an identifier which can be mapped to multiple field_AB.
field_C is the only field from source_2 I really want to add to source_1. Essentially I'm using source_2 as a reference to pull field_C from, matching all instances of UniqueID_X to whichever corresponding field_C.
Here is what it looks like with that in mind:
Inital Query Results
field_D | field_AB | field_C | field_E |
DeviceType | UniqueID_1 | Up | |
DeviceType | UniqueID_2 | Down | |
UniqueID_3 | Data_2 | ||
UniqueID_4 | Data_1 |
As well as expected results:
field_D | field_AB | field_C | field_E |
DeviceType | UniqueID_1 | Data_1 | Up |
DeviceType | UniqueID_2 | Data_2 | Down |
DeviceType | UniqueID_3 | Data_2 | Down |
DeviceType | UniqueID_4 | Data_1 | Down |
Additionally, here is raw data from source_1:
2024/03/07 09:06:12.75 ET,ND="\A001 ",DIP="$ABC0.",Sys=B002,Serv=C,Ent="AAA-BBBBBB ",AppID=0000,LogID=1111,field_A="AA000111 ",Alias=" ",Tkn1="222222222 ",Tkn2="333333333 ",field_D="DeviceType",field_E="BB AA000111 (00) Down, error 0000"
And here is raw data from source_2 (field_B and field_C are towards the bottom):
BSYS= ,EXL= ,HI= ,HSTT= ,MBR= ,NAME= ,ORRR= ,RDDD= ,REV= ,LINK=0000,POSTDATE=240307,RESP= ,RESP_ERR= ,R= ,D= ,DE= ,RECOM= ,SCORE= ,STAND= ,ROUTE=00000000,NUM=0000000000,NUM1=0000000000,NUM2=0000000000,NUM3=000000000,CODE=1,POS=P,RNDTTT= ,AUTH=000000,ASYS=0-,AN=A000 ,RCODE= ,TIMEIN=24030709061224,BURNRES= ,BURNRT=00000,BURNS=00000,ACCEPTCODE=0000,ACCEPTNAME=00000000000000 ,ANOTHERID=AAA ,INPUT_CAP=0,INPUTMODE=0,ST=00,STATE=AA,LEVEL= ,COMREQ=0,PDATE=240308,CCODE=000,RTD= ,CCC= ,CCC_RES= ,CCC_RES2= ,SCODE= ,DACTUIB= ,DDD= ,DDDR= ,DDDE= ,DDDEB= ,DDDN= ,DDDT= ,DDDF= ,DDDM= ,DDDM2= ,DDMN= ,DDOI= ,DDRN= ,DDRC= ,DDAF= ,DDTA= ,DDTAF= ,DDTT= ,EI= ,EARD= ,EERD= ,EEFT= ,ESSS= ,EAS= ,EFRT= ,EH=AAA0AAA ,EII=AAAAAAA0,EO=AAAAA00B,FMS= ,FPPP= ,FAT=00,GRC=00,HS=00,HII=00000000,ITN=AAA03,ISS=0000,JT=240307090612,LAA=0000000000,LAA2=0000000000,MSI= ,MIDD=AAAAA AAAAAAAAAAA,MT=0000,MMMM= ,MMOD= ,MMOS= ,MMU= ,MD= ,MDDD= ,MRRC= ,MRRRR= ,MSGGG=0,MSGGG=0000,MCS= ,MCQ= ,NCF= ,NPD=240308,NII=AAA,NPPC= ,NNNPPP= ,NNNPPPR= ,NTNT= ,NTNTR= ,OSS=00,OOM= ,PADD= ,POOOS=000,PDCCC=000000000A,PSC12=000000 ,PSCSC=11111111,PTTTID= ,QTIME=11111,RAII=111111111,REIII=77775750 ,RTDDD= ,RTRRR= ,RTRRRS= ,RIII=BBBBBBB ,RSSST=AA ,RSAAAA=AAAA B CCCCCCCCC DD EE,RRRC=00,RTTTTT=00000,RESSSCC=A,RSSS=0,RSCC=A,EFFN=00000000,RCCCC=00,RMRRM= ,RRTD= ,PREPRE=000000 ,SSTS=00,SNNN=000000,Ssrc=,STMF=00000000,field_B=AA000111,SCCA=00000000,SCCA=00000000,STQQ=000000,SYSTEMTRACE=00000000000,TIST=00000,field_C=AA00 ,TOID=00000000,TST=OK,TT=00,TIMEZONE=*,ACCT=00,TDA= ,TDARID= ,TCC=00,TNID=000000,TTTDM=240307080559,TTTIII=0000000000000000,VRES=
Hope that clears it up.
An update:
I've gotten pretty close by doing this:
index=index_1
(sourcetype=source_2 earliest=1 latest=now()) OR (sourcetype=source_1 field_D="DeviceType" field_E=*Down* OR field_E=*Up*)
| eval field_AB=coalesce(field_A, field_B)
| fields field_D field_E field_AB field_C
| stats values(field_E) values(field_C) by field_AB
and then using the UI to sort ascending by field_E.
However, there are still many cases where field_E is not populating with a matching field_AB, despite that same field_E matching a different field_AB. To illustrate, see below:
field_AB | field_E | field_C |
11122233 | Up Down | 00002 |
11111179 | Down | |
11111178 | Up | 00001 |
11111178 and 11111179 share 00001 as field_c, but it's not populating.
I checked to make sure it wasn't missing in source_2 and found multiple events where field_AB=11111178 and field_C=00001.
With that in mind, I'm not sure how to ensure I'm getting all the events from the log besides using the earliest and latest filters, unless I'm misunderstanding how those filters work.
Again, this is a data problem. As you described in OP, field_E and field_A only exist in source_1, field_C and field_B only exist in source_2. When you get the results reminiscent to your latest illustration, do you have evidence that field_B in source_2 has a value of 11111179 AND a non-null value of field_C? In other words, does this give you any result in the exact search period you used to obtain that table?
index=index_1
sourcetype=source_2 earliest=1 latest=now()
field_C=* field_B = 11111179
(By the way, I don't think you posted the correct mock code because earliest=1 latest=now() will always give you no results, therefore none of rows should have field_C populated. Unless your Splunk has a time machine in it.)
I described in OP that field_A and field_B share the value. So there's nothing I can do to join the rest of the corresponding fields together?
Yes, I did find evidence of field_B in source_2 having a value of 11111179 and a non-null value of field_C, which is what I intended to refer to in my previous post.
In regards to the mock code, the search worked, so I'm guessing the flags were ignored.
Either way, if it is indeed a data problem, I'm not seeing it in my logs. Thank you for the insight though.
Will update this post if that does end up being the problem.