Solved: How to write a search with an outer join to only r...

jonbelanger · ‎12-18-2015

I'm looking for the join syntax for an outer join in Splunk that is not "all of A and all of B that's in A". Rather, what I need is "all of A that's not in B."

The A and B index records look something like this (simplifying)

filename=<variable>,

so the search would have to be something like, where the tag values would be completely different with no overlapping values between the two indexes, however the filename values would overlap. I want to find filenames in A that are not in B, based on different value for tag in both indexes.

This is what I'm trying:

index=A tag="tagM" | join type=left [ search tag="tagY" index=B ]

But I don't understand how to get the subset of A that's not in B.

jplumsdaine22 · ‎12-21-2015

If the filename field is common, you shouldn't need a join at all.

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename

This search should give you a results table like this:

filename | index
aaaa.txt | A
           B
bbbb.txt | A
           B
cccc.txt | A

Then you can append another search like | search NOT index=B and you should just see

filename | index
cccc.txt | A

So the full search would be

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename  | search NOT index=B

There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation

View solution in original post

jplumsdaine22 · ‎12-21-2015

If the filename field is common, you shouldn't need a join at all.

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename

This search should give you a results table like this:

filename | index
aaaa.txt | A
           B
bbbb.txt | A
           B
cccc.txt | A

Then you can append another search like | search NOT index=B and you should just see

filename | index
cccc.txt | A

So the full search would be

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename  | search NOT index=B

There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation

jonbelanger · ‎12-22-2015

Although this one has the drawback of not being able to bring additional fields through. The whole key to this is that the index field becomes multivalued with the value of either index name. You can't do the stats by values other than filename because that will throw off the file index search.

jplumsdaine22 · ‎12-23-2015

If you want to bring all the fields through, use | stats values(*) as * by filename |

jonbelanger · ‎12-23-2015

This is exactly what I needed, thank you!

jonbelanger · ‎12-22-2015

Both the first response and this response are logically equivalent. However, this response does not use a subsearch which can be truncated to 10000, invalidating results. This method is also slightly faster.

sameera123 · ‎12-20-2015

renjith_nair · ‎12-19-2015

Try with NOT

index=A tag="tagM" NOT [ search index=B tag="tagY"]

Happy Splunking!

jplumsdaine22 · ‎12-21-2015

You may want to filter the fields that come back with the subsearch or you risk excluding duplicates other than the filename. EG

index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]

jonbelanger · ‎12-22-2015

I was able to get this approach to work, but I also had to group the index B results by file name

index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]

Without the stats group-by you get results that are still in index B as well as A. I think this is because "tagY" and filename exists over multiple records that are different values for other fields? Although I don't completely understand that.

The group by filters it down to a list all unique filenames as input to the "NOT" search which seems to work.

jplumsdaine22 · ‎12-22-2015

You can see the difference between

index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]

and

index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]

if you check the remoteSearch field in the Job Inspector. You will see the full output of the expanded subsearch. Also try dedup filename as opposed to stats count by filename

jonbelanger · ‎12-23-2015

Yes, I think dedup and stats are logically equivalent, but I'm running into the maxout default limit for subsearch. Will fix that soon.

How to write a search with an outer join to only return results from index=A that are not found in index=B?

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk