Splunk Search

How to write a search with an outer join to only return results from index=A that are not found in index=B?

jonbelanger
Explorer

I'm looking for the join syntax for an outer join in Splunk that is not "all of A and all of B that's in A". Rather, what I need is "all of A that's not in B."

The A and B index records look something like this (simplifying)

filename=<variable>,

so the search would have to be something like, where the tag values would be completely different with no overlapping values between the two indexes, however the filename values would overlap. I want to find filenames in A that are not in B, based on different value for tag in both indexes.

This is what I'm trying:

index=A tag="tagM" | join type=left [ search tag="tagY" index=B ]

But I don't understand how to get the subset of A that's not in B.

0 Karma
1 Solution

jplumsdaine22
Influencer

If the filename field is common, you shouldn't need a join at all.

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename 

This search should give you a results table like this:

filename | index
aaaa.txt | A
           B
bbbb.txt | A
           B
cccc.txt | A

Then you can append another search like | search NOT index=B and you should just see

filename | index
cccc.txt | A

So the full search would be

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename  | search NOT index=B

There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation

View solution in original post

jplumsdaine22
Influencer

If the filename field is common, you shouldn't need a join at all.

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename 

This search should give you a results table like this:

filename | index
aaaa.txt | A
           B
bbbb.txt | A
           B
cccc.txt | A

Then you can append another search like | search NOT index=B and you should just see

filename | index
cccc.txt | A

So the full search would be

(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename  | search NOT index=B

There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation

View solution in original post

jonbelanger
Explorer

Although this one has the drawback of not being able to bring additional fields through. The whole key to this is that the index field becomes multivalued with the value of either index name. You can't do the stats by values other than filename because that will throw off the file index search.

0 Karma

jplumsdaine22
Influencer

If you want to bring all the fields through, use | stats values(*) as * by filename |

jonbelanger
Explorer

This is exactly what I needed, thank you!

0 Karma

jonbelanger
Explorer

Both the first response and this response are logically equivalent. However, this response does not use a subsearch which can be truncated to 10000, invalidating results. This method is also slightly faster.

0 Karma

sameera123
Explorer

index=B tag="tagY" | join type=outer field1 [search index=A tag="tagM"|eval field2=field1|table field1,field2 ]|search NOT field2=*|table field1

0 Karma

renjith_nair
SplunkTrust
SplunkTrust

Try with NOT

index=A tag="tagM" NOT [ search index=B tag="tagY"]

jplumsdaine22
Influencer

You may want to filter the fields that come back with the subsearch or you risk excluding duplicates other than the filename. EG

index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]

0 Karma

jonbelanger
Explorer

I was able to get this approach to work, but I also had to group the index B results by file name

index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]

Without the stats group-by you get results that are still in index B as well as A. I think this is because "tagY" and filename exists over multiple records that are different values for other fields? Although I don't completely understand that.

The group by filters it down to a list all unique filenames as input to the "NOT" search which seems to work.

0 Karma

jplumsdaine22
Influencer

You can see the difference between

index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]

and

index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]

if you check the remoteSearch field in the Job Inspector. You will see the full output of the expanded subsearch. Also try dedup filename as opposed to stats count by filename

0 Karma

jonbelanger
Explorer

Yes, I think dedup and stats are logically equivalent, but I'm running into the maxout default limit for subsearch. Will fix that soon.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!