I'm looking for the join syntax for an outer join in Splunk that is not "all of A and all of B that's in A". Rather, what I need is "all of A that's not in B."
The A and B index records look something like this (simplifying)
filename=<variable>,
so the search would have to be something like, where the tag values would be completely different with no overlapping values between the two indexes, however the filename values would overlap. I want to find filenames in A that are not in B, based on different value for tag in both indexes.
This is what I'm trying:
index=A tag="tagM" | join type=left [ search tag="tagY" index=B ]
But I don't understand how to get the subset of A that's not in B.
If the filename field is common, you shouldn't need a join at all.
(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename
This search should give you a results table like this:
filename | index
aaaa.txt | A
B
bbbb.txt | A
B
cccc.txt | A
Then you can append another search like | search NOT index=B
and you should just see
filename | index
cccc.txt | A
So the full search would be
(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename | search NOT index=B
There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation
If the filename field is common, you shouldn't need a join at all.
(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename
This search should give you a results table like this:
filename | index
aaaa.txt | A
B
bbbb.txt | A
B
cccc.txt | A
Then you can append another search like | search NOT index=B
and you should just see
filename | index
cccc.txt | A
So the full search would be
(index=A AND tag="tagM") OR (tag="tagY" AND index=B) | stats values(index) as index by filename | search NOT index=B
There's a handy flowchart on which command is best for each situation here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation
Although this one has the drawback of not being able to bring additional fields through. The whole key to this is that the index field becomes multivalued with the value of either index name. You can't do the stats by values other than filename because that will throw off the file index search.
If you want to bring all the fields through, use | stats values(*) as * by filename |
This is exactly what I needed, thank you!
Both the first response and this response are logically equivalent. However, this response does not use a subsearch which can be truncated to 10000, invalidating results. This method is also slightly faster.
index=B tag="tagY" | join type=outer field1 [search index=A tag="tagM"|eval field2=field1|table field1,field2 ]|search NOT field2=*|table field1
Try with NOT
index=A tag="tagM" NOT [ search index=B tag="tagY"]
You may want to filter the fields that come back with the subsearch or you risk excluding duplicates other than the filename. EG
index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]
I was able to get this approach to work, but I also had to group the index B results by file name
index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]
Without the stats group-by you get results that are still in index B as well as A. I think this is because "tagY" and filename exists over multiple records that are different values for other fields? Although I don't completely understand that.
The group by filters it down to a list all unique filenames as input to the "NOT" search which seems to work.
You can see the difference between
index=A tag="tagM" NOT [ search index=B tag="tagY" | stats count by filename | fields filename]
and
index=A tag="tagM" NOT [ search index=B tag="tagY" | fields filename]
if you check the remoteSearch field in the Job Inspector. You will see the full output of the expanded subsearch. Also try dedup filename
as opposed to stats count by filename
Yes, I think dedup and stats are logically equivalent, but I'm running into the maxout default limit for subsearch. Will fix that soon.