Splunk Search

Adding a constraint to a subsearch yields more result rows

dbryan
Path Finder

Consider the following two Splunk searches:

index=a | join type=inner MyKey [
    search index=b
]

and:

index=a | join type=inner MyKey [
    search index=b | where MyVal > 0
]

Remarkably, the latter of the searches - the search whose subsearch has a constraint - has three times as many result rows as the former.

The documentation page for the join command suggests semantics that are close enough for the sake of argument to SQL's join:

A join is used to combine the results of a search and subsearch if specified fields are common to each. You can also join a table to itself using the selfjoin command.

This snippet is relevant to the type=inner argument:

A join is used to combine the results of a search and subsearch if specified fields are common to each. You can also join a table to itself using the selfjoin command.

Based on this information, I assume the two Splunk searches above should be equivalent to the following SQL, respectively:

SELECT *
FROM a
INNER JOIN b ON a.MyKey = b.MyKey

and:

SELECT *
FROM a
INNER JOIN b ON a.MyKey = b.MyKey
WHERE b.MyVal > 0

How is it possible that adding a constraint increases the number of result rows?

Interestingly, the following Splunk search produces a third result - one that matches what I got when I put the same data in an SQL database:

index=a | join type=outer MyKey [search index=b | eval hasmatch=1]
| where hasmatch=1

Some more notes:

  • the MyVal field has no duplicates in either table / index
  • I have verified that the raw events in Splunk's indexes match the raw source data in event counts and values for MyVal - the data appears to be indexed correctly
  • the only search-time operations configured for the relevant sourcetypes in props.conf is a report to extract the fields based on a stanza in transforms.conf (the source data is in a CSV dialect)

Can anyone give me some clues here? As far as I'm concerned this behaviour is nonsensical.

Tags (4)
0 Karma
1 Solution

sideview
SplunkTrust
SplunkTrust

I suspect that you're hitting the limits around number of results, that your index="b" search returns more than 50,000 rows, that most of the values are missing "myKey" and "myVal", so when you apply the restricting condition you end up getting more "joinable" rows out.

This probably sounds preposterous. But we have to back up. Join is not a mainstream command in Splunk. In fact it's best to think of join as a weapon of last resort. I've worked with ton of teams and new users and I've seen them gravitate to join. However even in cases that seem complicated, it is almost always better to use simple disjunctions (foo OR bar), a little eval when you need it (to normalize the space of fields and tweak the data), and then the stats or occasionally the transaction command, to roll up all the events and do the 'joining'.

At this point a lot of people still shrug and say "OK but my situation is complicated and join is what I need". So to come down to brass tacks, there are significant shortcomings to subsearches in general and to subsearches in join in particular, and there are significant splunk-like benefits to doing it the Splunk way, that you'll gain some comfort level with as you go along.

First of all, subsearches all have limits, both in terms of the number of result rows that will be returned (in the case of join this defaults to 50,000), and in terms of the number of seconds the subsearch will be allowed to run before it is prematurely finalized.

However if you take this search:

index=a | join type=inner MyKey [search index=b]

and just to fill in some specifics, say we assume that at the end of the day you wanted to get the userId from the "a" results, and a total 'bytes' value from the "b" events. You would just do this:

index=a OR index=b | stats sum(bytes) values(userId) by MyKey

Granted, for a type="inner" you probably want to do some filtering with where, and maybe you want to do that before the stats or after the stats, I think it depends. I'm probably making up enough parts here and you get the idea.

Short version: avoid join. Join is a weapon of last resort and it's almost always easier to use a simple disjunction (foo OR bar), or a simple disjunction followed by a little eval syntax to normalize the space of fields, or maybe the transaction command.

Some geenral comments about subsearches:
http://docs.splunk.com/Documentation/Splunk/4.3.4/User/HowSubsearchesWork

more details about the limits in subsearches, join, append, etc.. Note the [join] stanza says that only 50,000 rows will be yielded out of the subsearch.
http://docs.splunk.com/Documentation/Splunk/4.3.4/Admin/Limitsconf

View solution in original post

sideview
SplunkTrust
SplunkTrust

I suspect that you're hitting the limits around number of results, that your index="b" search returns more than 50,000 rows, that most of the values are missing "myKey" and "myVal", so when you apply the restricting condition you end up getting more "joinable" rows out.

This probably sounds preposterous. But we have to back up. Join is not a mainstream command in Splunk. In fact it's best to think of join as a weapon of last resort. I've worked with ton of teams and new users and I've seen them gravitate to join. However even in cases that seem complicated, it is almost always better to use simple disjunctions (foo OR bar), a little eval when you need it (to normalize the space of fields and tweak the data), and then the stats or occasionally the transaction command, to roll up all the events and do the 'joining'.

At this point a lot of people still shrug and say "OK but my situation is complicated and join is what I need". So to come down to brass tacks, there are significant shortcomings to subsearches in general and to subsearches in join in particular, and there are significant splunk-like benefits to doing it the Splunk way, that you'll gain some comfort level with as you go along.

First of all, subsearches all have limits, both in terms of the number of result rows that will be returned (in the case of join this defaults to 50,000), and in terms of the number of seconds the subsearch will be allowed to run before it is prematurely finalized.

However if you take this search:

index=a | join type=inner MyKey [search index=b]

and just to fill in some specifics, say we assume that at the end of the day you wanted to get the userId from the "a" results, and a total 'bytes' value from the "b" events. You would just do this:

index=a OR index=b | stats sum(bytes) values(userId) by MyKey

Granted, for a type="inner" you probably want to do some filtering with where, and maybe you want to do that before the stats or after the stats, I think it depends. I'm probably making up enough parts here and you get the idea.

Short version: avoid join. Join is a weapon of last resort and it's almost always easier to use a simple disjunction (foo OR bar), or a simple disjunction followed by a little eval syntax to normalize the space of fields, or maybe the transaction command.

Some geenral comments about subsearches:
http://docs.splunk.com/Documentation/Splunk/4.3.4/User/HowSubsearchesWork

more details about the limits in subsearches, join, append, etc.. Note the [join] stanza says that only 50,000 rows will be yielded out of the subsearch.
http://docs.splunk.com/Documentation/Splunk/4.3.4/Admin/Limitsconf

View solution in original post

sideview
SplunkTrust
SplunkTrust

Well, the stats(values) was just a random example. I recommend posting your whole use case and the search with all the joins, as a separate question, and give us a crack at recommending the more splunklike, joinless way of doing the same thing.

0 Karma

dbryan
Path Finder

Thanks for the response; this is enlightening.

I acknowledge, as you say, that I'm sort of fighting against the system by using so many joins, but the way we're working with data simply requires it; stats(values) means we lose certain information and properties that we need when working with the data.

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!