What's the relation between the Splunk inner/left joins and the ones in relational databases, functionality and terminology wise?
DO NOT USE JOINS IN SPLUNK. See here for visual representations:
https://goo.gl/images/zyDpMD
Continuing with the explanation of @kashz:
Assume:
sourcetype=A
contains {ip, more-columns-specific-to-sourcetype-A}
sourcetype=B
contains {ip, more-columns-specific-to-sourcetype-B}
and furthermore that you have a search like this:
(index=A AND sourcetype=A) OR (index=B AND sourcetype=B)
| stats dc(sourcetype) AS sourcetypeCount values(*) AS * BY ip
INNER JOIN on the ip
field between them will return only those contents that are matching in the both sourcetype
values on the ip
field. To do this, finish your search with this:
| where sourcetypeCount=2
LEFT JOIN A on the ip
field will return the FULL sourcetype=A
dataset (the "left" one) and only those contents from the souretype=B
dataset (the "right" one) which match on field ip
. To do this, finish your search with this:
| where sourcetype="A"
LEFT JOIN B on the ip
field will return the EXCLUSIVE sourcetype=A
dataset (the "left" one) that do not overlap the contents from the souretype=B
dataset (the "right" one) which match on field ip
. To do this, finish your search with this:
| where sourcetype="A" AND sourcetypeCount=1
RIGHT JOIN B on the ip
field will return the FULL sourcetype=B
dataset (the "right" one) and only those contents from the souretype=A
dataset (the "left" one) which match on field ip
. To do this, finish your search with this:
| where sourcetype="B"
RIGHT JOIN A on the ip
field will return the EXCLUSIVE sourcetype=B
dataset (the "right" one) that do not overlap the contents from the souretype=A
dataset (the "left" one) which match on field ip
. To do this, finish your search with this:
| where sourcetype="B" AND sourcetypeCount=1
OUTER JOIN on the ip
field will return both datasets in FULL minus the events that that have ip
in both datasets. To do this, finish your search with this:
| where sourcetypeCount=1
FULL JOIN on the ip
field will return both datasets in FULL. To do this, do nothing.
Using join
is bad because it uses subsearches
which have time and results limits which are enforced without obvious indications leaving users with no easy way to know that the results are incomplete.
Thank you for sharing the docs. They helped a lot.
Splunk Documentation on Joins: https://docs.splunk.com/Documentation/Splunk/7.2.4/SearchReference/Join
To give you a clear picture:
Splunk only support INNER
, LEFT
and OUTER
joins where as in SQL based queries can have a lot more variations; the concept remains the same.
Assume:
DATASET-A contains {ip, more-columns}
DATASET-B contains {ip, more-columns}
ip
field between them will return only those contents that are matching in the both dataset on the ip
field.ip
field will return the FULL DATASET-(left-one) and only those contents from the other dataset-(right-one) which match on field ip
ip
field will return both datasets in FULL with null
values entered in places where a field cannot be found based on the join condition.