Is it possible to perform "left join" lookup from CSV to an index?
Usually lookup start with index, then CSV file and the result only produces the correlation (inner join)
index= owner
| lookup host.csv ip_address AS ip OUTPUTNEW host, owner
but I am looking for left join that still retains all of the data in host.csv
Thank you for your help
Please see the example below:
host.csv
ip_address | host |
10.1.1.1 | host1 |
10.1.1.2 | host2 |
10.1.1.3 | host3 |
10.1.1.4 | host4 |
index=owner
ip | host | owner |
10.1.1.3 | host3 | owner3 |
10.1.1.4 | host4 | owner4 |
10.1.1.5 | host5 | owner5 |
left join "lookup" (expected output) - yellow and green circle (see drawing below)
ip_address | host | owner |
10.1.1.1 | host1 | |
10.1.1.2 | host2 | |
10.1.1.3 | host3 | owner3 |
10.1.1.4 | host4 | owner4 |
normal "inner join" lookup index first, then CSV - green circle
ip_address | host | owner |
10.1.1.3 | host3 | owner3 |
10.1.1.4 | host4 | owner4 |
Sorry I missed the definition here. This is easily fixable
index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
[inputlookup host.csv
| eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source
Here, I am back at using the side effect of Splunk's multivalue equality.
Here is the full emulation
| makeresults format=csv data="ip, host, owner
10.1.1.3, host3, owner3
10.1.1.4, host4, owner4
10.1.1.5, host5, owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
[makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4"
``` the above emulates
| inputlookup host.csv
```
| eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source
Let me first say, that is an excellent description of the requirements. Yes, you can do that with append.
Provided host and ip/ip_address match exactly, you can do
index=owner
| rename ip as ip_address
| append
[inputlookup host.csv]
| stats values(owner) as owner by host ip_address
Hope this helps.
Below is a full emulation you can play with
| makeresults format=csv data="ip, host, owner
10.1.1.3, host3, owner3
10.1.1.4, host4, owner4
10.1.1.5, host5, owner5"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
[makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4"
``` the above emulates
| inputlookup host.csv
```]
| stats values(owner) by ip_address host
Hi @yuanliu
Thanks for your help. It's been a while since the last time I saw you, I hope you're doing fine.
1) I ran the emulation and the result included 10.1.1.5 from the index=owner (right join). My expected output is host.csv plus the owner data from index=owner only if the IP matches. (See below)
2) When I tested with real data (100k rows CSV), the result reduced to 30k rows. I expected the left side as a baseline and the number of rows will remain the same.
Should I flip the logic? (inputlookup first, then append the index)
I actually tried to flip the logic, but after the append, the number of rows still doesn't match the CSV. (it's close)
I appreciate your help. Thank you
Emulation result
My expected output - yellow and green circle
ip_address | host | owner |
10.1.1.1 | host1 | |
10.1.1.2 | host2 | |
10.1.1.3 | host3 | owner3 |
10.1.1.4 | host4 | owner4 |
Sorry I missed the definition here. This is easily fixable
index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
[inputlookup host.csv
| eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source
Here, I am back at using the side effect of Splunk's multivalue equality.
Here is the full emulation
| makeresults format=csv data="ip, host, owner
10.1.1.3, host3, owner3
10.1.1.4, host4, owner4
10.1.1.5, host5, owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
[makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4"
``` the above emulates
| inputlookup host.csv
```
| eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source
Hello @yuanliu
I tested and it worked fine for the sample. I accepted your suggestion as solution. Thank you for your help.
1) Max 50k rows
When I tested with the real data, I found out that the sub search CSV file is limited to 50k rows. I need the CSV file as my baseline for left join, so if the file has 100k rows, then the expected result after left join is 100k rows (with additional column from index).
a) What do you suggest to fix this issue? (modifying limits.conf is not allowed)
b) Will splitting the CSV work?
2) Join command
Do you think join command can work on my case?
I tested it your solution using join in real data, but it always gave me the result as inner join, instead of left join, although I already specified join type=left
In the solution you provided index will be treated as left data because it's specified first
How do I make the CSV as left data?
I appreciate your help.
Thanks
1) Max 50k rows
b) Will splitting the CSV work?
It's unfortunate that you cannot change limits.conf. Yes, splitting CSV will work. If you don't need these CSVs as lookup, that's not a problem. But if you still need a lookup, you will need to maintain two sets of CSVs, one for lookup, the rest for this purpose. (Alternatively, you can modify your searches to use multiple lookups. At that point, you code can become unmaintainable.)
2) Join command
This is where things become intensely interesting😊 I did not compare your statements with the actual depiction. After reviewing your original description, I notice that your depiction (and illustration) is a left join of CSV on the left, with index search on the right. In this regard, Splunk's join is working exactly as documented.
| inputlookup host.csv
| join type=left ip_address
[ search index=owner
| rename ip as ip_address]
| table host ip_address owner
Here is an emulation:
| makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4"
``` the above emulates
| inputlookup host.csv
```
| join type=left ip_address [makeresults format=csv data="ip, host, owner
10.1.1.3, host3, owner3
10.1.1.4, host4, owner4
10.1.1.5, host5, owner5"
| eval index = "owner"
``` the above emulates
index=owner
```
| rename ip as ip_address]
| table host ip_address owner
The result is the same
host | ip_address | owner |
host1 | 10.1.1.1 | |
host2 | 10.1.1.2 | |
host3 | 10.1.1.3 | owner3 |
host4 | 10.1.1.4 | owner4 |
I suspect that the "bad" output you observe is caused by the 50K row limit. (Try a smaller CSV and a smaller index search you should see.)
In the solution you provided index will be treated as left data because it's specified first
Unlike join, the append-stats method that many Splunkers use does not really depend on which set is introduced first. The control is in the filter.
Hello @yuanliu
Can you please help on my other question?
This is more closer to the real data and it has complexity since it's involving multiple fields
https://community.splunk.com/t5/Splunk-Search/How-do-I-quot-Left-join-quot-by-appending-CSV-to-an-in...
I think I solved this, but I wonder if there is a way to do it without merging the string and mvexpand
I appreciate your help. Thank you so much
@LearningGuy use inputlookup to grab the whole csv file into the search, then append the data from your index, then get common values using stats. E.g.
| inputlookup host.csv
| rename ip_address as ip
| append [search index=owner | fields ip, owner]
| stats values(host) as host values(owner) as owner by ip
Hi @KendallW
I tried your suggestion with real data (100k rows CSV). I expected the left side as a baseline and the number of rows will remain the same , but the number of rows doesn't match the CSV.
In my initial example. My expected output is 4 rows, same number of rows in the host.CSV.
Please suggest. I appreciate your help. Thank you.
My expected output - yellow and green circle
ip_address | host | owner |
10.1.1.1 | host1 | |
10.1.1.2 | host2 | |
10.1.1.3 | host3 | owner3 |
10.1.1.4 | host4 | owner4 |