Solved: "left join" lookup from CSV to an index

LearningGuy · ‎08-05-2024

Is it possible to perform "left join" lookup from CSV to an index?
Usually lookup start with index, then CSV file and the result only produces the correlation (inner join)

index= owner
| lookup host.csv ip_address AS ip OUTPUTNEW host, owner

but I am looking for left join that still retains all of the data in host.csv

Thank you for your help

Please see the example below:

host.csv

ip_address	host
10.1.1.1	host1
10.1.1.2	host2
10.1.1.3	host3
10.1.1.4	host4

index=owner

ip	host	owner
10.1.1.3	host3	owner3
10.1.1.4	host4	owner4
10.1.1.5	host5	owner5

left join "lookup" (expected output) - yellow and green circle (see drawing below)

ip_address	host	owner
10.1.1.1	host1
10.1.1.2	host2
10.1.1.3	host3	owner3
10.1.1.4	host4	owner4

normal "inner join" lookup index first, then CSV - green circle

ip_address	host	owner
10.1.1.3	host3	owner3
10.1.1.4	host4	owner4

yuanliu · ‎08-06-2024

Sorry I missed the definition here. This is easily fixable

index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
  [inputlookup host.csv
  | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source

Here, I am back at using the side effect of Splunk's multivalue equality.

Here is the full emulation

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
    | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source

View solution in original post

yuanliu · ‎08-05-2024

Let me first say, that is an excellent description of the requirements. Yes, you can do that with append.

Provided host and ip/ip_address match exactly, you can do

index=owner
| rename ip as ip_address
| append
  [inputlookup host.csv]
| stats values(owner) as owner by host ip_address

Hope this helps.

Below is a full emulation you can play with

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```]
| stats values(owner) by ip_address host

LearningGuy · ‎08-06-2024

Hi @yuanliu

Thanks for your help. It's been a while since the last time I saw you, I hope you're doing fine.

1) I ran the emulation and the result included 10.1.1.5 from the index=owner (right join). My expected output is host.csv plus the owner data from index=owner only if the IP matches. (See below)

2) When I tested with real data (100k rows CSV), the result reduced to 30k rows. I expected the left side as a baseline and the number of rows will remain the same.
Should I flip the logic? (inputlookup first, then append the index)
I actually tried to flip the logic, but after the append, the number of rows still doesn't match the CSV. (it's close)

I appreciate your help. Thank you

Emulation result

My expected output - yellow and green circle

ip_address	host	owner
10.1.1.1	host1
10.1.1.2	host2
10.1.1.3	host3	owner3
10.1.1.4	host4	owner4

yuanliu · ‎08-06-2024

Sorry I missed the definition here. This is easily fixable

index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
  [inputlookup host.csv
  | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source

Here, I am back at using the side effect of Splunk's multivalue equality.

Here is the full emulation

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
    | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source

LearningGuy · ‎08-08-2024

Hello @yuanliu

I tested and it worked fine for the sample. I accepted your suggestion as solution. Thank you for your help.

1) Max 50k rows
When I tested with the real data, I found out that the sub search CSV file is limited to 50k rows. I need the CSV file as my baseline for left join, so if the file has 100k rows, then the expected result after left join is 100k rows (with additional column from index).
a) What do you suggest to fix this issue? (modifying limits.conf is not allowed)
b) Will splitting the CSV work?

2) Join command
Do you think join command can work on my case?

I tested it your solution using join in real data, but it always gave me the result as inner join, instead of left join, although I already specified join type=left
In the solution you provided index will be treated as left data because it's specified first
How do I make the CSV as left data?

I appreciate your help.

Thanks

yuanliu · ‎08-08-2024

1) Max 50k rows
b) Will splitting the CSV work?

It's unfortunate that you cannot change limits.conf. Yes, splitting CSV will work. If you don't need these CSVs as lookup, that's not a problem. But if you still need a lookup, you will need to maintain two sets of CSVs, one for lookup, the rest for this purpose. (Alternatively, you can modify your searches to use multiple lookups. At that point, you code can become unmaintainable.)

2) Join command

This is where things become intensely interesting😊 I did not compare your statements with the actual depiction. After reviewing your original description, I notice that your depiction (and illustration) is a left join of CSV on the left, with index search on the right. In this regard, Splunk's join is working exactly as documented.

| inputlookup host.csv
| join type=left ip_address
  [ search index=owner
  | rename ip as ip_address]
| table host ip_address owner

Here is an emulation:

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
| join type=left ip_address [makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
    | eval index = "owner"
``` the above emulates
index=owner
```
| rename ip as ip_address]
| table host ip_address owner

The result is the same

host	ip_address	owner
host1	10.1.1.1
host2	10.1.1.2
host3	10.1.1.3	owner3
host4	10.1.1.4	owner4

I suspect that the "bad" output you observe is caused by the 50K row limit. (Try a smaller CSV and a smaller index search you should see.)

In the solution you provided index will be treated as left data because it's specified first

Unlike join, the append-stats method that many Splunkers use does not really depend on which set is introduced first. The control is in the filter.

LearningGuy · ‎08-20-2024

Hello @yuanliu

Can you please help on my other question?
This is more closer to the real data and it has complexity since it's involving multiple fields
https://community.splunk.com/t5/Splunk-Search/How-do-I-quot-Left-join-quot-by-appending-CSV-to-an-in...

I think I solved this, but I wonder if there is a way to do it without merging the string and mvexpand
I appreciate your help. Thank you so much

isoutamo · ‎08-06-2024

Hi
I think that this helps you
https://community.splunk.com/t5/Splunk-Search/What-is-the-relation-between-the-Splunk-inner-left-joi...
In that answer there is define those joins and how you should do those in Splunk.
r Ismo

KendallW · ‎08-05-2024

@LearningGuy use inputlookup to grab the whole csv file into the search, then append the data from your index, then get common values using stats. E.g.

| inputlookup host.csv
| rename ip_address as ip
| append [search index=owner | fields ip, owner]
| stats values(host) as host values(owner) as owner by ip

LearningGuy · ‎08-06-2024

Hi @KendallW

I tried your suggestion with real data (100k rows CSV). I expected the left side as a baseline and the number of rows will remain the same , but the number of rows doesn't match the CSV.

In my initial example. My expected output is 4 rows, same number of rows in the host.CSV.
Please suggest. I appreciate your help. Thank you.

My expected output - yellow and green circle

ip_address	host	owner
10.1.1.1	host1
10.1.1.2	host2
10.1.1.3	host3	owner3
10.1.1.4	host4	owner4

"left join" lookup from CSV to an index

join

lookup

Index This | What’s a riddle wrapped in an enigma?

BORE at .conf25

OpenTelemetry for Legacy Apps? Yes, You Can!

Are you a member of the Splunk Community?