Splunk Search

"left join" lookup from CSV to an index

LearningGuy
Builder


Is it possible to perform "left join" lookup from CSV to an index?
Usually lookup start with index, then CSV file and the result only produces the correlation (inner join)

 

 

index= owner
| lookup host.csv ip_address AS ip OUTPUTNEW host, owner

 

 

but I am looking for left join that still retains all of the data in host.csv

Thank you for your help

Please see the example below:

host.csv

ip_addresshost
10.1.1.1host1
10.1.1.2host2
10.1.1.3host3
10.1.1.4host4


index=owner

iphostowner
10.1.1.3host3owner3
10.1.1.4host4owner4
10.1.1.5host5owner5

 

left join "lookup" (expected output) - yellow and green circle (see drawing below)

ip_addresshostowner
10.1.1.1host1 
10.1.1.2host2 
10.1.1.3host3owner3
10.1.1.4host4owner4


normal "inner join" lookup   index first, then CSV - green circle

ip_addresshostowner
10.1.1.3host3owner3
10.1.1.4host4owner4

 

LearningGuy_0-1722974764118.png

 

 

Labels (2)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Sorry I missed the definition here.  This is easily fixable

index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
  [inputlookup host.csv
  | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source

Here, I am back at using the side effect of Splunk's multivalue equality.

Here is the full emulation

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
    | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source

 

View solution in original post

yuanliu
SplunkTrust
SplunkTrust

Let me first say, that is an excellent description of the requirements.  Yes, you can do that with append.

Provided host and ip/ip_address match exactly, you can do

index=owner
| rename ip as ip_address
| append
  [inputlookup host.csv]
| stats values(owner) as owner by host ip_address

Hope this helps.

Below is a full emulation you can play with

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```]
| stats values(owner) by ip_address host
Tags (1)

LearningGuy
Builder

Hi @yuanliu 

Thanks for your help. It's been a while since the last time I saw you, I hope you're doing fine.

1)  I ran the emulation and the result included 10.1.1.5 from the index=owner (right join).  My expected output is host.csv plus the owner data from index=owner only if the IP matches.  (See below)

2)  When I tested with real data (100k rows  CSV), the result reduced to 30k rows. I expected the left side as a baseline and the number of rows will remain the same.   
Should I flip the logic?   (inputlookup first, then append the index)
I actually tried to flip the logic, but after the append, the number of rows still doesn't match the CSV. (it's close)

I appreciate your help. Thank you

Emulation result

LearningGuy_0-1722973847770.png

My expected output - yellow and green circle 

ip_addresshostowner
10.1.1.1host1 
10.1.1.2host2 
10.1.1.3host3owner3
10.1.1.4host4owner4

 

LearningGuy_1-1722974735855.png

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Sorry I missed the definition here.  This is easily fixable

index=owner ``` where source != host.csv ```
| rename ip as ip_address
| append
  [inputlookup host.csv
  | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by host ip_address
| where source == "host.csv"
| fields - source

Here, I am back at using the side effect of Splunk's multivalue equality.

Here is the full emulation

| makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
| eval source = "not-host.csv"
``` the above emulates
index=owner
```
| rename ip as ip_address
| append
    [makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
    | eval source = "host.csv"]
| stats values(owner) as owner values(source) as source by ip_address host
| where source == "host.csv"
| fields - source

 

LearningGuy
Builder

Hello @yuanliu 

I tested and it worked fine for the sample. I accepted your suggestion as solution. Thank you for your help.

1) Max 50k rows
When I tested with the real data, I found out that the sub search CSV file is limited to 50k rows.  I need the CSV file as my baseline for left join, so if the file has 100k rows, then the expected result after left join is 100k rows (with additional column from index).
a) What do you suggest to fix this issue?    (modifying limits.conf is not allowed)
b) Will splitting the CSV work?

2) Join command
Do you think join command can work on my case?

I tested it your solution using join in real data, but it always gave me the result as inner join, instead of left join,  although I already specified join type=left
In the solution you provided index will be treated as left data because it's specified first
How do I make the CSV as left data?     

I appreciate your help.

Thanks

0 Karma

yuanliu
SplunkTrust
SplunkTrust

1) Max 50k rows
b) Will splitting the CSV work?

It's unfortunate that you cannot change limits.conf.  Yes, splitting CSV will work.  If you don't need these CSVs as lookup, that's not a problem.  But if you still need a lookup, you will need to maintain two sets of CSVs, one for lookup, the rest for this purpose. (Alternatively, you can modify your searches to use multiple lookups.  At that point, you code can become unmaintainable.)


2) Join command

This is where things become intensely interesting😊  I did not compare your statements with the actual depiction.  After reviewing your original description, I notice that your depiction (and illustration) is a left join of CSV on the left, with index search on the right.  In this regard, Splunk's join is working exactly as documented.

 

 

| inputlookup host.csv
| join type=left ip_address
  [ search index=owner
  | rename ip as ip_address]
| table host ip_address owner

 

 

Here is an emulation:

 

 

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4"
    ``` the above emulates
    | inputlookup host.csv
    ```
| join type=left ip_address [makeresults format=csv data="ip,	host,	owner
10.1.1.3,	host3,	owner3
10.1.1.4,	host4,	owner4
10.1.1.5,	host5,	owner5"
    | eval index = "owner"
``` the above emulates
index=owner
```
| rename ip as ip_address]
| table host ip_address owner

 

 

The result is the same

hostip_addressowner
host110.1.1.1 
host210.1.1.2 
host310.1.1.3owner3
host410.1.1.4owner4

I suspect that the "bad" output you observe is caused by the 50K row limit. (Try a smaller CSV and a smaller index search you should see.)

In the solution you provided index will be treated as left data because it's specified first

Unlike join, the append-stats method that many Splunkers use does not really depend on which set is introduced first.  The control is in the filter.

LearningGuy
Builder

Hello @yuanliu 

Can you please help on my other question?
This is more closer to the real data and it has complexity since it's involving multiple fields
https://community.splunk.com/t5/Splunk-Search/How-do-I-quot-Left-join-quot-by-appending-CSV-to-an-in...


I think I solved this, but I wonder if there is a way to do it without merging the string and mvexpand
I appreciate your help.  Thank you so much

0 Karma

isoutamo
SplunkTrust
SplunkTrust
Hi
I think that this helps you
https://community.splunk.com/t5/Splunk-Search/What-is-the-relation-between-the-Splunk-inner-left-joi...
In that answer there is define those joins and how you should do those in Splunk.
r Ismo

KendallW
Contributor

@LearningGuy use inputlookup to grab the whole csv file into the search, then append the data from your index, then get common values using stats. E.g. 

| inputlookup host.csv
| rename ip_address as ip
| append [search index=owner | fields ip, owner]
| stats values(host) as host values(owner) as owner by ip

LearningGuy
Builder

Hi @KendallW 

I tried your suggestion with real data (100k rows  CSV).  I expected the left side as a baseline and the number of rows will remain the same ,   but the number of rows doesn't match the CSV. 

In my initial example. My expected output is 4 rows, same number of rows in the host.CSV.
Please suggest.  I appreciate your help. Thank you.

My expected output - yellow and green circle 

ip_addresshostowner
10.1.1.1host1 
10.1.1.2host2 
10.1.1.3host3owner3
10.1.1.4host4owner4

 

LearningGuy_0-1722975384163.png

 

 

 

0 Karma
Get Updates on the Splunk Community!

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

The University of Nevada, Las Vegas (UNLV) is another premier research institution helping to shape the next ...

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...