Splunk Search

How do I "Left join" by appending CSV to an index in multiple fields

LearningGuy
Builder

Hello,
How do I "Left join" by appending CSV to an index in multiple fields?

I was able to solve the problem, but
1) Is it possible to solve this problem without string manipulation and mvexpand? (see the code)
Mvexpand caused slowness

2) Can "stats value" NOT remove the duplicate?     In this case, stats values (*) as * by ip, it merged field "risk and "score" and removed the duplicates. My workaround is to combine the string to retain the duplicates.

3) a) Why does "stats value" ignore empty string?   
     b)  Why adding Null into non-null string will result empty?   
I have to use fillnull in order to retain the data.

Please review the sample data, drawing and the code

Thank you for your help.!!

host.csv

ip_addresshost
10.1.1.1host1
10.1.1.2host2
10.1.1.3host3
10.1.1.4host4
10.1.1.5host5
10.1.1.6host6
10.1.1.7host7


index=risk

ipriskscorecontact
10.1.1.1riskA6 
10.1.1.1riskB7 
10.1.1.1  person1
10.1.1.1riskC6 
10.1.1.2  person2
10.1.1.3riskA6person3
10.1.1.3riskE7person3
10.1.1.4riskF8person4
10.1.1.8riskA6person8
10.1.1.9riskB7person9


"Left join" expected output - yellow and green rectangle (see drawing below)

iphostriskscorecontact
10.1.1.1host1riskA6 
10.1.1.1host1riskB7 
10.1.1.1host1  person1
10.1.1.1host1riskC6 
10.1.1.2host2  person2
10.1.1.3host3riskA6person3
10.1.1.3host3riskE7person3
10.1.1.4host4riskF8person4
10.1.1.5host5   
10.1.1.6host6   
10.1.1.7host7   




LearningGuy_0-1724211057155.png


 

 

 

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| eval source="csv"
| rename ip_address as ip
| append
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"
| fillnull score value=0
| fillnull risk, score, contact value="N/A"
| eval source="index"]

| eval strmerged = risk + "," + score + "," + contact
| stats values(*) as * by ip 

| mvexpand strmerged
| eval temp = split(strmerged,",")
| eval risk = mvindex(temp, 0)
| eval score = mvindex(temp, 1)
| eval contact = mvindex(temp, 2)
| search (source="csv" AND source="index") OR (source="csv")
| table ip, host, risk, score, contact

 

 

 

LearningGuy_1-1724211248014.png

 

Labels (4)
Tags (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary.  As I last commented in your other post, there is nothing wrong with Splunk's left join.  Even though I want to avoid join in general, join is better than doing all that extra work.  Here is my emulation:

 

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| rename ip_address as ip
| join max=0 type=left ip
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"]
    | table ip, host, risk, score, contact

 

The output is

iphostriskscorecontact
10.1.1.1host1riskA6 
10.1.1.1host1riskB7 
10.1.1.1host1  person1
10.1.1.1host1riskC6 
10.1.1.2host2  person2
10.1.1.3host3riskA6person3
10.1.1.3host3riskE7person3
10.1.1.4host4riskF8person4
10.1.1.5host5   
10.1.1.6host6   
10.1.1.7host7   

Hope this helps. (And thanks for posting data emulation.  That makes things easier.)

View solution in original post

yuanliu
SplunkTrust
SplunkTrust

It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary.  As I last commented in your other post, there is nothing wrong with Splunk's left join.  Even though I want to avoid join in general, join is better than doing all that extra work.  Here is my emulation:

 

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| rename ip_address as ip
| join max=0 type=left ip
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"]
    | table ip, host, risk, score, contact

 

The output is

iphostriskscorecontact
10.1.1.1host1riskA6 
10.1.1.1host1riskB7 
10.1.1.1host1  person1
10.1.1.1host1riskC6 
10.1.1.2host2  person2
10.1.1.3host3riskA6person3
10.1.1.3host3riskE7person3
10.1.1.4host4riskF8person4
10.1.1.5host5   
10.1.1.6host6   
10.1.1.7host7   

Hope this helps. (And thanks for posting data emulation.  That makes things easier.)

LearningGuy
Builder

Hi @yuanliu,
I appreciate your help.  I accepted your solution.
I tried your solution and it worked, however the subsearch index hit 50k max rows, so I split the join based on subnets.

Is this the right way to do it?   
See below I split into 3 join, each join has a filter based on subnets, but the issue is I don't know if the subnets will hit max 50k rows in the future, and I will have to manually adjust.

Do you have any other suggestion ?
Thanks again

| inputlookup host.csv
| rename ip_address as ip
| eval source="csv"

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.1.0.0/16"
    | eval source="risk1" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.2.0.0/16"
    | eval source="risk2" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.3.0.0/16"
    | eval source="risk3" ]

| table ip, host, risk, score, contact




0 Karma

yuanliu
SplunkTrust
SplunkTrust

It would be easier if you could influence limits.conf.  Is this a limitation in Splunk Cloud?

Here is how I assess the situation.

  1. If you need to use join, your method is semantically equivalent to using one index search on the right.
  2. Append + stats may or may not be faster; your search also has some inefficiencies.  But as long as current performance is acceptable, you don't have to worry.
  3. It is more profitable to investigate why your right set is greater than 50K.

You have to remember: any solution has to be evaluated in the specific dataset and specific use case.  There is no universal formula.  On the last point, you need to carefully review your indexed data.  There are several ways to reduce the amount of rows in the right.

First of all, do you really have more than 50K unique combinations of ip, risk, score, contact?  I ask because you really only care about these fields.  My speculation is that you have many rows with identical combinations.  This is a simple test for you:

index=risk 
| stats count by ip risk score contact
| stats count

 Is this count really greater than 50K?  If not, this would be easier to maintain and possibly more efficient:

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | dedup ip risk score contact ]
| table ip host risk score contact

Another possible cause that you have too many rows in the right could be that there are too many ip's in the index that are missing from the lookup.  If that is the case, you can further restrict the index search by the lookup, i.e.,

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk
    [ inputlookup host.csv
      | rename ip_address as ip
      | fields ip ]
    | fields ip risk score contact
    | dedup ip risk score contact ]
| table ip host risk score contact

These could be other ways to reduce amount of data.  They all depends on your dataset and use case.

LearningGuy
Builder

Hi @yuanliu,

Thank you again for your analysis and suggestion.

The environment where I am working is very restrictive about making changes to limits.conf due to a resource problem.
The index in the real data set I am working on has more than 1 million rows in a1 day time frame, I got it down to 150k after filtering out with specific subnets, fields, and dedups. 

If this is the case,
Is splitting the sub search for the join the only way to it? 
Does the following search for splitting look correct?
Please let me know if you have a better idea or workaround.  The real data has a lot more than 3 IP subnets 🙂    Thank you again.

| inputlookup host.csv
| rename ip_address as ip
| eval source="csv"

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.1.0.0/16"
    | eval source="risk1" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.2.0.0/16"
    | eval source="risk2" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.3.0.0/16"
    | eval source="risk3" ]

| table ip, host, risk, score, contact



0 Karma

yuanliu
SplunkTrust
SplunkTrust

 Yes, the search is semantically equivalent.  That was my point one in previous comment.  If your index search already has a field named ip, there is no need to run search command in a second pipe (command).  You also do not need those source evaluation because join doesn't care about them.

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk ip="10.1.0.0/16"
    | fields ip risk score contact ]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"
    | fields ip risk score contact ]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16"
    | fields ip risk score contact ]

| table ip, host, risk, score, contact

 I do not see any dedup in your mock code but I assume that you have customization that is not shown.

 

LearningGuy
Builder


Hi @yuanliu 

Thank you again for your suggestion
Below I posted my sample search closer to the real search, where I have multiple subnets in "search filter" and additional field filter.
When I removed the "search ip filter" and moved it up next to index=risk,  the search is slower 3 seconds, but the results are the same.
1) What is the difference between using "| search ip=" and "ip="?   They give the same outcome

2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay.  What do you think?

I appreciate your suggestion again.

Thanks


Before  removing "filter ip"

 

 

 

| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16") 

| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
    | search (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
    | search (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16"
    | fields ip risk score contact 
    | search (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
]
    | fillnull value0 score
    | fillnull value="N/A" ip risk contact
    | dedup ip risk score contact
| table ip, host, risk, score, contact

 

 

 

 

After  removing "filter ip"  (3 seconds slower)

 

 

 

 

| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16") 

| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk  (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"   (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16" (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
]
    | fillnull value0 score
    | fillnull value="N/A" ip risk contact
    | dedup ip risk score contact
| table ip, host, risk, score, contact

 

 

 

 

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

1) What is the difference between using "| search ip=" and "ip="?   They give the same outcome

The idea that adding filter in index search improves performance is a general recommendation based on the assumption that the bottleneck is number of raw events.  This may not be the case.  There could be another case where index search filter works better when field-based filter applies to index-time extracted fields.  I did observe that in some of my searches index time filter slows search down rather than speeds it up.  I have yet to conduct systematic research but if it doesn't speed up for you, use what works better.


2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay.  What do you think?

dedup has an option keepemtpy that you can try

| dedup keepempty=true ip, risk, score, contact

In some of my use cases, keeping all events that has any empty field is a bit too much.  In that case, you can do fillnull before dedup provided that you don't care to print those rows with empty risk, score, or contact.  Something like

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk company IN (compA, compB)
    | fields ip risk score contact
    | fillnull risk score contact value=UNSPEC
    | dedup ip risk score contact ]
| foreach risk score contact
  [eval <<FIELD>> = if(<<FIELD>> == "UNSPEC", null(), <<FIELD>>)]
| table ip host risk score contact

You can also apply the same technique with split subsearches.  Again, I do not know your data characteristics.  So, whether dedup does any good is for you to find out.

LearningGuy
Builder

Hi @yuanliu 

Thanks for the suggestion.
The option keepempty=true is something new I learned, I wish stats value() also has that option.
However, when I tried keepempty=true, it added a lot more delay (3x) compare to using only dedup, perhaps maybe because I have so many fields.

I've been using fillnull to keep empty field. The reason is although one field is empty, I still want to keep the other field. Your way of using foreach to re-assigned the field to null() is awesome. Thanks for showing me this trick. 
Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.


I appreciate your help.  Thanks

0 Karma

yuanliu
SplunkTrust
SplunkTrust
Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.

None whatsoever.  This is purely for people who want non-existent values to show blank.

LearningGuy
Builder

Hi @yuanliu 

1) If I left join CSV and subsearch that have the same field name, will the data from subsearch rewrite the data from the CSV in that field?
In my example above is the "source" field. I added this for tracking purposes. 

2) I also found out that keepempty=true doesn't always work in dedup. Have you ever experienced the same?

Thank you again for your help.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

1) You can easily test out using emulation that the right-side values override left-side value.

2) I do not usually use keepempty.  If you have reproducible evidence that dedup does not behave as documented, maybe engage support.

Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...