Solved: How do I "Left join" by appending CSV to an index ...

LearningGuy

Hello,
How do I "Left join" by appending CSV to an index in multiple fields?

I was able to solve the problem, but
1) Is it possible to solve this problem without string manipulation and mvexpand? (see the code)
Mvexpand caused slowness

2) Can "stats value" NOT remove the duplicate? In this case, stats values (*) as * by ip, it merged field "risk and "score" and removed the duplicates. My workaround is to combine the string to retain the duplicates.

3) a) Why does "stats value" ignore empty string?
b) Why adding Null into non-null string will result empty? I have to use fillnull in order to retain the data.

Please review the sample data, drawing and the code

Thank you for your help.!!

host.csv

ip_address	host
10.1.1.1	host1
10.1.1.2	host2
10.1.1.3	host3
10.1.1.4	host4
10.1.1.5	host5
10.1.1.6	host6
10.1.1.7	host7

index=risk

ip	risk	score	contact
10.1.1.1	riskA	6
10.1.1.1	riskB	7
10.1.1.1			person1
10.1.1.1	riskC	6
10.1.1.2			person2
10.1.1.3	riskA	6	person3
10.1.1.3	riskE	7	person3
10.1.1.4	riskF	8	person4
10.1.1.8	riskA	6	person8
10.1.1.9	riskB	7	person9

"Left join" expected output - yellow and green rectangle (see drawing below)

ip	host	risk	score	contact
10.1.1.1	host1	riskA	6
10.1.1.1	host1	riskB	7
10.1.1.1	host1			person1
10.1.1.1	host1	riskC	6
10.1.1.2	host2			person2
10.1.1.3	host3	riskA	6	person3
10.1.1.3	host3	riskE	7	person3
10.1.1.4	host4	riskF	8	person4
10.1.1.5	host5
10.1.1.6	host6
10.1.1.7	host7

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| eval source="csv"
| rename ip_address as ip
| append
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"
| fillnull score value=0
| fillnull risk, score, contact value="N/A"
| eval source="index"]

| eval strmerged = risk + "," + score + "," + contact
| stats values(*) as * by ip 

| mvexpand strmerged
| eval temp = split(strmerged,",")
| eval risk = mvindex(temp, 0)
| eval score = mvindex(temp, 1)
| eval contact = mvindex(temp, 2)
| search (source="csv" AND source="index") OR (source="csv")
| table ip, host, risk, score, contact

yuanliu

It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary. As I last commented in your other post, there is nothing wrong with Splunk's left join. Even though I want to avoid join in general, join is better than doing all that extra work. Here is my emulation:

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| rename ip_address as ip
| join max=0 type=left ip
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"]
    | table ip, host, risk, score, contact

The output is

ip	host	risk	score	contact
10.1.1.1	host1	riskA	6
10.1.1.1	host1	riskB	7
10.1.1.1	host1			person1
10.1.1.1	host1	riskC	6
10.1.1.2	host2			person2
10.1.1.3	host3	riskA	6	person3
10.1.1.3	host3	riskE	7	person3
10.1.1.4	host4	riskF	8	person4
10.1.1.5	host5
10.1.1.6	host6
10.1.1.7	host7

Hope this helps. (And thanks for posting data emulation. That makes things easier.)

View solution in original post

yuanliu

It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary. As I last commented in your other post, there is nothing wrong with Splunk's left join. Even though I want to avoid join in general, join is better than doing all that extra work. Here is my emulation:

| makeresults format=csv data="ip_address,	host
10.1.1.1,	host1
10.1.1.2,	host2
10.1.1.3,	host3
10.1.1.4, 	host4
10.1.1.5, 	host5
10.1.1.6, 	host6
10.1.1.7, 	host7"
| rename ip_address as ip
| join max=0 type=left ip
     [makeresults format=csv data="ip, risk, score, contact
10.1.1.1,	riskA,	6,	,
10.1.1.1,	riskB,	7	,
10.1.1.1,		,,	person1,
10.1.1.1,	riskC,	6,,	
10.1.1.2,	,,		person2,
10.1.1.3,	riskA,	6,	person3,
10.1.1.3,	riskE,	7,	person3,
10.1.1.4,	riskF,	8,	person4,
10.1.1.8,	riskA,	6,	person8,
10.1.1.9,	riskB,	7,	person9"]
    | table ip, host, risk, score, contact

The output is

ip	host	risk	score	contact
10.1.1.1	host1	riskA	6
10.1.1.1	host1	riskB	7
10.1.1.1	host1			person1
10.1.1.1	host1	riskC	6
10.1.1.2	host2			person2
10.1.1.3	host3	riskA	6	person3
10.1.1.3	host3	riskE	7	person3
10.1.1.4	host4	riskF	8	person4
10.1.1.5	host5
10.1.1.6	host6
10.1.1.7	host7

Hope this helps. (And thanks for posting data emulation. That makes things easier.)

LearningGuy

Hi @yuanliu,
I appreciate your help. I accepted your solution.
I tried your solution and it worked, however the subsearch index hit 50k max rows, so I split the join based on subnets.

Is this the right way to do it?
See below I split into 3 join, each join has a filter based on subnets, but the issue is I don't know if the subnets will hit max 50k rows in the future, and I will have to manually adjust.

Do you have any other suggestion ?
Thanks again

| inputlookup host.csv
| rename ip_address as ip
| eval source="csv"

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.1.0.0/16"
    | eval source="risk1" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.2.0.0/16"
    | eval source="risk2" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.3.0.0/16"
    | eval source="risk3" ]

| table ip, host, risk, score, contact

yuanliu

It would be easier if you could influence limits.conf. Is this a limitation in Splunk Cloud?

Here is how I assess the situation.

If you need to use join, your method is semantically equivalent to using one index search on the right.
Append + stats may or may not be faster; your search also has some inefficiencies. But as long as current performance is acceptable, you don't have to worry.
It is more profitable to investigate why your right set is greater than 50K.

You have to remember: any solution has to be evaluated in the specific dataset and specific use case. There is no universal formula. On the last point, you need to carefully review your indexed data. There are several ways to reduce the amount of rows in the right.

First of all, do you really have more than 50K unique combinations of ip, risk, score, contact? I ask because you really only care about these fields. My speculation is that you have many rows with identical combinations. This is a simple test for you:

index=risk 
| stats count by ip risk score contact
| stats count

Is this count really greater than 50K? If not, this would be easier to maintain and possibly more efficient:

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | dedup ip risk score contact ]
| table ip host risk score contact

Another possible cause that you have too many rows in the right could be that there are too many ip's in the index that are missing from the lookup. If that is the case, you can further restrict the index search by the lookup, i.e.,

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk
    [ inputlookup host.csv
      | rename ip_address as ip
      | fields ip ]
    | fields ip risk score contact
    | dedup ip risk score contact ]
| table ip host risk score contact

These could be other ways to reduce amount of data. They all depends on your dataset and use case.

LearningGuy

Hi @yuanliu,

Thank you again for your analysis and suggestion.

The environment where I am working is very restrictive about making changes to limits.conf due to a resource problem.
The index in the real data set I am working on has more than 1 million rows in a1 day time frame, I got it down to 150k after filtering out with specific subnets, fields, and dedups.

If this is the case,
Is splitting the sub search for the join the only way to it?
Does the following search for splitting look correct?
Please let me know if you have a better idea or workaround. The real data has a lot more than 3 IP subnets 🙂 Thank you again.

| inputlookup host.csv
| rename ip_address as ip
| eval source="csv"

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.1.0.0/16"
    | eval source="risk1" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.2.0.0/16"
    | eval source="risk2" ]

| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact
    | search ip="10.3.0.0/16"
    | eval source="risk3" ]

| table ip, host, risk, score, contact

yuanliu

Yes, the search is semantically equivalent. That was my point one in previous comment. If your index search already has a field named ip, there is no need to run search command in a second pipe (command). You also do not need those source evaluation because join doesn't care about them.

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk ip="10.1.0.0/16"
    | fields ip risk score contact ]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"
    | fields ip risk score contact ]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16"
    | fields ip risk score contact ]

| table ip, host, risk, score, contact

I do not see any dedup in your mock code but I assume that you have customization that is not shown.

LearningGuy

Hi @yuanliu

Thank you again for your suggestion
Below I posted my sample search closer to the real search, where I have multiple subnets in "search filter" and additional field filter.
When I removed the "search ip filter" and moved it up next to index=risk, the search is slower 3 seconds, but the results are the same.
1) What is the difference between using "| search ip=" and "ip="? They give the same outcome

2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay. What do you think?

I appreciate your suggestion again.

Thanks

Before removing "filter ip"

| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16") 

| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk 
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
    | search (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
    | search (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16"
    | fields ip risk score contact 
    | search (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
]
    | fillnull value0 score
    | fillnull value="N/A" ip risk contact
    | dedup ip risk score contact
| table ip, host, risk, score, contact

After removing "filter ip" (3 seconds slower)

| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16") 

| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk  (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]

| join max=0 type=left ip
  [ search index=risk ip="10.2.0.0/16"   (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
    | where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]

| join max=0 type=left ip
  [ search index=risk ip="10.3.0.0/16" (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
    | fields ip risk score contact 
]
    | fillnull value0 score
    | fillnull value="N/A" ip risk contact
    | dedup ip risk score contact
| table ip, host, risk, score, contact

yuanliu

1) What is the difference between using "| search ip=" and "ip="? They give the same outcome

The idea that adding filter in index search improves performance is a general recommendation based on the assumption that the bottleneck is number of raw events. This may not be the case. There could be another case where index search filter works better when field-based filter applies to index-time extracted fields. I did observe that in some of my searches index time filter slows search down rather than speeds it up. I have yet to conduct systematic research but if it doesn't speed up for you, use what works better.

2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay. What do you think?

dedup has an option keepemtpy that you can try

| dedup keepempty=true ip, risk, score, contact

In some of my use cases, keeping all events that has any empty field is a bit too much. In that case, you can do fillnull before dedup provided that you don't care to print those rows with empty risk, score, or contact. Something like

| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
  [ search index=risk company IN (compA, compB)
    | fields ip risk score contact
    | fillnull risk score contact value=UNSPEC
    | dedup ip risk score contact ]
| foreach risk score contact
  [eval <<FIELD>> = if(<<FIELD>> == "UNSPEC", null(), <<FIELD>>)]
| table ip host risk score contact

You can also apply the same technique with split subsearches. Again, I do not know your data characteristics. So, whether dedup does any good is for you to find out.

LearningGuy

Hi @yuanliu

Thanks for the suggestion.
The option keepempty=true is something new I learned, I wish stats value() also has that option.
However, when I tried keepempty=true, it added a lot more delay (3x) compare to using only dedup, perhaps maybe because I have so many fields.

I've been using fillnull to keep empty field. The reason is although one field is empty, I still want to keep the other field. Your way of using foreach to re-assigned the field to null() is awesome. Thanks for showing me this trick.
Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.

I appreciate your help. Thanks

yuanliu

Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.

None whatsoever. This is purely for people who want non-existent values to show blank.

LearningGuy

Hi @yuanliu

1) If I left join CSV and subsearch that have the same field name, will the data from subsearch rewrite the data from the CSV in that field?
In my example above is the "source" field. I added this for tracking purposes.

2) I also found out that keepempty=true doesn't always work in dedup. Have you ever experienced the same?

Thank you again for your help.

yuanliu

1) You can easily test out using emulation that the right-side values override left-side value.

2) I do not usually use keepempty. If you have reproducible evidence that dedup does not behave as documented, maybe engage support.

How do I "Left join" by appending CSV to an index in multiple fields

eval

join

stats

subsearch

Monitoring Postgres with OpenTelemetry

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor