Hello,
How do I "Left join" by appending CSV to an index in multiple fields?
I was able to solve the problem, but
1) Is it possible to solve this problem without string manipulation and mvexpand? (see the code)
Mvexpand caused slowness
2) Can "stats value" NOT remove the duplicate? In this case, stats values (*) as * by ip, it merged field "risk and "score" and removed the duplicates. My workaround is to combine the string to retain the duplicates.
3) a) Why does "stats value" ignore empty string?
b) Why adding Null into non-null string will result empty? I have to use fillnull in order to retain the data.
Please review the sample data, drawing and the code
Thank you for your help.!!
host.csv
ip_address | host |
10.1.1.1 | host1 |
10.1.1.2 | host2 |
10.1.1.3 | host3 |
10.1.1.4 | host4 |
10.1.1.5 | host5 |
10.1.1.6 | host6 |
10.1.1.7 | host7 |
index=risk
ip | risk | score | contact |
10.1.1.1 | riskA | 6 | |
10.1.1.1 | riskB | 7 | |
10.1.1.1 | person1 | ||
10.1.1.1 | riskC | 6 | |
10.1.1.2 | person2 | ||
10.1.1.3 | riskA | 6 | person3 |
10.1.1.3 | riskE | 7 | person3 |
10.1.1.4 | riskF | 8 | person4 |
10.1.1.8 | riskA | 6 | person8 |
10.1.1.9 | riskB | 7 | person9 |
"Left join" expected output - yellow and green rectangle (see drawing below)
ip | host | risk | score | contact |
10.1.1.1 | host1 | riskA | 6 | |
10.1.1.1 | host1 | riskB | 7 | |
10.1.1.1 | host1 | person1 | ||
10.1.1.1 | host1 | riskC | 6 | |
10.1.1.2 | host2 | person2 | ||
10.1.1.3 | host3 | riskA | 6 | person3 |
10.1.1.3 | host3 | riskE | 7 | person3 |
10.1.1.4 | host4 | riskF | 8 | person4 |
10.1.1.5 | host5 | |||
10.1.1.6 | host6 | |||
10.1.1.7 | host7 |
| makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4
10.1.1.5, host5
10.1.1.6, host6
10.1.1.7, host7"
| eval source="csv"
| rename ip_address as ip
| append
[makeresults format=csv data="ip, risk, score, contact
10.1.1.1, riskA, 6, ,
10.1.1.1, riskB, 7 ,
10.1.1.1, ,, person1,
10.1.1.1, riskC, 6,,
10.1.1.2, ,, person2,
10.1.1.3, riskA, 6, person3,
10.1.1.3, riskE, 7, person3,
10.1.1.4, riskF, 8, person4,
10.1.1.8, riskA, 6, person8,
10.1.1.9, riskB, 7, person9"
| fillnull score value=0
| fillnull risk, score, contact value="N/A"
| eval source="index"]
| eval strmerged = risk + "," + score + "," + contact
| stats values(*) as * by ip
| mvexpand strmerged
| eval temp = split(strmerged,",")
| eval risk = mvindex(temp, 0)
| eval score = mvindex(temp, 1)
| eval contact = mvindex(temp, 2)
| search (source="csv" AND source="index") OR (source="csv")
| table ip, host, risk, score, contact
It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary. As I last commented in your other post, there is nothing wrong with Splunk's left join. Even though I want to avoid join in general, join is better than doing all that extra work. Here is my emulation:
| makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4
10.1.1.5, host5
10.1.1.6, host6
10.1.1.7, host7"
| rename ip_address as ip
| join max=0 type=left ip
[makeresults format=csv data="ip, risk, score, contact
10.1.1.1, riskA, 6, ,
10.1.1.1, riskB, 7 ,
10.1.1.1, ,, person1,
10.1.1.1, riskC, 6,,
10.1.1.2, ,, person2,
10.1.1.3, riskA, 6, person3,
10.1.1.3, riskE, 7, person3,
10.1.1.4, riskF, 8, person4,
10.1.1.8, riskA, 6, person8,
10.1.1.9, riskB, 7, person9"]
| table ip, host, risk, score, contact
The output is
ip | host | risk | score | contact |
10.1.1.1 | host1 | riskA | 6 | |
10.1.1.1 | host1 | riskB | 7 | |
10.1.1.1 | host1 | person1 | ||
10.1.1.1 | host1 | riskC | 6 | |
10.1.1.2 | host2 | person2 | ||
10.1.1.3 | host3 | riskA | 6 | person3 |
10.1.1.3 | host3 | riskE | 7 | person3 |
10.1.1.4 | host4 | riskF | 8 | person4 |
10.1.1.5 | host5 | |||
10.1.1.6 | host6 | |||
10.1.1.7 | host7 |
Hope this helps. (And thanks for posting data emulation. That makes things easier.)
It is somewhat confusing what that mvexpand is supposed to do and why string merge is necessary. As I last commented in your other post, there is nothing wrong with Splunk's left join. Even though I want to avoid join in general, join is better than doing all that extra work. Here is my emulation:
| makeresults format=csv data="ip_address, host
10.1.1.1, host1
10.1.1.2, host2
10.1.1.3, host3
10.1.1.4, host4
10.1.1.5, host5
10.1.1.6, host6
10.1.1.7, host7"
| rename ip_address as ip
| join max=0 type=left ip
[makeresults format=csv data="ip, risk, score, contact
10.1.1.1, riskA, 6, ,
10.1.1.1, riskB, 7 ,
10.1.1.1, ,, person1,
10.1.1.1, riskC, 6,,
10.1.1.2, ,, person2,
10.1.1.3, riskA, 6, person3,
10.1.1.3, riskE, 7, person3,
10.1.1.4, riskF, 8, person4,
10.1.1.8, riskA, 6, person8,
10.1.1.9, riskB, 7, person9"]
| table ip, host, risk, score, contact
The output is
ip | host | risk | score | contact |
10.1.1.1 | host1 | riskA | 6 | |
10.1.1.1 | host1 | riskB | 7 | |
10.1.1.1 | host1 | person1 | ||
10.1.1.1 | host1 | riskC | 6 | |
10.1.1.2 | host2 | person2 | ||
10.1.1.3 | host3 | riskA | 6 | person3 |
10.1.1.3 | host3 | riskE | 7 | person3 |
10.1.1.4 | host4 | riskF | 8 | person4 |
10.1.1.5 | host5 | |||
10.1.1.6 | host6 | |||
10.1.1.7 | host7 |
Hope this helps. (And thanks for posting data emulation. That makes things easier.)
Hi @yuanliu,
I appreciate your help. I accepted your solution.
I tried your solution and it worked, however the subsearch index hit 50k max rows, so I split the join based on subnets.
Is this the right way to do it?
See below I split into 3 join, each join has a filter based on subnets, but the issue is I don't know if the subnets will hit max 50k rows in the future, and I will have to manually adjust.
Do you have any other suggestion ?
Thanks again
| inputlookup host.csv
| rename ip_address as ip
| eval source="csv"
| join max=0 type=left ip
[ search index=risk
| fields ip risk score contact
| search ip="10.1.0.0/16"
| eval source="risk1" ]
| join max=0 type=left ip
[ search index=risk
| fields ip risk score contact
| search ip="10.2.0.0/16"
| eval source="risk2" ]
| join max=0 type=left ip
[ search index=risk
| fields ip risk score contact
| search ip="10.3.0.0/16"
| eval source="risk3" ]
| table ip, host, risk, score, contact
It would be easier if you could influence limits.conf. Is this a limitation in Splunk Cloud?
Here is how I assess the situation.
You have to remember: any solution has to be evaluated in the specific dataset and specific use case. There is no universal formula. On the last point, you need to carefully review your indexed data. There are several ways to reduce the amount of rows in the right.
First of all, do you really have more than 50K unique combinations of ip, risk, score, contact? I ask because you really only care about these fields. My speculation is that you have many rows with identical combinations. This is a simple test for you:
index=risk
| stats count by ip risk score contact
| stats count
Is this count really greater than 50K? If not, this would be easier to maintain and possibly more efficient:
| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk
| fields ip risk score contact
| dedup ip risk score contact ]
| table ip host risk score contact
Another possible cause that you have too many rows in the right could be that there are too many ip's in the index that are missing from the lookup. If that is the case, you can further restrict the index search by the lookup, i.e.,
| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk
[ inputlookup host.csv
| rename ip_address as ip
| fields ip ]
| fields ip risk score contact
| dedup ip risk score contact ]
| table ip host risk score contact
These could be other ways to reduce amount of data. They all depends on your dataset and use case.
Hi @yuanliu,
Thank you again for your analysis and suggestion.
The environment where I am working is very restrictive about making changes to limits.conf due to a resource problem.
The index in the real data set I am working on has more than 1 million rows in a1 day time frame, I got it down to 150k after filtering out with specific subnets, fields, and dedups.
If this is the case,
Is splitting the sub search for the join the only way to it?
Does the following search for splitting look correct?
Please let me know if you have a better idea or workaround. The real data has a lot more than 3 IP subnets 🙂 Thank you again.
| inputlookup host.csv | rename ip_address as ip | eval source="csv" | join max=0 type=left ip [ search index=risk | fields ip risk score contact | search ip="10.1.0.0/16" | eval source="risk1" ] | join max=0 type=left ip [ search index=risk | fields ip risk score contact | search ip="10.2.0.0/16" | eval source="risk2" ] | join max=0 type=left ip [ search index=risk | fields ip risk score contact | search ip="10.3.0.0/16" | eval source="risk3" ] | table ip, host, risk, score, contact
Yes, the search is semantically equivalent. That was my point one in previous comment. If your index search already has a field named ip, there is no need to run search command in a second pipe (command). You also do not need those source evaluation because join doesn't care about them.
| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk ip="10.1.0.0/16"
| fields ip risk score contact ]
| join max=0 type=left ip
[ search index=risk ip="10.2.0.0/16"
| fields ip risk score contact ]
| join max=0 type=left ip
[ search index=risk ip="10.3.0.0/16"
| fields ip risk score contact ]
| table ip, host, risk, score, contact
I do not see any dedup in your mock code but I assume that you have customization that is not shown.
Hi @yuanliu
Thank you again for your suggestion
Below I posted my sample search closer to the real search, where I have multiple subnets in "search filter" and additional field filter.
When I removed the "search ip filter" and moved it up next to index=risk, the search is slower 3 seconds, but the results are the same.
1) What is the difference between using "| search ip=" and "ip="? They give the same outcome
2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay. What do you think?
I appreciate your suggestion again.
Thanks
Before removing "filter ip"
| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16")
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk
| fields ip risk score contact
| where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
]
| join max=0 type=left ip
[ search index=risk ip="10.2.0.0/16"
| fields ip risk score contact
| where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
| search (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
]
| join max=0 type=left ip
[ search index=risk ip="10.3.0.0/16"
| fields ip risk score contact
| search (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
]
| fillnull value0 score
| fillnull value="N/A" ip risk contact
| dedup ip risk score contact
| table ip, host, risk, score, contact
After removing "filter ip" (3 seconds slower)
| inputlookup host.csv
| search (ip="10.1.0.0/16" OR ip="10.2.0.0/16" OR ip="10.3.0.0/16" OR ip="10.4.0.0/16" OR ip="10.5.0.0/16" OR ip="10.6.0.0/16")
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk (ip="10.1.0.0/16" OR ip="10.2.0.0/16") AND (company="compA" OR company="compB")
| fields ip risk score contact
| where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]
| join max=0 type=left ip
[ search index=risk ip="10.2.0.0/16" (ip="10.3.0.0/16" OR ip="10.4.0.0/16") AND (company="compA" OR company="compB")
| fields ip risk score contact
| where isnotnull(ip) AND isnotnull(risk) AND isnotnull(score)
]
| join max=0 type=left ip
[ search index=risk ip="10.3.0.0/16" (ip="10.5.0.0/16" OR ip="10.6.0.0/16") AND (company="compA" OR company="compB")
| fields ip risk score contact
]
| fillnull value0 score
| fillnull value="N/A" ip risk contact
| dedup ip risk score contact
| table ip, host, risk, score, contact
1) What is the difference between using "| search ip=" and "ip="? They give the same outcome
The idea that adding filter in index search improves performance is a general recommendation based on the assumption that the bottleneck is number of raw events. This may not be the case. There could be another case where index search filter works better when field-based filter applies to index-time extracted fields. I did observe that in some of my searches index time filter slows search down rather than speeds it up. I have yet to conduct systematic research but if it doesn't speed up for you, use what works better.
2) Sorry about not mentioning dedup. Because dedup will remove any rows that have empty/null fields, so I put the dedup after join and adding "fillnull" command
If I move it to each subsearch, I would need to add fillnull command for each subsearch and it's probably adding a delay. What do you think?
dedup has an option keepemtpy that you can try
| dedup keepempty=true ip, risk, score, contact
In some of my use cases, keeping all events that has any empty field is a bit too much. In that case, you can do fillnull before dedup provided that you don't care to print those rows with empty risk, score, or contact. Something like
| inputlookup host.csv
| rename ip_address as ip
| join max=0 type=left ip
[ search index=risk company IN (compA, compB)
| fields ip risk score contact
| fillnull risk score contact value=UNSPEC
| dedup ip risk score contact ]
| foreach risk score contact
[eval <<FIELD>> = if(<<FIELD>> == "UNSPEC", null(), <<FIELD>>)]
| table ip host risk score contact
You can also apply the same technique with split subsearches. Again, I do not know your data characteristics. So, whether dedup does any good is for you to find out.
Hi @yuanliu
Thanks for the suggestion.
The option keepempty=true is something new I learned, I wish stats value() also has that option.
However, when I tried keepempty=true, it added a lot more delay (3x) compare to using only dedup, perhaps maybe because I have so many fields.
I've been using fillnull to keep empty field. The reason is although one field is empty, I still want to keep the other field. Your way of using foreach to re-assigned the field to null() is awesome. Thanks for showing me this trick.
Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.
I appreciate your help. Thanks
Is there any benefits to move "UNPSEC" back to null()?
I usually just gave it "N/A" for string, and 0 for numeric.
None whatsoever. This is purely for people who want non-existent values to show blank.
Hi @yuanliu
1) If I left join CSV and subsearch that have the same field name, will the data from subsearch rewrite the data from the CSV in that field?
In my example above is the "source" field. I added this for tracking purposes.
2) I also found out that keepempty=true doesn't always work in dedup. Have you ever experienced the same?
Thank you again for your help.
1) You can easily test out using emulation that the right-side values override left-side value.
2) I do not usually use keepempty. If you have reproducible evidence that dedup does not behave as documented, maybe engage support.