I have data from a couple different sources that I am trying to combine together into coherent results. The issue I am running into is that sometimes the data does not line up perfectly. Both data sources will report on a user and try to list all their email aliases but sometimes they are incomplete lists and only partially overlap. So we end up with multiple rows that represent the same user but and have most of the same values for the email field, but because they are not exactly the same, when I try to group by email address it doesn't work out how I would hope.
I included some example SPL below to illustrate what the data looks like. There are also some other fields in results, but those cannot be used for merging results either as the email address of the user is the only field that is in both data sets.
| makeresults
| eval email =split("1@example.com,2@example.com;2@example.com,3@example.com;4@example.com;5@example.com", ";")
| mvexpand email
| eval email=split(email, ",")
| streamstats count as orig_row
So I am wondering if there is any way to combine rows #1 and 2 in the example results while leaving rows 3 and 4 intact?
Thanks!
UPDATED:
| makeresults
| eval email =split("1@example.com,2@example.com;2@example.com,3@example.com;4@example.com;5@example.com", ";")
| mvexpand email
| eval email=split(email, ",")
| streamstats count as orig_row
| mvexpand email
| eventstats list(email) as target
| streamstats current=f count as ses
| eval target=if(ses=0, mvindex(target,1, -1),mvappend(mvindex(target,0,ses-1),mvindex(target,ses+1,-1)))
| eval checked=if(match(target,email),1,0)
| stats list(email) as email sum(checked) as checked by orig_row
| where checked > 0
UPDATED:
| makeresults
| eval email =split("1@example.com,2@example.com;2@example.com,3@example.com;4@example.com;5@example.com", ";")
| mvexpand email
| eval email=split(email, ",")
| streamstats count as orig_row
| mvexpand email
| eventstats list(email) as target
| streamstats current=f count as ses
| eval target=if(ses=0, mvindex(target,1, -1),mvappend(mvindex(target,0,ses-1),mvindex(target,ses+1,-1)))
| eval checked=if(match(target,email),1,0)
| stats list(email) as email sum(checked) as checked by orig_row
| where checked > 0
After your update the query is close. It identifies which rows have overlap, I just need to combine those correctly. I think I can manage that part. Will post back here if I get that working but will mark your answer as correct as well.
| makeresults
| eval email =split("1@example.com,2@example.com;2@example.com,3@example.com;4@example.com;5@example.com", ";")
| mvexpand email
| eval email=split(email, ",")
| streamstats count as orig_row
| mvexpand email
| eventstats list(email) as target
| streamstats current=f count as ses
| eval target=if(ses=0, mvindex(target,1, -1),mvappend(mvindex(target,0,ses-1),mvindex(target,ses+1,-1)))
| eval checked=if(match(target,email),1,0)
| stats list(email) as email sum(checked) as checked, count(orig_row) as oc by orig_row
| where checked > 0 OR oc=1
Looks like 'checked' is not getting a value for me. Perhaps it is a difference in the version of splunk I am on (7.1.6)? But I suspect that match function is not working properly with the multi-value field.
But I suspect that match function is not working properly with the multi-value field.
No. match()
works. both single and multivalue.
@sonydrew
I have a mistake. my answer is updated. please confirm.
If you want the full range of email values, you'll need a different field to do an exact match on. Is there a way you could create the user field (or another common value that you could group on) either with a rex or an eval? Maybe one of the email addresses is username@example.com and you could extract the user field with regex
Well, I don't have the actual user account name available. I could rex out the user portion of the email address (everything before the @), but then I would still end up with two Multivalue rows that I am trying to combine, just with shorter strings. For example if I have primary email address of "john_smith@example.com" and an alias of "john_s@example.com", I could rex out "john_smith" and "john_s" but I don't think that helps.
you could try writing the regex more specifically so it only matches the full username syntax. maybe something like this: ^(?<user>[a-z]+\_[a-z]{2,})\@
That would match john_smith@example.com but not john_s@example.com. might need to be tweaked depending on the naming conventions