topic Re: how to compare characters in two fields and return number of matches? in Splunk Search

how to compare characters in two fields and return number of matches?

brienhawker — Fri, 12 Apr 2019 19:58:25 GMT

I have two fields se_split and re_split which are lined up like so

re_split          se_split
a                     g
g                     h
h                     p
e                     q
c                     a
t                     z
o                     t
p                     w

Is there a way for me to compare the two fields character by character and add up how many characters match?
much thanks.

Re: how to compare characters in two fields and return number of matches?

somesoni2 — Fri, 12 Apr 2019 20:07:42 GMT

All those characters appear in different row OR it's just one multivalued field in single row?

Re: how to compare characters in two fields and return number of matches?

brienhawker — Fri, 12 Apr 2019 20:10:56 GMT

All character are in a single field. I've taken usernames and strung them out and am now trying to compare the two so that i can see if they match a certain amount.

Re: how to compare characters in two fields and return number of matches?

somesoni2 — Fri, 12 Apr 2019 20:49:19 GMT

What would be the expected output based on your example? Do they have to match in the order they appear in the field OR just character match? (e.g. in your sample if you want to compare by order,, nothing matches, but if you just want to check existence of the character, you get aghtp

Re: how to compare characters in two fields and return number of matches?

brienhawker — Wed, 30 Sep 2020 00:09:28 GMT

I just want to match if re_split is in se_split. if it returns the letters that are in that field that is fine because I can just have it count how many letters there are in comparison to se_split and come up with a final number that way. in the end i just want a number that tells me how many matching characters there are and then im going to subtract the number of matching characters from the number of characters in se_split and return a percentage value as my final number. so yes i just want to check existence.

Re: how to compare characters in two fields and return number of matches?

somesoni2 — Wed, 30 Sep 2020 00:05:21 GMT

It'll be easier to give solution if you can provide your current query. You basically have to create a new field which is copy of re_split, expand it (using mvexpand), then compare the character if it's present in se_split (using mvfind) then run some stats to count and combine rows back to original count.

Re: how to compare characters in two fields and return number of matches?

woodcock — Fri, 12 Apr 2019 21:22:47 GMT

See my answers here for background:
https://answers.splunk.com/answers/567851/how-can-i-compare-mvfields-and-get-a-diff.html
https://answers.splunk.com/answers/734599/how-to-compare-the-same-search-from-the-previous-d.html

If your stuff is already in a mv-field then you can skip this part, but if not, do this:

index=YouShouldAlwaysSpecifyAnIndex AND sourcetype=AndSourcetypeToo
|  stats values(re_split) AS re_split values(se_split) AS se_split BY whatever

For run anywhere, try this:

| makeresults 
| eval re_split="a g h e c t o p", se_split="g h p q a z t w" 
| makemv re_split 
| makemv se_split

Then you can EITHER do this:

| streamstats count AS _serial 
| multireport 
    [| mvexpand se_split 
    | where re_split!=se_split 
    | rename se_split AS se_only] 
    [| mvexpand re_split 
    | where re_split!=se_split 
    | rename re_split AS re_only] 
| stats values(*) AS * BY _serial

OR this:

| nomv re_split
| nomv se_split
| rex field=re_split mode=sed "s/[\r\n\s]+/;/g"
| rex field=se_split mode=sed "s/[\r\n\s]+/;/g"
| eval setdiff = split(replace(replace(replace(replace(mvjoin(mvsort(mvappend(split(replace(re_split, "(;|$)", "#1;"), ";"), split(replace(se_split, "(;|$)", "#0;"), ";"))), ";"), ";(\w+)#0\;\1#1", ""), ";\w+#1", ""), "#0", ""), ";(?!\w)|^;", ""), ";")

Re: how to compare characters in two fields and return number of matches?

brienhawker — Fri, 12 Apr 2019 21:25:41 GMT

Here is my query. I commented in the place that I'm trying to run that mvfind command. I have two emails, i cut anything after @ giving me usernames of sorts then i split them and now am trying to search by sender to see if the sender is trying to send something to an email that closely resembles his own.

index=msexchange size>2000000 directionality="Originating" AND action="delivered" AND recipient!="*iccu.com" AND NOT
(message_subject="RE*" OR message_subject="FW*" OR message_subject="*EXTERNAL*") AND
(recipient="*gmail.com" OR recipient="*.edu" OR recipient="*hotmail.com" OR recipient "*yahoo.com" OR recipient="*msn.com" OR 
recipient="*outlook.com" OR recipient="*aol.com" OR recipient="*zoho.com" OR recipient="*icloud.com" OR recipient="*inbox.com" OR recipient="*mail.com" OR recipient="*yandex.com" OR recipient="*protonmail.com")
|rex field=recipient "^(?<re_name>.*)@(?<trash1>.*)$"      `comment("START OF FIRST 3 COMPARASON")`
|rex field=sender "^(?<se_name>.*)@(?<trash1>.*)$"
|rex field=se_name "^(?P<se_first3>...)"
|rex field=re_name "^(?P<re_first3>...)"
|eval first_value=if(like(se_first3, re_first3), 1, 0)   `comment("START OF LAST 3 COMPARASON")`
|rex field=re_name "(?<re_last3>\w{3})$" 
|rex field=se_name "(?<se_last3>\w{3})$" 
|eval last_value=if(like(se_last3, re_last3), 1, 0)
|eval subject_value=if(isnull(message_subject), 1, 0)

`comment("THIS IS WHERE IM PULLING OUT CHARACTERS THAT I DONT WANT")`
|replace "*.*" WITH "" in re_name
|replace "*_*" WITH "" in re_name

|eval se_cut=replace(se_name,"^.","")
|eval re_cut=replace(re_name,"^.","")

|eval se_mvmake=se_name
|eval re_mvmake=re_name

| rex field=re_mvmake "(?<re_my1>.*?)\d\d\d+(?<re_my2>.*)" | eval re_mvmake=if(isnull(re_my1), re_mvmake, re_my1+re_my2)
| rex field=se_mvmake "(?<se_my1>.*?)\d\d\d+(?<se_my2>.*)" | eval se_mvmake=if(isnull(se_my1), se_mvmake, se_my1+se_my2)
`comment("IGNORE THIS")`
|makemv re_mvmake delim=‘“\”’
|makemv se_mvmake delim=‘“\”’
|mvexpand re_mvmake
|mvexpand se_mvmake
|rex field=re_name max_match=10  "\"(?<re_mv>.*?)\""
|rex field=se_name max_match=10  "\"(?<se_mv>.*?)\""

`comment("THIS IS WHAT I AM WORKING ON NOW BASED ON YOUR COMMENT")`
|eval re_split=split(re_name,"")
|eval se_split=split(se_name,"")
|mvexpand se_split
|mvexpand re_split
|eval split_total=mvfind(se_split, re_split)

|eval threat_num=4
|eval is_threat=if(like(se_name, re_name), 2, 0)
|eval appraise=if(like(message_subject, "%ppraisal%"), 2, 0)     `comment("START OF MESSAGE SUBJECT COMPARASON")`
|eval payment=if(like(message_subject, "%ayment%"), 2, 0)
|eval loan=if(like(message_subject, "%oan%"), 2, 0)
|eval estimate=if(like(message_subject, "%stimate%"), 2, 0)
|eval size_count=if(size>5000000, 1, 0)
|eval diff=(threat_num - appraise - payment - loan - estimate + size_count + first_value + last_value + is_threat + subject_value)
|eval Threat_Level=case(diff=1 OR diff=2, "Low", diff=3 OR diff=4, "Medium", diff=5, "High", diff=6 OR diff=7 OR diff=8, "Urgent")
| eval Time=strftime(_time,"%Y-%m-%d %H:%M:%S")
`comment("|where Threat_Level="High" OR Threat_Level="Urgent"")`
|stats values(split_total) values(se_split) values(re_split) values(Threat_Level) as "Threat Level" values(size) as "Message Size"    values(recipient) as Recipient values(Time) as Time by sender subject | sort -"Message Size" "Threat Level"

Re: how to compare characters in two fields and return number of matches?

brienhawker — Fri, 12 Apr 2019 21:35:47 GMT

Thank you. This works great for what im trying to do.

Re: how to compare characters in two fields and return number of matches?

woodcock — Fri, 12 Apr 2019 22:25:00 GMT

I am curious: which one did you use?

Re: how to compare characters in two fields and return number of matches?

woodcock — Sat, 13 Apr 2019 16:16:43 GMT

So you need a set diff that runs against 2 multi-valued fields, essentially, mvdiff.