Splunk Search

how to compare characters in two fields and return number of matches?

brienhawker
Explorer

I have two fields se_split and re_split which are lined up like so

re_split          se_split
a                     g
g                     h
h                     p
e                     q
c                     a
t                     z
o                     t
p                     w

Is there a way for me to compare the two fields character by character and add up how many characters match?
much thanks.

1 Solution

woodcock
Esteemed Legend

See my answers here for background:
https://answers.splunk.com/answers/567851/how-can-i-compare-mvfields-and-get-a-diff.html
https://answers.splunk.com/answers/734599/how-to-compare-the-same-search-from-the-previous-d.html

If your stuff is already in a mv-field then you can skip this part, but if not, do this:

index=YouShouldAlwaysSpecifyAnIndex AND sourcetype=AndSourcetypeToo
|  stats values(re_split) AS re_split values(se_split) AS se_split BY whatever

For run anywhere, try this:

| makeresults 
| eval re_split="a g h e c t o p", se_split="g h p q a z t w" 
| makemv re_split 
| makemv se_split 

Then you can EITHER do this:

| streamstats count AS _serial 
| multireport 
    [| mvexpand se_split 
    | where re_split!=se_split 
    | rename se_split AS se_only] 
    [| mvexpand re_split 
    | where re_split!=se_split 
    | rename re_split AS re_only] 
| stats values(*) AS * BY _serial

OR this:

| nomv re_split
| nomv se_split
| rex field=re_split mode=sed "s/[\r\n\s]+/;/g"
| rex field=se_split mode=sed "s/[\r\n\s]+/;/g"
| eval setdiff = split(replace(replace(replace(replace(mvjoin(mvsort(mvappend(split(replace(re_split, "(;|$)", "#1;"), ";"), split(replace(se_split, "(;|$)", "#0;"), ";"))), ";"), ";(\w+)#0\;\1#1", ""), ";\w+#1", ""), "#0", ""), ";(?!\w)|^;", ""), ";")

View solution in original post

woodcock
Esteemed Legend

So you need a set diff that runs against 2 multi-valued fields, essentially, mvdiff.

0 Karma

woodcock
Esteemed Legend

See my answers here for background:
https://answers.splunk.com/answers/567851/how-can-i-compare-mvfields-and-get-a-diff.html
https://answers.splunk.com/answers/734599/how-to-compare-the-same-search-from-the-previous-d.html

If your stuff is already in a mv-field then you can skip this part, but if not, do this:

index=YouShouldAlwaysSpecifyAnIndex AND sourcetype=AndSourcetypeToo
|  stats values(re_split) AS re_split values(se_split) AS se_split BY whatever

For run anywhere, try this:

| makeresults 
| eval re_split="a g h e c t o p", se_split="g h p q a z t w" 
| makemv re_split 
| makemv se_split 

Then you can EITHER do this:

| streamstats count AS _serial 
| multireport 
    [| mvexpand se_split 
    | where re_split!=se_split 
    | rename se_split AS se_only] 
    [| mvexpand re_split 
    | where re_split!=se_split 
    | rename re_split AS re_only] 
| stats values(*) AS * BY _serial

OR this:

| nomv re_split
| nomv se_split
| rex field=re_split mode=sed "s/[\r\n\s]+/;/g"
| rex field=se_split mode=sed "s/[\r\n\s]+/;/g"
| eval setdiff = split(replace(replace(replace(replace(mvjoin(mvsort(mvappend(split(replace(re_split, "(;|$)", "#1;"), ";"), split(replace(se_split, "(;|$)", "#0;"), ";"))), ";"), ";(\w+)#0\;\1#1", ""), ";\w+#1", ""), "#0", ""), ";(?!\w)|^;", ""), ";")

brienhawker
Explorer

Thank you. This works great for what im trying to do.

0 Karma

woodcock
Esteemed Legend

I am curious: which one did you use?

0 Karma

somesoni2
Revered Legend

All those characters appear in different row OR it's just one multivalued field in single row?

0 Karma

brienhawker
Explorer

All character are in a single field. I've taken usernames and strung them out and am now trying to compare the two so that i can see if they match a certain amount.

0 Karma

somesoni2
Revered Legend

What would be the expected output based on your example? Do they have to match in the order they appear in the field OR just character match? (e.g. in your sample if you want to compare by order,, nothing matches, but if you just want to check existence of the character, you get aghtp

0 Karma

brienhawker
Explorer

I just want to match if re_split is in se_split. if it returns the letters that are in that field that is fine because I can just have it count how many letters there are in comparison to se_split and come up with a final number that way. in the end i just want a number that tells me how many matching characters there are and then im going to subtract the number of matching characters from the number of characters in se_split and return a percentage value as my final number. so yes i just want to check existence.

0 Karma

somesoni2
Revered Legend

It'll be easier to give solution if you can provide your current query. You basically have to create a new field which is copy of re_split, expand it (using mvexpand), then compare the character if it's present in se_split (using mvfind) then run some stats to count and combine rows back to original count.

0 Karma

brienhawker
Explorer

Here is my query. I commented in the place that I'm trying to run that mvfind command. I have two emails, i cut anything after @ giving me usernames of sorts then i split them and now am trying to search by sender to see if the sender is trying to send something to an email that closely resembles his own.

index=msexchange size>2000000 directionality="Originating" AND action="delivered" AND recipient!="*iccu.com" AND NOT
(message_subject="RE*" OR message_subject="FW*" OR message_subject="*EXTERNAL*") AND
(recipient="*gmail.com" OR recipient="*.edu" OR recipient="*hotmail.com" OR recipient "*yahoo.com" OR recipient="*msn.com" OR 
recipient="*outlook.com" OR recipient="*aol.com" OR recipient="*zoho.com" OR recipient="*icloud.com" OR recipient="*inbox.com" OR recipient="*mail.com" OR recipient="*yandex.com" OR recipient="*protonmail.com")
|rex field=recipient "^(?<re_name>.*)@(?<trash1>.*)$"      `comment("START OF FIRST 3 COMPARASON")`
|rex field=sender "^(?<se_name>.*)@(?<trash1>.*)$"
|rex field=se_name "^(?P<se_first3>...)"
|rex field=re_name "^(?P<re_first3>...)"
|eval first_value=if(like(se_first3, re_first3), 1, 0)   `comment("START OF LAST 3 COMPARASON")`
|rex field=re_name "(?<re_last3>\w{3})$" 
|rex field=se_name "(?<se_last3>\w{3})$" 
|eval last_value=if(like(se_last3, re_last3), 1, 0)
|eval subject_value=if(isnull(message_subject), 1, 0)

`comment("THIS IS WHERE IM PULLING OUT CHARACTERS THAT I DONT WANT")`
|replace "*.*" WITH "" in re_name
|replace "*_*" WITH "" in re_name

|eval se_cut=replace(se_name,"^.","")
|eval re_cut=replace(re_name,"^.","")

|eval se_mvmake=se_name
|eval re_mvmake=re_name

| rex field=re_mvmake "(?<re_my1>.*?)\d\d\d+(?<re_my2>.*)" | eval re_mvmake=if(isnull(re_my1), re_mvmake, re_my1+re_my2)
| rex field=se_mvmake "(?<se_my1>.*?)\d\d\d+(?<se_my2>.*)" | eval se_mvmake=if(isnull(se_my1), se_mvmake, se_my1+se_my2)
`comment("IGNORE THIS")`
|makemv re_mvmake delim=‘“\”’
|makemv se_mvmake delim=‘“\”’
|mvexpand re_mvmake
|mvexpand se_mvmake
|rex field=re_name max_match=10  "\"(?<re_mv>.*?)\""
|rex field=se_name max_match=10  "\"(?<se_mv>.*?)\""

`comment("THIS IS WHAT I AM WORKING ON NOW BASED ON YOUR COMMENT")`
|eval re_split=split(re_name,"")
|eval se_split=split(se_name,"")
|mvexpand se_split
|mvexpand re_split
|eval split_total=mvfind(se_split, re_split)

|eval threat_num=4
|eval is_threat=if(like(se_name, re_name), 2, 0)
|eval appraise=if(like(message_subject, "%ppraisal%"), 2, 0)     `comment("START OF MESSAGE SUBJECT COMPARASON")`
|eval payment=if(like(message_subject, "%ayment%"), 2, 0)
|eval loan=if(like(message_subject, "%oan%"), 2, 0)
|eval estimate=if(like(message_subject, "%stimate%"), 2, 0)
|eval size_count=if(size>5000000, 1, 0)
|eval diff=(threat_num - appraise - payment - loan - estimate + size_count + first_value + last_value + is_threat + subject_value)
|eval Threat_Level=case(diff=1 OR diff=2, "Low", diff=3 OR diff=4, "Medium", diff=5, "High", diff=6 OR diff=7 OR diff=8, "Urgent")
| eval Time=strftime(_time,"%Y-%m-%d %H:%M:%S")
`comment("|where Threat_Level="High" OR Threat_Level="Urgent"")`
|stats values(split_total) values(se_split) values(re_split) values(Threat_Level) as "Threat Level" values(size) as "Message Size"    values(recipient) as Recipient values(Time) as Time by sender subject | sort -"Message Size" "Threat Level"
0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...