topic Re: email partial match in Splunk Search

email partial match

ritzz — Fri, 20 Oct 2023 13:04:22 GMT

for my mail logs in JSON format, with my splunk query I created below table

mail from	mail sub	mail to
ABC	account created for A	abc@a.com
ABC	account created for B	bcd@a.com
ABC	account created for C	efg@a.com

In my splunk query I apply dedup on "mail sub". as you can see unique but very similar subject remains in table which I want to further become joined or considered as 1 row.

my ask: what are the possible way that I can partially match table column values and they combined into 1 .? in matching logic if somehow we can use two columns for matching (mail from and mail sub)

mail from	mail sub	mail to	count
ABC	account created for A	abc@a.com	3

count 3 is coming on the basis of partial match in unique subject and mail from combined.

Re: email partial match

gcusello — Fri, 20 Oct 2023 10:36:54 GMT

Hi @ritzz,

did you tried to use stats:

<your-search> | stats values(mail_sub) AS mail_sub values(mail_to) AS mail_to BY mail_from

the only probleme is that the lista in mail _sub and mail_to aren't aligned, because they are sorted in alphabetically order one by one.

if you want to have aligned values you have to combine them:

<your-search> | eval mail=mail_sub." - ".mail_to | stats values(mail) AS mail BY mail_from

Ciao.

Giuseppe

Re: email partial match

yuanliu — Fri, 20 Oct 2023 18:04:02 GMT

In my splunk query I apply dedup on "mail sub". as you can see unique but very similar subject remains in table which I want to further become joined or considered as 1 row.

I have a slightly different reading of the OP's intention based on these sentences. Do you mean you want to group by mail subject's similarity, such as "account created for *"? If so, you must realize that "similar" is a highly subjective word. Unless you spell out precise criteria to determine similarity, you must look for natural language processing tool rather than Splunk search.

Suppose my reading of your intention is correct, and that "account created for" is one such criterion for "similarity", your illustrated single-row output is still wrong. Do you mean something like

mail from	mail sub	mail to	count
ABC	account created for A, B, C	abc@a.com bcd@a.com efg@a.com	3

Not only that. You also mentioned dedup mail sub alone. That is quite counterproductive to accurate counting because you are asking for "count ... on the basis of partial match in unique subject and mail from combined." At the very minimum, you must dedup on mail from and mail sub; you SHOULD probably also add mail to in that list for the count to make sense. But I'll leave those decisions to you.

Now, to use "account created for *" as partial match. There are many ways to do that. Here is one

| rex field="mail sub" "(?<similarity>account created for)\s+(?<disimilarity>.+)" | stats values(disimilarity) as disimilarity values("mail to") as "mail to" by "mail from" similarity | eval similarity = similarity . " *" . mvjoin(disimilarity, ", ") . "*" | fields - disimilarity

This will give you

mail from	similarity	mail to
ABC	account created for A, B, C	abc@a.com bcd@a.com efg@a.com

Hope this helps. Here is an emulation that you can play with and compare with real data

| makeresults format=csv data="mail from, mail sub, mail to ABC, account created for A, abc@a.com ABC, account created for B, bcd@a.com ABC, account created for C, efg@a.com" ``` data emulation above ```