for my mail logs in JSON format, with my splunk query I created below table
mail from | mail sub | mail to |
ABC | account created for A | abc@a.com |
ABC | account created for B | bcd@a.com |
ABC | account created for C | efg@a.com |
In my splunk query I apply dedup on "mail sub". as you can see unique but very similar subject remains in table which I want to further become joined or considered as 1 row.
my ask: what are the possible way that I can partially match table column values and they combined into 1 .? in matching logic if somehow we can use two columns for matching (mail from and mail sub)
mail from | mail sub | mail to | count |
ABC | account created for A | abc@a.com | 3 |
count 3 is coming on the basis of partial match in unique subject and mail from combined.
In my splunk query I apply dedup on "mail sub". as you can see unique but very similar subject remains in table which I want to further become joined or considered as 1 row.
I have a slightly different reading of the OP's intention based on these sentences. Do you mean you want to group by mail subject's similarity, such as "account created for *"? If so, you must realize that "similar" is a highly subjective word. Unless you spell out precise criteria to determine similarity, you must look for natural language processing tool rather than Splunk search.
Suppose my reading of your intention is correct, and that "account created for" is one such criterion for "similarity", your illustrated single-row output is still wrong. Do you mean something like
mail from | mail sub | mail to | count |
ABC | account created for *A, B, C* | abc@a.com | 3 |
Not only that. You also mentioned dedup mail sub alone. That is quite counterproductive to accurate counting because you are asking for "count ... on the basis of partial match in unique subject and mail from combined." At the very minimum, you must dedup on mail from and mail sub; you SHOULD probably also add mail to in that list for the count to make sense. But I'll leave those decisions to you.
Now, to use "account created for *" as partial match. There are many ways to do that. Here is one
| rex field="mail sub" "(?<similarity>account created for)\s+(?<disimilarity>.+)"
| stats values(disimilarity) as disimilarity values("mail to") as "mail to" by "mail from" similarity
| eval similarity = similarity . " *" . mvjoin(disimilarity, ", ") . "*"
| fields - disimilarity
This will give you
mail from | similarity | mail to |
ABC | account created for *A, B, C* | abc@a.com bcd@a.com efg@a.com |
Hope this helps. Here is an emulation that you can play with and compare with real data
| makeresults format=csv data="mail from, mail sub, mail to
ABC, account created for A, abc@a.com
ABC, account created for B, bcd@a.com
ABC, account created for C, efg@a.com"
``` data emulation above ```
Hi @ritzz,
did you tried to use stats:
<your-search>
| stats values(mail_sub) AS mail_sub values(mail_to) AS mail_to BY mail_from
the only probleme is that the lista in mail _sub and mail_to aren't aligned, because they are sorted in alphabetically order one by one.
if you want to have aligned values you have to combine them:
<your-search>
| eval mail=mail_sub." - ".mail_to
| stats values(mail) AS mail BY mail_from
Ciao.
Giuseppe