Scenario: I want to find all sender email addresses that are not exact matches to a list, but "similar" to any domain of the list (or contains any part of a domain on the list).
For example: Correct sender email domain could be sender@company.com, Incorrect sender email domain could be sender@company.org, or sender@company-corp.net, or sender@companycorporation.us, etc...
Sample code:
index=mail sourcetype=xemail
[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID]
|stats list(subject) as subj list(sender) as sender list(recipient) as recp by UID
Please provide an example using correct_domain.csv as the good domain list.
Thank you
Have you considered using the cluster
command? You can use match=ngramset
to look at subcomponents of the domain, then tweak the t
threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.
index=mail sourcetype=xemail
[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID]
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label
Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):
Have you considered using the cluster
command? You can use match=ngramset
to look at subcomponents of the domain, then tweak the t
threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.
index=mail sourcetype=xemail
[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID]
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label
Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):
Thanks I will give it a shot with cluster
Before I accept your answer I need a bit more advice.
Lets say I have a large number of white list email domains (around 500K in my correct_domain.csv) that I need to check for variations in the sender values.
Cluster seems to be rather resource expensive. Is there a way to optimize the comparison? Or another way to do this domain variation check?
Thank you
Make sure you are clustering on the smallest possible table on the smallest subset of data that you can manage. The cluster
command does not have any memory/resource control options.
If you do not want to use cluster, the next best option would be to use a custom python command. There is one already written and explained here. Best of luck.
Thank you!!!