Help on a use case: Suspicious emails from a previ...

anoop · ‎02-26-2024

Dear team,

Good day! Hope you are doing well.

I need some help in understanding a correlation search. The search is as follows:

index=email sourcetype="ironport:summary" action=delivered
|fillnull value="" file_name senderdomain
|rex field=sender "\@(?<senderdomain>[^ ]*)"
| eval list="mozilla"
| `ut_parse_extended(senderdomain,list)`

| stats count first(subject) as subject earliest(_time) as earliest latest(_time) as latest values(file_name) as file_name by ut_domain
| inputlookup append=t previously_seen_domains.csv
| stats sum(count) as No_of_emails values(subject) as subject min(earliest) as earliest max(latest) as latest values(file_name) as file_name by ut_domain
| eval isNew=if(earliest >= relative_time(now(), "-1d@d"), 1,0)
| where isNew=1 and No_of_emails>=1

| mvcombine file_name delim=" "
| eval temp_file=split(file_name," ")
| rex field="temp_file" "\.(?<ext>[^\.]*$)"
| eventstats values(ext) as extension by ut_domain
| table latest earliest ut_domain No_of_emails subject file_name temp_file extension

| eval _comment="exchange search here"

| join type=outer ut_domain
[search index=email sourcetype="MSExchange:2013:MessageTracking" directionality="Incoming" event_id="RECEIVE"
| stats count by sender_domain
| fields sender_domain
| eval list="mozilla"
| `ut_parse_extended(sender_domain,list)`
| table ut_domain sender_domain
]
| eval isExchangeFound=if(isnull(sender_domain),"false","true")

| where isExchangeFound="true"
| eval qualifiers=if(No_of_emails>=5,mvappend(qualifiers, "- More Than 5 emails from a previously unseen domain (Possible Spam)."),qualifiers)

| cluster t=0.5 labelonly=1 showcount=0 field=file_name

| eventstats dc(file_name) as similer_attach_count dc(ut_domain) as no_of_domains by cluster_label
| eval qualifiers=if(similer_attach_count>=2 AND match(extension,"(?i)(bat|chm|cmd|cpl|exe|hlp|hta|jar|msi|pif|ps1|reg|scr|vbe|vbs|wsf|lnk|scr|xlsm|dotm|lnk|zip|rar|gz|html|iso|img|one)") ,mvappend(qualifiers, "- Suspicious email attachments with similar names, sent from " .no_of_domains. " previously unseen domains. (Qbot Style)"),qualifiers)
| where mvcount(qualifiers)>0
| eval _comment="informational qualifier not counted"
| eval qualifiers=if(match(extension,"(?i)(bat|chm|cmd|cpl|exe|hlp|hta|jar|msi|pif|ps1|reg|scr|vbe|vbs|wsf|lnk|scr|xlsm|dotm|lnk|zip|rar|gz|html|iso|img|one)") ,mvappend(qualifiers, "- Email attachment contains a suspicious file extension - " .extension ),qualifiers)
| eval cluster_label=if(isnull(cluster_label),ut_domain,cluster_label)
| stats values(subject) as subject values(no_of_domains) as no_of_domains values(severity) as severity values(file_name) as file_name values(ut_domain) as ut_domain values(qualifiers) as qualifiers min(earliest) as start_time max(latest) as end_time sum(No_of_emails) as No_of_emails by cluster_label
| eval sev=if(no_of_domains>1,mvcount(qualifiers) + 1,mvcount(qualifiers))
| eval urgency=case(sev=1,"low",sev=2,"medium",sev>2,"high" )
| eval reason=mvappend("Alert qualifiers:", qualifiers)
| eval dd=" index=email sourcetype=ironport:summary sender IN (\"*".mvjoin(ut_domain, "\", \"*")."\") | eventstats last(subject) as subject by sender | eventstats last(file_name) as file_name by sender |table _time action sender recipient subject file_name"

| table start_time end_time ut_domain subject No_of_emails file_name reason urgency dd
| `security_content_ctime(start_time)`
| `security_content_ctime(end_time)`
| rename No_of_emails as result
| eval network_segment="ABC"
|search ut_domain=* NOT [inputlookup domain_whitelist.csv | fields ut_domain]


The expansion of the macro `ut_parse_extended(senderdomain,list)`: 

| lookup ut_parse_extended_lookup url as senderdomain list as list
| spath input=ut_subdomain_parts
| fields - ut_subdomain_parts

We have this search and it works but giving a lot of false positives. Even though a domain is added to the look up table, still we are getting an alert. I am SOC analyst and I tried to understand this query but it appears to be very difficult. Can someone please help or support me to simplify this query? It will be really helpful. This is the first time I am posting something on a community page. So, if I missed to add any information, I apologize and do let me know if more info is required and I will be more than happy to furnish them.

Appreciate your help and support.

anoop · ‎02-27-2024

Dear All,

Thank you so much for the response. I am extremely sorry as I made it very difficult by sharing the whole Use Case. Actually, the requirement here is as follows:

1. If an email comes which was never seen before, we investigate and when confirmed that it is legit, we should whitlelist it

2. If any email comes from the same domain, then we shouldn't get any alert (depending on the throttling value)

3. If an email comes from a new domain, again that's not seen previously, then we should get an alert in the Splunk. After that, we repeat the step 1.

Is it possible for you to help me with a query to satiate this requirement? I tried my level best but it doesn't seem to be working. I would really appreciate all the help and support.

Regards,

Anoop

yuanliu · ‎02-27-2024

Thank you for describing your SOC workflow. Yes, that can be implemented. The question remain about your dataset including content of the lookup (whitelist), maybe also the procedure used to produce this lookup. One particular aspect is characteristics of sourcetype="ironport:summary" and sourcetype="MSExchange:2013:MessageTracking".

How frequently are they updated respectively?
Is one extremely large compared with another?
How does each sourcetype, and each lookup contribute to the workflow you are trying to implement? Do they play similar role or differing roles? In other words, describe your workflow in terms available data.
Which fields in each are of particular interest to SOC analyst? Which field(s) from which source contains sender domain, which contains known domain, for example?
What each of lookups contain? How are they of interest to SOC analyst?
There are a lot of miscellaneous fields in the code sample, e.g., file_name. Do they materially contribute to the end results? If so, how?
A technical detail is macro `ut_parse_extended()`. What does it do? (Which input fields does it take - explicit inputs are senderdomain and list, obviously but an SPL macro can also take implicit inputs, and which output fields does it produce/alter?)
Another macro `security_content_ctime()` also invoked twice. What does this do?

Additionally, is your main goal to improve performance (join is a major performance killer as @PickleRick points out), or to improve readability (hence maintainability)? These two do not necessarily converge as join is better understood in many circles.

anoop · ‎03-15-2024

Hello Yuanliu,

I am extremely sorry for the delayed response. Thank you so much for your answer. I was on a medical emergency. Apologies for the delay. I went through your answer and I have mentioned the following based on what I understand. If that's incorrect, please advise me.

Please find the following references or pointers as references for the questions you have asked:

1. The lookup table: '8112_domain_whitelist.csv' contains one column with the domains that needs to be whitelisted.
2. sourcetype="ironport:summary". The below mentioned are some of the field values that we get in this sourcetype

host
source
UBA Email Ironport:Summary generator
sourcetype
action
direction
eventtype
file_name
info_max_time
info_min_time
info_search_time
internal_message_id
message_size_mb
recipient
sender
src_user
src_user_domain
Time

3. sourcetype="MSExchange:2013:MessageTracking"

this gives success or failure. Meaning if an email is received to the end user (recipient)

4. How frequently are they updated respectively? --> I don't know the answer to this question, I am sorry. Is there a way I could get this answer? I will also ask the SIEM engineers if you advise.

5. Is one extremely large compared with another? --> In terms of number of fields, the sourcetype="MSExchange:2013:MessageTracking" contains less fields and information than the sourcetype="ironport:summary"

6. Expansion of the macro

`ut_parse_extended()`

lookup ut_parse_extended_lookup url as senderdomain list as list
| spath input=ut_subdomain_parts
| fields - ut_subdomain_parts

7. Expansion of the macro

| `security_content_ctime(end_time)`

convert timeformat="%Y-%m-%dT%H:%M:%S" ctime(end_time)

8. Expansion of the macro

| `security_content_ctime(start_time)`

| convert timeformat="%Y-%m-%dT%H:%M:%S" ctime(start_time)

9. Is there a way I could improve performance as well as improve readability

Appreciate your help and support.

PickleRick · ‎02-27-2024

As @yuanliu already mentioned - we don't know your data (we can guess some parts of it from the names of the fields and our own overall experience but it's nowhere near as good as a described sample (anonymized if needed)). We also don't know for sure what the search is supposed to be doing _exactly_. Again - we can make some guesses.

Anyway, I can still see several things wrong with this search.

First and foremost - the use of join command. This command has its limits and is best avoided whenever possible. It is good for some specific use cases and for them only. It's especially tricky when dealing with bigger datasets because it will silently finalize and return only partial results (if any) if you exceed its limits (run time or result count).

Secondly, you whitelist the domains at the very end of your search. That's something that should be done as early as possible to limit the number of events processed further down the pipeline.

Thirdly, while I think I can understand why you do the ut_parse_extended thingy, I don't see much point in this.

Fourthly, you're appending a list of previously seen domain but we have no idea what fields are in that lookup.

And there is so much more going on there... And this search is using a lot of relatively "heavy" commands...

yuanliu · ‎02-26-2024

As you have witnessed first hand, deciphering someone else's complex search is very difficult even for people who are intimately familiar with the specific dataset and detection logic like yourself. It is many times more difficult for volunteers unfamiliar with those specifics.

My suggestion, then, is to start from a description/illustration of dataset (anonymize as needed), followed by a description of desired output (illustration of current output could help, anonymize as needed), then, a description of the detection logic - i.e., given the data you describe, how will an analyst discern the desired results without using Splunk? What fields are available (from each data source) for the analyst to make determination? What is in the lookup and how is it supposed to help?

Help on a use case: Suspicious emails from a previously unseen domain

lookup

Can’t make it to .conf25? Join us online!

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Unlock What’s Next: The Splunk Cloud Platform at .conf25

Index This | How many sevens are there between 1 and 100?

Are you a member of the Splunk Community?