Splunk Enterprise Security

How to calculate MTTD in ES

vikas_gopal
Builder

Hello Splunk ES experts , 

I want to make a query which will produce MTTD (something like by analyzing the time difference between when a raw log event is ingested ( and meets the condition of a correlation search ) and when a notable event is generated based on the correlation search , I have tried something below but it does not give me results I am expecting because it is not calculating time difference for those notables which are in New status , below is working fine for any other status . Can someone please help me on this , may be it is too simple to achieve and I am making this complex 

index=notable
| eval
orig_epoch=if(
NOT isnum(orig_time),
strptime(orig_time, "%m/%d/%Y %H:%M:%S"),
'orig_time'
)
| eval
event_epoch_standardized= orig_epoch,
diff_seconds='_time'-'event_epoch_standardized'
| fields + _time, search_name, diff_seconds
| stats
count as notable_count,
min(diff_seconds) as min_diff_seconds,
max(diff_seconds) as max_diff_seconds,
avg(diff_seconds) as avg_diff_seconds
by search_name
| eval
avg_diff=tostring(avg_diff_seconds, "duration")
| addcoltotals labelfield=search_name

 

Labels (1)
Tags (1)
0 Karma

tscroggins
Influencer

Hi @vikas_gopal,

I answered a similar question a couple years ago. The answer should still be relevant but may require adjustments for the most recent version of ES.

https://community.splunk.com/t5/Splunk-Enterprise-Security/How-to-create-a-Dashboard-that-will-show-...

vikas_gopal
Builder

Thank you so much for your response. I have checked the link the queries that has been discussed on that answer are helpful for tracking the status of a notable event, such as when it is new, when it is picked up, and when it is closed.

However, this is not exactly what I’m looking for. I apologise if my question wasn't clear. What I need is to calculate the time difference between when the notable event was triggered and the time of the raw log that caused it. This will help me assess how long my correlation search took to detect the anomaly. The goal is to fine-tune the correlation searches, as not all of them are running in real time.

Let me explain with an example: Suppose I have a rule that triggers when there are 50 failed login attempts within a 20-minute window. If this condition was true from 9:00 AM to 9:20 AM, but due to a delay—either from the ES server or some other reason—the search didn’t run until 9:30 AM, then I’ve lost 10 minutes before my SOC team was alerted. If I can have a dashboard that shows the exact time difference between the raw event and the notable trigger, I can better optimise my correlation searches to minimise such delays.

0 Karma

tscroggins
Influencer

Hi @vikas_gopal,

The previous response provides searches to calculate time differences between known notable time values. Original event time values may not be available.

For example, the Expected Host Not Reporting rule uses the metadata command to identify hosts with a lastTime value between 2 and 30 days ago. The lastTime field is stored in the notable, and we can use it to calculate time-to-detect by subtracting the lastTime value from the _time value.

An example closer to your description, the Excessive Failed Logins rule, does not store the original event time(s). We could evaluate the notable action definition for the rule to find and execute a drill-down search, which would in turn give us one or more _time values, but as with the rules themselves, success depends on the implementation of the action and drill-down search.

When developing rules, understanding event lag is usually a prerequisite. We typically calculate lag by subtracting event _time from event _indextime. The lag value is used as a lookback in rule definitions. For example, a 90th percentile lag of 5 minutes may suggest a lookback of 5 minutes. A rule scheduled to search the last 20 minutes of events would then search between the last 25 and the last 5 minutes. Your mean time-to-detect should be approximately equal to your mean lag time plus rule queuing and execution time.

You'll need to adjust your lookback threshold relative to your tolerance for missed detections (false negatives), but this is generally how I would approach the problem.

As an alternative, you could enforce design constraints within your rules and require all notables to include original event _time values.

tscroggins
Influencer

(And all of this may be why the ES team included triage and resolution metrics but excluded detection.)

0 Karma
Get Updates on the Splunk Community!

Splunk App Dev Community Updates – What’s New and What’s Next

Welcome to your go-to roundup of everything happening in the Splunk App Dev Community! Whether you're building ...

The Latest Cisco Integrations With Splunk Platform!

Join us for an exciting tech talk where we’ll explore the latest integrations in Cisco + Splunk! We’ve ...

Enterprise Security Content Update (ESCU) | New Releases

In April, the Splunk Threat Research Team had 2 releases of new security content via the Enterprise Security ...