Splunk Enterprise Security

Overlapping window frame and throttling produce false positives and false negatives


Please, who can help with a solution for the below scenario that in my case produces false positives, false NEGATIVES and impossible to predict outcome

this has been tested and the scenario described it happens 100% as described as also little bit unpredictable

CS with tstats on accelerated data model
Scheduling:Real-Time Reason:application is not super important and event should be very rare if not existent; but that can change and be more often than expected;anyway it is acceptable not to run a few times as long as I don't loose the event because of the earliest time quite long back in time
earliest time=-24h Reason: as it is Real-Time it can be skipped if it can not be run so I want to cover a lot of interval back
latest time=now
cron schedule each 5 minute or each hour Reason: to run each hour, I am not interested to see it exactly when it can happen, can be delayed in producing the notable, though better not to , but acceptable
Schedule Window ( no window)
Throttle on 4 fields with window duration 24h Reason: if same event in terms of these 4 fields happens in same 24 hours I don't want to see notable for each of them - one is enough with closing the incident after 24 hours if it has not been produced again - a search can check this easily before closing it

Possible results look like this:

1.E1 is happening at T0, rule is running and not skipped and notable is produced.
Rule runs each hour looking back 24 hours and I want only one notable for this event and for any event that has those 4 fields identical in the next 24 hours.

2.E2 is happening at T1 let's say 11 hours later than first and it will not be produced a notable -because of 24 hours throttle.

3.after another 13 hours and 24 from the first one when the throttle expires and the rule runs again - we can consider this
run #25 if the first run was run #1 a notable will be created for E2 - the rule looks back 24 hours so it is under it's radar this event.
this is for me a false positive because I am not interested in having this notable and e-mail sent .Anyway is not a desired result|and even so it comes after 13 hours after it happened even if the system is not loaded - so this is not a result of system issue, but rule design issue or Splunk way of doing things, this is what we try to find out

  1. 24 hours after first event or 13 hours after second event E2, event E3 is happening and it is under the throttle of E2 for another 24 hours and this means it will be no notable and no e-mail at this time for E3.And even worse this event will never actually be caught by the CS because by the time the throttle finishes in 24 hours also the CS running CS #49 I guess will have passed with the time frame beyond this event. - so here we don't even have a false positive but a false NEGATIVE! If you think this is not happening you can test it - I have already done.

Also there is not predictable which events will be skipped because a REAL TIME search has even an interval in which can be run ( I think the interval is as long as the CRON, as per my tests...for example if CRON is 1 minute it can run in that minute or be canceled, if the interval is 2 minutes it can run in those 2 minutes or be canceled, so I guess if it is 1h it can run in that hour...so results can be even impacted by being run at the beginning of hour or finish of hour because that affects the relationship with the throttle - throttle can be expired if run at end of one hour interval and can still be active if run at beginning of interval

All above can be changed to 5,now,each 2 minutes,throttle 5 minutes and result will be the same. Mentioning in case somebody would think that 24h is to much or it is a solution

Considerations: I would not use a continuous search, because if I am not wrong those searches put more pressure on system and I am ok to skip seeing the events at a few searches as long as I see them rather sooner than later, but at least see them.

While I am welcoming any observations and recommendations I would like as much as possible to stay on the described scenario and solve this from CS setting and find out how we can have only one notable in the throttle interval ( that is how I would partially interpret the throttle ) and avoid having events not detected at more than 24 hours from the first one ( just because it is throttled by a second notable which shouldn't be there triggered)
For simplicity we can consider there is a desire to be implemented like this by AppTeam - one notification per hour not more(e-mail)

As I am knew in Splunk it might be many ways to be doable, but I don't have a clear one now.

I think worth mentioning :
Even how a real-time search is defined by Splunk it quite comes obvious that REAL TIME searches in CS-ES come with this possibility of either missing the interval( as it could not be run) either with false positives: "Searches with a real-time schedule are skipped if the search cannot be run at the scheduled time. Searches with a real-time schedule do not backfill gaps in data that occur if the search is skipped" - https://docs.splunk.com/Documentation/ES/5.2.2/Tutorials/ScheduleCorrelationSearch
But also running continuous is not an option : "Use a continuous schedule to prioritize data completion, as searches with a continuous schedule are never skipped " , at least not for this application, or fore some application

So I feel trapped between missing events and having false positives.

Some ways I was thinking, but again nothing written , done or tested and none of them is related to CS settings, which is actually what I am looking for :
- instruct people who work with notables to ignore any notable in 24 hours from the first one , but also to check by running a search if other events happened in the interval of 24 hours after the second notable , but this looks very complicated and cumbersome ;also does not really apply if you have e-mails
- maybe a lookup to ingest some fields and time of last notable and have the rule check that lookup each time - not sure if this would satisfy the fact that we use tstats and accelerated data for speed - so checking lookup and updating maybe again is in contradiction