Please, who can help with a solution for the below scenario that in my case produces false positives, false NEGATIVES and impossible to predict outcome
this has been tested and the scenario described it happens 100% as described as also little bit unpredictable
CS with tstats on accelerated data model
Scheduling:Real-Time Reason:application is not super important and event should be very rare if not existent; but that can change and be more often than expected;anyway it is acceptable not to run a few times as long as I don't loose the event because of the earliest time quite long back in time
earliest time=-24h Reason: as it is Real-Time it can be skipped if it can not be run so I want to cover a lot of interval back
cron schedule each 5 minute or each hour Reason: to run each hour, I am not interested to see it exactly when it can happen, can be delayed in producing the notable, though better not to , but acceptable
Schedule Window ( no window)
Throttle on 4 fields with window duration 24h Reason: if same event in terms of these 4 fields happens in same 24 hours I don't want to see notable for each of them - one is enough with closing the incident after 24 hours if it has not been produced again - a search can check this easily before closing it
Possible results look like this:
1.E1 is happening at T0, rule is running and not skipped and notable is produced.
Rule runs each hour looking back 24 hours and I want only one notable for this event and for any event that has those 4 fields identical in the next 24 hours.
2.E2 is happening at T1 let's say 11 hours later than first and it will not be produced a notable -because of 24 hours throttle.
3.after another 13 hours and 24 from the first one when the throttle expires and the rule runs again - we can consider this
run #25 if the first run was run #1 a notable will be created for E2 - the rule looks back 24 hours so it is under it's radar this event.
this is for me a false positive because I am not interested in having this notable and e-mail sent .Anyway is not a desired result|and even so it comes after 13 hours after it happened even if the system is not loaded - so this is not a result of system issue, but rule design issue or Splunk way of doing things, this is what we try to find out
24 hours after first event or 13 hours after second event E2, event E3 is happening and it is under the throttle of E2 for another 24 hours and this means it will be no notable and no e-mail at this time for E3.And even worse this event will never actually be caught by the CS because by the time the throttle finishes in 24 hours also the CS running CS #49 I guess will have passed with the time frame beyond this event. - so here we don't even have a false positive but a false NEGATIVE! If you think this is not happening you can test it - I have already done.
Also there is not predictable which events will be skipped because a REAL TIME search has even an interval in which can be run ( I think the interval is as long as the CRON, as per my tests...for example if CRON is 1 minute it can run in that minute or be canceled, if the interval is 2 minutes it can run in those 2 minutes or be canceled, so I guess if it is 1h it can run in that hour...so results can be even impacted by being run at the beginning of hour or finish of hour because that affects the relationship with the throttle - throttle can be expired if run at end of one hour interval and can still be active if run at beginning of interval
All above can be changed to 5,now,each 2 minutes,throttle 5 minutes and result will be the same. Mentioning in case somebody would think that 24h is to much or it is a solution
Considerations: I would not use a continuous search, because if I am not wrong those searches put more pressure on system and I am ok to skip seeing the events at a few searches as long as I see them rather sooner than later, but at least see them.
While I am welcoming any observations and recommendations I would like as much as possible to stay on the described scenario and solve this from CS setting and find out how we can have only one notable in the throttle interval ( that is how I would partially interpret the throttle ) and avoid having events not detected at more than 24 hours from the first one ( just because it is throttled by a second notable which shouldn't be there triggered)
For simplicity we can consider there is a desire to be implemented like this by AppTeam - one notification per hour not more(e-mail)
As I am knew in Splunk it might be many ways to be doable, but I don't have a clear one now.
I think worth mentioning :
Even how a real-time search is defined by Splunk it quite comes obvious that REAL TIME searches in CS-ES come with this possibility of either missing the interval( as it could not be run) either with false positives: "Searches with a real-time schedule are skipped if the search cannot be run at the scheduled time. Searches with a real-time schedule do not backfill gaps in data that occur if the search is skipped" - https://docs.splunk.com/Documentation/ES/5.2.2/Tutorials/ScheduleCorrelationSearch
But also running continuous is not an option : "Use a continuous schedule to prioritize data completion, as searches with a continuous schedule are never skipped " , at least not for this application, or fore some application
So I feel trapped between missing events and having false positives.
Some ways I was thinking, but again nothing written , done or tested and none of them is related to CS settings, which is actually what I am looking for :
- instruct people who work with notables to ignore any notable in 24 hours from the first one , but also to check by running a search if other events happened in the interval of 24 hours after the second notable , but this looks very complicated and cumbersome ;also does not really apply if you have e-mails
- maybe a lookup to ingest some fields and time of last notable and have the rule check that lookup each time - not sure if this would satisfy the fact that we use tstats and accelerated data for speed - so checking lookup and updating maybe again is in contradiction
... View more
but even so, the documentation here is one of the worst, I would almoast say contradictory: 1.prioritize current data VS data completion, what would one be supposed to understand from such abstract terms
2.As excessive failed logins matter most when you hear about them quickly VS If you care more about identifying all excessive failed logins in your environment -again what is the differnce between this two statememts?
might be because I am not native english speaker,but I am not so sure is my fault
Configure a schedule for the correlation search
Correlation searches can run with a real-time or continuous schedule.
• Use a real-time schedule to prioritize current data and performance. Searches with a real-time schedule are skipped if the search cannot be
run at the scheduled time. Searches with a real-time schedule do not
backfill gaps in data that occur if the search is skipped.
• Use a continuous schedule to prioritize data completion, as searches with
a continuous schedule are never skipped.
As excessive failed logins matter most when you hear about them quickly, select a real-time schedule for the search. If you care more about identifying all excessive failed logins in your environment, you can select a continuous schedule for the search instead.
... View more
thank you, I am asking exactly because I am trying to get a good solutions for some CS I try to build while looking to put as less stress on the system and accepting some possible delay because the app or use cases are not critical,just important
... View more
closer and closer, as I am searching to understand the difference between real-time search and continuous on tsidx(using tstats on summariesonly=true, so basically on already indexed and accelersted data)
the terms themselves zi find very counterintuitive ...continuous and rela-time,but that is another topic.
back to some examples as they always help:
one rule looking for same event and if it happens x times in an interval of y minutes to have an alert;let’s call this a fail event so I need x fails in y minutes and no succes event inside this 5 minutes and between the fails.
let’s say I want to put as less pressure on system as possible and I look on acceler data model ;I want to be sure I loose no event from reading so then what should I choose?
1.a real-time search which let’s say looks back 1hour and runs every minute and I can set inside search count by _time span=5min
2.a continous search which looks back five minutes and runs every minute with the count by _time span=5min ;
advantage for 1, as I am looking as this is not critical app and I want to offer other searches more space and time ,is that if it can not run it has 59 possible fails/canceled runs and it still can see the events I described above
if I apply continuous I understand the run will not be ever csnceled so it will fight with other searches untill the moment when whole system can have delayed searches.
Very confusing over all, looking fwd for an answer
... View more
hi,why do you say is expensive to run in real time, splunk mentions real time searches are lower in priority than continuous and they can be skipped while never being able to see the interval you skipped.So if you want your search to not stress the system you would choose real time instead of continuous and because you don’t want to loose events you would look longer back in time every time you run it amd hope that your run will run at least one time to catch those events.If instead you choose continuous search will not be skipped but run anyway, the question is what would happen if it just can not run because of performance issues, would it anyway cover the right time span?
I am not expert in searches but if splunk says real time put less pressure than continuous and he wants to use real time and not continuous how would he write his search?Also what happnes with your search if it fails to run , it will just posibly loose some evemts?
for sur that rule as it was originally asked will miss events if they don’t fall in the 5 mins span ,even if there are 10 and 2 failed inside 5 minutes interval, but not im the same interval defined by time span=5.Looking forward to hear an explanation .Thank you
... View more