We have been running alerts that periodically check various sourcetypes and notify us if there are zero events found in order to find out if something has broken with logging or indexing and we are missing data. I'm going to set this up for some AWS logs (e.g. CloudTrail) which we are collecting for multiple (10+) AWS accounts.
The goal is to be alerted if for example CloudTrail logs are missing for ANY one of the 10+ AWS accounts we're monitoring. I'm aware of two approaches that would work, but each with pros/cons. I wonder if somebody has a suggestion that combines the pros of both of these approaches and avoids the cons.
I. Use one saved alert to search all of the AWS accounts, e.g.:
index=aws sourcetype=cloudtrail
| stats latest(_time) as latest by aws_account_id
| where latest < relative_time(now(), "-1h")
And alert when there ARE results for this.
Pros: only one search/alert needed, only get one alert message covering all accounts with missing data (if more than one)
Cons: only works during the time range of the search (if the problem continues beyond the time range of the search, we'll just stop hearing about it)
II. Use one saved alert for each AWS account, e.g.
index=aws sourcetype=cloudtrail aws_account_id=foo
and
index=aws sourcetype=cloudtrail aws_account_id=bar
etc
And set up individual alerts for each one and alert when there are NO results.
Pros: not dependent on the time range of the search, if the problem continues beyond the time range of the search, the alert will keep complaining until the problem is resolved
Cons: need to set up a separate alert for each AWS account, including a new one every time we add a new AWS account. If the problem affects multiple accounts, we'll get spammed with separate alerts for each one
Is there a way to get the pros of both of these methods without the cons?
index=aws sourcetype=cloudtrail
| stats latest(_time) as latest by aws_account_id
| where latest < relative_time(now(), "-1h") AND latest > relative_time(now(), "-2h")
Continuous omissions are eliminated by using a query.
Thanks for answering! I don't see how this would solve the problem? To be a more specific, let's say I have the search:
index=aws sourcetype=cloudtrail | stats latest(_time) as latest by aws_account_id | where latest < relative_time(now(), "-1h")
with the time frame set to "Last 4 hours". And let's say I have the alert set to run once per hour. This will tell me if any aws_account_id
that was previously sending data has not sent data in the last hour, and it will continue sending that alert once per hour, until the latest event received is more than 4 hours old. Then it will stop alerting, even if the problem of missing data continues. My question is how to make this search continue alerting until the data resumes logging.
with the time frame set to "Last 4 hours".
I think the time frame set to par hour. will you change time frame?
why time frame is "Last 4 hours"?
I have the alert running once per hour, and the search time frame as "last 4 hours". This way, the results include any events in the last 4 hours where the timestamp is more than one hour old (specifically, more than one hour old, and less than four hours old).
If you set the time frame also to one hour, this search would produce no results at all (because there are no events with a timestamp more than one hour old if you're only looking at a time frame of last one hour).
do you make somthing flag?
Not sure what you mean by that?
| eval flag=if(latest < relative_time(now(), "-1h") AND latest > relative_time(now(), "-2h") ,"flag",NULL)
like above.
There is the event has flag . then fire alert.
index=aws sourcetype=cloudtrail
| stats latest(_time) as latest by aws_account_id
| eval flag=if(latest < relative_time(now(), "-1h") AND latest > relative_time(now(), "-2h") ,"flag",NULL)
| where isnotnull(flag)
like this.
If I do:
index=aws sourcetype=cloudtrail
The results can include events with
aws_account_id=123
aws_account_id=456
aws_account_id=789
So if aws_account_id=789
stops sending logs to splunk, now those events are missing. So if I do search: index=aws sourcetype=cloudtrail aws_account_id=789
I can alert when there are zero matches, and this works good, it will continue to send alerts until the missing data problem is fixed. The problem is I would need to do this for every aws_account_id value.
If I do your suggestion, the alerts will stop after the latest event from the missing aws_account_id value is outside the time range of the search. But I don't want the alerts to stop until the problem is fixed.
I see, as you like.
We have been running alerts that periodically check various sourcetypes and notify us if there are zero events found in order to find out if something has broken with logging or indexing and we are missing data
my query works for this, I think.
If you don't like this, please choose the other.
At the point, a missing and the continuing are different,I think.
another query and another alert, we need.