Alerting

Splunk - False Alerts

ashrafshareeb
Path Finder

Hi All,

I'm facing a situation of false alerts being triggered in Splunk.

From the internal splunkd logs,

11-22-2018 14:02:04.653 +0000 INFO SavedSplunker - savedsearch_id="nobody;search;MQ publisher down", user="", app="search", savedsearch_name=" MQ publisher down ", status=success, digest_mode=1, scheduled_time=1542895320, window_time=0, dispatch_time=1542895323, run_time=0.401, result_count=0, alert_actions="email", sid="scheduler_c008417search_RMD5a08184670275bb77_at_1542895320_95737_F13A4D07-C935-4605-B1F8-A2D2A0C73712", suppressed=0, thread_id="AlertNotifierWorker-0"

I could see that the result_count=0 has triggered this alert, but when i see the difference between the index time and the time I could see that there is no major difference_indextime-_time

Settings
Alert : MQ publisher down
Alert type :Scheduled
Run on Cron Schedule
Earliest: -4m@m
Latest:-1m@m
Cron Expression 
*/1 6-22 * * 1-5

Trigger Conditions

Trigger alert when Number of Results:  is less than 1
Trigger : Once For each result

Throttle = Yes
Suppress triggering for 10 minute(s)

the search being run is host=<HOSTNAME> "server.MQPublisher" source="F:\\TEST\\MQPUBLISHER.log"

Can someone help on this please. This for observing that the the logs are continuously written in the MQ logs and is not down. If there is an instance that nothing is written in the logs, this alert would be triggered

0 Karma

Richfez
SplunkTrust
SplunkTrust

Normally I'd say the 3 minute window, run -1m to -4 minutes ago, should be plenty to get around an issue with the indexing getting behind a bit. And besides you have confirmed _indextime and _time are close enough to not matter. So, let's assume that works. What else might be going wrong here?

I'd say the next thing to check is if that there really were no actual results in that time period. You could do this manually, or maybe ... if you had supplied a search or even an event or two this would be easier, but here goes -

my base search ...
| streamstats time_window=3m count AS three_minute_window
|timechart min(three_minute_window)

Or something like that run over the few hours or whatever that the false alert showed up in. You know, just to confirm what Splunk sees...

Hope this helps at least a little,
-Rich

0 Karma

ashrafshareeb
Path Finder

Hi Rich,

I have tried to run the search in the time range the alert got triggered, below is the statics, the false alert triggered an email about 14:02 and from the below streamstats I could see a change between 2018-11-22 14:02:30 90 and 2018-11-22 14:02:35 88. but when I changed the window to run on 22-11-2018 it shows all 90, lil bit confused on this. does it mean its all good? if I use time range of 14:02 to 14:05, I see the below

_time                               min(three_minute_window)
2018-11-22 14:02:10 90
2018-11-22 14:02:15 90
2018-11-22 14:02:20 90
2018-11-22 14:02:25 90
2018-11-22 14:02:30 90
2018-11-22 14:02:35 88  
2018-11-22 14:02:40 86
2018-11-22 14:02:45 83
2018-11-22 14:02:50 81
2018-11-22 14:02:55 78
2018-11-22 14:03:00 76
2018-11-22 14:03:05 73 
2018-11-22 14:03:10 71
2018-11-22 14:03:15 68
2018-11-22 14:03:20 66
2018-11-22 14:03:25 63
2018-11-22 14:03:30 61
2018-11-22 14:03:35 58
2018-11-22 14:03:40 56
2018-11-22 14:03:45 53
2018-11-22 14:03:50 51
2018-11-22 14:03:55 48
2018-11-22 14:04:00 46
2018-11-22 14:04:05 43
2018-11-22 14:04:10 41
2018-11-22 14:04:15 38
2018-11-22 14:04:20 36
2018-11-22 14:04:25 33
2018-11-22 14:04:30 31
2018-11-22 14:04:35 28
2018-11-22 14:04:40 26
2018-11-22 14:04:45 23
2018-11-22 14:04:50 21
2018-11-22 14:04:55 18
2018-11-22 14:05:00 16
2018-11-22 14:05:05 13
2018-11-22 14:05:10 11
2018-11-22 14:05:15 8
2018-11-22 14:05:20 6
2018-11-22 14:05:25 3
2018-11-22 14:05:30 1
0 Karma

Richfez
SplunkTrust
SplunkTrust

Well, that was a 3-minute rolling window, . So... (I'm thinking this through as I type), at 14:02 you had a count of 90 in the previous three minutes. From 14:02 (I'm just rounding off those seconds) to about 14:05, your count steadily drops, until it's nearly at zero.

(By the way, Streamstats is probably doing this backwards from the way you would normally think - I use streamstats a lot but it's nearly always with a tiny window of 2 events or something so it's rarely noticeable. )

That's a drop of the entire 90 in 3 minutes, which means you spent three minutes with a continuous count of zero events with server.MQPublisher for that host and source.

Which means ... that's a legitimate alert. Three minutes, no events matching your criteria.

Let's try from the other side of things:

What made you think it was a false alert in the first place? Is there some other product/monitoring software you are using that said "everything's fine?" Or was it just that "no one complained then" or something?

Which sort of leads me to wonder, if it was a legit alert, but the alerting thing wasn't really something that should have alerted, then the criteria just needs tweaking. Maybe the "log writing process" gets behind, so Splunk doesn't see current information? Maybe you need to broaden the time frame you are looking at?

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...