Hi All,
I'm facing a situation of false alerts being triggered in Splunk.
From the internal splunkd logs,
11-22-2018 14:02:04.653 +0000 INFO SavedSplunker - savedsearch_id="nobody;search;MQ publisher down", user="", app="search", savedsearch_name=" MQ publisher down ", status=success, digest_mode=1, scheduled_time=1542895320, window_time=0, dispatch_time=1542895323, run_time=0.401, result_count=0, alert_actions="email", sid="scheduler_c008417search_RMD5a08184670275bb77_at_1542895320_95737_F13A4D07-C935-4605-B1F8-A2D2A0C73712", suppressed=0, thread_id="AlertNotifierWorker-0"
I could see that the result_count=0 has triggered this alert, but when i see the difference between the index time and the time I could see that there is no major difference_indextime-_time
Settings
Alert : MQ publisher down
Alert type :Scheduled
Run on Cron Schedule
Earliest: -4m@m
Latest:-1m@m
Cron Expression
*/1 6-22 * * 1-5
Trigger Conditions
Trigger alert when Number of Results: is less than 1
Trigger : Once For each result
Throttle = Yes
Suppress triggering for 10 minute(s)
the search being run is host=<HOSTNAME> "server.MQPublisher" source="F:\\TEST\\MQPUBLISHER.log"
Can someone help on this please. This for observing that the the logs are continuously written in the MQ logs and is not down. If there is an instance that nothing is written in the logs, this alert would be triggered
Normally I'd say the 3 minute window, run -1m to -4 minutes ago, should be plenty to get around an issue with the indexing getting behind a bit. And besides you have confirmed _indextime and _time are close enough to not matter. So, let's assume that works. What else might be going wrong here?
I'd say the next thing to check is if that there really were no actual results in that time period. You could do this manually, or maybe ... if you had supplied a search or even an event or two this would be easier, but here goes -
my base search ...
| streamstats time_window=3m count AS three_minute_window
|timechart min(three_minute_window)
Or something like that run over the few hours or whatever that the false alert showed up in. You know, just to confirm what Splunk sees...
Hope this helps at least a little,
-Rich
Hi Rich,
I have tried to run the search in the time range the alert got triggered, below is the statics, the false alert triggered an email about 14:02 and from the below streamstats I could see a change between 2018-11-22 14:02:30 90 and 2018-11-22 14:02:35 88. but when I changed the window to run on 22-11-2018 it shows all 90, lil bit confused on this. does it mean its all good? if I use time range of 14:02 to 14:05, I see the below
_time min(three_minute_window)
2018-11-22 14:02:10 90
2018-11-22 14:02:15 90
2018-11-22 14:02:20 90
2018-11-22 14:02:25 90
2018-11-22 14:02:30 90
2018-11-22 14:02:35 88
2018-11-22 14:02:40 86
2018-11-22 14:02:45 83
2018-11-22 14:02:50 81
2018-11-22 14:02:55 78
2018-11-22 14:03:00 76
2018-11-22 14:03:05 73
2018-11-22 14:03:10 71
2018-11-22 14:03:15 68
2018-11-22 14:03:20 66
2018-11-22 14:03:25 63
2018-11-22 14:03:30 61
2018-11-22 14:03:35 58
2018-11-22 14:03:40 56
2018-11-22 14:03:45 53
2018-11-22 14:03:50 51
2018-11-22 14:03:55 48
2018-11-22 14:04:00 46
2018-11-22 14:04:05 43
2018-11-22 14:04:10 41
2018-11-22 14:04:15 38
2018-11-22 14:04:20 36
2018-11-22 14:04:25 33
2018-11-22 14:04:30 31
2018-11-22 14:04:35 28
2018-11-22 14:04:40 26
2018-11-22 14:04:45 23
2018-11-22 14:04:50 21
2018-11-22 14:04:55 18
2018-11-22 14:05:00 16
2018-11-22 14:05:05 13
2018-11-22 14:05:10 11
2018-11-22 14:05:15 8
2018-11-22 14:05:20 6
2018-11-22 14:05:25 3
2018-11-22 14:05:30 1
Well, that was a 3-minute rolling window, . So... (I'm thinking this through as I type), at 14:02 you had a count of 90 in the previous three minutes. From 14:02 (I'm just rounding off those seconds) to about 14:05, your count steadily drops, until it's nearly at zero.
(By the way, Streamstats is probably doing this backwards from the way you would normally think - I use streamstats a lot but it's nearly always with a tiny window of 2 events or something so it's rarely noticeable. )
That's a drop of the entire 90 in 3 minutes, which means you spent three minutes with a continuous count of zero events with server.MQPublisher
for that host and source.
Which means ... that's a legitimate alert. Three minutes, no events matching your criteria.
Let's try from the other side of things:
What made you think it was a false alert in the first place? Is there some other product/monitoring software you are using that said "everything's fine?" Or was it just that "no one complained then" or something?
Which sort of leads me to wonder, if it was a legit alert, but the alerting thing wasn't really something that should have alerted, then the criteria just needs tweaking. Maybe the "log writing process" gets behind, so Splunk doesn't see current information? Maybe you need to broaden the time frame you are looking at?