I have a current alert that is working as expected to capture a log event that states a service is down. We have started to receive many false positives on this because the service automatically recovers in a matter of seconds. I would like to change the alert so that instead of immediately sending a notification, it will pause for 30 seconds and search for a recovery event and only send the notification if that recovery is not found.
index=networklogs host=foo10* OR host=foo11* AND ("member" AND "monitor status down") |rex "monitor status\s+(?<State>\w+)" |rex "member /Common/(?<trpHost>[^:]+):53" |eval Identifier=trpHost + "dropped out of the VIP pool" |eval Summary="Critical Infrastructure - Server dropped out of the VIP pool. Pool member is " + State + "." |eval ProcessID="foo" |eval Severity=if( State=="down", 5, 1 ) | eval Type=if(State=="down", 1, 2 ) |eval OwnerGID=1000 |eval ForceUpdateFields="Severity,Type,Summary" |eval Submitter="foo" |eval LOB="IP" |eval AlertGroup="VIP Member Dropped out" |eval Agent="rdns"
I cannot edit the original post or submit any further replies so, here is the second search that should generate the alert if no results are found:
index=networklogs host=foo10* OR host=foo11* AND ("member" AND "monitor status up")
index=networklogs host=foo10* OR host=foo11* AND "member" AND ("monitor status up" OR "monitor status down") | rex "monitor status\s+(?<state>up|down)" | transaction host startswith="monitor status down" endswith="monitor status up" maxspan=30s maxevents=2 keepevicted=t | where closed_txn=0 AND state="down"
How can I test this?
I tried changing the maxspan to 1s and set the timeframe to where we had false positives of 6s downtime but I still didn't get a result.
would my test scenario be correct then? adjust the maxspan?
I'm not having success with this. Can you breakdown what you suggested into what it is doing? I don't understand the field closed_txn=0
closed_txn=0 will show transactions that don't have 2 events (start and end).