Alert on specific EventCode values unless followed...

danbutterman · ‎12-29-2017

Happy New Year,

I'm working on an alert for certain event codes regarding DFS Replication.

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=host1 OR host2 OR host3 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 earliest=-5m
| rex "Message=(?<Message>.*)" 
| table _time,Message,ComputerName,EventCode,Error

I would like to return a result if any of the following EventCodes are found in an event from five minutes ago (EventCode 1202 OR 5002 OR 5008 OR 5012 OR 5014), unless followed by an event with either EventCode 5004 or 1206 (which represent a recovery) within five minutes of the error event code.

I'm eyeballing the case and validate functions, but I'm having some difficulty putting the picture together.

Thank you for any assistance.

danbutterman · ‎01-01-2018

I will give these a try tomorrow when I’m back in office and send an update.

nikita_p · ‎01-01-2018

Hi @danbutterman,
I don't know id i understood your question correct, but can you try below query if it helps you.
index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=host1 OR host2 OR host3 EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=1206 OR EventCode=5004 | search EventCode!=1206 OR EventCode!=5004
| rex "Message=(?.*)"
| table _time,Message,ComputerName,EventCode,Error

And for timerange you can adjust it with your Custom Time

woodcock · ‎12-31-2017

This is a pretty good use case for the transaction command; you need startswith, endswith, and maxspan.

danbutterman · ‎01-02-2018

One step closer.

I swapped out the field "EventCode" with "host" and now I'm seeing transactions.

| transaction host startswith=5014 endswith=5004 maxspan=5m

Next step would be to change this so that a transaction is created only when a "5004" or "1206" is not found (within 5 minutes of the error event, e.g., 5014 or 5008).

danbutterman · ‎01-02-2018

Woodcock,

This is the search I perform to find 5014 and 5004 events (for example).

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=DC01 OR DC02 OR DC03 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 OR 5004 OR 1206
| rex "Message=(?.*)"
| table _time,Message,ComputerName,EventCode,Error

The first two events that result are as shows below:

1/1/18
11:00:57.000 PM
01/01/2018 11:00:57 PM
LogName=DFS Replication
SourceName=DFSR
EventCode=5004
EventType=4

1/1/18
11:00:51.000 PM
01/01/2018 11:00:51 PM
LogName=DFS Replication
SourceName=DFSR
EventCode=5014
EventType=3

When I add in the transaction command (as follows), no results are returned:

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=DC01 OR DC02 OR DC03 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 OR 5004 OR 1206
| rex "Message=(?.*)"
| transaction EventCode startswith=5014 endswith=5004 maxspan=5m

The way I imagine the transaction command is supposed to work using the example above is it finds my 5014 event at 11:00:51 PM (which marks the beginning of a new transaction, specifically where a replication error occurred), and finds the 5004 event at 11:00:57 PM (which marks the end of the transaction); however, nothing is returned.

I seem to have taken a wrong turn.

micahkemp · ‎12-29-2017

This is a bit difficult to validate without sample data, but here's my untested attempt:

index=wineventlogs sourcetype="WinEventLog:DFS Replication" (host=host1 OR host2 OR host3) (EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=5004 OR EventCode=1206)
| eval failure_time=if(EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014, _time, NULL)
| where isnull(failure_time) OR failure_time<relative_time(now(), "-5min")
| head 1 
| search EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=5004
| rex "Message=(?<Message>.*)"
| table _time,Message,ComputerName,EventCode,Error

It takes into consideration @samesonei2's point about only counting "failures" that are at least 5 minutes old. It intends to find all failure and recovery events, remove failures that aren't 5 minutes old, then only show the most recent event, and further filter to only show it if it's a failure.

somesoni2 · ‎12-29-2017

You'd need to adjust your timerange to allow those recovery events to happen. So you should be looking for error events for say -10m@m to -5m@m (only) and recovery event for -10m@m to now. This way you'd be able to correlate a recovery event with error event. With current 5 min time range, you're alerting prematurely as you may not be allowing a recovery event to be logged yet.

Alert on specific EventCode values unless followed by a specific EventCode within a 5-minute span

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes