Splunk Dev

Alert on specific EventCode values unless followed by a specific EventCode within a 5-minute span

danbutterman
Explorer

Happy New Year,

I'm working on an alert for certain event codes regarding DFS Replication.

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=host1 OR host2 OR host3 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 earliest=-5m
| rex "Message=(?<Message>.*)" 
| table _time,Message,ComputerName,EventCode,Error

I would like to return a result if any of the following EventCodes are found in an event from five minutes ago (EventCode 1202 OR 5002 OR 5008 OR 5012 OR 5014), unless followed by an event with either EventCode 5004 or 1206 (which represent a recovery) within five minutes of the error event code.

I'm eyeballing the case and validate functions, but I'm having some difficulty putting the picture together.

Thank you for any assistance.

Tags (1)
0 Karma

danbutterman
Explorer

I will give these a try tomorrow when I’m back in office and send an update.

0 Karma

nikita_p
Contributor

Hi @danbutterman,
I don't know id i understood your question correct, but can you try below query if it helps you.
index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=host1 OR host2 OR host3 EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=1206 OR EventCode=5004 | search EventCode!=1206 OR EventCode!=5004
| rex "Message=(?.*)"
| table _time,Message,ComputerName,EventCode,Error

And for timerange you can adjust it with your Custom Time

0 Karma

woodcock
Esteemed Legend

This is a pretty good use case for the transaction command; you need startswith, endswith, and maxspan.

danbutterman
Explorer

One step closer.

I swapped out the field "EventCode" with "host" and now I'm seeing transactions.

| transaction host startswith=5014 endswith=5004 maxspan=5m

Next step would be to change this so that a transaction is created only when a "5004" or "1206" is not found (within 5 minutes of the error event, e.g., 5014 or 5008).

0 Karma

danbutterman
Explorer

Woodcock,

This is the search I perform to find 5014 and 5004 events (for example).

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=DC01 OR DC02 OR DC03 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 OR 5004 OR 1206
| rex "Message=(?.*)"
| table _time,Message,ComputerName,EventCode,Error

The first two events that result are as shows below:

1/1/18
11:00:57.000 PM
01/01/2018 11:00:57 PM
LogName=DFS Replication
SourceName=DFSR
EventCode=5004
EventType=4

1/1/18
11:00:51.000 PM
01/01/2018 11:00:51 PM
LogName=DFS Replication
SourceName=DFSR
EventCode=5014
EventType=3

When I add in the transaction command (as follows), no results are returned:

index=wineventlogs sourcetype="WinEventLog:DFS Replication" host=DC01 OR DC02 OR DC03 EventCode=1202 OR 5002 OR 5008 OR 5012 OR 5014 OR 5004 OR 1206
| rex "Message=(?.*)"
| transaction EventCode startswith=5014 endswith=5004 maxspan=5m

The way I imagine the transaction command is supposed to work using the example above is it finds my 5014 event at 11:00:51 PM (which marks the beginning of a new transaction, specifically where a replication error occurred), and finds the 5004 event at 11:00:57 PM (which marks the end of the transaction); however, nothing is returned.

I seem to have taken a wrong turn.

0 Karma

micahkemp
Champion

This is a bit difficult to validate without sample data, but here's my untested attempt:

index=wineventlogs sourcetype="WinEventLog:DFS Replication" (host=host1 OR host2 OR host3) (EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=5004 OR EventCode=1206)
| eval failure_time=if(EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014, _time, NULL)
| where isnull(failure_time) OR failure_time<relative_time(now(), "-5min")
| head 1 
| search EventCode=1202 OR EventCode=5002 OR EventCode=5008 OR EventCode=5012 OR EventCode=5014 OR EventCode=5004
| rex "Message=(?<Message>.*)"
| table _time,Message,ComputerName,EventCode,Error

It takes into consideration @samesonei2's point about only counting "failures" that are at least 5 minutes old. It intends to find all failure and recovery events, remove failures that aren't 5 minutes old, then only show the most recent event, and further filter to only show it if it's a failure.

0 Karma

somesoni2
Revered Legend

You'd need to adjust your timerange to allow those recovery events to happen. So you should be looking for error events for say -10m@m to -5m@m (only) and recovery event for -10m@m to now. This way you'd be able to correlate a recovery event with error event. With current 5 min time range, you're alerting prematurely as you may not be allowing a recovery event to be logged yet.

Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...