Alerting

Loss of Feeds Alert - Need Some General Advice

SplunkLunk
Path Finder

Greetings,

Right now I have the following search report on any hosts that haven't talked in 30 minutes (the csv file lists hosts that are in maintenance or being decommissioned):

|metadata type=hosts
|search NOT [|inputlookup DecomMaint.csv]
|where recentTime < now() - 1800
|eval lastSeen = strftime(recentTime, "%F %T")
|fields + host lastSeen

This works fine but I have some hosts that are low threshold event generating and show up on the report and would be a false positive. Alternatively, I have some hosts (domain controllers for example) that will always generate events on a pretty constant basis that I would want to know about before 30 minutes of no feeds. So I want to capture three tiers of loss of feeds (no events for five minutes, 30 minutes, and 60 minutes)

I'm trying to develop one alert that would catch all three scenarios rather than have three separate alerts. Is there an efficient way to do it via lookup tables where I would maintain a list of low, medium, and high threshold servers? I would imagine I could create three separate csv files and reference them accordingly in the search. Does this make sense or is there a better way to do it?

Thanks for any advice.

Tags (1)
0 Karma

SplunkLunk
Path Finder

If I only have a handful of servers that will be out-liers, is there a way I can use a csv file for just those hosts and then have the alert check against those for specific thresholds and everything else be 30 minutes? It would be much easier to maintain a host list if only a few hosts deviate from the default expected check in time. For example , I have ~300 hosts. Four of those would need to check in every five minutes. Two of them would need to check in every 60 minutes. The rest need to check in every 30 minutes. Seems like I could so an if/then sort of search.

0 Karma

SplunkLunk
Path Finder

Thanks for the advice. I have hundreds of hosts so I have to think about the best way to do it. The majority of them will be either 30 minutes with very few five minutes or 60 minutes. So even though I have a lot of hosts I might try the first method as I can cut and paste a lot of info. Once the csv file is created it shouldn't be that hard to maintain. I'll try a couple things out and then "accept" whatever one works best.

0 Karma

lguinn2
Legend

If you want, you could modify this search to include hosts without an entry in the lookup table - see below. This method still allows you to explicitly set some hosts for no monitoring as well.

 | tstats last(_time) as lastSeen where index=* by host
 | append [ inputlookup host_settings.csv ]
 | stats last(lastSeen) as lastSeen last(monitor) as monitor 
         last(threshhold_minutes) as threshhold_minutes by host
 | where isnull(monitor) OR monitor="Y"
 | eval threshhold_minutes=if(isnull(threshhold_minutes),30)
 | eval status=case(isnull(lastSeen),"MISSING",
                    lastSeen >= now()-(threshhold_minutes*60),"okay",
                    1==1,"MISSING")
 | eval lastSeen = strftime(lastSeen,"%x %X")
 | table host lastSeen status threshhold_minutes

This is a pretty good idea in general (I hadn't thought of it before) - in case you have some new hosts in your environment but forget to add them to the CSV. Now they will always be monitored with the default threshhold of 30 minutes if they don't appear in the host_settings.csv

0 Karma

SplunkLunk
Path Finder

Greetings,

I tried this search and I get the error:

"Error in 'TsidxStats': _time aggregations are not yet supported except for min/max"

0 Karma

lguinn2
Legend

You could certainly setup a lookup file for your hosts, perhaps something like this:

host_settings.csv

host,monitor,threshhold_minutes
ahost,Y,30
bhost,Y,10
hostc,N,0

Your search:

| tstats last(_time) as lastSeen where index=* by host
| append [ inputlookup host_settings.csv ]
| stats last(lastSeen) as lastSeen last(monitor) as monitor 
        last(threshhold_minutes) as threshhold_minutes by host
| where monitor="Y"
| eval status=case(isnull(lastSeen),"MISSING",
                   lastSeen >= now()-(threshhold_minutes*60),"okay",
                   1==1,"MISSING")
| eval lastSeen = strftime(lastSeen,"%x %X")
| table host lastSeen status threshhold_minutes

I hope this gives you a good starting point. Why am I appending instead of searching with the lookup table? For my purposes, I want to create a list of all the hosts, whether they have had data within the search time period or not. (Set your search time to approximately the longest time you want to monitor, in your example: 60 minutes.)

Why use tstats instead of metadata? tstats is very fast, almost as fast as metadata. The metadata command however, can return partial results in larger environments. So if you want better accuracy in this case, use tstats. If not, then just change the tstats to metadata and proceed...

somesoni2
SplunkTrust
SplunkTrust

If the number of hosts are limited OR there is a pattern in names to define which hosts have low/medium/high frequency of events, then you can create a lookup where you'll list all host (host name patterns) with corresponding threshold and then use lookup command to get that threshold value in search result and update where condition to check based on threshold field. E.g. If you could create a lookup which give host and threshold column, then your search could be like this

Lookup - host_event_threshold.csv

host, threshold
host1,1800
host2,3600
host3,300
...

Alert search

|metadata type=hosts 
|search NOT [|inputlookup DecomMaint.csv]
| lookup host_event_threshold.csv host OUTPUT threshold
|where (now()-recentTime)>threshold
|eval lastSeen = strftime(recentTime, "%F %T") 
|fields + host lastSeen

DalJeanis
SplunkTrust
SplunkTrust

I'd kill line 2 and have all that "host's acceptable delay" information in the same file. No sense maintaining two CSVs that both tell you version of the same thing.

This strategy has the advantage that you can just remove decommissioned servers from the file and they will cease reporting. (...since threshold is null). Servers under maintenance windows, you can just use the duration of the scheduled window.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...