Right now I have the following search report on any hosts that haven't talked in 30 minutes (the csv file lists hosts that are in maintenance or being decommissioned):
|search NOT [|inputlookup DecomMaint.csv]
|where recentTime < now() - 1800
|eval lastSeen = strftime(recentTime, "%F %T")
|fields + host lastSeen
This works fine but I have some hosts that are low threshold event generating and show up on the report and would be a false positive. Alternatively, I have some hosts (domain controllers for example) that will always generate events on a pretty constant basis that I would want to know about before 30 minutes of no feeds. So I want to capture three tiers of loss of feeds (no events for five minutes, 30 minutes, and 60 minutes)
I'm trying to develop one alert that would catch all three scenarios rather than have three separate alerts. Is there an efficient way to do it via lookup tables where I would maintain a list of low, medium, and high threshold servers? I would imagine I could create three separate csv files and reference them accordingly in the search. Does this make sense or is there a better way to do it?
Thanks for any advice.
If I only have a handful of servers that will be out-liers, is there a way I can use a csv file for just those hosts and then have the alert check against those for specific thresholds and everything else be 30 minutes? It would be much easier to maintain a host list if only a few hosts deviate from the default expected check in time. For example , I have ~300 hosts. Four of those would need to check in every five minutes. Two of them would need to check in every 60 minutes. The rest need to check in every 30 minutes. Seems like I could so an if/then sort of search.
Thanks for the advice. I have hundreds of hosts so I have to think about the best way to do it. The majority of them will be either 30 minutes with very few five minutes or 60 minutes. So even though I have a lot of hosts I might try the first method as I can cut and paste a lot of info. Once the csv file is created it shouldn't be that hard to maintain. I'll try a couple things out and then "accept" whatever one works best.
If you want, you could modify this search to include hosts without an entry in the lookup table - see below. This method still allows you to explicitly set some hosts for no monitoring as well.
| tstats last(_time) as lastSeen where index=* by host | append [ inputlookup host_settings.csv ] | stats last(lastSeen) as lastSeen last(monitor) as monitor last(threshhold_minutes) as threshhold_minutes by host | where isnull(monitor) OR monitor="Y" | eval threshhold_minutes=if(isnull(threshhold_minutes),30) | eval status=case(isnull(lastSeen),"MISSING", lastSeen >= now()-(threshhold_minutes*60),"okay", 1==1,"MISSING") | eval lastSeen = strftime(lastSeen,"%x %X") | table host lastSeen status threshhold_minutes
This is a pretty good idea in general (I hadn't thought of it before) - in case you have some new hosts in your environment but forget to add them to the CSV. Now they will always be monitored with the default threshhold of 30 minutes if they don't appear in the host_settings.csv
You could certainly setup a lookup file for your hosts, perhaps something like this:
host,monitor,threshhold_minutes ahost,Y,30 bhost,Y,10 hostc,N,0
| tstats last(_time) as lastSeen where index=* by host | append [ inputlookup host_settings.csv ] | stats last(lastSeen) as lastSeen last(monitor) as monitor last(threshhold_minutes) as threshhold_minutes by host | where monitor="Y" | eval status=case(isnull(lastSeen),"MISSING", lastSeen >= now()-(threshhold_minutes*60),"okay", 1==1,"MISSING") | eval lastSeen = strftime(lastSeen,"%x %X") | table host lastSeen status threshhold_minutes
I hope this gives you a good starting point. Why am I appending instead of searching with the lookup table? For my purposes, I want to create a list of all the hosts, whether they have had data within the search time period or not. (Set your search time to approximately the longest time you want to monitor, in your example: 60 minutes.)
Why use tstats instead of metadata? tstats is very fast, almost as fast as metadata. The metadata command however, can return partial results in larger environments. So if you want better accuracy in this case, use tstats. If not, then just change the tstats to metadata and proceed...
If the number of hosts are limited OR there is a pattern in names to define which hosts have low/medium/high frequency of events, then you can create a lookup where you'll list all host (host name patterns) with corresponding threshold and then use lookup command to get that threshold value in search result and update where condition to check based on threshold field. E.g. If you could create a lookup which give host and threshold column, then your search could be like this
Lookup - hosteventthreshold.csv
host, threshold host1,1800 host2,3600 host3,300 ...
|metadata type=hosts |search NOT [|inputlookup DecomMaint.csv] | lookup host_event_threshold.csv host OUTPUT threshold |where (now()-recentTime)>threshold |eval lastSeen = strftime(recentTime, "%F %T") |fields + host lastSeen
I'd kill line 2 and have all that "host's acceptable delay" information in the same file. No sense maintaining two CSVs that both tell you version of the same thing.
This strategy has the advantage that you can just remove decommissioned servers from the file and they will cease reporting. (...since
threshold is null). Servers under maintenance windows, you can just use the duration of the scheduled window.