Solved: Re: Alert When Any of Multiple Sources Don't Send ...

SplunkLunk · ‎04-03-2018

Good afternoon,

I want to create a Loss of Feeds alert for multiple database connections. Is there a way to create an alert to email me if any of the specified sources don't generate at least one event during the time specified? I would be using a lookup table to define the sources I want Splunk to Look at.

Currently I have sixteen sources I want it to look at and if any one of them doesn't have an event during "X" minutes to alert me. Thanks for any help you can provide.

elliotproebstel · ‎04-03-2018

We have a lookup file that contains host, sourcetype, state, max_delay, and admin. All hosts start with state="up". The max_delay field allows us to have hosts with varying expected delays between events. (Some hosts are expected to log events every 10 seconds, and others may go 3 hours between events in normal operations.) The admin field defines who gets notified when a given host goes down - again, this is for modularity. Every 10 minutes, we run the following search:

| tstats latest(_time) as latest where index=* OR index=_* by host, sourcetype 
| search 
 [ |inputlookup critical_hosts 
 | fields host sourcetype ] 
| lookup critical_hosts host sourcetype OUTPUT max_delay, severity, admin, state, _key 
| eval current_delay=now()-latest
| where max_delay<current_delay
| search state="up"
| eval state="down"
| outputlookup hosts_lookup append=true key_field=_key
| fillnull value="" admin 
| eval admin="default_email@our.enterprise,".admin 
| convert ctime(latest) 
| map search="| sendemail from=\"splunk-outage@our.enterprise\" to=\"$admin$\" subject=\"Splunk Alert: Critical Host Not Logging: $host$\" message=\"Splunk last heard from critical host $host$ with sourcetype $sourcetype$ and severity $severity$ at $latest$. The maximum expected delay from this host is $max_delay$ seconds, while the current delay is $current_delay$ seconds. Please check the host for possible failure. If you believe you received this message in error, please email support@our.enterprise. Thank you.\" server=\"ip.address\"" maxsearches=1000

We have a similar search that runs every 10 minutes looking for hosts that have resumed reporting. By updating the lookup file with changes in state, we can send alerts only when a host appears to go down and then when it appears to come back up - not every 5 minutes while it remains down.

View solution in original post

elliotproebstel · ‎04-03-2018

We have a lookup file that contains host, sourcetype, state, max_delay, and admin. All hosts start with state="up". The max_delay field allows us to have hosts with varying expected delays between events. (Some hosts are expected to log events every 10 seconds, and others may go 3 hours between events in normal operations.) The admin field defines who gets notified when a given host goes down - again, this is for modularity. Every 10 minutes, we run the following search:

| tstats latest(_time) as latest where index=* OR index=_* by host, sourcetype 
| search 
 [ |inputlookup critical_hosts 
 | fields host sourcetype ] 
| lookup critical_hosts host sourcetype OUTPUT max_delay, severity, admin, state, _key 
| eval current_delay=now()-latest
| where max_delay<current_delay
| search state="up"
| eval state="down"
| outputlookup hosts_lookup append=true key_field=_key
| fillnull value="" admin 
| eval admin="default_email@our.enterprise,".admin 
| convert ctime(latest) 
| map search="| sendemail from=\"splunk-outage@our.enterprise\" to=\"$admin$\" subject=\"Splunk Alert: Critical Host Not Logging: $host$\" message=\"Splunk last heard from critical host $host$ with sourcetype $sourcetype$ and severity $severity$ at $latest$. The maximum expected delay from this host is $max_delay$ seconds, while the current delay is $current_delay$ seconds. Please check the host for possible failure. If you believe you received this message in error, please email support@our.enterprise. Thank you.\" server=\"ip.address\"" maxsearches=1000

We have a similar search that runs every 10 minutes looking for hosts that have resumed reporting. By updating the lookup file with changes in state, we can send alerts only when a host appears to go down and then when it appears to come back up - not every 5 minutes while it remains down.

SplunkLunk · ‎04-23-2018

Greetings,

Trying to see if this will meet my needs. Do you need the quotes in the lookup table for state? So, in the CSV lookup table, would the value be "up" or up for a host? Thanks. Your solution is a little complicated for me, but I'll give it a shot.

Also, for delay do you use seconds for everything or can you define as various time formats (e.g., 30m, 30s, 30h, etc.)? Thanks.

SplunkLunk · ‎04-26-2018

I've accepted this answer so you get credit even though I can't do everything here (issue with the way admins have setup our Splunk instance). Thanks though. I was able to generate an alert using some of the info in your response.

elliotproebstel · ‎04-23-2018

Yeah, it seems like a heavy lift at first, but that's what makes it really modular and easy to update. 🙂

We don't use quotes in the lookup table. So in the table, you'll find up and not "up". And we use seconds for all of our delay settings. You could potentially use other time formats, but we didn't bother to. It was easy enough to convert everything to seconds up front and then have a standardized numeric format that we were using in the scheduled searches.

elliotproebstel · ‎04-23-2018

Also - I can say nice things about this alerting approach because I didn't design it. I inherited it, and I've found it a very clean solution that has been really easy to keep current. I'm not patting myself on the back here, since it wasn't my solution! 🙂

SplunkLunk · ‎04-24-2018

Also, what is the corresponding search you use to tell when systems are back up?

elliotproebstel · ‎04-24-2018

The corresponding logic is exactly the same, except lines 7-9 above are replaced with:

| search state="down"
| where max_delay>=current_delay
| eval state="up"

And the wording of the email is changed to reflect that the host is now up. Pretty straightforward!

SplunkLunk · ‎04-24-2018

So thanks again for the help. I'll be bugging you some more since you're the only one that seems to have done what I'm asking about. I'm getting an error when running my search:

Error in 'lookup' command: Could not find all of the specified destination fields in the lookup table.

You mention the _key field in the lookup portion of your search but you don't mention it in your lookup file description. Is there a filed in your lookup file called _key? If so, what is the default value?

It seems like it should create the key so it can update the outputlookup file but I don't know why I'm getting the error then.

elliotproebstel · ‎04-24-2018

The _key field is present if you are using a kvstore. Should have mentioned that explicitly, sorry. A feature you get from using a kvstore is the ability to update a single row in the lookup table like this:

| outputlookup hosts_lookup append=true key_field=_key

It will find the event with the corresponding _key value and update that row.

SplunkLunk · ‎04-24-2018

Thanks. That was the missing piece. I don't know if our Splunk is configured to use kvstore. The central admins just grant access to various people in our organization. I'm assuming it's something I don't have access to configure.

elliotproebstel · ‎04-26-2018

If you can't use a kvstore for this, it's still possible to implement using csv files. Basically, instead of using the clean outputlookup append method, you'd need to read in the whole file, update just the line with a required state change, and output the whole file again. Back up your csv before you start testing that - but it will work, assuming the list isn't too huge.

Alert When Any of Multiple Sources Don't Send At Least One Event

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!