Advanced alerting: How to uniquly report the most recent events while allowing for indexing delays?

Super Champion

I'm looking for best practice when setting up a savedsearch email alerting when the alerting has the following requirements:

  1. Alert should be sent as soon as possible. (Recent time window; runs fairly frequently)
  2. There can be indexing delays (possibly due to splunk, but more likely due to system outage such as a failing hardware scenario.)
  3. The alert should only report a single event once. (We don't want to duplicate the reporting of an event nor miss reporting that event. This is not a 100% hard requirement, although we prefer to error on the side of duplication. We are looking for a best-effort false-positive avoidance solution.)

The combination of these three requirements make searching more complicated. Without #1, simply delaying the search for a few hours would be sufficient. Without #2, you could just run the search every 15 minutes over the last 15 minute window. Without #3, you can just run the report for a larger time window every 15 minutes and ignore the duplicate alerts. But getting all three is tricky.

I'm thinking that there should be a way to accomplish this in Splunk 4.0+ since we now have not only the event's timestamp (_time) but the index time (_indextime) too. So it seems possible to create a savedsearch that meets all these requirements. Has anyone had any experience with this?

For the purpose of an example. Say I have a search that runs every 15 minutes and looks back over a 2 hour timespan. (The assumption being that events delayed by more than 2 hours can be safely ignored.) We want to send all the events we have as soon as possible so the latest time is snapped only to the latest whole minute to make the rounding easier.

[Very bad OS failure in progress] = 1 = 1 =
counttype = number of events
cron_schedule = */15 * * * *
dispatch.earliest_time = -120m@m
dispatch.latest_time = @m
displayview = flashtimeline
enableSched = 1
quantity = 0
relation = greater than
request.ui_dispatch_view = flashtimeline
search = "Process startup failed" OR "CreateThread failed" | stats count as events, values(sourcetype) as sourcetypes, values(source) as log_files by host

One factor I don't know how to account for is this: What is the expected delay between the time a new event is written to the index (which I'm assuming is when _indextime is set), and when that same event becomes visible to a search? Is there anyway to calculate this?

My ultimate goal would be to roll up some of this logic into a macro for reusablility, so any thoughts on that would be welcomed too.

Path Finder

We faced similar requirements to send alerts via SNMP traps when certain errors were logged. Initially we tried a scheduled search that executed Splunk's SNMP script if any events were found. We encountered the issues you described.

Our solution was to write a Python script that searched Splunk (using CLI), analyzed the search results, and sent the SNMP traps. Splunk runs our script as a configured scripted input; the script logs its activity to output, which Splunk indexes so we can later show that we really did sent the SNMP traps.

Like you, we observed some latency in Splunk's indexing application log files. We also noted that Splunk's scripted inputs do not run intervals exact to the second.

To solve these timing issues, we explicitly scoped the time range for each search. The end time for a search was the current time less the latency. The start time usually was the end time of the previous search (stored in a file between runs), but was limited by a maximum search span (typically an hour) and the end time. The time range was formatted for Splunk searches.

endTime = int(time.time ()) - indexLatencySeconds
startTime = min (endTime, max (previousEndTime, endTime - maxSpanSeconds))
return "starttimeu=%d endtimeu=%d" % (startTime, endTime)

The Python script used Splunk CLI to run a search. The query formatted the results as pipe delimited fields. Here is a typical query; our script adds the time constraints to the beginning.

MessageType="FATAL" | fields + host,problem | stats count(problem) as count by host,problem | strcat host "|" count "|" problem formatted | fields + formatted

Python runs the complete Splunk search command and collects its outputs as lists of search results and errors:

from subprocess import *

proc = Popen (command, shell = True, stdin = PIPE, stdout = PIPE, stderr = PIPE, close_fds = True)
(childStdin, childStdout, childStderr) = (proc.stdin, proc.stdout, proc.stderr)
childStdin.close ()
searchResults = childStdout.readlines ()
childStdout.close ()
errorResults = childStderr.readlines()
childStderr.close ()
return (searchResults, errorResults)

The first two lines of search results contains headers. The remaining lines can easily be transformed into a list of events comprised of fields:

return [line.rstrip ().split ('|') for line in searchResults [2:-1]]

In our script, we send an SNMP alert for each resulting event. You can send an email alert.

Splunk indexes the script output, which contains log messages. Early on, the team receiving the SNMP traps claimed that we weren't sending them. We added logging throughout our scripts to aid testing and to record its activity in production. Here is an example from the code that sends the SNMP trap: ('SNMP trap: problemHost=%s count=%s searchName="%s" problem="%s"', host, count, searchName, problem)

The logger adds a timestamp, severity level, etc. in the spirit of Log4j. Here is an example log line generated by the the logging statement above:

<UOW-U-WSE-MEL-APPFATAL-FE9301> 2010-05-27 15:42:57,981 [main] INFO  SnmpTrapDetails - SNMP trap: problemHost=db02 count=1 searchName="Application Error" problem="Fatal: Attempting to action a nonexistent TCA user PROV100AA to/from ROLE NEEDSBUILDER"

Super Champion

I think you'll be happy with the improvements in 4.0 and 4.1. The overlaps/gaps issue shouldn't exist anymore. The problem I'm having it that I want that same precision but I don't want to wait to report on events 5 to 10 minutes old, I want events within the last minute. Thanks again for your input and feedback.

0 Karma

Path Finder

In our environment, we have high confidence that events will be indexed at least n minutes before we run the query (where n is typically 5 or 10 minutes). If old events arrive later than allowed, we won't alert on them. We will rethink how to exploit _indextime in 4.1 after we complete our upgrades from 3.4. We also will test whether scheduled 4.1 searches yield time range overlaps or gaps.

0 Karma

Path Finder

We started with a scheduled saved search that alerted via Splunk's SNMP script written in Perl. The other team wanted more details in the SNMP traps, so we modified the SNMP script to search
Splunk and extract data. We ran the script via Linux cron. That seemed more direct than Splunk's mechanism of writing results to a file and passing the file path to an alert script. As the variety of searches and SNMP traps grew, we rewrote the scripts in Python to generalize many functions as Python classes. We ran as scripted input to capture the script logs. Now we can move to the Python Splunk API.

0 Karma

Super Champion

Out of curiosity, any reason why you went with an external call to splunk search over using the Splunk's python search APIs? Also, in terms of timing (scoping your search results) I'm looking at startTime and wondering how you deal with out of order events? Say you have events A, B, C occurring at times 1, 2, 3. Say event B comes from a faster system and get indexed first. Event 'A' and 'C' (from the same slow machine) don't come in until after your your search ran. So your previousEndTime=2, so next time event C is picked up but, A is never seen. Am I missing something?

0 Karma

Super Champion

Thanks for a very detailed and thoughtful response. However, in our case, I would really like to keep as much as possible inside of splunk. Splunk 4.0 provided great improvement of being able to select more consistent time ranges, and 4.1 seems to include every possible scenario I can think of. And using _indextime should cover the rest... So It really seems like there should be away to accomplish this all internally. That said, I do find your approach quite interesting, perhaps I'll be able to borrow this idea in other situations. Thanks again. 😉

0 Karma
Get Updates on the Splunk Community!

Observability | How to Think About Instrumentation Overhead (White Paper)

Novice observability practitioners are often overly obsessed with performance. They might approach ...

Cloud Platform | Get Resiliency in the Cloud Event (Register Now!)

IDC Report: Enterprises Gain Higher Efficiency and Resiliency With Migration to Cloud  Today many enterprises ...

The Great Resilience Quest: 10th Leaderboard Update

The tenth leaderboard update (11.23-12.05) for The Great Resilience Quest is out &gt;&gt; As our brave ...