Splunk Search

I want to continuously rerun an alert that fails until it is successful

codedtech
Path Finder

I have something like 20+ alerts that give my team telemetric data on our ESX and Storage clusters. We collect our metrics from a data bus via API calls and then send them into splunk for analysis.  Sometimes when the team that manages the data bus has an issue my reports don't trigger unless I manually run them.  I want to know if there is a way that I can create a query that will continuously run every hour until that alert completes with results.  

Labels (3)
0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

AHA!  That's a different question which has a possibly far simpler answer!

You may want to switch to using the internal field

_indextime 

instead of _time.

You can read on it here - https://docs.splunk.com/Documentation/Splunk/8.0.6/Knowledge/Usedefaultfields and know they're serious about the whole "it won't show up".  It won't show up even if you try to do things with it directly, you really do have to just use eval to make a new field with it or else you'll find it disappears in the middle of your SPL sometimes.  It's spooky.

One use of it:

index=_internal
| eval indexed_time = _indextime
| eval indexing_delay = indexed_time - _time
| timechart max(indexing_delay) by sourcetype

And if you want, you can even eval it into the _time field so it "just works".

index=_internal
| eval _time = _indextime
| timechart count by sourcetype

The above will chart your count of events by sourcetype, but using the time it was *indexed* as the time axis, not the time assigned to the event.  It may not change your charts much - if your indexing lag is pretty consistent, it just shifts everything by 4 seconds or whatever, which timechart will likely bin together anyway.

But if your indexing lag is more variable, the above timechart will show a different sort of graph than the regular one based on _time itself.  That chart itself may be interesting to see - you might spot all your "when the API broke" from it.  Take a look.

But in any case, if you convert your reports to using _indextime - there's some fiddling around you'll have to do and testing - google for "splunk indextime" or something - you may get results you like better when there are delays.

Happy Splunking!

-Rich

View solution in original post

Richfez
SplunkTrust
SplunkTrust

So you don't get alerts because data stopped coming in?

Maybe it's easiest if you rethink your needs - you don't need the alert to run continuously, what you need is for someone to know that things are broken.

Then they'll know they won't be getting the alerts they're not getting because the incoming data is broken.

For that task, there's a zillion ways to do this.  Searching "splunk alert data stopped coming in" in google gives you like 400 patterns to try in community/answers, and even a blog from Splunk specifically on this:

https://www.splunk.com/en_us/blog/tips-and-tricks/how-to-determine-when-a-host-stops-sending-logs-to...

Also, if it's always supposed to send results, it should be not set to be an alert, but instead just a saved report run on a schedule. 

codedtech
Path Finder

@Richfez We have the alert in place if something breaks, and our api calls are configured to go out and collect the data that they missed during the outage. What I'm trying to do is have these reports rerun and send out when the data that was delayed hits my index.  I'm just lazy and want to avoid manually rerunning each report when this happens.

0 Karma

Richfez
SplunkTrust
SplunkTrust

AHA!  That's a different question which has a possibly far simpler answer!

You may want to switch to using the internal field

_indextime 

instead of _time.

You can read on it here - https://docs.splunk.com/Documentation/Splunk/8.0.6/Knowledge/Usedefaultfields and know they're serious about the whole "it won't show up".  It won't show up even if you try to do things with it directly, you really do have to just use eval to make a new field with it or else you'll find it disappears in the middle of your SPL sometimes.  It's spooky.

One use of it:

index=_internal
| eval indexed_time = _indextime
| eval indexing_delay = indexed_time - _time
| timechart max(indexing_delay) by sourcetype

And if you want, you can even eval it into the _time field so it "just works".

index=_internal
| eval _time = _indextime
| timechart count by sourcetype

The above will chart your count of events by sourcetype, but using the time it was *indexed* as the time axis, not the time assigned to the event.  It may not change your charts much - if your indexing lag is pretty consistent, it just shifts everything by 4 seconds or whatever, which timechart will likely bin together anyway.

But if your indexing lag is more variable, the above timechart will show a different sort of graph than the regular one based on _time itself.  That chart itself may be interesting to see - you might spot all your "when the API broke" from it.  Take a look.

But in any case, if you convert your reports to using _indextime - there's some fiddling around you'll have to do and testing - google for "splunk indextime" or something - you may get results you like better when there are delays.

Happy Splunking!

-Rich

Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...