Solved: I want to continuously rerun an alert that fails u...

codedtech · ‎10-14-2020

I have something like 20+ alerts that give my team telemetric data on our ESX and Storage clusters. We collect our metrics from a data bus via API calls and then send them into splunk for analysis. Sometimes when the team that manages the data bus has an issue my reports don't trigger unless I manually run them. I want to know if there is a way that I can create a query that will continuously run every hour until that alert completes with results.

Richfez · ‎10-15-2020

AHA! That's a different question which has a possibly far simpler answer!

You may want to switch to using the internal field

_indextime

instead of _time.

You can read on it here - https://docs.splunk.com/Documentation/Splunk/8.0.6/Knowledge/Usedefaultfields and know they're serious about the whole "it won't show up". It won't show up even if you try to do things with it directly, you really do have to just use eval to make a new field with it or else you'll find it disappears in the middle of your SPL sometimes. It's spooky.

One use of it:

index=_internal
| eval indexed_time = _indextime
| eval indexing_delay = indexed_time - _time
| timechart max(indexing_delay) by sourcetype

And if you want, you can even eval it into the _time field so it "just works".

index=_internal
| eval _time = _indextime
| timechart count by sourcetype

The above will chart your count of events by sourcetype, but using the time it was *indexed* as the time axis, not the time assigned to the event. It may not change your charts much - if your indexing lag is pretty consistent, it just shifts everything by 4 seconds or whatever, which timechart will likely bin together anyway.

But if your indexing lag is more variable, the above timechart will show a different sort of graph than the regular one based on _time itself. That chart itself may be interesting to see - you might spot all your "when the API broke" from it. Take a look.

But in any case, if you convert your reports to using _indextime - there's some fiddling around you'll have to do and testing - google for "splunk indextime" or something - you may get results you like better when there are delays.

Happy Splunking!

-Rich

View solution in original post

Richfez · ‎10-14-2020

So you don't get alerts because data stopped coming in?

Maybe it's easiest if you rethink your needs - you don't need the alert to run continuously, what you need is for someone to know that things are broken.

Then they'll know they won't be getting the alerts they're not getting because the incoming data is broken.

For that task, there's a zillion ways to do this. Searching "splunk alert data stopped coming in" in google gives you like 400 patterns to try in community/answers, and even a blog from Splunk specifically on this:

https://www.splunk.com/en_us/blog/tips-and-tricks/how-to-determine-when-a-host-stops-sending-logs-to...

Also, if it's always supposed to send results, it should be not set to be an alert, but instead just a saved report run on a schedule.

codedtech · ‎10-15-2020

@Richfez We have the alert in place if something breaks, and our api calls are configured to go out and collect the data that they missed during the outage. What I'm trying to do is have these reports rerun and send out when the data that was delayed hits my index. I'm just lazy and want to avoid manually rerunning each report when this happens.

Richfez · ‎10-15-2020