Splunk Observability Cloud

Help using time in a detector

MichaelRTR
Loves-to-Learn

Hi there,

I wanted to get some advice please on a detector creation.

For my Cloud Scheduled Functions, I have a custom metric called "job.status" for operational monitoring.
This metric has the dimensions: job_id, job_event (START/ERROR/END), and job_event_source (SCHEDULER/JOB).

I have Cloud Scheduled Functions that run on cron schedule.
For example, 1 service may run its CSFs at:

1pm, 3pm and 5pm daily.

I want to create some detectors that:

Detector 1: Expected Start Detectors: Detectors assert that a job_event=START metric occurs within a small window around the job's Cron-scheduled start time. For example, the job_event=START occurs around 1pm, 3pm, 5pm daily as per above example. This lets us know if jobs fail to start.

Detector 2: Expected Completion Detectors: Detectors assert that a terminal job_event (END/ERROR) metric occurs within the specific window defined by the job's Cron schedule, and/or since the prior job_event=START metric, and the max run time. (E.g., when(hour >= 1 and hour < 2) for a job scheduled only at 1AM and having a max runtime of an hour.) This lets us know if jobs are running too long.

Is it possible to incorporate the above detectors using existing logic available in Splunk Observability Cloud? I am finding difficult to achieve the above.

It would be great to get some advice.

Thanks,
Michael

Labels (1)
0 Karma

kknairr
Contributor

@MichaelRTR As per your use case, this is essentially a cronaware monitoring problem, but Splunk Observability detectors are built on SignalFlow, which operates on metric streams and time windows. Try the below approaches:

  • Use SignalFlow which supports calendar window transformations (e.g., when(hour >= 13 and hour < 14)), which can approximate cronlike checks. You can define detectors that look for job_event=START within those windows.
  • For scaling, instead of creating separate detectors per job, consider templated detectors or Terraform automation (signalfx_detector resource) to generate detectors programmatically.

SignalFX - Splunk terraform

terraform-provider-signalfx/docs/resources/detector.md at main · splunk-terraform/terraform-provider...

Signalflow

https://dev.splunk.com/observability/docs/signalflow/

>>

If this post addressed your question, you can:

  • Give it karma to show appreciation 👍
  • Mark it as the solution if it solved your issue ✔️
  • Add a comment if you’d like more details ✏️

 

Acknowledging helpful answers keeps the community strong and motivates contributors to continue sharing their expertise.

>>

0 Karma

bishida
Splunk Employee
Splunk Employee

Hi,
So, quick disclaimer--there's probably a few different ways to accomplish this. But, the first idea that comes to mind for me is to take a look at muting rules. The idea here is that it might be easier to setup a detector that looks over the past 15 minutes or so and look for your custom metric filtered by job_event=START. You could detect on a static threshold where count < 1 (effectively zero). This detector would always be running, but then you strategically apply muting rules so that it only alerts you in those time windows you care about. I'm thinking you'll have to use a "custom absolute time window" but then mark it as repeating daily.

https://help.splunk.com/en/splunk-observability-cloud/create-alerts-detectors-and-service-level-obje...

0 Karma

MichaelRTR
Loves-to-Learn

Hi Bishida,

Thanks for the response.

While this would be a good solution for 1 (or a low volume) Cloud Scheduled Function, I need to roll this out for 150+ CSF that run on different schedules. Hence, it wouldn't be feasible to add so many muting rules. and this probably would not scale as I need (when adding more CSFs in the future). I'm not sure if you know if signalflow can use/read cron or time windows in general? 

Thanks!

0 Karma

bishida
Splunk Employee
Splunk Employee

Here are a couple of thoughts that might help you choose an approach:

- I'm unsure if you can limit detection to a specific timeframe only using signalflow. It might be possible--I just don't know how to do it off the top of my head, so it might be kinda tricky.

- If you do want to try the muting rule approach, there is an API that could help with scaling it and automating it:
https://dev.splunk.com/observability/reference/api/incidents/latest/#endpoint-create-single-muting-r...

- Another idea is that you can pull metrics from Observability Cloud into Splunk Enterprise/Cloud using the Observability Cloud IM technical add-on (https://splunkbase.splunk.com/app/5247). You can use this TA to store Observability Cloud metrics in a Splunk metrics index pretty much for free (doesn't count against your ingest if you have ingest licensing). Once the metrics are in Splunk Enterprise/Cloud, the time-restricted searching and alerting would be much easier to configure.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Keep the Learning Going with the New Best of .conf Hub

Hello Splunkers, With .conf26 getting closer, there’s already a lot of excitement building around this year’s ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

How to find the worst searches in your Splunk environment and how to fix them

Everyone knows Splunk is a powerful platform for running searches and doing data analytics. Your ...