Solved: Suitability of Splunk for real time monitoring and...

mitag · ‎05-06-2020

Existential question here... 🙂

What is the appropriate mechanism in Splunk to have multiple (potentially hundreds) of alerts that are based on the latest events, rather than real-time or timeframe searches, while keeping our Splunk deployment sane and simple? (Is it even possible?)

Example:

I need an alert when a volume (disk) breaches an 80% used space threshold, and need it within 30 seconds of when Splunk gets an event. (Then similar alerts for NAS and SAN volumes, CPU, memory, interface utilization, and a whole bunch of other metrics.)

Setting up a few dozen of these realtime searches and respective alerts brings our cluster to its knees. Attempting to set up auto-refreshing and fast dashboards with the same metrics simply knocks it out. Whereas doing something like this in Solarwinds or Datadog - piece of cake, including statistics-based metrics (e.g. if a metric exceeds a 30-minute baseline by more than 20% over the last 3 minutes).

Is Splunk not the right product for the task? If it is, what is the technical term for my problem, how can it be solved in Splunk, and would you be so kind as to point me to where it's discussed?

Thanks!

P.S. Please read the question fully and please abstain from attempting to answer what you might think the question is. The question is specific enough:

What is the appropriate mechanism in Splunk to have multiple (potentially hundreds) of alerts that are based on the latest events, rather than real-time or timeframe searches, while keeping our Splunk deployment sane and simple?

To rephrase it a bit:

in an environment where the ability to set up fast alerts on a large number of metrics is important, is Splunk (OOTB) the right product for the task? If so, what are the mechanisms to accomplish that? (Because OOTB, Splunk is not suitable for it - at least not in my experience.)

P.P.S. Is metrics such a mechanism? If so what would be a good resource to get those set up and running fast? (Digging through the documentation tells me it's not a streamlined, easy experience.)

shivanshu1593 · ‎05-07-2020

In your scenario, a tool like Solarwinds SAM module, BMC Truesight or Microsoft SCOM, may do a better job, as they are made with the goal of monitoring of Infrastructure in mind. Splunk, as you yourself explained, doesn't seems to be the right choice for your requirement. If I was you, I'll be using the tools mentioned above, instead of Splunk.

Real time searches, as you explained, are causing issues with your cluster, and a scheduled alert/report requires a minimum of 1 minute of difference between the last and the current execution, even with a cron schedule. So, processing of your request to act within 30 seconds of the arrival of the event isn't possible without having a real time search running. Which is where using Splunk becomes pointless in your case.

Thank you,
Shiv
###If you found the answer helpful, kindly consider upvoting/accepting it as the answer as it helps other Splunkers find the solutions to similar issues###

View solution in original post

shivanshu1593 · ‎05-07-2020

In your scenario, a tool like Solarwinds SAM module, BMC Truesight or Microsoft SCOM, may do a better job, as they are made with the goal of monitoring of Infrastructure in mind. Splunk, as you yourself explained, doesn't seems to be the right choice for your requirement. If I was you, I'll be using the tools mentioned above, instead of Splunk.

Real time searches, as you explained, are causing issues with your cluster, and a scheduled alert/report requires a minimum of 1 minute of difference between the last and the current execution, even with a cron schedule. So, processing of your request to act within 30 seconds of the arrival of the event isn't possible without having a real time search running. Which is where using Splunk becomes pointless in your case.

Thank you,
Shiv
###If you found the answer helpful, kindly consider upvoting/accepting it as the answer as it helps other Splunkers find the solutions to similar issues###

mitag · ‎05-07-2020

Thank you! Just to confirm: not even metrics would help? (Provided we're OK with waiting longer than a minute or two for the alert to fire.)

Also: can you think of where this topic is discussed at length? (I can't be the 1st one to come up with such a question?)

shivanshu1593 · ‎05-07-2020

Metrics can help you in this situation, but the chances of that area little slim. Here are some pointers that you may wish to consider before using it.

Using Metrics will definitely save disk space and may improve the search performance, which is highly required for your condition, provided the SPL code for the search is highly efficient. Using time and resource consuming commands like transaction, join etc will nullify the efficiency introduced by Metrics.
implementation isn't as easy as compared to ingesting normal data, as it is very peculiar about the formatting of the data. Also to consider is the effort that you'll have to put it to convert the data into the appropriate format before ingesting (Will require some scripting magic).
The data formatting may seem difficult and even off putting to the folks who are using Splunk at your organisation. I have people in my team, who do not like it at all.
Splunk uses Metrics data to send it's internal logs, and even their dashboards at times take ages to fill the data. Which again brings us back to the SPL, which may or may not be efficient, even if made by a Splunk expert (Boils down to what do you want to achieve with the data).

Unfortunately, I cannot direct you to an old thread, where this topic was discussed at length before, cannot recall/find of any threads like it here.

Thank you,
Shiv
###If you found the answer helpful, kindly consider upvoting/accepting it as the answer as it helps other Splunkers find the solutions to similar issues###

shivanshu1593 · ‎05-07-2020

A point which I forgo to add was that, even if the SPL is efficient, it usually takes one search at least 1 CPU core to run and execute. Now with 100s of searches running at the same time, will choke your search head and will create a bottleneck for itself, before it can help you solve the bottlenecks of other servers in your infrastructure, causing a big headache of queued searches, high CPU utilisation on itself. Something, which we would definitely want to avoid.

Thank you,
Shiv
###If you found the answer helpful, kindly consider upvoting/accepting it as the answer as it helps other Splunkers find the solutions to similar issues###

to4kawa · ‎05-07-2020

I'm not sure we actually need a hundred searches.

(Index=ps OR index=iostat)  (CPU > 80 OR Disk > 80 OR process > 80 )
| stats values(*)  as * by host

Wouldn't it be a single search like that?

mitag · ‎05-07-2020

Different thresholds (transcoders vs. SMTP relays vs. SQL servers), escalation paths (engineering, NOC, SOC?) , alert types (email, Slack), alert frequency for different hosts and entities depending on their purpose and criticality. Then it's:

host up/down or in warning or critical
service or application up/down - or in warning or critical (there can be multiple criteria for that depending on the application or service)
high latencies
CPU, disk, memory, network utilization - with varying thresholds depending on multiple factors

... and I am probably forgetting a thing or twenty. In Solarwinds, we have about 50-60 active alerts and they're a bear to manage due to no conditional execution in alert actions (e.g. tag based) - but at least they have escalations tiers - not something Splunk does. Going forward we need a better alert management system - in addition to fast alerts that don't bring the cluster down.

to4kawa · ‎05-07-2020

thanks @mitag

I understand well.

We need the right tool for the right place, I think.

martin_mueller · ‎05-07-2020

There is no need to have individual alerts for different volumes, NAS volumes, SAN volumes, etc. - merge them all into one. Use lookups if you need different static thresholds. Set sane timeranges. Use _index_earliest if you need to look at data from a "when it arrived" point of view instead of the usual splunky "when it happened" point of view.

For your question of (Is it even possible?), sure. Splunk can run lots of complex searches, so there's no reason for your simple alerts to bring your cluster to its knees.

If you need more specific advice then be more specific about what in particular you're doing that breaks in your environment.

mitag · ‎05-07-2020

I downvoted this post because it's disrespectful and/or inconsiderate in my eyes: the poster did not appear to read the question through.

martin_mueller · ‎05-07-2020

Is there a sample or a template search and alert available in official Splunk documentation that I could easily integrate into my environment? (Focus on "easily" - so that someone else in my team not intimately familiar with Splunk could do that?) No such template? Rethink your answer then?

I'm not engaging in hostile back-and-forth on my free time. I don't work for Splunk, I don't get paid for answering stuff here. Rethink your tone?

mitag · ‎05-07-2020

If there is a possibility you didn't read the question fully or are attempting to answer a different one - please consider deleting or revising your answer. In my eyes, it doesn't even attempt to answer the question - which I believe to be inconsiderate, at the very least - and perhaps the reason for the "tone". Cheers.

mitag · ‎05-07-2020

The question is specific - or at least contains very specific sub-questions. Specific scenarios knocking out our 3-strong cluster are:
- 5+ realtime searches on fields extracted at index time,
- Someone else in my team converting a dashboard with expensive searches to run in realtime (they are used to that in other tools like Solarwinds and Datadog, and when I tell them Splunk is not like that, they stop using it).

... however those scenarios were omitted on purpose, to keep the focus laser sharp: in an environment where fast alerts on a large number of metrics is important, is Splunk the right product for the task? If so, what are the mechanisms to accomplish that? (Because OOTB, Splunk is not suitable for it - at least not in my experience.)

So does your answer imply "no such OOTB mechanism"? (Outside of a rather steep learning curve - or hiring a Splunk specialist - to optimize things? And, of course, metrics? Or are metrics not the right mechanism?)

What is the appropriate mechanism in Splunk to have multiple (potentially hundreds) of alerts that are based on the latest events, rather than real-time or timeframe searches, while keeping our Splunk deployment sane and simple?

Now to your answer...

There is no need to have individual alerts for different volumes, NAS volumes, SAN volumes, etc. - merge them all into one.

There isn't? Well then.

Is there a sample or a template search and alert available in official Splunk documentation that I could easily integrate into my environment? (Focus on "easily" - so that someone else in my team not intimately familiar with Splunk could do that?) No such template? Rethink your answer then?
How does one merge multiple alerts into one if they need to have different alert actions? E.g. alert different teams based on the host, volume, severity, threshold, escalation level, etc.?
How does one merge alerts when the underlying searches - and resulting alerts are very different? (Local volumes are searched across all hosts excluding certain volume types while SAN and NAS volumes - at least with respect to "disk full" alerts - are only searched on specific hosts? If merging them makes the search unwieldy, too complex, hard to manage - rethink your answer?

It seems your answer implies hiring someone with 3+ (or is it 10+?) years of Splunk experience who could optimize the searches alerts sufficiently to make them more performant? If this sound about right - perhaps rethink your answer?

jkat54 · ‎05-07-2020

Each search can consume up to one CPU core by default. Each real-time search (nasty things they are) consumes a CPU core indefinitely.

I would love to know why you need to know within 30s if one of these KPIs is breached. Will you fix the problem in less than 30s? Or will you wait 15 minutes for your mailbox to download your emails, refresh, generate a ticket, and then resolve the issue? If you can fix these problems in seconds rather than minutes then real-time might be worth the costs to you. Otherwise, schedule the alert to run every 5-15 minutes instead and be sure to account for indexing lag.

You'll note (if you look hard enough) that the solar winds details suffer from similar lag. You might think it's computing health every 30s, but instead you might actually have more lag than you solarwinds lets on.

mitag · ‎05-07-2020

Sounds like the answer is a "no"? Should have been a comment though, not an answer - I don't see even a half-serious attempt to answer the actual question.

jkat54 · ‎05-07-2020

You might not like my real answer... Don't use splunk as a primary monitoring tool.

Splunk the data to generate meaningful insights about your environment instead... every time x falls over, 15-30 minutes later 404s go crazy,,, ok now fix that so x is fixed within 15-30 minutes OR replace x with more elegant solution... and prove to management that adding one more x will lead to 100k more revenue... that's how you splunk and save a day.

mitag · ‎05-07-2020

You might not like my real answer...

Deflection. Go easy.

Don't use splunk as a monitoring tool.

Now that may be an answer to one my questions:

Is Splunk not the right product for the task?

...although it deflects, again: says "don't use" rather than, "no, it is not the right product". See the difference? Besides, I suspect Metrics and Convert event logs to metric data points are either better answers - or at least point me in the right direction.

You're correct, I do not like your answer - and not because it steps on my toes or contradicts by biases. It's because it contains factually incorrect information (about Solarwinds), and doesn't even attempt to answer the actual question.

jkat54 · ‎05-07-2020

Ok then you see this? This is me trying to help you for free... and I'm done doing it.

mitag · ‎05-07-2020

Good. Delete your answer then?

to4kawa · ‎05-07-2020

balance.
If you search each 5 sec,the result returns by 1 sec. It's OK.

Suitability of Splunk for real time monitoring and alerting?

alert condition

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

Suitability of Splunk for real time monitoring and alerting?

alert condition

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers