Solved: Calculation of availability

wcastillocruz · ‎01-21-2021

Hello Community, I am asking you today to ask you for help concerning a project that I manage in my company. This is the availability calculation: I carry out the availability calculation by retrieving the critical alerts on ITRS (ITRS database) which are then indexed on splunk via the Splunk_DB_Connect application, then by applying a formula to tilt the unavailability then availability over a given time (24 hours or a week). my question is the following : when it comes to a cluster of servers (Active, Passive), I only take into account critical alerts if they are generated at the same time by the two members of a cluster (the active and passive servers). At the moment all the critical alerts generated by the 2 members of the cluster are indexed but I want to filter them in order to take the alert only if the two servers have emitted them at the same time or that the alert of the two server has been generated while server 1 was in critical condition and viseversa. do you have any idea how i can do this filtering? the final goal is to create an availability dashboard with graphics. thank you in advance for your help.

rnowitzki · ‎01-26-2021

Hi @wcastillocruz ,

working with "transaction" as you did is also an option, yes.
Dependent on how much data you look at, the performance might be not so nice...but it should give you what you need. You could add endswith=(severity=0) to the command. Maybe it gives better results.

To your question on how to seperate the timestamp field from you screenshot:

I see two options, please adjust to your needs.
- | mvexpand timestamp => this will create 2 lines, each with one of the timestamps.
- | rex field=timestamp "(?<timestampA>\d*)\n?\r?\s?(?<timestampB>\d*)" => puts the timestamps in 2 new fields. Can be seperated by new line or space.

BR
Ralph

--
Karma and/or Solution tagging appreciated.

View solution in original post

rnowitzki · ‎01-21-2021

Hi @wcastillocruz ,

We could provide better help, if you'd show some example logs.

But in general you could have a logic like:

If condition of pair_member_1 AND condition of pair_member_2 is critical, then alert.
(this is not SPL, just to write out the logic 🙂 )

You could apply this logic to a timeframe of 5 minutes for example. So, if both of the pair members are critical within a 5 minute window, an alert is triggered (or a flag is set, bulb in your office goes red... whatever you need 😛 ).

If you provide some (anonymized) samples, how your data looks like, we could help you with more details, like how to implement the logic in a search and how to trigger an alert based on it.

BR
Ralph

--
Karma and/or Solution tagging appreciated.

wcastillocruz · ‎01-21-2021

Hi @rnowitzki, Thank you for your reply. I have provided you with a screenshot of two indexed events (each event consists of 2 alerts: a critical alert and a return to normal alert) these events are generated from two member servers of a cluster but they are not happening at the same time, then availability is not affected because while one member was in critical condition, the second was providing service. my goal is to find events that happened at the same time or that both servers were in critical state at some point, even though the alerts did not occur at the same time, for example : server 1 deploys an alert on 01/21/2021 at 3:00 p.m. and server 2 generated an alert on 01/21/2021 at 3:20 p.m. while the server was still in critical mode. I have to find a way to create this condition in SPL. Thank you

rnowitzki · ‎01-21-2021

Hi @wcastillocruz ,

This should do it.

| makeresults count=5
| streamstats count as id
| eval _time = case(id=1,_time-6000,id=2,_time-12000,id=3,_time-18000,id=4,_time-23000,id=5,_time-23000)
| sort _time
| eval severity=case(id=1,2,id=2,0,id=3,2,id=4,2,id=5,0)
| eval host = case(id=1,"A",id=2,"B",id=3,"A",id=4,"B",id=5,"A")

| timechart latest(severity) by host
| filldown A,B
| eval super_mega_alert=if(A=2 AND B=2,"yes","no")

You will only need the last 3 lines. The others are just to make up some sample data.

Add a span to the timechart that fits your need, e.g. span=5m to monitor 5 minute windows.
You have the change A and B to the hostnames in your data.

Try to add the commands one after another to see and understand the logic.

So first only add the timechart command. You might have to change "host" to whatever fields your device name is in ("managed_entity" maybe?).
You will see many lines with no values, that's where the filldown comes in. It adds the last known value to any field...which is valid because there was no change of state in the meantime.
The last line just checks when both of the nodes had severity 2 in the same timeframe and puts a yes in the new field super_mega_alert which you can use to trigger your alert.

Let me know if you have any further questions
BR
Ralph

--
Karma and/or Solution tagging appreciated.

wcastillocruz · ‎01-25-2021

Hello @rnowitzki, thank you for answering so quickly. I analyzed and tried your solution, which seems very good to me, but I do not obtain the desired result. I managed to put down on paper the desired SPL research: this is based on my previous screenshot, an event is composed of two alerts: a Critical alert "represents the startime of event" and an OK alert "represents the endtime of event" in a generic search how I identify the startime and the endtime of the event? Thanks for your help

wcastillocruz · ‎01-25-2021

when i regroup the alerts to make event the two times temps are in the same fiel "timestamp start and timestamp end" how I can do to separate two values contained in the same field. values are separated by a space

rnowitzki · ‎01-26-2021

Hi @wcastillocruz ,

working with "transaction" as you did is also an option, yes.
Dependent on how much data you look at, the performance might be not so nice...but it should give you what you need. You could add endswith=(severity=0) to the command. Maybe it gives better results.

To your question on how to seperate the timestamp field from you screenshot:

I see two options, please adjust to your needs.
- | mvexpand timestamp => this will create 2 lines, each with one of the timestamps.
- | rex field=timestamp "(?<timestampA>\d*)\n?\r?\s?(?<timestampB>\d*)" => puts the timestamps in 2 new fields. Can be seperated by new line or space.

BR
Ralph

--
Karma and/or Solution tagging appreciated.

wcastillocruz · ‎01-28-2021

Hello @rnowitzki, 
I would like to ask you one last question, is it possible to know the number of seconds between earliest and latest based on relative time?

rnowitzki · ‎01-28-2021

Hi @wcastillocruz ,

transaction will add a "duration" field, which i think is what you are looking for.
You can also subtract earliest from latest to a new field.
| eval seconds=latest-earliest

Or did I missunderstood?

BR
Ralph

--
Karma and/or Solution tagging appreciated.

wcastillocruz · ‎01-28-2021

I may have explained myself badly. I have to calculate the availability of a service over a period which can be a year, a month, last week, month to date etc ... in my dashboard I have a time button where you can select the desired period to calculate the availability. so I have to create a formula that calculates the availability but there is a variant which is the number of seconds of the period over which we calculate the availability, I put my desired search:

index=index_sqlprod-itrs_toc

| eval ID=Env+"_"+Apps+"_"+Function+"_"+varname

| transaction ID startswith=(severity=2) maxevents=2

| eval start_time=mvindex(timestamp,0), end_time=mvindex(timestamp,1)

| eval periode = $earliest$ - $latest$ """""these variables will look for the earliest and latest of my time button according to the selected period"""""

| stats sum(duration) AS duration_indispo by Function

| eval Percent_Available = round((periode - duration_indispo)*100/periode,4) """""here I use the result to calculate the availability according to the desired period """""

but the search is not working. thk for your help

rnowitzki · ‎01-28-2021

Hi @wcastillocruz,

What exactly does not work. No output or wrong output, errors?

Some trouble shooting suggestions:

Are the tokens actually working?
Maybe check with this simple evals:

| eval first=$earliest$
| eval last=$latest$
| table first, last

If they actually contain the data, then the next thing is to check if the calculation is correct.
I think you have to change the order in your math like;
| eval periode = $latest$ - $earliest$
In epoch seconds, "latest" is the bigger number, so extract earliest from latest.

Check these 2 items. If it still does not work, try to debug by removing one SPL line after another to check if the disired output comes.

So, first remove the last line to check if
| stats sum(duration) AS duration_indispo by Function
results in a valid number for duration_indispo

If not, remove the line also to check if periode has the corret value and so on.

BR
Ralph

--
Karma and/or Solution tagging appreciated.

wcastillocruz · ‎01-28-2021

@rnowitzki

wcastillocruz · ‎01-28-2021

i used this :

| eval start_time=mvindex(timestamp,0), end_time=mvindex(timestamp,1)

Calculation of availability

chart

timechart

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation

Calculation of availability

chart

timechart

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience