Splunk Search

How to get Perctage of Uptime for a Transaction

tdavison76
Explorer

Hello,

If possible, I need help on getting a Percentage of Uptime for a Transaction overtime.  I have a Search created that creates a Transaction, it's based on:

startwith=Create

endswith=Close

keepevicted=true

The events are coming from OpsGenie for when an alert is created and closed.  Is there anyway to take the time from either between Create/Close or Close/Create for a one week timeframe to obtain the percentage?

Thanks for all of the help, let me know if any more details are needed.

Tom

 

 

Labels (4)
0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

I'd love to see a small sample of the data this is based on (and please remember to use the code button to enter it so the browser/system doesn't eat special characters).

in any case, 1) your output has Created and Branch, Branch being "alert.message".  Yet you don't include this in your transaction?

Here's a run-anywhere search that illustrates the technique.

 

| makeresults format="CSV" data="time, action, branch
1715258900, create, bigville
1715251900, close, bigville
1715254900, create, smallville
1715253920, close, smallville
1715228900, create, bigville
1715211970, close, bigville"
| eval _time = time
| transaction maxspan=5h branch

 

 In this case we have two branches, "bigville" and "smallville".  The first 7 lines just build a set of data to work with.  We then convert time into "the real time of the event".

The meat is the transaction, we are now doing it "by branch" (though 'transaction' doesn't use the keyword "by".)  So if you run the above - you'll see we create 3 transactions, each has a duration field in it.  (I had to fiddle with the maxspan to get my silly test data to work right).

Now, let's add this to the end -

 

| stats sum(duration) as total_duration by branch

 

And poof, we now have a total sum of the duration fields for each branch.  Once we have that, we can add to the end...

 

| eval percent_uptime = (total_duration / (86400*7)) *100

 

and there's our percent uptime.  Obviously smallville has some problems.  🙂

So, untested (I don't have your data), but I think this should work for you:

 

index=healthcheck integrationName="Opsgenie Edge Connector - Splunk" alert.message = "STORE*" "entity.source"=Meraki
| rename alert.message AS "Branch"
| transaction "alert.id", alert.message startswith=Create endswith=Close keepevicted=true Branch
| where closed_txn=0
| spath 'alert.createdAt'
| stats sum(duration) as total_duration, latest(Created) as Created by Branch
| eval Created=strftime ('alert.createdAt'/1000,"%m-%d-%Y %I:%M:%S %p")
| eval percent_uptime = (total_duration / (86400*7)) *100

 

I moved your rename to earlier (because life is easier this way), added "Branch" to your transaction, left most of that middle bit alone, added the stats to sum the duration of the transactions and to snag the latest "Create" from the event (again by "Branch"), then a bit of cleanup and math.

Give it a try.  And as always, if something's not working right start chopping lines off the end of that search until you get back to data that makes sense, analyze it one line at a time going forward being careful to figure out how each step works and what it does and that its results are right (and fixing it if it isn't), then proceeding.  Sort of how I gave you the run-anywhere example, splitting it out into three sets of search so you can see how it builds.

 

View solution in original post

Richfez
SplunkTrust
SplunkTrust

I'd love to see a small sample of the data this is based on (and please remember to use the code button to enter it so the browser/system doesn't eat special characters).

in any case, 1) your output has Created and Branch, Branch being "alert.message".  Yet you don't include this in your transaction?

Here's a run-anywhere search that illustrates the technique.

 

| makeresults format="CSV" data="time, action, branch
1715258900, create, bigville
1715251900, close, bigville
1715254900, create, smallville
1715253920, close, smallville
1715228900, create, bigville
1715211970, close, bigville"
| eval _time = time
| transaction maxspan=5h branch

 

 In this case we have two branches, "bigville" and "smallville".  The first 7 lines just build a set of data to work with.  We then convert time into "the real time of the event".

The meat is the transaction, we are now doing it "by branch" (though 'transaction' doesn't use the keyword "by".)  So if you run the above - you'll see we create 3 transactions, each has a duration field in it.  (I had to fiddle with the maxspan to get my silly test data to work right).

Now, let's add this to the end -

 

| stats sum(duration) as total_duration by branch

 

And poof, we now have a total sum of the duration fields for each branch.  Once we have that, we can add to the end...

 

| eval percent_uptime = (total_duration / (86400*7)) *100

 

and there's our percent uptime.  Obviously smallville has some problems.  🙂

So, untested (I don't have your data), but I think this should work for you:

 

index=healthcheck integrationName="Opsgenie Edge Connector - Splunk" alert.message = "STORE*" "entity.source"=Meraki
| rename alert.message AS "Branch"
| transaction "alert.id", alert.message startswith=Create endswith=Close keepevicted=true Branch
| where closed_txn=0
| spath 'alert.createdAt'
| stats sum(duration) as total_duration, latest(Created) as Created by Branch
| eval Created=strftime ('alert.createdAt'/1000,"%m-%d-%Y %I:%M:%S %p")
| eval percent_uptime = (total_duration / (86400*7)) *100

 

I moved your rename to earlier (because life is easier this way), added "Branch" to your transaction, left most of that middle bit alone, added the stats to sum the duration of the transactions and to snag the latest "Create" from the event (again by "Branch"), then a bit of cleanup and math.

Give it a try.  And as always, if something's not working right start chopping lines off the end of that search until you get back to data that makes sense, analyze it one line at a time going forward being careful to figure out how each step works and what it does and that its results are right (and fixing it if it isn't), then proceeding.  Sort of how I gave you the run-anywhere example, splitting it out into three sets of search so you can see how it builds.

 

tdavison76
Explorer

You are awesome, I was able to get it working.

index=healthcheck integrationName="Opsgenie Edge Connector - Splunk" alert.message = "STORE_117_RSO - Unreachable" "entity.source"=Meraki
| rename alert.message AS "Branch"
| transaction "alert.id", alert.message startswith=Create endswith=Close keepevicted=true Branch
| stats sum(duration) as total_duration by Branch
| eval percent_downtime = (total_duration / (86400*7)) *100

 

Sorry, I just have one last question, this actually gives me the Downtime, how would I also show a percentage of Uptime?

Wish I could give you 100 Kudos 🙂 

0 Karma

Richfez
SplunkTrust
SplunkTrust

For "uptime", just subtract your downtime from 100%.

Something like this at the end:

| eval percent_uptime = 100 - percent_downtime

Hope that works too!

tdavison76
Explorer

That did the trick, think you again for the excellent help.  Have a good week.

 

Thanks,

Tom

0 Karma

Richfez
SplunkTrust
SplunkTrust

I think we're missing some details to be able to provide *the answer* for you, but I can certainly point you in the right direction!

You have a transaction, so you have duration for each transaction.

So you'll want to sum those durations using stats, then do some division to get your uptime.  Something like (pseudocode only)

... your base search here
| transaction ...
| stats sum(duration) as total_uptime [by something?]
| eval percent_uptime = total_uptime / (86400*7)

that's assuming a 1 week period and that your durations are in seconds (I'm pretty sure that's what pops out of transaction), so 86400 seconds per day times 7 days.

Give that a try, and if you have any further problems or questions about this, reply back with a bit more information (like the search involved, a bit of the sample output from that search, etc...)

Also if this helps, karma would be appreciated!

Happy Splunking,

Rich

tdavison76
Explorer

Hello,

Thank you for the very quick response, much appreciated and helpful.  I have been testing the uptime you provided to obtain the percentage, but am not very good yet at the search creations.  This is the Search I am using:

 

index=healthcheck integrationName="Opsgenie Edge Connector - Splunk" alert.message = "STORE*" "entity.source"=Meraki
| transaction "alert.id", alert.message startswith=Create endswith=Close keepevicted=true
| where closed_txn=0
| fields alert.updatedAt, alert.message, alertAlias, alert.id, action, "alertDetails.Alert Details URL", closed_txn, _time, dv_number, "alert.createdAt"
| spath 'alert.createdAt'
| eval Created=strftime ('alert.createdAt'/1000,"%m-%d-%Y %I:%M:%S %p")
| rename alert.message AS "Branch"
| table Created, Branch
| sort by Created DESC

Can't figure out what the stats sum(duration) should be by.  The goal is to have a percentage of the time between the Create and Close Transaction out of 7 days.

Thanks again for all of the help,

Tom

0 Karma
Get Updates on the Splunk Community!

Combine Multiline Logs into a Single Event with SOCK - a Guide for Advanced Users

This article is the continuation of the “Combine multiline logs into a single event with SOCK - a step-by-step ...

Everything Community at .conf24!

You may have seen mention of the .conf Community Zone 'round these parts and found yourself wondering what ...

Index This | I’m short for "configuration file.” What am I?

May 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with a Special ...