Hello Splunk Lovers,
This question is very specific to ITSI, I would like to know how long is the system taking to calculate a Service Health score with 10 KPI's that have configured with 5 base searches in it.
As I am researching on Lag time for my service to calculate health score. There are 3 pieces to this to what I understand.
1. Data coming into Splunk. I know the details. so u can ignore this.
2. Service Schedule/Frequency of Each KPI. For example for Every 5 minutes .
3. ** How long eack KPI runs and in turn and when does the service has the all the information to calculate Health Score. As this is available in "ITSI _Summary" and from where my Service Analyzer will calculate.
I will happy to elaborate more if need, I tried my best to put as much I can.
Ultimate I need to understand "LAG" in calculating Health Score.
Ah, yes. This is something we have to analyze quite often. Also remember that the deeper you stack KPIs, the longer the lag will be. Any subservice, or any KPI that consists of the relationship between multiple other KPIs - will have one or more additional lag steps.
Your lag #1 consists of collection and transmission lags, plus whatever lag your ingestion adds. You can chart this by comparing the difference between _time and _indextime. That calculated difference is the lag time for collection, transmission and ingestion. We'll call it the "ingestion lag" for simplicity.
Your lag#2 and #3 are going to be calculated together, and are going to require analysis of the schedule and the ingestion lag of the underlying data sources. If you want to make a worst case scenario, then you probably want to take the 90th or 95th percentile of the ingestion lag for each data source, then assume that transaction barely missed the cutoff for the KPI. For instance, if you had a 90 second ingestion lag, and a 5m KPI search, then the resulting KPI is up to 390 seconds old, plus however long that search takes to run. That result will roll up to the service health SHKPI one minute later than that. If that service is a subservice, then it will roll up to the next level a minute later, and so on.
Thanks a bunch for the information. I follow what you are saying.While we make a guess at calculation time . One other factor I see happening is.
-> what if the KPI cal has been skipped for 2 rounds ? i.e. I m looking at 10:15 value
but there were no calculation done for 10:00 and 10:05? and I see that the next value calculated was at 10:16 . but when I use 10:10 to 10:15 what will I get?
This is not something I m making up. sometime I see the value are been skipped or might not have calculated then what will be value definitely I don't see N/A option . than is it finalizing the calculation on what every is gathered at till that point?
@splunk . this is something we need to
Here are 2 specific questions I have , any information is greatly appreciated.
In general if the value can’t be calculated for the give 5 minutes I expect it to use old value? Is it true?
In other cases, point i noticed for example when I want to know score between 10:20- 10:25 at (10:29 when KPI calculated it matched with service analyzer)( while I have value for 10:19, 10:24). Is this adding Lag time up to 300 sec i.e 5 min that I setup?