Hi Splunk Community,
I am having some issues with an ITSI module I am building. Before I get into it, I have checked the advice given here and here but neither resolve the issue I am having. My dev VM also has 12 vCPUS and 16GB RAM so resources is not an issue, I have run the health check in the monitoring console also and can see no issues with skipped searches.
Out of 12 services, I currently have 5 marked as N/A. These services are fed by an add-on which runs every 5 minutes to collect metrics. These metrics collection runs take somewhere in the region of 30s to 120s depending on environmental factors, but never anywhere near the 5 minute mark.
To improve efficiency, I use KPI base searches, so there isn't an issue with resources, I'm sitting at about 4% CPU usage and 11% memory usage.
When I click into the service I can see all service KPIs are marked as Unknown and NaN. The Entities for each KPI are showing as Normal and the alert_values have values associated. However, when I open up the KPIs in a deep-dive I am seeing 'No Data' for the past few hours. There is data, I have checked the events and they are recent as of a few minutes ago, and when I go to configure KPI under services, when defining the threshold values I can see the aggregate and per entity stats in there for right up until present, showing ITSI is picking up the data as intended in some manner.
To make it more confusing, the services are not appearing as N/A as you would expect. During the metrics collection in the TA, it collects the metrics in the same manner each time, for one service after another. What I am seeing is services that are last in the metric collection run have health scores, but lesser services with less entities in the middle of the metrics collection run are showing N/A.
So my question is, what is going on? And yes, I have had numerous occasions where all services were showing their health scores, this behaviour is a frequently occurring issue so i want to be able to warn end-users about it.
... View more