Hi Splunk Community,
I am having some issues with an ITSI module I am building. Before I get into it, I have checked the advice given here and here but neither resolve the issue I am having. My dev VM also has 12 vCPUS and 16GB RAM so resources is not an issue, I have run the health check in the monitoring console also and can see no issues with skipped searches.
Out of 12 services, I currently have 5 marked as N/A. These services are fed by an add-on which runs every 5 minutes to collect metrics. These metrics collection runs take somewhere in the region of 30s to 120s depending on environmental factors, but never anywhere near the 5 minute mark.
To improve efficiency, I use KPI base searches, so there isn't an issue with resources, I'm sitting at about 4% CPU usage and 11% memory usage.
When I click into the service I can see all service KPIs are marked as Unknown and NaN. The Entities for each KPI are showing as Normal and the alert_values have values associated. However, when I open up the KPIs in a deep-dive I am seeing 'No Data' for the past few hours. There is data, I have checked the events and they are recent as of a few minutes ago, and when I go to configure KPI under services, when defining the threshold values I can see the aggregate and per entity stats in there for right up until present, showing ITSI is picking up the data as intended in some manner.
To make it more confusing, the services are not appearing as N/A as you would expect. During the metrics collection in the TA, it collects the metrics in the same manner each time, for one service after another. What I am seeing is services that are last in the metric collection run have health scores, but lesser services with less entities in the middle of the metrics collection run are showing N/A.
So my question is, what is going on? And yes, I have had numerous occasions where all services were showing their health scores, this behaviour is a frequently occurring issue so i want to be able to warn end-users about it.
Is your search using tags? Are the tags defined?
All the out of box services use tags which cause N/A of health scores.
You should create a single ad-hoc KPI and add it to a service. If that is working as expected, you should build another adhoc KPI and then create a base search and change the ad-hocs to the base search
These weren't out of the box services but services I had configured myself. As mentioned previously, the behavior is not confined to any one service, it was seen across multiple services at varying times. I did however get to the bottom of it, the issue is to do with monitoring lag, and what I believe to be a pretty big problem for end-users if they use ITSI services where entities are imported by way of a recurring search.
Without getting into the specifics of the setup as it is irrelevant, lets say the system I was pulling KPIs from had what translated into 500 entities at the time of creating the module/services/entities etc. When I was defining the KPI base searches I had to set a monitoring lag, recommenced setting was 136 seconds.
I left the ITSI module running away doing its own thing whilst I worked on other projects, but in the mean time the amount of entities increased to approx 2500 entities. The effect of all of these extra entities imported by recurring search meant that the monitoring lag increase immensely, with recommendations of 1600 seconds for the last service to have data pulled from the add-on feeding data to ITSI. Once I changed all the monitoring lag values to their new recommended settings there was no more issues with Unknown/NA/NaN values appearing at any level.
This does raise a far more important question than what does NaN mean (no explanation anywhere....), are we expecting end-users to go and continually update the monitoring lag on the base searches as they have more entities added to ITSI? If so, this is really poor behavior from the feature, what if ITSI fails at a crucial time because there were more entities added resulting in increased monitoring lag. If not, is there a way to set ITSI to dynamically set the monitoring lag?
Yeah the reoccurring entity import feature can be a little dangerous for this exact reason. You may want to review your entities and either scale your hardware based on how many you expect or cut down on the amount of entities you have. You should also check for dupes
We are also facing this issue. Is there any recommended settings to setup the Monitoring-Lag. COnsidering if we have 100 entitites for a service.
I just abandoned the ITSI effort after having more questions than answers in the end, I cannot tell you if there has been any improvements in this space since I worked on a custom ITSI module but it would appear not if you are facing the same problem!