We experienced a problem last week with ITSI where some of the services we use stopped functioning properly. Service Health Scores and KPIs seemed to stop being calculated properly. In addition, entities were not getting filtered properly in the KPI. At the same time, we started to see the itsi_refresh_queue increase to over 161k and our /opt partition fill at a rate that was much higher than previously.
My question is, has anyone ever seen this occur in their environment? Is there a rate that the refresh queue should be decreasing or a way to run the refresh scripts (which are quite a big black box for us) manually to move this along and hopefully bring the environment back into a normal state?
Any advice would be greatly appreciated.
Also, to note, we have about 200-300 services with somewhere between 5-10 base-search based KPIs and around 90k entities in the KV store (around 27k of which are actively associated to services at the moment).
Thanks in advance.
If you have an ITSI instance, you probably have a support contract, please open a support case for deeper troubleshooting.
From the description of the question, here are some pointers.
To check :
If you have an empty collection (zero objects), but a large collection on disk, a one time workaround could be to clear the kvstore collection
./splunk clean kvstore -app SA-ITOA -collection itsi_refresh_queue