Splunk IT Service Intelligence

KPIs stopped functioning properly and disk space increased dramatically. How to fix?

Path Finder

Hi all-

We experienced a problem last week with ITSI where some of the services we use stopped functioning properly. Service Health Scores and KPIs seemed to stop being calculated properly. In addition, entities were not getting filtered properly in the KPI. At the same time, we started to see the itsi_refresh_queue increase to over 161k and our /opt partition fill at a rate that was much higher than previously.

My question is, has anyone ever seen this occur in their environment? Is there a rate that the refresh queue should be decreasing or a way to run the refresh scripts (which are quite a big black box for us) manually to move this along and hopefully bring the environment back into a normal state?

Any advice would be greatly appreciated.

Also, to note, we have about 200-300 services with somewhere between 5-10 base-search based KPIs and around 90k entities in the KV store (around 27k of which are actively associated to services at the moment).

Thanks in advance.

0 Karma

Splunk Employee
Splunk Employee

If you have an ITSI instance, you probably have a support contract, please open a support case for deeper troubleshooting.

From the description of the question, here are some pointers.

  • the itsi_refresh_queue is a kvstore collection used to store the changes to apply to ITSI, it can fill up when you have service-templates change to propagate, entity import, shared-bases-searches update, threshold updates... If you had a recent mass change or an upgrade, this could be the reason.
  • after a refresh task has being applied, an object will be removed from the collection.
  • However the kvstore is a file storage, when a collection is growing, it will add new chunks of files on disk to reserve disk space. But when the objects are removed from the collection, the files on disk will not be removed ( instead, the reserved slots we be reused later). So the disk space will not be recovered. This may be what happened to you, a large collection grew big and did not shrink when the objects were removed.

To check :

  • You can check the collection size on the ITSI healthcheck dashboard, look at the "KV Store Collections" and "Number of Objects" versus "Collection Size (MB)"
  • or directly the disk on $SPLUNK_HOME/var/lib/splunk/kvstore/mongo/s_SA-ITO* files, but it may be hard to find which one is mapped to which collection.

If you have an empty collection (zero objects), but a large collection on disk, a one time workaround could be to clear the kvstore collection

./splunk clean kvstore -app SA-ITOA -collection itsi_refresh_queue

Splunk Employee
Splunk Employee

If this hasn't been resolved yet, I would suggest opening a support case.

0 Karma