The ITSI capacity planning manual talks about planning for capacity based on the number of entities per KPI.
It is not clear to me how the abstractions available help or play into that.
Does anyone know if the number of entities per KPI guidelines apply as written when employing base searches for KPIs and service templates for the services?
I thought that the base search ran once for all services using it, not once per service. Can anyone speak to this?
The ultimate limits are :
this shared base search
... | stats count(kpi1) max(kpi2) last(kpi3) by entity
has 3 metrics results per entity
if the SBS is used by 3 KPIs per service
and you have 10 services using those 3 kpis
then if you have a split by entity, and about 200 entities per services ....
then you are looking ar 3 x 3 x 10 x 200 ~ 18000 combinations of cardinality each time the SBS runs.
And if this is too much, because the search is taking too long, or because it hits a stats command limit...
then you may have to create another identical SBS, and spread the services to use one of the other, and stay under control.
the search is long to run
if you kpi runs every minute, but it takes more than 60 seconds to completes, you are in trouble.
You may also have to optimize your search or reduce the number for the split by.
the REST call to create the entity filter in the SPL query is very long
I saw some long list of "host=a OR host=b OR host=c ...." that made the search so massive that it was very slow to parse the search.
Thanks @yannK. This makes good sense. In our case, I think the actual issue has more to do with your last bullet point. Specifically, the resulting base search is quite long with all of the host=a OR host=b OR host=c. If the parsing of that (which I'll check) is causing the bottle neck, that's probably part of the answer.
As far as cardinality goes though, how would that affect performance or capacity planning? Our math on a per service basis looks something like 3 metrics x 3 kpis x 100 services x 40 entities = 36,000 for a given base search. Besides parsing, where else could a bottleneck exist for a base search like this? Note, the actual search (without all the extra stuff at the end) runs well under 30 seconds and only runs on a 15 minute frequency.
usually, the problem with long list of entities filters, are long search terms, or scripts timing out when waiting for the REST call to return the list.
But if you reach that point, you will be working with a support engineer to tune splunk.