I want to focus your attention on the method of collecting CPU utilization data in Splunk_TA_nix (cpu_metric.sh).
I have been dealing with many false positive alerts regarding CPU usage in our organization.
We have ITSI implemented and use Splunk_TA_nix to collect data.
An alert is generated when 2 values of CPU usage > 90%.
We collect values every 5 minutes.
Script for collecting this data (Splunk_TA_nix/bin/cpu_metric.sh) use the command sar -P ALL 1 1.
This command will display the CPU load within 1 second.
If used for CPU monitoring in our setup (every 5 min)
we only have information about 1 second out of five minutes.
Based on this data we evaluate CPU usage.
Normally the CPU usage fluctuates depending on how the commands are started, how long they run, and how difficult they are.
With this method of measurement, it happens quite often that 2 values cross the threshold in a row. Based on this, an alert is subsequently generated.
For monitoring, however, it is important to know the average CPU utilization and not random peaks.
When collecting average values, such false positive alerts would not occur (if the CPU is not overloaded).
The standard way good administrators test CPU usage is, for example: sar 120 1 when they get an average CPU usage in 2 minutes. Data collection in sar via cron was once recommended to be set up like this:
*/10 * * * * root /usr/lib64/sa/sa1 -S XALL 600 1.
This setup collected the average CPU usage over a 10-minute period, wrote this value to a sar file, and repeated this every 10 minutes.
Such a setting gives a real overview of how the CPU is pulled out.
Splunk does not provide a reasonable way to set these values in the cpu_metric.sh script.
The only way to solve it is to copy this script and modify it according to yourself.
However, the connection to Splunk_TA_nix will be lost. What happens when Splunk_TA_nix is upgraded?
My preference is to enable CPU data collection by introducing the following stanza in our application (deployed via the deployment server) which is linked to Splunk_TA_nix.
[script://$SPLUNK_HOME/etc/apps/Splunk_TA_nix/bin/cpu_metric.sh]
disabled = false
index = unix_perfmon_metrics
But this method does not give us the possibility to set OPTIONS for sar.
It would be ideal if something like this could be done:
[script://./bin/my_cpu_metric.sh]
disabled = false
index = unix_perfmon_metrics
./bin/my_cpu_metric.sh
exec $SPLUNK_HOME/etc/apps/Splunk_TA_nix/bin/cpu_metric.sh 120 1
But this doesn't work.
It would not be necessary for cpu_metric.sh to be able to process some input settings and modify the use of the sar command.
The same can also be applied to other scripts in this TA.
If you have similar experiences, feel free to share them. If my concerns are justified, it would be right if this TA would be updated and give administrators the opportunity to set better metrics collection parameters.
What do you think?