All Apps and Add-ons

Linux monitoring ps.sh for cpu usage > 100% is normalized to 0

taldavita
Explorer

I've the Splunk_TA_nix add-on installed to monitor Linux systems (all VMs). Researching a recent server issue there's a process running at %500 CPU usage. This is only possible because it's a VM.

What's I've noticed is sourcetype=top collects the CPU usage correctly however sourcetype=ps normalizes the CPU usage with a condition if the usage is under 0 or over 100, usage is set to 0.

From ps.sh:
    NORMALIZE='(NR>1) {if ($4<0 || $4>100) $4=0; if ($6<0 || $6>100) $6=0}'

In this case it's a java container, to figure out which container, I need to look at the ARGS which is collected by ps, not top. So now instead of just using results from ps, need to combine both top and ps to see the history on CPU usage.

Is there's a reason for fixing the CPU usages when greater than 100 to 0?

jethompson_splu
Splunk Employee
Splunk Employee

@taldavita -- so there are differences on how TOP and PS obtain the CPU Usage information that they display and as such this can cause some confusion on the information provided by the 2 processes.

To provide a little insight:

ps is based on the accumulate CPU usage (since the process started), where the %CPU is an average (total/time).

top reports the (average) CPU usage since the last time it was sampled.

For reference, see this snippet from man ps

CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.

and from man top

The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. In a true SMP environment, if 'Irix mode' is Off, top will operate in 'Solaris mode' where a task's cpu usage will be divided by the total number of CPUs. You toggle 'Irix/Solaris' modes with the 'I' interactive command.

So the "Normalization" of the PS output is done in an attempt to provide the "Active" Data that might be represented in TOP. This is why there is a difference in the CPU Usage Statistics of the 2 Linux Processes. For further insight the 2 commands that are actually being ran by the Scripts are:

From ps.sh:

ps -wweo uname,pid,psr,pcpu,cputime,pmem,rsz,vsz,tty,s,etime,args

From top.sh:

top -bn 1

Hopefully this information helps to provide insight on why there is a difference between the TOP Printout and PS Printouts and why there is a "Normalization" of the Data being provided by the PS Script ran by the Splunk Unix/Linux TA.

Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...