Linux monitoring ps.sh for cpu usage > 100% is nor...

taldavita · ‎08-08-2018

I've the Splunk_TA_nix add-on installed to monitor Linux systems (all VMs). Researching a recent server issue there's a process running at %500 CPU usage. This is only possible because it's a VM.

What's I've noticed is sourcetype=top collects the CPU usage correctly however sourcetype=ps normalizes the CPU usage with a condition if the usage is under 0 or over 100, usage is set to 0.

From ps.sh:
    NORMALIZE='(NR>1) {if ($4<0 || $4>100) $4=0; if ($6<0 || $6>100) $6=0}'

In this case it's a java container, to figure out which container, I need to look at the ARGS which is collected by ps, not top. So now instead of just using results from ps, need to combine both top and ps to see the history on CPU usage.

Is there's a reason for fixing the CPU usages when greater than 100 to 0?

jethompson_splu · ‎08-09-2018

@taldavita -- so there are differences on how TOP and PS obtain the CPU Usage information that they display and as such this can cause some confusion on the information provided by the 2 processes.

To provide a little insight:

ps is based on the accumulate CPU usage (since the process started), where the %CPU is an average (total/time).

top reports the (average) CPU usage since the last time it was sampled.

For reference, see this snippet from man ps

CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.

and from man top

The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time. In a true SMP environment, if 'Irix mode' is Off, top will operate in 'Solaris mode' where a task's cpu usage will be divided by the total number of CPUs. You toggle 'Irix/Solaris' modes with the 'I' interactive command.

So the "Normalization" of the PS output is done in an attempt to provide the "Active" Data that might be represented in TOP. This is why there is a difference in the CPU Usage Statistics of the 2 Linux Processes. For further insight the 2 commands that are actually being ran by the Scripts are:

From ps.sh:

ps -wweo uname,pid,psr,pcpu,cputime,pmem,rsz,vsz,tty,s,etime,args

From top.sh:

top -bn 1

Hopefully this information helps to provide insight on why there is a difference between the TOP Printout and PS Printouts and why there is a "Normalization" of the Data being provided by the PS Script ran by the Splunk Unix/Linux TA.

Linux monitoring ps.sh for cpu usage > 100% is normalized to 0

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard

Are you a member of the Splunk Community?

Linux monitoring ps.sh for cpu usage > 100% is normalized to 0

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard