All Apps and Add-ons

Splunk On Splunk performance measurements for universal forwarders are showing constant CPU percentage

Communicator

I have a bunch of Universal Forwarders running on 64bit linux systems, that are forwarding the data from TA-SoS to an indexer running on Windows.

The "ps" output for my forwarders are all showing almost constant values, as well as the same CPU percentage across different machines (with varying amounts of cores).

When running

index=sos sourcetype="ps" | multikv | where COMMAND=="splunkd" | timechart range(pctCPU) by host

I get flat lines for all forwarders. My indexer is running the windows PS script, and I get measurements for that. If I go back to when the forwarder started, the CPU% shows a peak but then goes down to the constant value. Currently the forwarders claim to use constant 0.4% CPU over several days.

I also have scripts running that use top to monitor all applications on the machines, including splunkd, and they are showing variations of CPU% between 0.1-1.5 or so depending on traffic.

Why am I not getting correct CPU% measurements from TA-sos?

edit:
Read up on ps and what it does, and it seems to be a difference in how it and top works:
http://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result

Essentially, ps only measures lifetime CPU usage, while top does a sampling. Perhaps forwarders simply vary too little in CPU usage for the lifetime value to change? This makes me wonder how useful it is for detecting spikes in CPU usage on forwarders.

1 Solution

Splunk Employee
Splunk Employee

I think you nailed it with your latest edit. From the man page of /usr/bin/ps:

CPU usage is currently expressed as the percentage of time spent running *during the entire lifetime of a process***.

From the man page of /usr/bin/top:

k: %CPU -- CPU usage

The task’s share of the elapsed CPU time *since the last screen update*, expressed as a percentage of total CPU time.

View solution in original post

Splunk Employee
Splunk Employee

I think you nailed it with your latest edit. From the man page of /usr/bin/ps:

CPU usage is currently expressed as the percentage of time spent running *during the entire lifetime of a process***.

From the man page of /usr/bin/top:

k: %CPU -- CPU usage

The task’s share of the elapsed CPU time *since the last screen update*, expressed as a percentage of total CPU time.

View solution in original post

Splunk Employee
Splunk Employee

ps_sos.ps1 fetches per-process CPU usage from WMI:

$pctCPU = get-wmiobject Win32_PerfFormattedData_PerfProc_Process -Filter "IDProcess = $myPID" | select -expand PercentProcessorTime

I believe that this yields usage over the sample period (5s by default), which makes spikiness a lot more noticeable of course.

0 Karma

Communicator

Yes, but good to get confirmation from someone else as well.

I wonder if the Windows ps_sos script handles this the same way.

I would probably prefer that the script used top, but perhaps there are portability or other reasons for the choice of data source. Measuring CPU usage isn't straightforward, I guess.

0 Karma