I have a bunch of Universal Forwarders running on 64bit linux systems, that are forwarding the data from TA-SoS to an indexer running on Windows.
The "ps" output for my forwarders are all showing almost constant values, as well as the same CPU percentage across different machines (with varying amounts of cores).
When running
index=sos sourcetype="ps" | multikv | where COMMAND=="splunkd" | timechart range(pctCPU) by host
I get flat lines for all forwarders. My indexer is running the windows PS script, and I get measurements for that. If I go back to when the forwarder started, the CPU% shows a peak but then goes down to the constant value. Currently the forwarders claim to use constant 0.4% CPU over several days.
I also have scripts running that use top
to monitor all applications on the machines, including splunkd, and they are showing variations of CPU% between 0.1-1.5 or so depending on traffic.
Why am I not getting correct CPU% measurements from TA-sos?
edit:
Read up on ps
and what it does, and it seems to be a difference in how it and top
works:
http://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result
Essentially, ps
only measures lifetime CPU usage, while top
does a sampling. Perhaps forwarders simply vary too little in CPU usage for the lifetime value to change? This makes me wonder how useful it is for detecting spikes in CPU usage on forwarders.
I think you nailed it with your latest edit. From the man page of /usr/bin/ps
:
CPU usage is currently expressed as the percentage of time spent running **during the entire lifetime of a process**.
From the man page of /usr/bin/top
:
k: %CPU -- CPU usage
The task’s share of the elapsed CPU time **since the last screen update, expressed as a percentage of total CPU time.
I think you nailed it with your latest edit. From the man page of /usr/bin/ps
:
CPU usage is currently expressed as the percentage of time spent running **during the entire lifetime of a process**.
From the man page of /usr/bin/top
:
k: %CPU -- CPU usage
The task’s share of the elapsed CPU time **since the last screen update, expressed as a percentage of total CPU time.
ps_sos.ps1
fetches per-process CPU usage from WMI:
$pctCPU = get-wmiobject Win32_PerfFormattedData_PerfProc_Process -Filter "IDProcess = $myPID" | select -expand PercentProcessorTime
I believe that this yields usage over the sample period (5s by default), which makes spikiness a lot more noticeable of course.
Yes, but good to get confirmation from someone else as well.
I wonder if the Windows ps_sos script handles this the same way.
I would probably prefer that the script used top, but perhaps there are portability or other reasons for the choice of data source. Measuring CPU usage isn't straightforward, I guess.