Given sourcetype=ps
and sourcetype=top
, in both of which pctCPU
are present, how do I associate pctCPU
from top only while using fields unique to ps? (Despite identical field name, values in these two sources represent very different things.)
In Splunk Add-on for Nix, for example, *ps and top both contain fields PID
, COMMAND
and pctCPU
. (They share some other field names of interest which I will not use in this example.) As @Paolo Prigione pointed out many years ago, pctCPU in ps is not useful for monitoring. (https://answers.splunk.com/answers/27398/is-nix-sourcetype-ps-pctcpu-really-suitable-for-charting-oo...) In the simplest use case, pctCPU in top would give the instantaneous CPU usage of each process. However, COMMAND
in top only gives a simple program name, which is insufficient for my purposes. (In the old nix for Splunk, *ps' COMMAND
includes full arguments; in Splunk Add-on for Nix, *ps has a separate ARGS
field.)
Conceivably I can associate top's pctCPU
values with ps' app
(combination of COMMAND
and ARGS
in the new Splunk Add-on for Nix) by joining a *top search with a ps search. This looks very wasteful, however. So I thought I would tackle it by a simple search, then eliminate values from ps.
index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| stats values(if(sourcetype="ps",app,COMMAND)) as app values(eval(if(sourcetype="top",pctCPU,null()))) as pctCPU by _time PID
(bucket _time
is necessary because, though launched with the same frequency, the two sources often have sub-minute stagger.) This works for all processes output from ps. However, as ps and top do not always survey the same processes even when they are launched within a subsecond, some processes captured by ps will not show in top of the same time interval, and vice versa. As a result, the above strategy gives null values when the process is in ps only. I want to fill these gaps with values from ps, because for these extremely momentary processes, pctCPU
from ps has the same significance as that from top.
In other words, I want eliminate value of pctCPU
from ps when top is available, but use value from ps when not. (The first term in the example, values(if(sourcetype="ps",app,COMMAND)) as app
, is a much more sophisticated macro output in reality. That output can cause gaps when a process is only in ps but missing from top.)
@woodcock's introduction of coalesce
makes me search for alternative statement of the problem. Here is one clunky solution:
index=os (sourcetype="top" OR sourcetype=ps)
| eval pctCPU=sourcetype.pctCPU
| bucket _time span=1m
| stats values(pctCPU) as pctCPU latest(eval(if(sourcetype="ps",app,COMMAND) as app
by _time PID host
| eval pctCPU=replace(if(match(pctCPU,"top"),mvfilter(match(pctCPU,"top")),pctCPU),"[stop]+","")
Effectively, label pctCPU
from different sources, then filter desired values by label based on the pseudo code; get rid of the label lastly. ( (ps|top)
would be more efficient, but [stop]+ or [tops]+ has the sound byte.)
It is noisy in terms of code efficiency, and that span=1m
is a very bad approximation. (There should be better methods to tidy up small stagger.) I hope for better, but I'll take this for the time being.
Have you considered the Nmon app? You may be able to accomplish what you're looking for and more vs the nix app.
Thanks for the suggestion, @stmyers7941. Though keenly aware of the pains induced by *nix app, the option is not mine to pick . This said, the general method could have other use cases when field name overload happens.
@woodcock's introduction of coalesce
makes me search for alternative statement of the problem. Here is one clunky solution:
index=os (sourcetype="top" OR sourcetype=ps)
| eval pctCPU=sourcetype.pctCPU
| bucket _time span=1m
| stats values(pctCPU) as pctCPU latest(eval(if(sourcetype="ps",app,COMMAND) as app
by _time PID host
| eval pctCPU=replace(if(match(pctCPU,"top"),mvfilter(match(pctCPU,"top")),pctCPU),"[stop]+","")
Effectively, label pctCPU
from different sources, then filter desired values by label based on the pseudo code; get rid of the label lastly. ( (ps|top)
would be more efficient, but [stop]+ or [tops]+ has the sound byte.)
It is noisy in terms of code efficiency, and that span=1m
is a very bad approximation. (There should be better methods to tidy up small stagger.) I hope for better, but I'll take this for the time being.
The above works well as a solution to the stated generalised question. But there's a big caveat as to suitability for fixing the nix app. In GNU *top, the default (which is how top.ps calls it) is to use Irix mode, in which percentage is calculated against a single core. For this data to be useful, therefore, one must divide the number by number of cores. But then, I haven't determined how GNU ps handles pcpu. Is it calibrated against a single core or is it against all cores? I'll post outcome in the other thread. In all cases, I really like to see *nix app fixed from the source as I suggested in https://answers.splunk.com/answers/117872/for-splunk-add-on-for-linux-why-do-we-need-both-ps-and-top....
If you are going with this answer (note that I modified my solution yet again), then you should click "Accept".
@woodcock I'm going with this. After some investigation, I realise that field name overload is a cardinal sin that we shouldn't commit in the first place. So I'm really trying to solve an artificial problem. Still, your methods really expanded my Splunk vocabulary. (xyseries
is something I have wanted for some other problems.)
An alternative statement of the problem could be: How to ask Splunk to perform the following pseudo code:
pctCPU
from sourcetype=ps
IF output from sourcetype=top
exists for that PID
in that sample period (every 5 minute but wavers from period to period and from sourcetype to sourcetype)COMMAND
from sourcetype=top
IF output from sourcetype=ps
exists for that PID
in that sample periodLike this:
index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| chart over _time latests(pctCPU) by sourcetype
| eval pctCPU=coalesce(top, ps)
At this point, each value for _time (each minute) has a value for pctCPU that uses sourcetype top
in preference to sourcetype ps
. Tack on the rest of what you need after that.
@woodcock Thanks for the reply. I need the result by PID so I can show consumption of each process over time.
OK, then do this:
index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| chart over _time latests(pctCPU) by sourcetype PID
| eval pctCPU=coalesce(top, ps)
I mean, Splunk won't allow two groupings in chart
when over
is used. I have already permuted through these.
OK, then try this:
index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| stats latest(pctCPU) AS pctCPU by sourcetype PID _time
| eval combo=sourcetype . ":" . PID
| xyseries _time combo pctCPU
| foreach top* [ eval pctCPU<<MATCHSTR>>=coalesce(top<<FIELD>>, ps<<FIELD>>)