Solved: Re: How to cherry pick values from different sourc...

yuanliu · ‎10-20-2015

Given sourcetype=ps and sourcetype=top, in both of which pctCPU are present, how do I associate pctCPU from top only while using fields unique to ps? (Despite identical field name, values in these two sources represent very different things.)

In Splunk Add-on for Nix, for example, *ps and top both contain fields PID, COMMAND and pctCPU. (They share some other field names of interest which I will not use in this example.) As @Paolo Prigione pointed out many years ago, pctCPU in ps is not useful for monitoring. (https://answers.splunk.com/answers/27398/is-nix-sourcetype-ps-pctcpu-really-suitable-for-charting-oo...) In the simplest use case, pctCPU in top would give the instantaneous CPU usage of each process. However, COMMAND in top only gives a simple program name, which is insufficient for my purposes. (In the old nix for Splunk, *ps' COMMAND includes full arguments; in Splunk Add-on for Nix, *ps has a separate ARGS field.)

Conceivably I can associate top's pctCPU values with ps' app (combination of COMMAND and ARGS in the new Splunk Add-on for Nix) by joining a *top search with a ps search. This looks very wasteful, however. So I thought I would tackle it by a simple search, then eliminate values from ps.

index=os (sourcetype=ps OR sourcetype=top)
|  bucket _time span=1m
| stats values(if(sourcetype="ps",app,COMMAND)) as app values(eval(if(sourcetype="top",pctCPU,null()))) as pctCPU by _time PID

(bucket _time is necessary because, though launched with the same frequency, the two sources often have sub-minute stagger.) This works for all processes output from ps. However, as ps and top do not always survey the same processes even when they are launched within a subsecond, some processes captured by ps will not show in top of the same time interval, and vice versa. As a result, the above strategy gives null values when the process is in ps only. I want to fill these gaps with values from ps, because for these extremely momentary processes, pctCPU from ps has the same significance as that from top.

In other words, I want eliminate value of pctCPU from ps when top is available, but use value from ps when not. (The first term in the example, values(if(sourcetype="ps",app,COMMAND)) as app, is a much more sophisticated macro output in reality. That output can cause gaps when a process is only in ps but missing from top.)

yuanliu · ‎10-21-2015

@woodcock's introduction of coalesce makes me search for alternative statement of the problem. Here is one clunky solution:

index=os (sourcetype="top" OR sourcetype=ps)
| eval pctCPU=sourcetype.pctCPU
| bucket _time span=1m
| stats values(pctCPU) as pctCPU latest(eval(if(sourcetype="ps",app,COMMAND) as app
 by _time PID host
| eval pctCPU=replace(if(match(pctCPU,"top"),mvfilter(match(pctCPU,"top")),pctCPU),"[stop]+","")

Effectively, label pctCPU from different sources, then filter desired values by label based on the pseudo code; get rid of the label lastly. ( (ps|top) would be more efficient, but [stop]+ or [tops]+ has the sound byte.)

It is noisy in terms of code efficiency, and that span=1m is a very bad approximation. (There should be better methods to tidy up small stagger.) I hope for better, but I'll take this for the time being.

View solution in original post

stmyers7941 · ‎10-21-2015

Have you considered the Nmon app? You may be able to accomplish what you're looking for and more vs the nix app.

yuanliu · ‎10-21-2015

Thanks for the suggestion, @stmyers7941. Though keenly aware of the pains induced by *nix app, the option is not mine to pick . This said, the general method could have other use cases when field name overload happens.

yuanliu · ‎10-21-2015

@woodcock's introduction of coalesce makes me search for alternative statement of the problem. Here is one clunky solution:

index=os (sourcetype="top" OR sourcetype=ps)
| eval pctCPU=sourcetype.pctCPU
| bucket _time span=1m
| stats values(pctCPU) as pctCPU latest(eval(if(sourcetype="ps",app,COMMAND) as app
 by _time PID host
| eval pctCPU=replace(if(match(pctCPU,"top"),mvfilter(match(pctCPU,"top")),pctCPU),"[stop]+","")

Effectively, label pctCPU from different sources, then filter desired values by label based on the pseudo code; get rid of the label lastly. ( (ps|top) would be more efficient, but [stop]+ or [tops]+ has the sound byte.)

It is noisy in terms of code efficiency, and that span=1m is a very bad approximation. (There should be better methods to tidy up small stagger.) I hope for better, but I'll take this for the time being.

yuanliu · ‎10-21-2015

The above works well as a solution to the stated generalised question. But there's a big caveat as to suitability for fixing the nix app. In GNU *top, the default (which is how top.ps calls it) is to use Irix mode, in which percentage is calculated against a single core. For this data to be useful, therefore, one must divide the number by number of cores. But then, I haven't determined how GNU ps handles pcpu. Is it calibrated against a single core or is it against all cores? I'll post outcome in the other thread. In all cases, I really like to see *nix app fixed from the source as I suggested in https://answers.splunk.com/answers/117872/for-splunk-add-on-for-linux-why-do-we-need-both-ps-and-top....

woodcock · ‎10-22-2015

If you are going with this answer (note that I modified my solution yet again), then you should click "Accept".

yuanliu · ‎10-26-2015

@woodcock I'm going with this. After some investigation, I realise that field name overload is a cardinal sin that we shouldn't commit in the first place. So I'm really trying to solve an artificial problem. Still, your methods really expanded my Splunk vocabulary. (xyseries is something I have wanted for some other problems.)

yuanliu · ‎10-21-2015

An alternative statement of the problem could be: How to ask Splunk to perform the following pseudo code:

discard pctCPU from sourcetype=ps IF output from sourcetype=top exists for that PID in that sample period (every 5 minute but wavers from period to period and from sourcetype to sourcetype)
discard COMMAND from sourcetype=top IF output from sourcetype=ps exists for that PID in that sample period

woodcock · ‎10-21-2015

Like this:

index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| chart over _time latests(pctCPU) by sourcetype
| eval pctCPU=coalesce(top, ps)

At this point, each value for _time (each minute) has a value for pctCPU that uses sourcetype top in preference to sourcetype ps. Tack on the rest of what you need after that.

yuanliu · ‎10-21-2015

@woodcock Thanks for the reply. I need the result by PID so I can show consumption of each process over time.

woodcock · ‎10-21-2015

OK, then do this:

index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| chart over _time latests(pctCPU) by sourcetype PID
| eval pctCPU=coalesce(top, ps)

yuanliu · ‎10-21-2015

I mean, Splunk won't allow two groupings in chart when over is used. I have already permuted through these.

woodcock · ‎10-21-2015

OK, then try this:

index=os (sourcetype=ps OR sourcetype=top)
| bucket _time span=1m
| stats latest(pctCPU) AS pctCPU by sourcetype PID _time
| eval combo=sourcetype . ":" . PID
| xyseries _time combo pctCPU
| foreach top* [ eval pctCPU<<MATCHSTR>>=coalesce(top<<FIELD>>, ps<<FIELD>>)

How to cherry pick values from different sources?

Splunk Classroom Chronicles: Training Tales and Testimonials (Episode 3)

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Almost Too Eventful Assurance: Part 1

Are you a member of the Splunk Community?

How to cherry pick values from different sources?

Splunk Classroom Chronicles: Training Tales and Testimonials (Episode 3)

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Almost Too Eventful Assurance: Part 1