What is gained by having both ps and top collected separately by the forwarder?
Could they be merged, are people typically picking one or the other?
I'm noticing that the resulting data collected seems pretty redundant, and can consume quite a bit of index time, if you are looking at a number of hosts.
Your question is a great example of data visibility versus data usefulness.
You know that ps and top are different, and you know that they both consume a lot of resources, and you know they both have their uses. However, from what I've seen it is the user who prefers to use one or the other based solely on user preference unless some specific issue requires the other.
If you're using the TA for unix to feed the Unix app, then a better question might be - what do I lose from the app when I eliminate one or the other?
I think your focus should be - I can get all of this data, but what does it do for me? What are my goals for Splunk indexing and analysis? What am I looking for? What do I want to report on? What do I want to alert on?
Splunk can index and make available gobs and gobs of data, but that is not the point of implementing Splunk. The point of implementing Splunk is to make that data do something for you. What do you want to do with your data?
The output of ps and top is slightly different. ps gives you far more detail on the processes that are running on your system, including the args provided to the command, as well as the TTY they are running in. top gives you NICE and elapsed time. I do agree, though, that there is almost total overlap aside from these differences, so there might not be much harm running one or the other depending on your use case.
Combining them would be a bit harry, since we would have to merge and deduplicate the output of both commands and do so for all the platforms that the TA supports (Unix, Solaris, AIX, MacOS, etc).
Not only is there redundancy, but an identical field name could mean totally different things in the two, such is the case with pctCPU. It is really a poor choice of field name on the app's part. (As of 6.2.)
In fact, the only data unique to top as a utility are those interval data as opposed to snapshot values. (Instantaneous and cumulative values, respectively exemplified by NICE and elapsed time, are both snapshot values. The two examples are both retrievable via ps command at least in modern GNU,
-o nice,etime.) %CPU in top (represented as field
sourcetype=top) is an interval value. This is totally different from the cumulative value
-o pcpu as given in ps utility. (Also represented as field
sourcetype=ps.) The three load averages that top displays are also unavailable in ps, because though cumulative, they are aggregations (as opposed to per process). These are captured in
sourcetype=vmstat via uptime utility.
In short, top is an decidedly interactive utility. GNU default displays (as used in top.ps) use human-friendly units and such, complicating field extractions, but with no purpose (because all such information is retrievable via GNU ps). The main useful field, %CPU is confused with ps' pcpu output.
For these sources to be more useful, I would (strictly GNU)
top's diagnostic value is particularly high when run with
-H. You can find which thread is eating up CPU. But that would add lots of overload to both host and indexer.