Getting uptime of a process from PS

JDukeSplunk · ‎04-02-2020

I've been searching splunk answers all morning trying to get this one. It seems simple enough, but I can't lick it and I'm just spinning my wheels.

I'm trying to get a percentage uptime based on the TA_nix ps sourcetype. The rub is that it's for a two node cluster, so when one host is down and the other one is still up then the cluster as a whole is still up, and that's what they want..
Also the search I am running is sometimes providing results greater than 100% even when I break it down by Node1 and Node2. I'm counting on ps to poll 1 result per minute for this process.

Here's my search, and a sample set of results so you can see what I'm working with.

index="os" sourcetype="ps" USER=processuser COMMAND="commandiwanttocheck" (host=homehostsh OR host=someotherhosts) 
| lookup serverinfo_lookup hostname AS host OUTPUTNEW  ServerType ClusterNode
| stats count(COMMAND) as TotalResponses max(_time) as last_time min(_time) as first_time by ClusterNode ServerType
| eval minutes=((last_time-first_time)/60)
| eval Percent=round(((TotalResponses)/minutes)*100,2)

The result of the search is this. I've still got my "working" fields in there

ClusterNode ServerType  TotalResponses  last_time   first_time  Percent minutes
Node1   API_High    240 1585850077  1585835725  100.33  239.2
Node2   API_High    240 1585850069  1585835718  100.34  239.1833333
Node1   API_Low 240 1585850099  1585835749  100.35  239.1666667
Node2   API_Low 240 1585850060  1585835704  100.31  239.2666667
Node1   Batch_High  240 1585850067  1585835717  100.35  239.1666667
Node2   Batch_High  240 1585850078  1585835723  100.31  239.25
Node1   Batch_Low   240 1585850085  1585835732  100.33  239.2166667
Node2   Batch_Low   240 1585850070  1585835717  100.33  239.2166667
Node1   DMZ 240 1585850051  1585835702  100.36  239.15
Node2   DMZ 240 1585850084  1585835732  100.33  239.2
Node1   Internal    240 1585850079  1585835727  100.33  239.2
Node2   Internal    239 1585850042  1585835752  100.35  238.1666667

to4kawa · ‎04-08-2020

| makeresults 
| eval _raw=" ClusterNode    ServerType    TotalResponses    last_time    first_time    Percent    minutes
 Node1    API_High    240    1585850077    1585835725    100.33    239.2
 Node2    API_High    240    1585850069    1585835718    100.34    239.1833333
 Node1    API_Low    240    1585850099    1585835749    100.35    239.1666667
 Node2    API_Low    240    1585850060    1585835704    100.31    239.2666667
 Node1    Batch_High    240    1585850067    1585835717    100.35    239.1666667
 Node2    Batch_High    240    1585850078    1585835723    100.31    239.25
 Node1    Batch_Low    240    1585850085    1585835732    100.33    239.2166667
 Node2    Batch_Low    240    1585850070    1585835717    100.33    239.2166667
 Node1    DMZ    240    1585850051    1585835702    100.36    239.15
 Node2    DMZ    240    1585850084    1585835732    100.33    239.2
 Node1    Internal    240    1585850079    1585835727    100.33    239.2
 Node2    Internal    239    1585850042    1585835752    100.35    238.1666667" 
| multikv 
| fields - _* linecount 
| table ClusterNode ServerType TotalResponses last_time first_time Percent minutes
| rename COMMENT as "this is sample. from here, the logic"
| eventstats range(eval(mvappend(first_time,last_time))) as duration by ServerType
| addinfo
| eval baseline=round(info_max_time-info_min_time)
| eval time_perc=round(duration/baseline*100,2)
| fields - info*

This query aims to aggregate uptime based on the search period. How about this?

Getting uptime of a process from PS

Mastering Data Pipelines: Unlocking Value with Splunk

The Latest Cisco Integrations With Splunk Platform!

AI Adoption Hub Launch | Curated Resources to Get Started with AI in Splunk