I've been searching splunk answers all morning trying to get this one. It seems simple enough, but I can't lick it and I'm just spinning my wheels.
I'm trying to get a percentage uptime based on the TA_nix ps sourcetype. The rub is that it's for a two node cluster, so when one host is down and the other one is still up then the cluster as a whole is still up, and that's what they want..
Also the search I am running is sometimes providing results greater than 100% even when I break it down by Node1 and Node2. I'm counting on ps to poll 1 result per minute for this process.
Here's my search, and a sample set of results so you can see what I'm working with.
index="os" sourcetype="ps" USER=processuser COMMAND="commandiwanttocheck" (host=homehostsh OR host=someotherhosts)
| lookup serverinfo_lookup hostname AS host OUTPUTNEW ServerType ClusterNode
| stats count(COMMAND) as TotalResponses max(_time) as last_time min(_time) as first_time by ClusterNode ServerType
| eval minutes=((last_time-first_time)/60)
| eval Percent=round(((TotalResponses)/minutes)*100,2)
The result of the search is this. I've still got my "working" fields in there
ClusterNode ServerType TotalResponses last_time first_time Percent minutes
Node1 API_High 240 1585850077 1585835725 100.33 239.2
Node2 API_High 240 1585850069 1585835718 100.34 239.1833333
Node1 API_Low 240 1585850099 1585835749 100.35 239.1666667
Node2 API_Low 240 1585850060 1585835704 100.31 239.2666667
Node1 Batch_High 240 1585850067 1585835717 100.35 239.1666667
Node2 Batch_High 240 1585850078 1585835723 100.31 239.25
Node1 Batch_Low 240 1585850085 1585835732 100.33 239.2166667
Node2 Batch_Low 240 1585850070 1585835717 100.33 239.2166667
Node1 DMZ 240 1585850051 1585835702 100.36 239.15
Node2 DMZ 240 1585850084 1585835732 100.33 239.2
Node1 Internal 240 1585850079 1585835727 100.33 239.2
Node2 Internal 239 1585850042 1585835752 100.35 238.1666667
| makeresults
| eval _raw=" ClusterNode ServerType TotalResponses last_time first_time Percent minutes
Node1 API_High 240 1585850077 1585835725 100.33 239.2
Node2 API_High 240 1585850069 1585835718 100.34 239.1833333
Node1 API_Low 240 1585850099 1585835749 100.35 239.1666667
Node2 API_Low 240 1585850060 1585835704 100.31 239.2666667
Node1 Batch_High 240 1585850067 1585835717 100.35 239.1666667
Node2 Batch_High 240 1585850078 1585835723 100.31 239.25
Node1 Batch_Low 240 1585850085 1585835732 100.33 239.2166667
Node2 Batch_Low 240 1585850070 1585835717 100.33 239.2166667
Node1 DMZ 240 1585850051 1585835702 100.36 239.15
Node2 DMZ 240 1585850084 1585835732 100.33 239.2
Node1 Internal 240 1585850079 1585835727 100.33 239.2
Node2 Internal 239 1585850042 1585835752 100.35 238.1666667"
| multikv
| fields - _* linecount
| table ClusterNode ServerType TotalResponses last_time first_time Percent minutes
| rename COMMENT as "this is sample. from here, the logic"
| eventstats range(eval(mvappend(first_time,last_time))) as duration by ServerType
| addinfo
| eval baseline=round(info_max_time-info_min_time)
| eval time_perc=round(duration/baseline*100,2)
| fields - info*
This query aims to aggregate uptime based on the search period. How about this?