Solved: Nmon splunk app:Calculations for average are defyi...

Ankitha_d · ‎04-19-2016

Hi ,

We have installed the splunk forwarder to calculate the cpu usage on few aix servers.
This is the stats observed in the following selection of NMON app:

UI LPAR Pool, Pool Virtual CPU Usage (AIX)
weekly average shown in NMON 0.46
thu 14 0.37
fri 15 0.37
sat 16 0.37
sun 17 0.36
mon 18 0.37
tue 19 0.39
wed 20 0.39
How can we have overall average more than the average of each day.Any spike also should be smoothened out.This is a really confusing issue which I am unable to tackle.

guilmxm · ‎04-21-2016

Hello !

Right, could you give more details about the selection you have done within the interface ? Have you selected any time filtering ?
Maybe there is a mistake somewhere in the User Interface that would explain this, giving me the detail of your selection will help.

Between this, please have a look on the How interface provided within the App (Howto menu in the app bar), you will find a dedicated dashboard that provides SPL search samples.

HOWTO LPAR: Generate stats and charts with Splunk Search Processing Language (SPL) for IBM Pseries Pools

If i well understand what you want to get, you can do things like:

Global stats over the period, result in a table stats

eventtype=nmon:performance type=LPAR host=<myserver> PoolIdle>0
| eval usage=round((poolCPUs-PoolIdle),2)
| eval Pool_id=if(isnull(Pool_id), "0", Pool_id)
| stats min(usage) AS "Min Pool CPU usage", avg(usage) AS "Avg Pool CPU usage", max(usage) AS "Max Pool CPU usage" by frameID,Pool_id,hostname
| eval "Avg Pool CPU usage"=round('Avg Pool CPU usage', 2) | sort frameID,Pool_id,hostname

Per day stats over the period, result in a table stats (adding a | bucket _time span=1d and the _time in the by statement)

eventtype=nmon:performance type=LPAR host=<myserver> PoolIdle>0
| eval usage=round((poolCPUs-PoolIdle),2)
| eval Pool_id=if(isnull(Pool_id), "0", Pool_id)
| bucket _time span=1d
| stats min(usage) AS "Min Pool CPU usage", avg(usage) AS "Avg Pool CPU usage", max(usage) AS "Max Pool CPU usage" by _time, frameID,Pool_id,hostname
| eval "Avg Pool CPU usage"=round('Avg Pool CPU usage', 2) | sort frameID,Pool_id,hostname

You can also use timechart, or chart if the final result wanted is charting instead of table stats. (see examples in the Howto)

Note that you can also use the provided Pivot data model, open the model available in the Pivot menu within the app bar.

View solution in original post

guilmxm · ‎04-22-2016

Right,

The explanation comes from the search associated with the single forms within the Application.
If you have Splunk 6.3, you can directly open in search the Single form associated with the Single to see what beeing done.

So, I will take an example, with the following selection:

timerange: 1 week, starting from monday to sunday included
1 host is selected
In the charting parameter, setting the span value to 1 day
Average as the stats mode

Let's first take a look at the search associated with the chart:

| tstats max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" from datamodel=NMON_Data_CPU where (nodename = CPU.LPAR) (CPU.hostname=myserver) (CPU.frameID="*") (CPU.hostname=*) (CPU.LPAR.Pool_id="*") (CPU.PoolIdle!=0) `No_Filter(CPU)` groupby _time, "CPU.frameID", "CPU.hostname" prestats=true span=1m
| stats dedup_splitvals=t max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" by _time, "CPU.frameID", "CPU.hostname"
| sort limit=0 _time | fields *
| rename CPU.LPAR.Pool_usage AS usage | fields * | timechart span=1d useother=f limit=0 avg(usage) As usage  by "CPU.hostname"

As you can see, the search (that is using data model) starts building a table stats with a minimal span value of 1 min, which means evaluating the Pool usage each minute in the period.
Finally, the timestart command with a span value à 1d with evaluate the average value per day.
In my example case:

_time   myhost
2016-04-11  24.281831
2016-04-12  24.474734
2016-04-13  24.279440
2016-04-14  25.535630
2016-04-15  24.591124
2016-04-16  29.036124
2016-04-17  27.137331

If a take a look at the single search:

| tstats max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" from datamodel=NMON_Data_CPU where (nodename = CPU.LPAR) (CPU.hostname=myserver) (CPU.frameID="*") (CPU.hostname=*) (CPU.LPAR.Pool_id="*") (CPU.PoolIdle!=0) `No_Filter(CPU)` groupby _time, "CPU.frameID", "CPU.hostname" prestats=true `inline_customspan`
| stats dedup_splitvals=t max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" by _time, "CPU.frameID", "CPU.hostname"
| fields * | sort limit=0 _time | rename CPU.LPAR.Pool_usage AS usage
| stats min(usage) As min_usage, avg(usage) As avg_usage, max(usage) As max_usage, values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id", max("CPU.poolCPUs") AS "CPU.poolCPUs", sparkline(avg(usage)) As sparkline by "CPU.frameID", "CPU.hostname"
| rename "CPU.frameID" AS frameID, "CPU.hostname" AS hostname, "CPU.LPAR.Pool_id" AS "Pool ID", "CPU.poolCPUs" AS "CPUs in Pool"
| sort limit=0 frameID | eval avg_usage=round(avg_usage,2) | stats avg(avg_usage) AS avg | eval avg=round(avg,2)

Result in my example:

27.59

As you will notice, the first part of the command (tstats) does not use a span value per minute (span=1m) but instead of that calls a macro "inline_customspan"
This macro is being used within the App to set to always set the best and accurate value of span for charting.

Splunk by default will set this value automatically when you generates charts, but when the time range period you have selected increase, the span value gets quickly set to very large value.
For example, if you use a 24h period, Splunk will set the span to 30m.
This is may be fine with very simple things, not for people wanting the best and accurate expertise.
This is will the App uses this macro.

Any way, in the single form, it should not bee used here, and this is the root cause of the issue.

Also, maybe you will wonder why creating a table stats per minute than continuing the operations, this is done to manage duplicated value, or multiple parallel nmon execution.
I could have done a dedup command too, in the App context i have considered keeping the higher reported value per minute being the true reported by nmon.

The macro will set a span value, in my case with the time range selected, it will set a 15 minutes span value.
That means creating a table stats of the max Pool usage per split of 15 minutes, then evaluate the average value of the global period. (resulting from the 15 minutes table stats)

If i correct the search, and build initally the table stats per split of 1 minute:

| tstats max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" from datamodel=NMON_Data_CPU where (nodename = CPU.LPAR) (CPU.hostname=myserver) (CPU.frameID="*") (CPU.hostname=*) (CPU.LPAR.Pool_id="*") (CPU.PoolIdle!=0) `No_Filter(CPU)` groupby _time, "CPU.frameID", "CPU.hostname" prestats=true span=1m
| stats dedup_splitvals=t max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" by _time, "CPU.frameID", "CPU.hostname"
| fields * | sort limit=0 _time | rename CPU.LPAR.Pool_usage AS usage
| stats min(usage) As min_usage, avg(usage) As avg_usage, max(usage) As max_usage, values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id", max("CPU.poolCPUs") AS "CPU.poolCPUs", sparkline(avg(usage)) As sparkline by "CPU.frameID", "CPU.hostname"
| rename "CPU.frameID" AS frameID, "CPU.hostname" AS hostname, "CPU.LPAR.Pool_id" AS "Pool ID", "CPU.poolCPUs" AS "CPUs in Pool"
| sort limit=0 frameID | eval avg_usage=round(avg_usage,2) | stats avg(avg_usage) AS avg | eval avg=round(avg,2)

Result in my example:

25.28

If you recycle this search to get the 1 minute table stats, add a "| addcoltotals" to get the final sum of the Pool usage, then divide by the number of entry you got, you will get the average value.
Which is correct.

| tstats max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" from datamodel=NMON_Data_CPU where (nodename = CPU.LPAR) (CPU.hostname=myserver) (CPU.frameID="*") (CPU.hostname=*) (CPU.LPAR.Pool_id="*") (CPU.PoolIdle!=0) `No_Filter(CPU)` groupby _time, "CPU.frameID", "CPU.hostname" prestats=true span=1m
| stats dedup_splitvals=t max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" by _time, "CPU.frameID", "CPU.hostname"
| fields * | sort limit=0 _time | rename CPU.LPAR.Pool_usage AS usage
| addcoltotals

Result in my case:

(57302.50/2267) : 25.28

If i do the same with the origin search associated with the search:

| tstats max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" from datamodel=NMON_Data_CPU where (nodename = CPU.LPAR) (CPU.hostname=myserver) (CPU.frameID="*") (CPU.hostname=*) (CPU.LPAR.Pool_id="*") (CPU.PoolIdle!=0) `No_Filter(CPU)` groupby _time, "CPU.frameID", "CPU.hostname" prestats=true `inline_customspan`
| stats dedup_splitvals=t max("CPU.poolCPUs") AS "CPU.poolCPUs" max("CPU.LPAR.Pool_usage") AS usage values("CPU.LPAR.Pool_id") AS "CPU.LPAR.Pool_id" by _time, "CPU.frameID", "CPU.hostname"
| fields * | sort limit=0 _time | rename CPU.LPAR.Pool_usage AS usage
| addcoltotals

Result in my case:

 (16859.41/612): 27.59

So, that makes sens.

Note that the user interface uses data model searches (which explains the use of the tstats command), you can do the same (easier) in SPL (look at Howto interfaces)

If i take back my too examples:

with a manual set of span value to 1 minute:

eventtype=nmon:performance type=LPAR (hostname=myserver) (PoolIdle!=0)
| eval Pool_usage=round((poolCPUs-PoolIdle),2)
| bucket _time span=1m
| stats max(Pool_usage) AS Pool_usage by _time,hostname
| stats avg(Pool_usage) AS Pool_usage by hostname

Result: 25.28

The result with the macro being called: (which sets the span value to 15 min)

eventtype=nmon:performance type=LPAR (hostname=myserver) (PoolIdle!=0)
| eval Pool_usage=round((poolCPUs-PoolIdle),2)
| bucket _time `inline_customspan`
| stats max(Pool_usage) AS Pool_usage by _time,hostname
| stats avg(Pool_usage) AS Pool_usage by hostname

Result: 27.59

Again that makes sense, and we do get same results between pivot data model and SPL searches. (hopefully !)

So, to sume up, you are correct, there is an error, and the error comes from the fact the macro is being called to evaluate the global stats of your periods, which is incorrect.
The macro shall be used only when charting, or in table stats if this is wanted.
Note that if the macro would be called, AND if no span value would be defined, the result would not be correct either because Splunk would choose its value (which would be again less accurate)

A few last things, you have selected multiple hosts, and average agregation per time interval.

This will modify the search, and adds:

| stats avg(CPU.poolCPUs) AS CPU.poolCPUs, avg(usage) AS usage by _time | eval CPU.frameID="aggreg_frameID" | eval CPU.hostname="aggreg_hostname"

In a few words, that means: take all values for all selected hosts at the time interval (normally every minute), and evaluate the average value.
this is fine, you can only consider keeping only the higher value per minute for all of your hosts choosing max per time interval.

If you 10 partitions that have measured the pool usage in the same full minute, you may want to keep the highest value as the "true" situation of the Pool usage, but the average per minute is not bad too.
There should be a very low difference between the 2, but it can happen.

FINAL:

Thank you for having reporting this, i will open an issue on Git, this will be corrected ASAP with the release 1.6.16 that is currently under development.
You can easily if you like modify the view to correct and set the span in the raw search, but don't forget to clean the view when you will upgrade, if you do so.

I hope this helps, and answers your question !

Guilhem

Ankitha_d · ‎04-22-2016

Thanks a lot for the help

jplumsdaine22 · ‎04-22-2016

@guilmxm sensational support

guilmxm · ‎04-22-2016

@jplumsdaine22: Thanks 🙂

Ankitha_d · ‎04-22-2016

Found a few more things to while checking out the option of listing the averages per day.
I obtained the average over a period of 7 days from 15th of april :00.00 to 21st of april :24.00and found the average to be 16.38.
I used the query from the same interface and broke it for each day using timechart span=1d.
When I did this the average for 15th of april is 16.51.

After this with the same selections I selected the timerange to be just 15th of April 00.00 to 24.00 and obtained the average to be 14.51.
How is this possible?.It is turning out to be really confusing!!!...Please help.I have uploaded the images to support the arguments.
I could not attach the one where it mentions the average from 15th to 21st as this was the limit of images allowed .alt text

Ankitha_d · ‎04-21-2016

This the selection I have used to obtain the average over seven days.And when I keep the same selections and change the time range alone to select the average for each day over the same time period ,not even one average is greater than the overall average.And this is not possible I guess.

guilmxm · ‎04-21-2016

Hello !

Right, could you give more details about the selection you have done within the interface ? Have you selected any time filtering ?
Maybe there is a mistake somewhere in the User Interface that would explain this, giving me the detail of your selection will help.

Between this, please have a look on the How interface provided within the App (Howto menu in the app bar), you will find a dedicated dashboard that provides SPL search samples.

HOWTO LPAR: Generate stats and charts with Splunk Search Processing Language (SPL) for IBM Pseries Pools

If i well understand what you want to get, you can do things like:

Global stats over the period, result in a table stats

eventtype=nmon:performance type=LPAR host=<myserver> PoolIdle>0
| eval usage=round((poolCPUs-PoolIdle),2)
| eval Pool_id=if(isnull(Pool_id), "0", Pool_id)
| stats min(usage) AS "Min Pool CPU usage", avg(usage) AS "Avg Pool CPU usage", max(usage) AS "Max Pool CPU usage" by frameID,Pool_id,hostname
| eval "Avg Pool CPU usage"=round('Avg Pool CPU usage', 2) | sort frameID,Pool_id,hostname

Per day stats over the period, result in a table stats (adding a | bucket _time span=1d and the _time in the by statement)

eventtype=nmon:performance type=LPAR host=<myserver> PoolIdle>0
| eval usage=round((poolCPUs-PoolIdle),2)
| eval Pool_id=if(isnull(Pool_id), "0", Pool_id)
| bucket _time span=1d
| stats min(usage) AS "Min Pool CPU usage", avg(usage) AS "Avg Pool CPU usage", max(usage) AS "Max Pool CPU usage" by _time, frameID,Pool_id,hostname
| eval "Avg Pool CPU usage"=round('Avg Pool CPU usage', 2) | sort frameID,Pool_id,hostname

You can also use timechart, or chart if the final result wanted is charting instead of table stats. (see examples in the Howto)

Note that you can also use the provided Pivot data model, open the model available in the Pivot menu within the app bar.

jplumsdaine22 · ‎04-21-2016

@Ankitha_d this answer is posted by the app maintainer - have a read!

sideview · ‎04-20-2016

Unless there's some weird mismatch of definitions, like the "weekly" is only counting the last full monday-to-monday week, this has to be a bug in the logging or in the app.

Ankitha_d · ‎04-20-2016

I donot see any issues with the logging.i am not sure about the bug in the app.Hope someone can help on this

Nmon splunk app:Calculations for average are defying the logics of Math

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms