Solved: Re: Average TPS and Average response time in a sin...

mrigs13 · ‎04-07-2022

Hi,

I am trying to write a query that would get me the average TPS and average response time for services in the same table. I tried this -

<search>
| eval <evaluate response time as RT>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, service
| stats count AS txnCount, avg(TPS) as avgTPS, avg(RT) as avgRT by service

However, the numbers don't seem to match when I am running the TPS query individually like this -

<search>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, service
| stats count AS txnCount, avg(TPS) as avgTPS by service

Any suggestions what I could be doing wrong here?

Thank you!

bowesmana · ‎04-11-2022

The problem with using eventstats to calculate something then stats to do something else on that value is that eventstats will put the calculated value in every event found, so consider this data

12:00:00.001 query=a, rt=100
12:00:00.002 query=a, rt=100
12:00:00.003 query=a, rt=100
12:00:01.001 query=a, rt=100
12:00:01.002 query=a, rt=100
12:00:02.001 query=a, rt=100

After eventstats you will have this data

12:00:00.001 query=a, rt=100 TPS=3
12:00:00.002 query=a, rt=100 TPS=3 
12:00:00.003 query=a, rt=100 TPS=3 
12:00:01.001 query=a, rt=100 TPS=2
12:00:01.002 query=a, rt=100 TPS=2
12:00:02.001 query=a, rt=100 TPS=1

so then when you calculate avg(TPS) on this data you have

3+3+3+2+2+1=14 / 6 events = 2.3333333

however, the average of the TPS is actually 3+2+1=6 / 3 seconds = 2

which is why you have different values and the eventstats+stats version is wrong.

The challenge you have is in getting both percentiles and TPS. You can only use ONE of the TPS values per second when calculating averages and percentiles and you cannot use the double stats method if you want percentiles.

This is a technique that can be used to calculate the numbers using eventstats. This uses streamstats to count the events per second and then sets all other TPS values to null apart from the first one per second, which then means you can use the avg(TPS) and percentiles as the events that have null TPS are not counted, so in the above data example, you get the correct average TPS value of 2.

| eventstats count as TPS by TIME query
| streamstats time_window=1s c as tmp
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

There may be other ways to do this

@ITWhisperer do you know of a way to calc TPS then percentiles+averages of that TPS without using a subsearch to duplicate the search?

View solution in original post

mrigs13 · ‎04-12-2022

Hello @bowesmana & @ITWhisperer, many thanks for the solution, and for the explanation provided along with! 🙂 I had to take a refresher for evenstats and streamstats, and it all fit well with the explanations by @bowesmana.

Finally went with the modified query suggested by @ITWhisperer, and the results look good, and as expected.

Thank you once again, both!

bowesmana · ‎04-10-2022

@mrigs13

Also, note that using eventstats will be much slower than using stats twice. As you are doing average, it's each to calculate the total RT and then create the average RT at the end like this.

<search>
| bin _time as "TIME" span=1s
| stats count as TPS sum(RT) as RT by TIME service
| stats sum(TPS) as txnCount, avg(TPS) as avgTPS, sum(RT) as RT by service
| eval avgRT_sum=RT/txnCount

mrigs13 · ‎04-11-2022

Thanks @bowesmana. Learnt something new about evenstats today. 🙂

I do need the 90th percentile TPS and RT as well. Also, the TPS does not match with what I was getting with eventstats.

<search>
| bin _time AS "TIME" span=1s
| stats count as TPS sum(RT) as RT by TIME query
| stats sum(TPS) as txnCount, avg(TPS) as avgTPS, sum(RT) as RT by query
| eval avgRT_sum=round(RT/txnCount,2)
| eval avgTPS = round (avgTPS,2)

This is the query that I have -

<search>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(RT) as P90_RT, avg(TPS) as avgTPS, avg(RT) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)

Thank you!

bowesmana · ‎04-11-2022

The problem with using eventstats to calculate something then stats to do something else on that value is that eventstats will put the calculated value in every event found, so consider this data

12:00:00.001 query=a, rt=100
12:00:00.002 query=a, rt=100
12:00:00.003 query=a, rt=100
12:00:01.001 query=a, rt=100
12:00:01.002 query=a, rt=100
12:00:02.001 query=a, rt=100

After eventstats you will have this data

12:00:00.001 query=a, rt=100 TPS=3
12:00:00.002 query=a, rt=100 TPS=3 
12:00:00.003 query=a, rt=100 TPS=3 
12:00:01.001 query=a, rt=100 TPS=2
12:00:01.002 query=a, rt=100 TPS=2
12:00:02.001 query=a, rt=100 TPS=1

so then when you calculate avg(TPS) on this data you have

3+3+3+2+2+1=14 / 6 events = 2.3333333

however, the average of the TPS is actually 3+2+1=6 / 3 seconds = 2

which is why you have different values and the eventstats+stats version is wrong.

The challenge you have is in getting both percentiles and TPS. You can only use ONE of the TPS values per second when calculating averages and percentiles and you cannot use the double stats method if you want percentiles.

This is a technique that can be used to calculate the numbers using eventstats. This uses streamstats to count the events per second and then sets all other TPS values to null apart from the first one per second, which then means you can use the avg(TPS) and percentiles as the events that have null TPS are not counted, so in the above data example, you get the correct average TPS value of 2.

| eventstats count as TPS by TIME query
| streamstats time_window=1s c as tmp
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

There may be other ways to do this

@ITWhisperer do you know of a way to calc TPS then percentiles+averages of that TPS without using a subsearch to duplicate the search?

ITWhisperer · ‎04-12-2022

@bowesmana Your approach is almost how I would have done it too. A slight modification is to the streamstats which should take into account the query as well. Here is a runanywhere example (based on your approach)

| makeresults count=1000
| eval _time=_time+(random()%1000)/100
| eval query=mvindex(split("ABC",""),random()%3)
| eval RT=random()%100/10
``` the lines above set up some dummy data ```
| bin _time as TIME span=1s
| eventstats count as TPS by TIME query
| streamstats c as tmp global=f by TIME query
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

bowesmana · ‎04-12-2022

Thanks for that pick @ITWhisperer - I copied the wrong streamstats example.

bowesmana · ‎04-10-2022

Make sure you are comparing the same time intervals when you check your numbers - if you latest time is 'now' then you will always get different results every time you run the search

mrigs13 · ‎04-11-2022

I think that's the mistake I was doing when comparing the two. Although I don't recall now 🙂. However, now the initial query seems to be giving me the results I expect.

Final working query -

<search>
| eval RT = endTime - startTime
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(TotalTime) as P90_RT, avg(TPS) as avgTPS, avg(TotalTime) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)

yuanliu · ‎04-09-2022

Before trying to find what is "wrong", maybe first determine if anything is actually wrong. What is the method used to determine that the numbers do not match? I ask because the codes suggest very dense events, so any observed difference could simply be an artifact due to data variants.

Another possible path to examine could be any side effect from "<evaluate response time as RT>" that is not illustrated.

mrigs13 · ‎04-11-2022

Thank you for responding @yuanliu. I just did a comparison w/ and w/o the RT evaluation - basically comparing the results of the 2 queries that is mentioned in the original post.

As for this -

"<evaluate response time as RT>"

It was a simple evaluation

RT=endTime-startTime

However, it looks like the initial queries that I posted are working as expected. Probably I was using a wrong time window for running both the queries. Final working query -

<search>
| eval RT = endTime - startTime
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(TotalTime) as P90_RT, avg(TPS) as avgTPS, avg(TotalTime) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)

How to write a query for average TPS and average response time in a single table?

stats

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

How to write a query for average TPS and average response time in a single table?

stats

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases