Splunk Search

How to write a query for average TPS and average response time in a single table?

mrigs13
Explorer

Hi,

I am trying to write a query that would get me the average TPS and average response time for services in the same table. I tried this -

 

<search>
| eval <evaluate response time as RT>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, service
| stats count AS txnCount, avg(TPS) as avgTPS, avg(RT) as avgRT by service

 

However, the numbers don't seem to match when I am running the TPS query individually like this -

 

<search>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, service
| stats count AS txnCount, avg(TPS) as avgTPS by service

 

Any suggestions what I could be doing wrong here?

Thank you!

Labels (1)
Tags (3)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

The problem with using eventstats to calculate something then stats to do something else on that value is that eventstats will put the calculated value in every event found, so consider this data

12:00:00.001 query=a, rt=100
12:00:00.002 query=a, rt=100
12:00:00.003 query=a, rt=100
12:00:01.001 query=a, rt=100
12:00:01.002 query=a, rt=100
12:00:02.001 query=a, rt=100

After eventstats you will have this data

12:00:00.001 query=a, rt=100 TPS=3
12:00:00.002 query=a, rt=100 TPS=3 
12:00:00.003 query=a, rt=100 TPS=3 
12:00:01.001 query=a, rt=100 TPS=2
12:00:01.002 query=a, rt=100 TPS=2
12:00:02.001 query=a, rt=100 TPS=1

so then when you calculate avg(TPS) on this data you have

3+3+3+2+2+1=14 / 6 events = 2.3333333

however, the average of the TPS is actually 3+2+1=6 / 3 seconds = 2

which is why you have different values and the eventstats+stats version is wrong.

The challenge you have is in getting both percentiles and TPS. You can only use ONE of the TPS values per second when calculating averages and percentiles and you cannot use the double stats method if you want percentiles.

This is a technique that can be used to calculate the numbers using eventstats. This uses streamstats to count the events per second and then sets all other TPS values to null apart from the first one per second, which then means you can use the avg(TPS) and percentiles as the events that have null TPS are not counted, so in the above data example, you get the correct average TPS value of 2.

| eventstats count as TPS by TIME query
| streamstats time_window=1s c as tmp
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

There may be other ways to do this

@ITWhisperer do you know of a way to calc TPS then percentiles+averages of that TPS without using a subsearch to duplicate the search?

 

View solution in original post

mrigs13
Explorer

Hello @bowesmana & @ITWhisperer, many thanks for the solution, and for the explanation provided along with! 🙂 I had to take a refresher for evenstats and streamstats, and it all fit well with the explanations by @bowesmana.

Finally went with the modified query suggested by @ITWhisperer, and the results look good, and as expected.

mrigs13_0-1649798206821.png

Thank you once again, both!

0 Karma

bowesmana
SplunkTrust
SplunkTrust

@mrigs13 

Also, note that using eventstats will be much slower than using stats twice. As you are doing average, it's each to calculate the total RT and then create the average RT at the end like this.

<search>
| bin _time as "TIME" span=1s
| stats count as TPS sum(RT) as RT by TIME service
| stats sum(TPS) as txnCount, avg(TPS) as avgTPS, sum(RT) as RT by service
| eval avgRT_sum=RT/txnCount

 

0 Karma

mrigs13
Explorer

Thanks @bowesmana. Learnt something new about evenstats today. 🙂

I do need the 90th percentile TPS and RT as well. Also, the TPS does not match with what I was getting with eventstats. 

 

 

<search>
| bin _time AS "TIME" span=1s
| stats count as TPS sum(RT) as RT by TIME query
| stats sum(TPS) as txnCount, avg(TPS) as avgTPS, sum(RT) as RT by query
| eval avgRT_sum=round(RT/txnCount,2)
| eval avgTPS = round (avgTPS,2)

 

 

mrigs13_0-1649715852489.png

This is the query that I have -

<search>
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(RT) as P90_RT, avg(TPS) as avgTPS, avg(RT) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)

mrigs13_1-1649716147959.png

Thank you!

0 Karma

bowesmana
SplunkTrust
SplunkTrust

The problem with using eventstats to calculate something then stats to do something else on that value is that eventstats will put the calculated value in every event found, so consider this data

12:00:00.001 query=a, rt=100
12:00:00.002 query=a, rt=100
12:00:00.003 query=a, rt=100
12:00:01.001 query=a, rt=100
12:00:01.002 query=a, rt=100
12:00:02.001 query=a, rt=100

After eventstats you will have this data

12:00:00.001 query=a, rt=100 TPS=3
12:00:00.002 query=a, rt=100 TPS=3 
12:00:00.003 query=a, rt=100 TPS=3 
12:00:01.001 query=a, rt=100 TPS=2
12:00:01.002 query=a, rt=100 TPS=2
12:00:02.001 query=a, rt=100 TPS=1

so then when you calculate avg(TPS) on this data you have

3+3+3+2+2+1=14 / 6 events = 2.3333333

however, the average of the TPS is actually 3+2+1=6 / 3 seconds = 2

which is why you have different values and the eventstats+stats version is wrong.

The challenge you have is in getting both percentiles and TPS. You can only use ONE of the TPS values per second when calculating averages and percentiles and you cannot use the double stats method if you want percentiles.

This is a technique that can be used to calculate the numbers using eventstats. This uses streamstats to count the events per second and then sets all other TPS values to null apart from the first one per second, which then means you can use the avg(TPS) and percentiles as the events that have null TPS are not counted, so in the above data example, you get the correct average TPS value of 2.

| eventstats count as TPS by TIME query
| streamstats time_window=1s c as tmp
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

There may be other ways to do this

@ITWhisperer do you know of a way to calc TPS then percentiles+averages of that TPS without using a subsearch to duplicate the search?

 

ITWhisperer
SplunkTrust
SplunkTrust

@bowesmana Your approach is almost how I would have done it too. A slight modification is to the streamstats which should take into account the query as well. Here is a runanywhere example (based on your approach)

| makeresults count=1000
| eval _time=_time+(random()%1000)/100
| eval query=mvindex(split("ABC",""),random()%3)
| eval RT=random()%100/10
``` the lines above set up some dummy data ```
| bin _time as TIME span=1s
| eventstats count as TPS by TIME query
| streamstats c as tmp global=f by TIME query
| eval TPS=if(tmp>1, null(), TPS)
| stats avg(TPS) as avgTPS perc90(TPS) as p90TPS by query

bowesmana
SplunkTrust
SplunkTrust

Thanks for that pick @ITWhisperer - I copied the wrong streamstats example.

 

0 Karma

bowesmana
SplunkTrust
SplunkTrust

Make sure you are comparing the same time intervals when you check your numbers - if you latest time is 'now' then you will always get different results every time you run the search

0 Karma

mrigs13
Explorer

I think that's the mistake I was doing when comparing the two. Although I don't recall now 🙂. However, now the initial query seems to be giving me the results I expect.

Final working query -

 

<search>
| eval RT = endTime - startTime
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(TotalTime) as P90_RT, avg(TPS) as avgTPS, avg(TotalTime) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Before trying to find what is "wrong", maybe first determine if anything is actually wrong.  What is the method used to determine that the numbers do not match?  I ask because the codes suggest very dense events, so any observed difference could simply be an artifact due to data variants.

Another possible path to examine could be any side effect from "<evaluate response time as RT>" that is not illustrated.

0 Karma

mrigs13
Explorer

Thank you for responding @yuanliu. I just did a comparison w/ and w/o the RT evaluation - basically comparing the results of the 2 queries that is mentioned in the original post.

As for this -

 "<evaluate response time as RT>"

It was a simple evaluation

RT=endTime-startTime

However, it looks like the initial queries that I posted are working as expected. Probably I was using a wrong time window for running both the queries. Final working query -

<search>
| eval RT = endTime - startTime
| bin _time AS "TIME" span=1s
| eventstats count as TPS by TIME, query
| stats count AS txnCount, perc90(TPS) as P90_TPS, perc90(TotalTime) as P90_RT, avg(TPS) as avgTPS, avg(TotalTime) as avgRT by query
| eval avgTPS = round (avgTPS,2)
| eval avgRT = round (avgRT,2)
0 Karma
Get Updates on the Splunk Community!

Improve Your Security Posture

Watch NowImprove Your Security PostureCustomers are at the center of everything we do at Splunk and security ...

Maximize the Value from Microsoft Defender with Splunk

 Watch NowJoin Splunk and Sens Consulting for this Security Edition Tech TalkWho should attend:  Security ...

This Week's Community Digest - Splunk Community Happenings [6.27.22]

Get the latest news and updates from the Splunk Community here! News From Splunk Answers ✍️ Splunk Answers is ...