Splunk Search

Trimmed Average Calculation

zacksoft
Contributor
host = Mayhem
sourcetype="phutans:servo" host=R00878
| eval headers=split(_raw," ")
| eval plant_length=mvindex(headers,10) | sort 0 plant_length

(I am trying to calculate mean plant_lenght here but in a different way)

Let me put it as example :
say we get 10 values of plant_length. 1, 2, 2, 3, 4, 4, 4, 5, 20
Ideally to get mean, we would do (1+2+2+3+4+4+4++5+20)/10
But I don't want it this way. What I am looking for is to drop 10% of data in each tail and divide by number of values left.

Like this : (2+2+3+4+4+4++5)/8
In the numerator here the bottom 10% data (1) and top 10% of data(20) is dropped , so we have 8 vales left, hence the denominator is 8.

(In the above example as we have 10 values so 10% of 10 becomes 1. Therefore 1 value is dropped from top end and 1 value from bottom end.) And after dropping 2 values the denominator has to be 8 , not 10)

I could use some help in implementing the above scenario.

Tags (1)
0 Karma
1 Solution

Sukisen1981
Champion

Try this - index="_audit"
| stats values(available_count)
| rename values(available_count) as x
| eval perc=0.1*mvcount(x)
| mvexpand x
| eventstats count(x) as maxcount
| streamstats count as row
| where row>perc
| eval z=maxcount-row
| where z>perc|reverse

I have bulit out this on action_count on the _audit index, so the code works as is for you. Now I have renamed action_count as x and removed the top and bottom 10% values, the left over values of x is what you need to use to form your average.
WARNING - Run this for last 7 days to have a large data set AND the values keep updating real time so look at maxcount and perc. macount is your total at any point in time and perc is what you need to trim from the bottom and top at any point of time..hope it helps

View solution in original post

0 Karma

Sukisen1981
Champion

Try this - index="_audit"
| stats values(available_count)
| rename values(available_count) as x
| eval perc=0.1*mvcount(x)
| mvexpand x
| eventstats count(x) as maxcount
| streamstats count as row
| where row>perc
| eval z=maxcount-row
| where z>perc|reverse

I have bulit out this on action_count on the _audit index, so the code works as is for you. Now I have renamed action_count as x and removed the top and bottom 10% values, the left over values of x is what you need to use to form your average.
WARNING - Run this for last 7 days to have a large data set AND the values keep updating real time so look at maxcount and perc. macount is your total at any point in time and perc is what you need to trim from the bottom and top at any point of time..hope it helps

0 Karma

zacksoft
Contributor

stats value(available_count) ...gives me the unique entries only.
if I have six values such as 2,2,3,3,4,4. It only shows 2,3,4

0 Karma

Sukisen1981
Champion

hi you need to replace vlaues with list if you want non-unique values as well

index="_audit" | stats list(available_count) | rename list(available_count) as x | eval perc=0.1*mvcount(x) | mvexpand x | eventstats count(x) as maxcount | streamstats count as row | where row>perc | eval z=maxcount-row | where z>perc|reverse
0 Karma

zacksoft
Contributor

The 'list' only returns 100 values. Is there anyway to override that to see all the values?

0 Karma

Sukisen1981
Champion

Hi @zacksoft - Yes the list is limited to 100 values - https://docs.splunk.com/Documentation/Splunk/7.2.6/SearchReference/Multivaluefunctions

I honestly, did not think from the examples that you had provided that in any specific case your list of values will exceed 100 .
What is/could be the maximum number of values in your case?

0 Karma

Sukisen1981
Champion

Try this - this should account for all values , not just the list limit of 100

index="_audit" | table available_count,_time
| where isnotnull(available_count) | rename available_count as x 
|  eventstats count(x) as maxcount
| eval perc=0.1*maxcount  | streamstats count as row | where row>perc | eval z=maxcount-row | where z>perc|reverse
0 Karma

zacksoft
Contributor

@Sukisen1981 Thank you for so patiently assisting with this issue. Really appreciate it.
This is exactly what I wanted.
If I may add one small thing...."what if instead of dropping the tail 10 % of data, I choose to replace them with the next available data.
e.g. 1, 2, 2, 3, 4, 4, 4, 5, 20

The 10 % on each side is, 1 and 20.
So the updated data set to become 2, 2, 2, 3, 4, 4, 4, 5, 5. (I just replaced the each extreme 10 % with the nearest value). Could you please guide how to achieve this. Thank you.

0 Karma

Sukisen1981
Champion

Hi @zacksoft - this looks a bit tough, I suggest asking a separate question on this in the forum. Meanwhile, I will try to find a solution,many apologies but office work calls 😞 😞

0 Karma

zacksoft
Contributor

Sure @Sukisen1981 . If you get time.
Following is the link of the new question.
https://answers.splunk.com/answers/744080/winsorized-average-calculation.html?minQuestionBodyLength=...

0 Karma
Get Updates on the Splunk Community!

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

From Idea to Implementation: Why Splunk Built mTLS into Splunk Enterprise 10.0  mTLS wasn’t just a checkbox ...