Splunk Search

get bottom 3 duration within each group

arusoft
Communicator

I have duration for multiple websites.

How can I get 3 least duration for each websites. 

So here is example

Duration_in_minwebsiteExtraColumn
10.0x.comA
2.0x.comB
2.0x.comAA
3.0x.comC
4.0x.comABC
15.0 Y.comBB
1.0 Y.comCAV
1.0 Y.comXY
3.0 Y.comA
4.0 Y.comB
5.0 Y.comBB

 

 So I only want these rows ( 3 least duration for each website). 

Duration_in_minwebsiteExtraColumn
2.0x.comB
2.0x.comAA
3.0x.comC
4.0x.comABC
1.0 Y.comCAV
1.0 Y.comXY
3.0 Y.comA
4.0 Y.comB

 

Thanks.

Labels (1)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

So this is getting a completely different task altogether and you should have specified that from the start.

Yes, you can use streamstats (and that's probably the only reasonable approach).

For example:

<your search>
| sort 0 website Duration

Here you have your results ranked by website and within each website by Duration. We're gonna need this in this form (and you want it returned like this anyway :-)).

So now we'll check if the Duration value changed vs. previous occurrence.

| streamstats window=2 current=t dc(Duration) as duration_changes by client
| eval duration_changes=duration_changes-1

Now we can count how many times for any particular website the Duration value changed.

| streamstats sum(duration_changes) as rank by website

And now, since we know that our data was already sorted, which means that any Duration change must have been increasing the value we can filter the results to include only those that changed Duration at most two times, since you want to get top 3 results.

| where rank<3

 

View solution in original post

ITWhisperer
SplunkTrust
SplunkTrust

This relies lexicographic sorting provided by values - if you need numeric sorting, you will need to convert your duration field to a number.

| makeresults
| eval _raw="Duration	website	ExtraColumn
1 min	x.com	A
2 min	x.com	B
2 min	x.com	AA
3 min	x.com	C
4 min	x.com	ABC
1 min	Y.com	BB
1 min	Y.com	CAV
1 min	Y.com	XY
3 min	Y.com	A
4 min	Y.com	B
5 min	Y.com	BB"
| multikv forceheader=1
| table Duration website ExtraColumn



| eventstats values(Duration) as Durations by website
| eval Durations=mvdedup(Durations)
| eval Durations=mvindex(Durations,0,2)
| eval keep=mvfind(Durations,Duration)
| where isnotnull(keep)
| table Duration website ExtraColumn

arusoft
Communicator

Thanks @ITWhisperer , I do need numeric sorting and I update my question too. 

It works ( I had to add | sort website Duration ExtraColumn before we use eventstats)  if I use your example data, but its not working if I update your example with this data 

Duration website ExtraColumn
4 x.com ABC
5 x.com AB
6 x.com AC
2 x.com AA
5 y.com AB
6 y.com AC
2 y.com AA
7 x.com AC
10 x.com AA
10 x.com AAA
10 y.com AB
2 y.com ABC
3 y.com AC
3 x.com C

For some reason Values is sorting duration as if its string. Not sure why its doing that. I guess this is the reason its not working.  My duration is already a number because when I do normal sorting , it sorts corrects. | sort 0 website duration sort works fine here. Also I know my duration field is number because I am doing number based filter (duration >=0) before I use your logic.

Also I noticed that this logic takes lot of time which I guess is because of Values command. If there are 1000s record per website then this is going to be very slow ?  

 

Update: I changed Values to list and now its working. @ITWhisperer can you please advice if this is the right way or this might create other problems. But I do see that list has limitation to only display 100 values.  

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

You could create a zero filled version of the duration and use that although this may not solve the multi-value limit problem @PickleRick solution with streamstats is probably a better approach overall

| makeresults
| eval _raw="Duration_in_min,website,ExtraColumn
10,x.com,A
15,x.com,A
2,x.com,B
2,x.com,AA
3,x.com,C
4,x.com,ABC
15,Y.com,BB
10,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAVH
1,Y.com,XYF
1,Y.com,CAVF
1,Y.com,XYF
1,Y.com,CAVD
1,Y.com,XYD
1,Y.com,CAV
1,Y.com,CAV
1,Y.com,CXY
1,Y.com,CCAV
1,Y.com,XCY
1,Y.com,CAV
1,Y.com,XY
1,Y.com,CAV
1,Y.com,XY
3,Y.com,A
4,Y.com,B
5,Y.com,BB"
| multikv forceheader=1
| table Duration_in_min website ExtraColumn
| eval Duration=printf("%04d",Duration_in_min)
| eventstats values(Duration) as Durations by website
| eval Durations=mvdedup(Durations)
| eval Durations=mvindex(Durations,0,2)
| eval keep=mvfind(Durations,Duration)
| where isnotnull(keep)
| table Duration_in_min website ExtraColumn

johnhuang
Motivator

Since you're looking for the bottom 3, we can try deduping and keeping 10 lowest duration record for each site which makes it easier downstream -- you can adjust that number up or down depending on your data.

Bit of a hack, here's my attempt.

 

| makeresults
| eval _raw="Duration_in_min	website	ExtraColumn
10	x.com	A
2	x.com	B
2	x.com	AA
3	x.com	C
4	x.com	ABC
15	Y.com	BB
1	Y.com	CAV
1	Y.com	XY
3	Y.com	A
4	Y.com	B
5	Y.com	BB"
| multikv forceheader=1
| eval Duration_in_min=ROUND(Duration_in_min, 1)
| table Duration_in_min website ExtraColumn

| sort 0 Duration_in_min website
| dedup 10 website
| eval Duration_in_min=ROUND(Duration_in_min, 1)
| eventstats list(Duration_in_min) AS duration_list by website
| eval duration_list=MVDEDUP(duration_list)
| eval duration_3rd=IF(MVCOUNT(duration_list)>3, MVINDEX(duration_list, 2), MVINDEX(duration_list, -1))
| where Duration_in_min<=duration_3rd
| sort 0 -website Duration_in_min
| table Duration_in_min website ExtraColumn

 

arusoft
Communicator

Good idea, the only problem is that there is no guarantee how many duplicate durations will be there. So if I dedup 25 and there are 100 duplicate durations that are within bottom 3 then I will lose most of them.

Wish I can rank each row by site and duration and then I can easily filter by rank. I would expect Splunk should provide this out of box functionality. 

0 Karma

johnhuang
Motivator

Yep.

How many websites are in this data set?

0 Karma

arusoft
Communicator

In my case there can be around 100 sites. And each site can have 1000s of durations.

0 Karma

arusoft
Communicator

I tested with following data and it dosen't work

 

| eval _raw="Duration_in_min website ExtraColumn
10 x.com A
15 x.com A
2 x.com B
2 x.com AA
3 x.com C
4 x.com ABC
15 Y.com BB
10 Y.com XY
1 Y.com CAV
1 Y.com XY
1 Y.com CAV
1 Y.com XY
1 Y.com CAV
1 Y.com XY
1 Y.com CAV
1 Y.com XY
1 Y.com CAV
1 Y.com XY
1 Y.com CAVH
1 Y.com XYF
1 Y.com CAVF
1 Y.com XYF
1 Y.com CAVD
1 Y.com XYD
1 Y.com CAV
1 Y.com CAV
1 Y.com CXY
1 Y.com CCAV
1 Y.com XCY
1 Y.com CAV
1 Y.com XY
1 Y.com CAV
1 Y.com XY
3 Y.com A
4 Y.com B
5 Y.com BB"

0 Karma

johnhuang
Motivator

It works. I told you to adjust the dedup according to cardinality of your data. In this case, change the dedup to 100.

| makeresults
| eval _raw="Duration_in_min	website	ExtraColumn
10	x.com	A
15	x.com	A
2	x.com	B
2	x.com	AA
3	x.com	C
4	x.com	ABC
15	Y.com	BB
10	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAVH
1	Y.com	XYF
1	Y.com	CAVF
1	Y.com	XYF
1	Y.com	CAVD
1	Y.com	XYD
1	Y.com	CAV
1	Y.com	CAV
1	Y.com	CXY
1	Y.com	CCAV
1	Y.com	XCY
1	Y.com	CAV
1	Y.com	XY
1	Y.com	CAV
1	Y.com	XY
3	Y.com	A
4	Y.com	B
5	Y.com	BB"
| multikv forceheader=1
| table Duration_in_min website ExtraColumn

| sort 0 Duration_in_min website
| dedup 100 website
| eval Duration_in_min=ROUND(Duration_in_min, 1)
| eventstats list(Duration_in_min) AS duration_list by website
| eval duration_list=MVDEDUP(duration_list)
| eval duration_3rd=IF(MVCOUNT(duration_list)>3, MVINDEX(duration_list, 2), MVINDEX(duration_list, -1))
| where Duration_in_min<=duration_3rd
| sort 0 -website Duration_in_min
| table Duration_in_min website ExtraColumn

arusoft
Communicator

sorry @johnhuang I didn't tried that. But why are we sorting by duration and then by website? should this be other way around because i want this by site and then by duration. Sorry I am not understand the logic here. Can you please explain? Also is there any condition where this will fail (other than more than 100 sites if I dedup 100) ? 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

You say 3 but show 4 or 5 values. Since those results are (at least in your example) indistinguishable, there's no point in showing them separately.

So it's either

<your search>
| sort website Duration
| streamstats count by website
| where count<=3

Or you could try

<your search>
| eventstats count by website Duration
| sort website Duration
| streamstats count as webcount by website
| where webcount<=3

If you really want to have all values for the three distinct lowest values of Duration, that can probably be achieved but would be way uglier to perform.

0 Karma

arusoft
Communicator

So I need to show those 4 or 5 values too because they are same. I am looking to view table where duration falls into bottom 3 for each website.  I made my example simple, but there will more columns. And I want to see all the fields. I updated my question to reflect this. Hopefully I am making sense here. But I already knew about StreamStats. And this is the reason I coudn't use it.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

So this is getting a completely different task altogether and you should have specified that from the start.

Yes, you can use streamstats (and that's probably the only reasonable approach).

For example:

<your search>
| sort 0 website Duration

Here you have your results ranked by website and within each website by Duration. We're gonna need this in this form (and you want it returned like this anyway :-)).

So now we'll check if the Duration value changed vs. previous occurrence.

| streamstats window=2 current=t dc(Duration) as duration_changes by client
| eval duration_changes=duration_changes-1

Now we can count how many times for any particular website the Duration value changed.

| streamstats sum(duration_changes) as rank by website

And now, since we know that our data was already sorted, which means that any Duration change must have been increasing the value we can filter the results to include only those that changed Duration at most two times, since you want to get top 3 results.

| where rank<3

 

arusoft
Communicator

WoW.. this looks like is the solution . Thank you @ppig @johnhuang @ITWhisperer all for your solutions. All helped me learn something new today 🙂

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...