What are your best patterns for handling stats by ...

w564432 · ‎10-24-2023

I often run into a case where I find I need to take the same dataset and compute aggregate statistics on different group-by sets, for instance if you want the output of this:

index=example | stats avg(field1) by x,y,z
| append [ index=example | stats perc95(field2) by a,b,c ]

I am using the case n=2 groupbys for convenience. In the general case there are N groupbys, and arbitrary stats functions... what is the best way to optimize this kind of query, without using append (which runs into subsearch limits)?

Some of the patterns I can think of are below.

One way is to use appendpipe.

index=example
| appendpipe [ | stats avg(field1) by x,y,z ] 
| appendpipe [ | stats perc95(field2) by a,b,c ]

Unfortunately this seems kind of slow, especially once you start having to add more subsearches and preserving and passing a large number of non-transformed events throughout the search.

Another way is to use eventstats to preserve the events data, finishing it off with a final stats.

index=example
| eventstats avg(field1) as avg_field1 by x,y,z
| stats first(avg_field1) as avg_field1, perc95(field2) by a,b,c

Unfortunately this is not much faster.

I think there is another way using streamstats in place of eventstats, but I still haven't figured out how to retrieve the last event without just invoking eventstats last() or relying on an expensive sort.

Another way I've tried is intentionally duplicating your data using mvexpand which has the best performance by far.

index=example
```Duplicate all the data```
| eval key="1,2" 
| makemv delim="," key
| mvexpand key

```Set groupby = concatenation of groupby field values```
| eval groupby=case(key=1,x.",".y.",".z, key2=a.",".b.",".c, true(), null()) 

| stats avg(field1), perc95(field2) by groupby

Are there any other patterns that are easier/faster? I'm curious as to how Splunk processes things under the hood, I know something called "map-reduce" is part of it but would be curious to know if anyone knows how to optimize this computation and why it's optimal in a theoretical sense.

yuanliu · ‎10-24-2023

As with most questions about data analytics, you need to explain data characteristics. Do fields x y z and a b c appear in the same events? What about field1 and field2? If they are totally diagonal, you will get a diagonal matrix, whether you group by x y z or by x.",".y.",".z.

avg(field1)	perc95(field2)	x	y	z	a	b	c
avg_x1y1z1		x1	y1	z1
avg_x2y2z2		x2	y2	z2
	perc95_a1b1c1				a1	b1	c1
	perc95_a2b2c2				a2	b2	c2

Such data would be best presented as two normalized tables. What is the point of combining them?

w564432 · ‎10-24-2023

Yes, field1, field2, x,y,z,a,b,c are all from the same set of events and are all non-null, and in general, we might have other groupbys besides xyz and abc -- in one of my frequent use cases I have three: x, xy, and xyz, for instance (say, when I want to calculate statistics with different levels of granularity -- e.g. percentile response times by hour, or hour-IP, or hour-IP-server ).

I guess the question is rather more of a data-engineering problem rather than an analytics one: regardless of if we want two tables or one, how do we generate the data in a fast way? As it happens, doing two or more separate searches is significantly slower than, say, running one and doing some fancy stats magic on it, even if it's more complicated.

Also just out of curiosity, what do we mean by normalized tables here?

yuanliu · ‎10-27-2023

(Data normalization is just to place data in diagonal tables for faster retrieval.) Anyway, the reason why data characteristics matter is because the cost of searching depends on them. This is true in all relational data, whether it is SQL or SPL.

"All from the same set of events" is too broad. It can describe a set of diagonal events like

field1	field2	x	y	z	a	b	c
f1v1		xv1	yv1	zv1
	f2v1				av1	bv1	cv1
f1v2		xv2	yv2	zv2
	f2v2				av2	bv2	cv2

But it fits just as well a set of fully populated events like

field1	fiel2	x	y	z	a	b	c
f1v1	f2v1	xv1	yv1	zv1	av1	bv1	cv1
f1v2	f2v2	xv2	yv2	zv2	av2	bv2	cv2

For fully populated data, why not use this?

index=example
| stats avg(field1) perc95(field2) by x,y,z a,b,c

For diagonal (sparse) data, this would speed things up:

index=example field1=* x=* y=* z=*
| stats avg(field1) by x,y,z
| append
    [ index=example field2=* a=* b=* c=*
    stats perc95(field2) by a,b,c ]

I suspect that you have a specific use case that you know about the data that are in between the extremes, and have some specific results in mind. You are correct to say that this is data engineering because in Splunk, you are really designing your schema on the fly. This is where Splunk shows its tremendous power.

In traditional data engineering, you optimize your schema based on queries (analytics) you anticipate and data characteristics. Same here. You need to articulate data characteristics in order to optimize SPL. There is no single "optimal" pattern. Not in SQL, not in SPL.

As you already realized, there is a good reason why Optimizing Searches emphasizes limiting number of events retrieved from the index. If you append multiple subsearches that retrieves the same raw events from index, as some of your mock codes do, it naturally multiplies index-search cost. When events are numerous, index-search cost can heavily affect total cost. So, using filter in the first pipe is important. But which filters can be applied relies heavily on data characteristics and the kind of analytics you perform. The pattern you observed is very much a function of your actual data based on the stats you perform.

w564432 · ‎04-03-2025

Sorry kind of fell off there, but just wanted to update in case others see this.
Basically the problem is for the "fully populated" case.

For fully populated data, why not use this?

index=example
| stats avg(field1) perc95(field2) by x,y,z a,b,c

I may not have been very clear here, but basically this would not work because what I'm looking for is:

avg(field1)	perc95(field2)	x	y	z	a	b	c
f1g1		10	20	30
	f2g2				1	2	3
f1g3		40	50	60
	f2g4				4	5	6

Here we have agg stats for four groups, g1to g4. For example g1 represents the stats for the grouping x=10, y=20, z=30, a=*, b=*, c=*, and g4 represents the stats for the group of transactions with x=*, y=*, z=*, a=4, b=5, c=6.
Just a stats doesn't help us here because of overlap, for instance g1 contains events of g2 (g1 contains events with a=1,b=2,c=3 and g2 contains events with x=10,y=20,z=30)

What are your best patterns for handling stats by multiple groupby sets?

stats

Good Sourcetype Naming

See your relevant APM services, dashboards, and alerts in one place with the updated ...

Splunk App for Anomaly Detection End of Life Announcement

Are you a member of the Splunk Community?

What are your best patterns for handling stats by multiple groupby sets?

stats

Good Sourcetype Naming

See your relevant APM services, dashboards, and alerts in one place with the updated ...

Splunk App for Anomaly Detection End of Life Announcement