I often run into a case where I find I need to take the same dataset and compute aggregate statistics on different group-by sets, for instance if you want the output of this:
index=example | stats avg(field1) by x,y,z
| append [ index=example | stats perc95(field2) by a,b,c ]
I am using the case n=2 groupbys for convenience. In the general case there are N groupbys, and arbitrary stats functions... what is the best way to optimize this kind of query, without using append (which runs into subsearch limits)?
Some of the patterns I can think of are below.
One way is to use appendpipe.
index=example
| appendpipe [ | stats avg(field1) by x,y,z ]
| appendpipe [ | stats perc95(field2) by a,b,c ]
Unfortunately this seems kind of slow, especially once you start having to add more subsearches and preserving and passing a large number of non-transformed events throughout the search.
Another way is to use eventstats to preserve the events data, finishing it off with a final stats.
index=example
| eventstats avg(field1) as avg_field1 by x,y,z
| stats first(avg_field1) as avg_field1, perc95(field2) by a,b,c
Unfortunately this is not much faster.
I think there is another way using streamstats in place of eventstats, but I still haven't figured out how to retrieve the last event without just invoking eventstats last() or relying on an expensive sort.
Another way I've tried is intentionally duplicating your data using mvexpand which has the best performance by far.
index=example
```Duplicate all the data```
| eval key="1,2"
| makemv delim="," key
| mvexpand key
```Set groupby = concatenation of groupby field values```
| eval groupby=case(key=1,x.",".y.",".z, key2=a.",".b.",".c, true(), null())
| stats avg(field1), perc95(field2) by groupby
Are there any other patterns that are easier/faster? I'm curious as to how Splunk processes things under the hood, I know something called "map-reduce" is part of it but would be curious to know if anyone knows how to optimize this computation and why it's optimal in a theoretical sense.
As with most questions about data analytics, you need to explain data characteristics. Do fields x y z and a b c appear in the same events? What about field1 and field2? If they are totally diagonal, you will get a diagonal matrix, whether you group by x y z or by x.",".y.",".z.
avg(field1) | perc95(field2) | x | y | z | a | b | c |
avg_x1y1z1 | x1 | y1 | z1 | ||||
avg_x2y2z2 | x2 | y2 | z2 | ||||
perc95_a1b1c1 | a1 | b1 | c1 | ||||
perc95_a2b2c2 | a2 | b2 | c2 |
Such data would be best presented as two normalized tables. What is the point of combining them?
Yes, field1, field2, x,y,z,a,b,c are all from the same set of events and are all non-null, and in general, we might have other groupbys besides xyz and abc -- in one of my frequent use cases I have three: x, xy, and xyz, for instance (say, when I want to calculate statistics with different levels of granularity -- e.g. percentile response times by hour, or hour-IP, or hour-IP-server ).
I guess the question is rather more of a data-engineering problem rather than an analytics one: regardless of if we want two tables or one, how do we generate the data in a fast way? As it happens, doing two or more separate searches is significantly slower than, say, running one and doing some fancy stats magic on it, even if it's more complicated.
Also just out of curiosity, what do we mean by normalized tables here?
(Data normalization is just to place data in diagonal tables for faster retrieval.) Anyway, the reason why data characteristics matter is because the cost of searching depends on them. This is true in all relational data, whether it is SQL or SPL.
"All from the same set of events" is too broad. It can describe a set of diagonal events like
field1 | field2 | x | y | z | a | b | c |
f1v1 | xv1 | yv1 | zv1 | ||||
f2v1 | av1 | bv1 | cv1 | ||||
f1v2 | xv2 | yv2 | zv2 | ||||
f2v2 | av2 | bv2 | cv2 |
But it fits just as well a set of fully populated events like
field1 | fiel2 | x | y | z | a | b | c |
f1v1 | f2v1 | xv1 | yv1 | zv1 | av1 | bv1 | cv1 |
f1v2 | f2v2 | xv2 | yv2 | zv2 | av2 | bv2 | cv2 |
For fully populated data, why not use this?
index=example
| stats avg(field1) perc95(field2) by x,y,z a,b,c
For diagonal (sparse) data, this would speed things up:
index=example field1=* x=* y=* z=*
| stats avg(field1) by x,y,z
| append
[ index=example field2=* a=* b=* c=*
stats perc95(field2) by a,b,c ]
I suspect that you have a specific use case that you know about the data that are in between the extremes, and have some specific results in mind. You are correct to say that this is data engineering because in Splunk, you are really designing your schema on the fly. This is where Splunk shows its tremendous power.
In traditional data engineering, you optimize your schema based on queries (analytics) you anticipate and data characteristics. Same here. You need to articulate data characteristics in order to optimize SPL. There is no single "optimal" pattern. Not in SQL, not in SPL.
As you already realized, there is a good reason why Optimizing Searches emphasizes limiting number of events retrieved from the index. If you append multiple subsearches that retrieves the same raw events from index, as some of your mock codes do, it naturally multiplies index-search cost. When events are numerous, index-search cost can heavily affect total cost. So, using filter in the first pipe is important. But which filters can be applied relies heavily on data characteristics and the kind of analytics you perform. The pattern you observed is very much a function of your actual data based on the stats you perform.