Splunk Search

What are your best patterns for handling stats by multiple groupby sets?

w564432
Explorer

I often run into a case where I find I need to take the same dataset and compute aggregate statistics on different group-by sets, for instance if you want the output of this:  

 

index=example | stats avg(field1) by x,y,z
| append [ index=example | stats perc95(field2) by a,b,c ]

 

I am using the case n=2 groupbys for convenience. In the general case there are N groupbys, and arbitrary stats functions... what is the best way to optimize this kind of query, without using append (which runs into subsearch limits)?

Some of the patterns I can think of are below.

One way is to use appendpipe. 

 

index=example
| appendpipe [ | stats avg(field1) by x,y,z ] 
| appendpipe [ | stats perc95(field2) by a,b,c ] 

 

Unfortunately this seems kind of slow, especially once you start having to add more subsearches and preserving and passing  a large number of non-transformed events throughout the search.

Another way is to use eventstats to preserve the events data, finishing it off with a final stats.

 

index=example
| eventstats avg(field1) as avg_field1 by x,y,z
| stats first(avg_field1) as avg_field1, perc95(field2) by a,b,c

 

Unfortunately this is not much faster.

I think there is another way using streamstats in place of eventstats, but I still haven't figured out how to retrieve the last event without just invoking eventstats last() or relying on an expensive sort.  

Another way I've tried is intentionally duplicating your data using mvexpand which has the best performance by far. 

 

index=example
```Duplicate all the data```
| eval key="1,2" 
| makemv delim="," key
| mvexpand key

```Set groupby = concatenation of groupby field values```
| eval groupby=case(key=1,x.",".y.",".z, key2=a.",".b.",".c, true(), null()) 

| stats avg(field1), perc95(field2) by groupby

 

Are there any other patterns that are easier/faster? I'm curious as to how Splunk processes things under the hood, I know something called "map-reduce" is part of it but would be curious to know if anyone knows how to optimize this computation and why it's optimal in a theoretical sense. 

Labels (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

As with most questions about data analytics, you need to explain data characteristics.  Do fields x y z and a b c appear in the same events?  What about field1 and field2?  If they are totally diagonal, you will get a diagonal matrix, whether you group by x y z or by x.",".y.",".z.

avg(field1)perc95(field2)xyzabc
avg_x1y1z1 x1y1z1   
avg_x2y2z2 x2y2z2   
 perc95_a1b1c1   a1b1c1
 perc95_a2b2c2   a2b2c2

Such data would be best presented as two normalized tables.  What is the point of combining them?

0 Karma

w564432
Explorer

Yes, field1, field2, x,y,z,a,b,c are all from the same set of events and are all non-null, and in general, we might have other groupbys besides xyz and abc -- in one of my frequent use cases I have three: x, xy, and xyz, for instance (say, when I want to calculate statistics with different levels of granularity -- e.g. percentile response times by hour, or hour-IP, or hour-IP-server ). 

I guess the question is rather more of a data-engineering problem rather than an analytics one: regardless of if we want two tables or one, how do we generate the data in a fast way? As it happens, doing two or more separate searches is significantly slower than, say, running one and doing some fancy stats magic on it, even  if it's more complicated. 

Also just out of curiosity, what do we mean by normalized tables here? 

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

(Data normalization is just to place data in diagonal tables for faster retrieval.) Anyway, the reason why data characteristics matter is because the cost of searching depends on them.  This is true in all relational data, whether it is SQL or SPL.

"All from the same set of events" is too broad.  It can describe a set of diagonal events like

field1

field2xyzabc
f1v1 xv1yv1zv1   
 f2v1   av1bv1cv1
f1v2 xv2yv2zv2   
 f2v2   av2bv2cv2

But it fits just as well a set of fully populated events like

field1fiel2xyzabc
f1v1f2v1xv1yv1zv1av1bv1cv1
f1v2f2v2xv2yv2zv2av2bv2cv2

For fully populated data, why not use this?

 

index=example
| stats avg(field1) perc95(field2) by x,y,z a,b,c

 

For diagonal (sparse) data, this would speed things up:

 

index=example field1=* x=* y=* z=*
| stats avg(field1) by x,y,z
| append
    [ index=example field2=* a=* b=* c=*
    stats perc95(field2) by a,b,c ]

 

I suspect that you have a specific use case that you know about the data that are in between the extremes, and have some specific results in mind.  You are correct to say that this is data engineering because in Splunk, you are really designing your schema on the fly.  This is where Splunk shows its tremendous power.

In traditional data engineering, you optimize your schema based on queries (analytics) you anticipate and data characteristics.  Same here.  You need to articulate data characteristics in order to optimize SPL.  There is no single "optimal" pattern.  Not in SQL, not in SPL.

As you already realized, there is a good reason why Optimizing Searches emphasizes limiting number of events retrieved from the index.  If you append multiple subsearches that retrieves the same raw events from index, as some of your mock codes do, it naturally multiplies index-search cost.  When events are numerous, index-search cost can heavily affect total cost.  So, using filter in the first pipe is important.  But which filters can be applied relies heavily on data characteristics and the kind of analytics you perform.  The pattern you observed is very much a function of your actual data based on the stats you perform.

0 Karma
Get Updates on the Splunk Community!

How to Get Started with Splunk Data Management Pipeline Builders (Edge Processor & ...

If you want to gain full control over your growing data volumes, check out Splunk’s Data Management pipeline ...

Out of the Box to Up And Running - Streamlined Observability for Your Cloud ...

  Tech Talk Streamlined Observability for Your Cloud Environment Register    Out of the Box to Up And Running ...

Splunk Smartness with Brandon Sternfield | Episode 3

Hello and welcome to another episode of "Splunk Smartness," the interview series where we explore the power of ...