(Data normalization is just to place data in diagonal tables for faster retrieval.) Anyway, the reason why data characteristics matter is because the cost of searching depends on them. This is true ...
See more...
(Data normalization is just to place data in diagonal tables for faster retrieval.) Anyway, the reason why data characteristics matter is because the cost of searching depends on them. This is true in all relational data, whether it is SQL or SPL. "All from the same set of events" is too broad. It can describe a set of diagonal events like field1 field2 x y z a b c f1v1 xv1 yv1 zv1 f2v1 av1 bv1 cv1 f1v2 xv2 yv2 zv2 f2v2 av2 bv2 cv2 But it fits just as well a set of fully populated events like field1 fiel2 x y z a b c f1v1 f2v1 xv1 yv1 zv1 av1 bv1 cv1 f1v2 f2v2 xv2 yv2 zv2 av2 bv2 cv2 For fully populated data, why not use this? index=example
| stats avg(field1) perc95(field2) by x,y,z a,b,c For diagonal (sparse) data, this would speed things up: index=example field1=* x=* y=* z=*
| stats avg(field1) by x,y,z
| append
[ index=example field2=* a=* b=* c=*
stats perc95(field2) by a,b,c ] I suspect that you have a specific use case that you know about the data that are in between the extremes, and have some specific results in mind. You are correct to say that this is data engineering because in Splunk, you are really designing your schema on the fly. This is where Splunk shows its tremendous power. In traditional data engineering, you optimize your schema based on queries (analytics) you anticipate and data characteristics. Same here. You need to articulate data characteristics in order to optimize SPL. There is no single "optimal" pattern. Not in SQL, not in SPL. As you already realized, there is a good reason why Optimizing Searches emphasizes limiting number of events retrieved from the index. If you append multiple subsearches that retrieves the same raw events from index, as some of your mock codes do, it naturally multiplies index-search cost. When events are numerous, index-search cost can heavily affect total cost. So, using filter in the first pipe is important. But which filters can be applied relies heavily on data characteristics and the kind of analytics you perform. The pattern you observed is very much a function of your actual data based on the stats you perform.