We have an index containing ~100k events that are each about 1k in size, making a roughly 100MB collection of data. This covers one week, but eventually we'd like to operate on months of this data. Let's consider this one week first.
We desire to study these events and group them together based on a field we call 'transaction_id'. In practice, most events have unique 'transaction_id' fields and collecting them together will only reduce the number of results to around 65% of the initial input events.
The problem we face is that attempts to aggregate this data hit memory limits (stats command) or run very slowly (transaction command).
Attempt #1: transaction
When we were rookies, we started with transaction. It has a simple and effective "UI" and we used it successfully on this problem. It is however slower and does not scale as well as a 'stats'-based solution so we desire to move on.
Attempt #2: stats
We have read the old scrolls and understand that we should prefer 'stats' if possible (for multiple reasons). We can generally get this technique to work, but if we apply too many input events, our jobs are crushed by the resource limit enforcer on our search head.
Our SPL command is very simple...
search index=data | stats first(field1) as field1 by transaction_id
Our reasoning regarding the required memory usage, from first principles, is thus. Our data set is, in total, about 100MB. In order to aggregate our data, the 'stats' command must hold it all (memory or disk) and combine then return it. If we imagine that the stats command's data structures and in-memory representation require 100x the original data input size, then we would need around 100MBx100 = 10GB of RAM for this operation, if no disk is to be used.
While we understand conceptually that the 'stats' command is smart enough to 'spill to disk' while aggregating to avoid needing infinite amounts of memory, our experience with it in these cases is not rewarding. It gets killed while using far more memory than we would expect given our 'max_mem_usage_mb' setting.
We expect that any given command in our SPL pipeline that honors 'max_mem_usage_mb' will try to use around or under our configured 200MB limit...
max_mem_usage_mb = 200
We want to believe that this should mean that our 'stats' command will begin spilling to disk before using GBs of memory. However, when we monitor this SPL query while running (watching 'top' on the search head), it is easy to see that it will grow well into the tens of GBs of resident set size.
This leads to the following questions...
How does 'stats' decide when to 'spill to disk' and change its algorithmic approach?
Are we just hitting some current weakness in the algorithm that selects the aggregation strategy?
We have taken note of the 'phased_execution' documentation in limits.conf.spec and wonder how we might learn more... 🙂
We are also aware of this talk which proposes a novel and interesting strategy for combatting this kind of problem. We are still reviewing the mechanism it proposes.