I like and need mvexpand
to work with some of my data.
Sometimes, our input events contain information about multiple, underlying events (esp. rich JSON data sources). I understand that mvexpand
can, under certain situations, can lead to scaling challenges with SPL. I generally think of these problematic cases as examples where each individual input event expands into lots (hundreds, thousands or more) of newevents. I can imagine this being especially tricky when the arity of the expansion varies greatly from input event to input event.
I want to believe that cases where mvexpand
causes the event count to be doubled should be safe. It seems that these cases could be implemented to be fully streamable (at the indexers) and that the SPL should scale out embarrassingly easily. Here's an example query:
| makeresults count=10000 | streamstats count | eval count=1000*round((count-1)/1000-0.5,0)
| eval mcount=mvrange(0,99,10) | mvexpand mcount | fields count mcount | fields - _raw
| eval ucount=mvrange(0,49,10) | mvexpand ucount | fields count mcount ucount | fields - _raw
| eventstats count as total by count | eventstats count as mtotal by mcount | eventstats count as utotal by ucount
| stats count, values(eval(count." (".total.")")) as cvalues,
values(eval(mcount." (".mtotal.")")) as mvalues,
values(eval(ucount." (".utotal.")")) as uvalues
This SPL makes 10,000 events and then mvexpand
s twice, once by 10x and once by 5x. The result is 500,000 events as expected. By tweaking the makeresults
and mvrange
commands, we can test different limits of the mvexpand
command.
Adjusting the ucount
to mvrange(0,99,10)
produces the expected 1,000,000 events. This, however, is the highest number that works as I expected. Once the total number of total events exceeds 1,000,000 events, as any mvexpand
, some (undesirable) caps begin to be applied.
In my case, I need to use mvexpand
with a case where the base search itself produces many tens or hundreds of millions of events. The "expansion factor", if you will, is a small, constant number (<100, likely less than 10 and can be constrained).
Here is an example where the final expansion merely doubles the event count (in a completely local way) that I believe should work...
| makeresults count=10000 | streamstats count | eval count=1000*round((count-1)/1000-0.5,0)
| eval mcount=mvrange(0,99,1) | mvexpand mcount | fields count mcount | fields - _raw
| eval ucount=mvrange(0,49,25) | mvexpand ucount | fields count mcount ucount | fields - _raw
| eventstats count as total by count | eventstats count as mtotal by mcount | eventstats count as utotal by ucount
| stats count, values(eval(count." (".total.")")) as cvalues,
values(eval(mcount." (".mtotal.")")) as mvalues,
values(eval(ucount." (".utotal.")")) as uvalues
Instead of 2,000,000 events, I only get 984,200 on my environment.
I am imagining building my own custom command, but I suspect that others have hit this limit. It certainly seems that mvexpand
/could/ be smarter than this. Any advice?
(For the record, I have already tried the fields - _raw
trick shared in other mvexpand
answers.)
Yes, mvexpand
is very inefficient. You can trigger the default 500MB memory limit with | makeresults | eval foo = mvrange(0,10000) | mvexpand foo
in some splunk instances, for example - 20000 simple values shouldn't need 5MB, let alone 500.
Since there's no actual question in your question I'll provide advice instead of an answer: File an ER with support.
Yes, mvexpand
is very inefficient. You can trigger the default 500MB memory limit with | makeresults | eval foo = mvrange(0,10000) | mvexpand foo
in some splunk instances, for example - 20000 simple values shouldn't need 5MB, let alone 500.
Since there's no actual question in your question I'll provide advice instead of an answer: File an ER with support.
Well, in many cases you can write searches in a way that don't need mvexpand. Whether that's possible in your case or not depends on your case.
And in fact, Martin taught me a great trick to avoid needing mvexpand
. The trick covers cases where you would ultimately just be using the field in question in a group by clause of a subsequent stats
command. In this case, you can simply leave the multi-valued field multi-valued and things will "just work". Cool trick! Thanks for showing me that one, Martin!
Unfortunate, this behavior, but if it is the current state of the art, then ER seems the best path forward. Thanks.