I'm really confused about performance related to use of foreach + rename. I have a macro that renames potential name collisions. My original search takes about 5 seconds on some 100K events; after applying the macro, the search runs > 50 seconds! Then, I removed foreach + rename from the macro, any commands in between add practically no additional run time, not even 1 second.
To test, I set up several macros. The macro that resembles the one I have is
```
This macro does nothing except renaming a bunch of fields back and forth using foreach
```
foreach a b c d
[rename <<FIELD>> as _macro_<<FIELD>>]
| noop
| foreach a b c d
[rename _macro_<<FIELD>> as <<FIELD>>]
Running this over 400K _internal events with simple stats takes >4s. Running the same stats with a macro that has no foreach + rename, or a macro with rename fully spelled out takes <0.4s.
How to explain the dramatic increase of run time in my 100K events when foreach + rename is used on a handful of names? Then, why the degradation is far less in my 400K events when the equivalent amount of foreach + rename is applied?
(I have also tested the effect of foreach + rename without macro. It is just as bad. Macro is just a place where foreach + rename can have more practical application.)
@yuanliu Hello! because foreach is a "template" command that can handle wildcards (like foreach *), it forces the search engine to switch to Full Materialization.
The reason your search slows down is exactly what you suspected: the "brute force" extraction.
Can you try run your search again, but put | fields a b c d (the fields you are renaming) immediately before your macro. This tells Splunk to throw away the extra data before the foreach forces the full extraction.?
Cheers!
@yuanliu Hello! because foreach is a "template" command that can handle wildcards (like foreach *), it forces the search engine to switch to Full Materialization.
This is so meta😆 In my Slack thread on this subject, madscient used "'all bets are off' mode" to describe such a scenario. Meanwhile, foreach itself doesn't always degrade performance. In fact my macro contains another foreach command that does regex and some multivalue calculations. No degradation. That's why I speculate that name space manipulation may have triggered foreach to go wild in that thread.
I also specifically avoided wildcard with foreach in hope that the compiler will treat the tokens as constant and set rename at compile time. Obviously that was misguided.
Can you try run your search again, but put | fields a b c d (the fields you are renaming) immediately before your macro. This tells Splunk to throw away the extra data before the foreach
The reason why I want to preempt potential name collision by adding such rename's is for the macro to stand alone without worrying about side effects. In my real macro, a b c d are names that wouldn't usually exist in practical dataset. (In other words, the intention is similar to declaring a set of variables as local in some other languages.)
Meanwhile, I am also aware of the effect of narrowing field set - again, in relationship to name space manipulation. A while ago, I posted another Slack thread about tojson command vs json_set() call. Here was my observation:
I did some tests with the same dataset comparable to that used in my original observation.
- Baseline | stats count
- tojson | tojson output_field=json field1 field2 field3 | stats count
- eval | json = json_object("field1", field1, "field2", field2, "field3", field3) | stats count
So a | stats count command is used to terminate all three for job inspector. All three would give me event count of 102,365. Baseline runs for 0.20s, tojson runs for 38.6s, eval runs for 0.26s....I suspect that the number of fields carried in raw events may have some effect. So, I ran | fields field1 field2 field3 | tojson output_field=json field1 field2 field3 | stats count . In the half-million dataset, run time has reduced to 1.6s. Still meaningfully longer than eval but the slowdown is much reduced. Of course, determining which fields might be needed downstream is tedious work and things can change in the future. So, as a practical measure, I will write the long json_object where I can. (Which fields I will need in a JSON object is much easier to determine; future maintenance is more straightforward.)
Knowing that number of fields has effect, I also tested Smart Mode against Fast Mode. No difference.
Is tojson examining every field in a raw event even when a list is provided? (I do understand that syntax allows tojson to be used without list.) Could this be a potential bug?
Basically, my observations in both foreach + rename combo and tojson suggest that whenever a command syntax potentially allows unlimited name tokens, even when actual name tokens are explicitly enumerated, the command goes into that "all bets are off" mode and starts to operate on imaginary list of tokens.