Splunk Search

Has anyone else come across unexpected behavior using the (?J) mode modifier in the rex command?

gvmorley
Contributor

Hi,

I wanted to see if anyone else had come across some strange behaviour when using the (?J) mode modifier in the 'rex' command.

This modifier should allow you to use the same capture group name more than once, in the same regular expression. If you try and do this without the modifier, you get the error:

Regex: two named subpatterns have the same name

In some 'rex' work that I'm doing, I'm using the Regular Expression conditionals syntax for 'If, then, else'.

The syntax for this is:

(?(?=regex)then|else)

I'm using a number of these in a nested way to match some code in Cisco ACLs that has very poor (read awful) syntax structure.

(Anyway, that's another story).

The problem that I'm seeing in Splunk, is that if the same capture group name is in both the 'Then' and 'Else' parts, then it will only extract for the 'Else' case.

If it matches in the 'Then' part, it would appear that the field gets 'nulled' due to the second definition in the 'Else' part. This feels wrong, as if the 'Then' case is matched, the regex engine shouldn't be tracking through the 'Else' part.

You can test this behaviour in Splunk with the following test case:

| makeresults
| eval case1="a then match"
| eval case2="a else match"
| rex field=case1 "(?J)a (?(?=then)(?<case1_match>then)|(?<case1_match>else)) match"
| rex field=case2 "(?J)a (?(?=then)(?<case2_match>then)|(?<case2_match>else)) match"
| table case*

In the resulting table, you should get:

case1 = "a then match"
case1_match = "then"

case2 = "a else match"
case2_match = "else"

What actually happens is that the field 'case1_match' is blank / null.

I've tried the expression in the online Regex101 site (unfortunately I can't post URLs yet, but copy/paste 'regex101.com/r/lX2uY8/2').

Has anyone else come across this issue before?
Is it by design or is it a bug?

Im sure that there are other ways for me to tackle what I'm looking at (I'm not too worried about that). What I just want to know if if this is functionality that 'should' work in Splunk.

This is in version 6.4.2 of Splunk Enterprise.

Many thanks,

Graham.

lenpistoria
New Member

Almost 5 years to the day after the last reply to this thread, and this issue still hasn't been resolved in Splunk.

Our situation is slightly different;  we have raw csv data which is processed via regex captures in in-line field extractions (EXTRACT-)  in props.conf.  The caveat for this type of field extraction is that only one single regex can be used.

For years this has worked great, but recently for the same "source", 2 additional fields have been added, making the existing extraction not work.

To compensate, I've retooled the regex using a conditional if/then/else lookahead to test if N+2 fields exist.  If so, process through the "then" regex, otherwise the "else".

Here's the regex:

(?J)^[^,\n]*,(?(?=([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+))((?P<retailer>[^,]+),(?P<storeNumber>[^,]+),(?P<displayName>[^,]+),(?P<button>[^,]+),(?P<buttonSelection>[^,]+),(?P<activeDevice>[^,]+),(?P<swVersion>[^,]+),(?P<contentVersion>[^,]+),(?P<platform>[^,]+),(?P<configType>[^,]+),(?P<presses>[^,]+))|((?P<retailer>[^,]+),(?P<storeNumber>[^,]+),(?P<displayName>[^,]+),(?P<button>[^,]+),(?P<swVersion>[^,]+),(?P<contentVersion>[^,]+),(?P<platform>[^,]+),(?P<configType>[^,]+),(?P<presses>[^,]+)))

Here are examples of the two possible data:

Original:

20210727,some_retailer,950,display-10,someButton,7.99.9966,2021.07.11-USA,platformX,PLAYER_MASTER,35

New data:

20210727,some_retailer,950,display-10,someButton,THIS IS THE BUTTON SELECTION,BIG-DEVICE,7.99.9966,2021.07.11-USA,platformX,PLAYER_MASTER,35

This works very well on regex101.com.  Captures populate as expected depending on the number of comma-delimited values present.

Sadly, when I push the change to my dev Splunk, it doesn't work.  For the original data (i.e., with 2 less fields), all fields get extracted.  However for the newer data, ONLY the two new fields get extracted.  

It's as if the Splunk regex engine isn't implementing PCRE2 correctly, in particular the (?J).

Unfortunately I don't have the privilege of writing tickets, but I do believe it's a Splunk error.

With all that being said, do any of you gurus out there know another way to write a single regex that I could try to accomplish this?

Many thanks in advance! 

0 Karma

gvmorley
Contributor

It looks like this may be a 'PCRE' thing as opposed to anything to do with Splunk.

The site here (http://www.regular-expressions.info/branchreset.html) suggests that:

In Perl and PCRE, it is best to use a
branch reset group when you want
groups in different alternatives to
have the same name. That's the only
way in Perl and PCRE to make sure that
groups with the same name really are
one and the same group.

So dropping the (?J) mode and the conditional, I can still use duplicate subpatterns within a Branch Reset group:

| rex field=case1 "a (?|(?<case1b_match>then)|(?<case1b_match>else)) match"
| rex field=case2 "a (?|(?<case2b_match>then)|(?<case2b_match>else)) match"

Not quite what I'm looking for, as I'd still like to use the conditional form, but it's an interesting one nevertheless.

Every day's a school-day when it comes to Regular Expressions!

jhg03a
Explorer

I worked around the problem by creating multiple transform stanzas and prioritizing them in the report stanza.

transform.conf
[then_match]
REGEX = a (?then) match

[else_match]
REGEX = a (?else) match

props.conf
REPORT-thenelse = then_match,else_match

0 Karma

gvmorley
Contributor

Thanks for the suggestion. Another interesting way to solve the example.

What I'm actually doing is using conditionals to do lots of branching. Think if it like using nested IFs in Excel. You end up with something like this

(?(?=regex)then|(?(?=regex)then|(?(?=regex)then|(?(?=regex)then|else))))

etc...

There's lots of other ways to achieve what I looking at; I'm more curious about the (?J) behaviour and if others are using it.

Thanks again.

0 Karma

Masa
Splunk Employee
Splunk Employee

It seems like two same named group in rex is confusing Splunk field extraction.
This worked in my test.

| makeresults
 | eval case1="a then match"
 | eval case2="a else match"
 | rex field=case1 "a (?<case1_match>(?(?=then)then|else)) match"
 | rex field=case2 "a (?<case2_match>(?(?=then)then|else)) match"
 | table case*

If we consider regex for scalability and cost of system resource usage, I would avoid using conditional and/or lookahead/lookbehind as much as possible. But, in this specific question to check functionality, it is a good question 🙂

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Mild tangent: Have you considered this alternative to (?J)?

| rex field=case1 "a (?<case1_match>(?(?=then)then|else)) match"
| rex field=case2 "a (?<case2_match>(?(?=then)then|else)) match"
0 Karma

gvmorley
Contributor

Thanks for the feedback. Yes, your example would certainly work, as would the very simple form:

| rex field=case1 "a (?<case1_match>then|else) match"
| rex field=case2 "a (?<case2_match>then|else) match"

without any need for the conditional.

The example in the original post was just to demonstrate the issue that I'm seeing with the duplicate subpattern mode.

I'm more curious around seeing if others are trying to use the (?J) mode and what they're thoughts are.

The great think about Splunk and Regex, is that's there's always going to be lots of ways to get to the answer!

Thanks again.

0 Karma
Get Updates on the Splunk Community!

What’s new on Splunk Lantern in August

This month’s Splunk Lantern update gives you the low-down on all of the articles we’ve published over the past ...

Welcome to the Future of Data Search & Exploration

You have more data coming at you than ever before. Over the next five years, the total amount of digital data ...

This Week's Community Digest - Splunk Community Happenings [8.3.22]

Get the latest news and updates from the Splunk Community here! News From Splunk Answers ✍️ Splunk Answers is ...