Splunk Search

Bug in rex command?? Not working if the raw data has more than 12 pipe or comma.

somesoni2
Revered Legend

Running a simple in-line field extraction command.

| gentimes start=-1 | eval temp="f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15" | table temp | eval _raw=temp | rex field=temp "(?<aaa>.+),(?<bb>.+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

Command works and show header for 15 fields that I'm extracting but blank field values.
alt text

Whereas, If I just reduce 2 commas from start (and update the regex as well accordingly), everything works just fine.

| gentimes start=-1 | eval temp="f1 f2 f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15" | table temp | eval _raw=temp | rex field=temp "(?<aaa>.+) (?<bb>.+) (?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

alt text

Same behavior in Splunk 6.2 and 6.3. Anyone seen this??

1 Solution

gokadroid
Motivator

I am not sure if this is the bug with the Splunk or it is a regex which is putting it off. Since the first capturing group in first example i.e. "(.+) is trying to search any single character more than once and as many times as possible (greedy), therefor this expression doesn't know when to stop capturing the first group while it happily consumes one character at a time (including commas) and stops at last comma (maybe till the 14th capturing group ) and then realizes that now it doesn't have anything to match remaining 14 capturing groups from there. Hence I assume leaves all the captures as blanks. Remember there is no way for the regex engine to lookahead in this case to see whether it has exactly 14 commas one for each group we want to capture.

I think, if you change your first few capturing groups (like your space capture group example) to something non wildcard (say a \w+ or [^,]+), then it shall behave as expected, as now it has the data to match immediately at the start and other "wildcard" pieces fall in place thereafter.

As for the statement "Not working if the raw data has more than 12 pipe or comma", I think it will work for pipes or commas greater than 12 if you just change the first few capturing groups to exclusion or non greedy ones (like matching a definite space) rather than greedy capture as both of these with initial few non greedy approach below shall work:

| rex field=temp "(?<aaa>.+?),(?<bb>.+?),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

OR

| rex field=temp "(?<aaa>[^,]+),(?<bb>[^,]+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

It might be that when there are more than 13 same capturing groups (of wildcard captures), that is when the regex engine is failing, rather than 12 commas or 12 pipes being the cause of failure. So in short it shouldn't fail if the repeating capture groups are non greedy non wildcards thereby being stricter matches or exclusions (rather than repeating wildcard captures)

View solution in original post

acharlieh
Influencer

To be fair to Splunk... putting your regex and sample data row into regex101.com fails due to "Catastrophic Backtracking": https://regex101.com/r/U3e5mq/1

Even your modified regex and sample data takes over half a second due to the amount of backtracking required to match (or identify mis-matches)

if you can guarantee no commas in the body of your data, replacing looking for any character any number of times with looking for any character except a comma as suggested by gokadroid takes the execution time of your regex from catastrophic to almost instantaneous.

gokadroid
Motivator

Here is more on catastrophic backtracking, which regex101.com might have ended up with. The link I picked up from the sidebar of the link you posted.

gokadroid
Motivator

I am not sure if this is the bug with the Splunk or it is a regex which is putting it off. Since the first capturing group in first example i.e. "(.+) is trying to search any single character more than once and as many times as possible (greedy), therefor this expression doesn't know when to stop capturing the first group while it happily consumes one character at a time (including commas) and stops at last comma (maybe till the 14th capturing group ) and then realizes that now it doesn't have anything to match remaining 14 capturing groups from there. Hence I assume leaves all the captures as blanks. Remember there is no way for the regex engine to lookahead in this case to see whether it has exactly 14 commas one for each group we want to capture.

I think, if you change your first few capturing groups (like your space capture group example) to something non wildcard (say a \w+ or [^,]+), then it shall behave as expected, as now it has the data to match immediately at the start and other "wildcard" pieces fall in place thereafter.

As for the statement "Not working if the raw data has more than 12 pipe or comma", I think it will work for pipes or commas greater than 12 if you just change the first few capturing groups to exclusion or non greedy ones (like matching a definite space) rather than greedy capture as both of these with initial few non greedy approach below shall work:

| rex field=temp "(?<aaa>.+?),(?<bb>.+?),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

OR

| rex field=temp "(?<aaa>[^,]+),(?<bb>[^,]+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

It might be that when there are more than 13 same capturing groups (of wildcard captures), that is when the regex engine is failing, rather than 12 commas or 12 pipes being the cause of failure. So in short it shouldn't fail if the repeating capture groups are non greedy non wildcards thereby being stricter matches or exclusions (rather than repeating wildcard captures)

View solution in original post

somesoni2
Revered Legend

That is it. Being too vague in regex was causing the issue. Your solution 2 type of regex works fine.

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!