Solved: Re: Bug in rex command?? Not working if the raw da...

somesoni2 · ‎01-13-2017

Running a simple in-line field extraction command.

| gentimes start=-1 | eval temp="f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15" | table temp | eval _raw=temp | rex field=temp "(?<aaa>.+),(?<bb>.+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

Command works and show header for 15 fields that I'm extracting but blank field values.

Whereas, If I just reduce 2 commas from start (and update the regex as well accordingly), everything works just fine.

| gentimes start=-1 | eval temp="f1 f2 f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15" | table temp | eval _raw=temp | rex field=temp "(?<aaa>.+) (?<bb>.+) (?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

Same behavior in Splunk 6.2 and 6.3. Anyone seen this??

gokadroid · ‎01-13-2017

I am not sure if this is the bug with the Splunk or it is a regex which is putting it off. Since the first capturing group in first example i.e. "(.+) is trying to search any single character more than once and as many times as possible (greedy), therefor this expression doesn't know when to stop capturing the first group while it happily consumes one character at a time (including commas) and stops at last comma (maybe till the 14th capturing group ) and then realizes that now it doesn't have anything to match remaining 14 capturing groups from there. Hence I assume leaves all the captures as blanks. Remember there is no way for the regex engine to lookahead in this case to see whether it has exactly 14 commas one for each group we want to capture.

I think, if you change your first few capturing groups (like your space capture group example) to something non wildcard (say a \w+ or [^,]+), then it shall behave as expected, as now it has the data to match immediately at the start and other "wildcard" pieces fall in place thereafter.

As for the statement "Not working if the raw data has more than 12 pipe or comma", I think it will work for pipes or commas greater than 12 if you just change the first few capturing groups to exclusion or non greedy ones (like matching a definite space) rather than greedy capture as both of these with initial few non greedy approach below shall work:

| rex field=temp "(?<aaa>.+?),(?<bb>.+?),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

OR

| rex field=temp "(?<aaa>[^,]+),(?<bb>[^,]+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

It might be that when there are more than 13 same capturing groups (of wildcard captures), that is when the regex engine is failing, rather than 12 commas or 12 pipes being the cause of failure. So in short it shouldn't fail if the repeating capture groups are non greedy non wildcards thereby being stricter matches or exclusions (rather than repeating wildcard captures)

View solution in original post

acharlieh · ‎01-13-2017

To be fair to Splunk... putting your regex and sample data row into regex101.com fails due to "Catastrophic Backtracking": https://regex101.com/r/U3e5mq/1

Even your modified regex and sample data takes over half a second due to the amount of backtracking required to match (or identify mis-matches)

if you can guarantee no commas in the body of your data, replacing looking for any character any number of times with looking for any character except a comma as suggested by gokadroid takes the execution time of your regex from catastrophic to almost instantaneous.

gokadroid · ‎01-13-2017

Here is more on catastrophic backtracking, which regex101.com might have ended up with. The link I picked up from the sidebar of the link you posted.

gokadroid · ‎01-13-2017

I am not sure if this is the bug with the Splunk or it is a regex which is putting it off. Since the first capturing group in first example i.e. "(.+) is trying to search any single character more than once and as many times as possible (greedy), therefor this expression doesn't know when to stop capturing the first group while it happily consumes one character at a time (including commas) and stops at last comma (maybe till the 14th capturing group ) and then realizes that now it doesn't have anything to match remaining 14 capturing groups from there. Hence I assume leaves all the captures as blanks. Remember there is no way for the regex engine to lookahead in this case to see whether it has exactly 14 commas one for each group we want to capture.

I think, if you change your first few capturing groups (like your space capture group example) to something non wildcard (say a \w+ or [^,]+), then it shall behave as expected, as now it has the data to match immediately at the start and other "wildcard" pieces fall in place thereafter.

As for the statement "Not working if the raw data has more than 12 pipe or comma", I think it will work for pipes or commas greater than 12 if you just change the first few capturing groups to exclusion or non greedy ones (like matching a definite space) rather than greedy capture as both of these with initial few non greedy approach below shall work:

| rex field=temp "(?<aaa>.+?),(?<bb>.+?),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

OR

| rex field=temp "(?<aaa>[^,]+),(?<bb>[^,]+),(?<cc>.+),(?<dd>.+),(?<ee>.+),(?<ff>.+),(?<gg>.+),(?<hh>.+),(?<ii>.+),(?<jj>.+),(?<kk>.+),(?<ll>.+),(?<mmm>.+),(?<nn>.+),(?<oooo>.+)"

It might be that when there are more than 13 same capturing groups (of wildcard captures), that is when the regex engine is failing, rather than 12 commas or 12 pipes being the cause of failure. So in short it shouldn't fail if the repeating capture groups are non greedy non wildcards thereby being stricter matches or exclusions (rather than repeating wildcard captures)

somesoni2 · ‎01-14-2017

That is it. Being too vague in regex was causing the issue. Your solution 2 type of regex works fine.

Bug in rex command?? Not working if the raw data has more than 12 pipe or comma.

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard

Are you a member of the Splunk Community?

Bug in rex command?? Not working if the raw data has more than 12 pipe or comma.

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard