Splunk Search

How do I edit my regex to parse fields correctly if a field delimiter appears within a field?

markwymer
Path Finder

Hi,

Another regex problem I'm afraid.....

I've got a very long event with 37 fields where all the fields are quoted and separated by comma. Also there are no key=value pairs.
For the most part my regex works nicely with the event data, but there are occasions where a quote also appears in the actual field data thereby breaking my regex separator character.

Working example (extremely simplified regex and event):

^"(?P<dest_ip>[^"]+)","(?P<dest_port>[^"]+)","(?P<uri>[^"]+)","(?P<request>[^"]+)","(?P<response>[^\n]+)"$

Data:

"192.0.0.20","80","fl=city,name,code,group=true&group.field=city","GET /solr/lpbm/select?fl=city","Logging rate limit reached"

No problem with this, all the fields parse out OK. However, this next event fails - note the additional " in fourth field:-

"192.0.0.20","80","fl=city,name,code,group=true&group.field=city","GET /solr/"lpbm"/select?fl=city","Logging rate limit reached"

This now breaks the [^"]+)"," part of my regex and distorts the field extractions.

Is there a way to do the equivalent of:-

......","(?P<request>[^","]+)",".......

I know that this is invalid, but I don't know what the alternative looks like 😞 !!

Thanks for any help,
Mark.

0 Karma
1 Solution

Sebastian2
Path Finder

Your problem should be solvable by using non greedy (or lazy) quantifiers instead of the [^"] syntax. The advantage is, that you can use the whole pattern "," as seperator instead of just [^"]. How ever, I'm not sure if the Splunk RegEx works as I expect to do, but try (something like) this:

^"(?P<dest_ip>.+?)","(?P<dest_port>.+?)","(?P<uri>.+?)","(?P<request>.+?)","(?P<response>[^\n]+)"$

What's the difference:

  • I'd say the [^"] syntax is "old school". The parser is consuming just everything until an " is found.
  • Lazy quantifiers, how ever, parse as much as they can. And "as much" means: As much as possible unless the whole pattern doesn't match. In theory this should (I can't test that right now) therefore consume a single " but no "," as the pattern would no longer match as a whole. (And it should be a little bit slower, again, in theory)

/edit & just as info: a ? makes an quantifier lazy (here: .+?: "Consume lazy at least one character").

View solution in original post

javiergn
SplunkTrust
SplunkTrust

Try the following:

^"(?P<dest_ip>[^"]+)","(?P<dest_port>[^"]+)","(?P<uri>[^"]+)","(?P<request>[^"][^,]+)","(?P<response>[^\n]+)"$

You can test it here: https://regex101.com/r/nD3sL1/2

Sebastian2
Path Finder

Your problem should be solvable by using non greedy (or lazy) quantifiers instead of the [^"] syntax. The advantage is, that you can use the whole pattern "," as seperator instead of just [^"]. How ever, I'm not sure if the Splunk RegEx works as I expect to do, but try (something like) this:

^"(?P<dest_ip>.+?)","(?P<dest_port>.+?)","(?P<uri>.+?)","(?P<request>.+?)","(?P<response>[^\n]+)"$

What's the difference:

  • I'd say the [^"] syntax is "old school". The parser is consuming just everything until an " is found.
  • Lazy quantifiers, how ever, parse as much as they can. And "as much" means: As much as possible unless the whole pattern doesn't match. In theory this should (I can't test that right now) therefore consume a single " but no "," as the pattern would no longer match as a whole. (And it should be a little bit slower, again, in theory)

/edit & just as info: a ? makes an quantifier lazy (here: .+?: "Consume lazy at least one character").

View solution in original post

Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.