Solved: Complex RegEx Capturing Group Assistance

draracle · ‎06-18-2018

Complex RegEx Capturing Group Assistance

I have a couple similar cases where I am struggling to get the desired fields extracted with RegEx capturing groups. Please take a look at both cases and share your wisdom.

Thanks!

CASE #1
I am looking for some RegEx help to capture the USERID from logsources where the USERID may be DOMAIN/USERID or just USERID. I do not want to capture 'DOMAIN/'. This way the Field Extractions will not have two different versions of the user ID.

Sample (loginID=s.buttercup-shopping.com/bcs234):

Jan 1 01:1:10 10.10.10.10 CEF:0|Proxy1|Something|1.4.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.2.3.4 dhost=host.buttercup-games.com dpt=80 src=10.20.30.40 spt=19491 suser=LDAP://usldap.s.buttercup-shopping.com OU\=
City,OU\=Country,OU\=Users,OU\=Region,DC\=s,DC\=buttercup-shopping,DC\=com/FirstName LastName loginID=s.buttercup-shopping.com/bcs234 destinationTranslatedPort=<redacted>

Sample (loginID=bcs234):

Jan 1 09:1:10 10.10.10.10 CEF:0|Proxy2|Something|2.8.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.2.3.4 dhost=host.buttercup-games.com dpt=80 src=10.20.30.40 spt=19491 suser=LDAP://usldap.s.buttercup-shopping.com OU\=
City,OU\=Country,OU\=Users,OU\=Region,DC\=s,DC\=buttercup-shopping,DC\=com/FirstName LastName loginID=bcs234 destinationTranslatedPort=<redacted>

Desired Field Extraction:

loginID=bcs234

Progress:
RegEx:

loginID=(?P<userid>.*)(?= destination)

The following RegEx seems to work outside of Splunk but Splunk does not support using the capturing group (e.g. (?P) state over and over again (where the (.*) reside).
RegEx:

(?<=\.com\/)(.*)(?= destination)|(?<=\.corp\/)(.*)(?= destination)|(?<=loginID=)([A-Za-z0-9_-]{1,})(?= destination)

CASE #2

I was trying to capture the domain and IP addresses from 3 similar logs.

The below Field Extractions worked for the most part but I still needed a sed statement to remove a '.' since both scenarios with a '.' matched. It seems that when there's are more than two cases for a match that getting the capturing groups right is fairly difficult or even impossible.

Sample (email address + '.' + ' ')

relay=user@buttercup-games.com. [1.1.1.1]

Sample (email address + ' ')

relay=user@buttercup-games.com [1.1.1.1]

Sample (email address + '.')

relay=user@buttercup-games.com.[1.1.1.1]

FIELD EXTRACTIONS

relay=(?P<dest_domain>.*)(?=(\.[\[\s])|(\s\[))
^(?:[^\[\n]*\[){2}(?P<dest_ip>[^\]]+)

SED

| rex field=dest_domain mode=sed "s/\.$//g"

FrankVl · ‎06-18-2018

For the first case that can be solved by adding a non-capturing group for the part you want to ignore, and require that group to occur 0 or 1 times (?):

loginID=(?:[^\/]+\/)?(?<userid>\S*)

https://regex101.com/r/DO74m7/1

Second case (trick is to end the capturing group for the domain with a \w, to prevent it from grabbing the .):

relay=(?<dest_domain>.*\w+)[\.\s]+\[(?<dest_ip>[^\]]+)

https://regex101.com/r/yjTluC/1

View solution in original post

FrankVl · ‎06-18-2018

For the first case that can be solved by adding a non-capturing group for the part you want to ignore, and require that group to occur 0 or 1 times (?):

loginID=(?:[^\/]+\/)?(?<userid>\S*)

https://regex101.com/r/DO74m7/1

Second case (trick is to end the capturing group for the domain with a \w, to prevent it from grabbing the .):

relay=(?<dest_domain>.*\w+)[\.\s]+\[(?<dest_ip>[^\]]+)

https://regex101.com/r/yjTluC/1

draracle · ‎06-19-2018

Thank you! The second one worked flawlessly. The first one is not picking up logs where the domain is missing, such as below or simply: loginid=userid. What is being matched in these cases is 'xml' from text/xml. Is there still hope? Thanks in advanced!

 Jan 1 09:35:37 10.10.10.10 CEF:0|Appliance|Security|8.4.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.1.1.1 dhost=dict.buttercup-shopping.com dpt=80 src=10.20.30.40 spt=20912 suser=LDAP://usldap.s.buttercup-games.com OU\=J,OU\=C,OU\=Users,OU\=A,DC\=s,DC\=buttercup-games,DC\=com/FirstName LastName loginID=bcs234 destinationTranslatedPort=28213 rt=1529393737 in=395 out=848 requestMethod=GET requestClientApplication=buttercup-shopping Desktop Dict (Windows NT 6.1) reason=- cs1Label=Policy cs1=Super Administrator**Domain Base,Super Administrator**s Default cs2Label=DynCat cs2=0 cs3Label=ContentType cs3=text/xml; charset\=utf-8 cn1Label=DispositionCode cn1=1026 cn2Label=ScanDuration cn2=0 request=http://site.com/fsearch?keyfrom\=sdf.setqw.cd.http.0&q\=%20N&pos\=1&doctype\=xml&xmlVersion\=3.2&dogVersion\=1.0&client\=deskdict&id\=0ef47d7cdd3941d96&vendor\=qiang.buttercup-shopping&in\=buttercup-shoppingDictFull&appVer\=6.3.69.8341&appZengqiang\=1&abTest\=8&le\=eng&scradv\=1&wstate\=yes&LTH\=890&LWH\=0&LSDH\=-1&proc\=some.exe&headTxt\=2B05

FrankVl · ‎06-20-2018

Problem is that there is a / somewhere down the line, that causes my regex to look in the wrong place.

This should fix that (added a \s to prevent it from reading beyond whitespace):

loginID=(?:[^\/\s]+\/)?(?<userid>\S*)

draracle · ‎06-21-2018

That worked! You are a true RegEx genius! Thank you very much!

Complex RegEx Capturing Group Assistance

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Transform your security operations with Splunk Enterprise Security

Splunk Admins and App Developers | Earn a $35 gift card!