Complex RegEx Capturing Group Assistance
I have a couple similar cases where I am struggling to get the desired fields extracted with RegEx capturing groups. Please take a look at both cases and share your wisdom.
Thanks!
CASE #1
I am looking for some RegEx help to capture the USERID from logsources where the USERID may be DOMAIN/USERID or just USERID. I do not want to capture 'DOMAIN/'. This way the Field Extractions will not have two different versions of the user ID.
Sample (loginID=s.buttercup-shopping.com/bcs234):
Jan 1 01:1:10 10.10.10.10 CEF:0|Proxy1|Something|1.4.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.2.3.4 dhost=host.buttercup-games.com dpt=80 src=10.20.30.40 spt=19491 suser=LDAP://usldap.s.buttercup-shopping.com OU\=
City,OU\=Country,OU\=Users,OU\=Region,DC\=s,DC\=buttercup-shopping,DC\=com/FirstName LastName loginID=s.buttercup-shopping.com/bcs234 destinationTranslatedPort=<redacted>
Sample (loginID=bcs234):
Jan 1 09:1:10 10.10.10.10 CEF:0|Proxy2|Something|2.8.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.2.3.4 dhost=host.buttercup-games.com dpt=80 src=10.20.30.40 spt=19491 suser=LDAP://usldap.s.buttercup-shopping.com OU\=
City,OU\=Country,OU\=Users,OU\=Region,DC\=s,DC\=buttercup-shopping,DC\=com/FirstName LastName loginID=bcs234 destinationTranslatedPort=<redacted>
Desired Field Extraction:
loginID=bcs234
Progress:
RegEx:
loginID=(?P<userid>.*)(?= destination)
The following RegEx seems to work outside of Splunk but Splunk does not support using the capturing group (e.g. (?P) state over and over again (where the (.*) reside).
RegEx:
(?<=\.com\/)(.*)(?= destination)|(?<=\.corp\/)(.*)(?= destination)|(?<=loginID=)([A-Za-z0-9_-]{1,})(?= destination)
CASE #2
I was trying to capture the domain and IP addresses from 3 similar logs.
The below Field Extractions worked for the most part but I still needed a sed statement to remove a '.' since both scenarios with a '.' matched. It seems that when there's are more than two cases for a match that getting the capturing groups right is fairly difficult or even impossible.
Sample (email address + '.' + ' ')
relay=user@buttercup-games.com. [1.1.1.1]
Sample (email address + ' ')
relay=user@buttercup-games.com [1.1.1.1]
Sample (email address + '.')
relay=user@buttercup-games.com.[1.1.1.1]
FIELD EXTRACTIONS
relay=(?P<dest_domain>.*)(?=(\.[\[\s])|(\s\[))
^(?:[^\[\n]*\[){2}(?P<dest_ip>[^\]]+)
SED
| rex field=dest_domain mode=sed "s/\.$//g"
For the first case that can be solved by adding a non-capturing group for the part you want to ignore, and require that group to occur 0 or 1 times (?):
loginID=(?:[^\/]+\/)?(?<userid>\S*)
https://regex101.com/r/DO74m7/1
Second case (trick is to end the capturing group for the domain with a \w, to prevent it from grabbing the .):
relay=(?<dest_domain>.*\w+)[\.\s]+\[(?<dest_ip>[^\]]+)
For the first case that can be solved by adding a non-capturing group for the part you want to ignore, and require that group to occur 0 or 1 times (?):
loginID=(?:[^\/]+\/)?(?<userid>\S*)
https://regex101.com/r/DO74m7/1
Second case (trick is to end the capturing group for the domain with a \w, to prevent it from grabbing the .):
relay=(?<dest_domain>.*\w+)[\.\s]+\[(?<dest_ip>[^\]]+)
Thank you! The second one worked flawlessly. The first one is not picking up logs where the domain is missing, such as below or simply: loginid=userid. What is being matched in these cases is 'xml' from text/xml. Is there still hope? Thanks in advanced!
Jan 1 09:35:37 10.10.10.10 CEF:0|Appliance|Security|8.4.0|121|Transaction permitted|1| act=permitted app=http dvc=10.10.10.10 dst=1.1.1.1 dhost=dict.buttercup-shopping.com dpt=80 src=10.20.30.40 spt=20912 suser=LDAP://usldap.s.buttercup-games.com OU\=J,OU\=C,OU\=Users,OU\=A,DC\=s,DC\=buttercup-games,DC\=com/FirstName LastName loginID=bcs234 destinationTranslatedPort=28213 rt=1529393737 in=395 out=848 requestMethod=GET requestClientApplication=buttercup-shopping Desktop Dict (Windows NT 6.1) reason=- cs1Label=Policy cs1=Super Administrator**Domain Base,Super Administrator**s Default cs2Label=DynCat cs2=0 cs3Label=ContentType cs3=text/xml; charset\=utf-8 cn1Label=DispositionCode cn1=1026 cn2Label=ScanDuration cn2=0 request=http://site.com/fsearch?keyfrom\=sdf.setqw.cd.http.0&q\=%20N&pos\=1&doctype\=xml&xmlVersion\=3.2&dogVersion\=1.0&client\=deskdict&id\=0ef47d7cdd3941d96&vendor\=qiang.buttercup-shopping&in\=buttercup-shoppingDictFull&appVer\=6.3.69.8341&appZengqiang\=1&abTest\=8&le\=eng&scradv\=1&wstate\=yes<H\=890&LWH\=0&LSDH\=-1&proc\=some.exe&headTxt\=2B05
Problem is that there is a /
somewhere down the line, that causes my regex to look in the wrong place.
This should fix that (added a \s
to prevent it from reading beyond whitespace):
loginID=(?:[^\/\s]+\/)?(?<userid>\S*)
That worked! You are a true RegEx genius! Thank you very much!