Hello,
I would like to ask about problem with parsing log using regex with lookahead.
I have this log:
Oct 10 04:18:31 ATLAS Threat Categories|Blocked Host|7|rt=1633832250000 src=122.226.102.59 cs3Label=Match Type dpt=23 cn2=13 proto=TCP dst=193.85.146.63 cn1=21129644 spt=39528 cs2Label=Protection Group Name cs1Label=IOC Pattern cn1Label=Element Id cn2Label=Protection Group ID cs7Label=Threat Category cs7=Malware cs6=Telnet Bruteforce cs1=122.226.102.59 cs6Label=Threat Name cs3=ip cs2=Default Protection Group
As you can see, there are a number of parameters for which two fields are always used, eg "cs1Label" (parameter name) and "cs1" (parameter value). My goal is to create a new field in search time parsing phase, which will be the value of the field "cs1Label" and the value will be the value of the field "cs1". For example:
IOC Pattern = 122.226.102.59
The problem is that the log does not have a fixed structure, the order of the individual fields changes.
Therefore, "normal" regex cannot be used. So I created the following regex using a lookahead that parses the appropriate values (I tested it in a regex101.com tester and it works):
^(?=.*\bcs1Label=\b((.*?)((\s\w+\=)|($))))(?=.*\bcs1=\b((.*?)((\s\w+\=)|($))))
Unfortunately, when I use it in Splunk, in transforms.conf, it doesn't work. My transforms.conf looks like this:
[combined_field_cs1]
SOURCE_KEY = _raw
REGEX = ^(?=.\bcs1Label=\b((.?)((\s\w+=)|($))))(?=.\bcs1=\b((.?)((\s\w+=)|($))))
FORMAT = $2::$7
props.conf
REPORT-combined_field_cs1 = combined_field_cs1
Strictly speaking, when I simply want to look at the messages in a given index, the search freezes, it does not display any messages and I have to close it manually.
I made some testing and it obvious that REGEX is the problem. It is not clear to me why the regex in Splunk does not work. Or did I choose the completely wrong path and need to use a completely different way to achieve my goal? Could yomeone more experinced help? Any help will be highly appreciated.
Best regards
Lukas Mecir
I thought again about what you wrote, and I think I finally - inspired by you - found a solution.
Parameters cn1 - cn7 and cs1 - cs7 can appear in each log message, where each parameter is expressed by two fields, eg field cs1Label bears the name of the parameter and its corresponding field cs1 bears the value of this parameter (and similarly for parameters cs2 - cs7 and cn1 - cn7).
So the message can contain the following fields:
cs1, cs2, cs3, cs4, cs5, cs6, cs7
cs1Label, cs2Label, cs3Label, cs4Label, cs5Label, cs6Label, cs7Label
cn1, cn2, cn3, cn4, cn5, cn6, cn7
cn1Label, cn2Label, cn3Label, cn4Label, cn5Label, cn6Label, cn7Label
These fields can be in a different order in each message.
So in the end I created two REGEXs for each pair of corresponding parameters - eg cs1 and cs1Label:
in case the first field in the message is cs1 and cs1Label after it
cs1=([^=]+)\s.*cs1Label=([^=]+)(\s|$)
for the opposite case, when the first field is cs1Label and cs1 only after it
cs1Label=([^=]+)\s.*cs1=([^=]+)(\s|$)
And the same for every other pair.
This ensures that the data is parsed in any field order.
Therefore, the transforms.conf file looks like this:
[cs1_named_v1]
SOURCE_KEY = _raw
REGEX = cs1=([^=]+)\s.*cs1Label=([^=]+)(\s|$)
FORMAT = $2::$1
[cs1_named_v2]
SOURCE_KEY = _raw
REGEX = cs1Label=([^=]+)\s.*cs1=([^=]+)(\s|$)
FORMAT = $1::$2
[cn1_named_v1]
SOURCE_KEY = _raw
REGEX = cn1=([^=]+)\s.*cn1Label=([^=]+)(\s|$)
FORMAT = $2::$1
[cn1_named_v2]
SOURCE_KEY = _raw
REGEX = cn1Label=([^=]+)\s.*cn1=([^=]+)(\s|$)
FORMAT = $1::$2
etc.
And of course I added the appropriate REPORT commands to props.conf.
Thanks again for the inspiration and the effort to help.
Try a regex that doesn't use lookahead.
REGEX = \bcs1Label=(.*?)((\s\w+=)|$).*cs1=(.*?)((\s\w+=)|$)
FORMAT = $1::$4
I thought again about what you wrote, and I think I finally - inspired by you - found a solution.
Parameters cn1 - cn7 and cs1 - cs7 can appear in each log message, where each parameter is expressed by two fields, eg field cs1Label bears the name of the parameter and its corresponding field cs1 bears the value of this parameter (and similarly for parameters cs2 - cs7 and cn1 - cn7).
So the message can contain the following fields:
cs1, cs2, cs3, cs4, cs5, cs6, cs7
cs1Label, cs2Label, cs3Label, cs4Label, cs5Label, cs6Label, cs7Label
cn1, cn2, cn3, cn4, cn5, cn6, cn7
cn1Label, cn2Label, cn3Label, cn4Label, cn5Label, cn6Label, cn7Label
These fields can be in a different order in each message.
So in the end I created two REGEXs for each pair of corresponding parameters - eg cs1 and cs1Label:
in case the first field in the message is cs1 and cs1Label after it
cs1=([^=]+)\s.*cs1Label=([^=]+)(\s|$)
for the opposite case, when the first field is cs1Label and cs1 only after it
cs1Label=([^=]+)\s.*cs1=([^=]+)(\s|$)
And the same for every other pair.
This ensures that the data is parsed in any field order.
Therefore, the transforms.conf file looks like this:
[cs1_named_v1]
SOURCE_KEY = _raw
REGEX = cs1=([^=]+)\s.*cs1Label=([^=]+)(\s|$)
FORMAT = $2::$1
[cs1_named_v2]
SOURCE_KEY = _raw
REGEX = cs1Label=([^=]+)\s.*cs1=([^=]+)(\s|$)
FORMAT = $1::$2
[cn1_named_v1]
SOURCE_KEY = _raw
REGEX = cn1=([^=]+)\s.*cn1Label=([^=]+)(\s|$)
FORMAT = $2::$1
[cn1_named_v2]
SOURCE_KEY = _raw
REGEX = cn1Label=([^=]+)\s.*cn1=([^=]+)(\s|$)
FORMAT = $1::$2
etc.
And of course I added the appropriate REPORT commands to props.conf.
Thanks again for the inspiration and the effort to help.
Hi, thanks for the suggestion, but I have already tried such a regex and unfortunately it is not a solution. As I wrote, the order of the parts of the report may vary. So your regex would work if the "cs1Label" field was in the message before the "cs1" field, but if the order was reversed (which can happen), then it wouldn't work. That's why I used lookahead, it works with it in any order of fields, but unfortunately only in regex101.com, not in Splunk ...