Hello..
I am attempting to extract a string of varying format using regex. I have successfully extracted part of the string but am struggling to extract the string if it contains white space or a special character '-' for example
The text I am trying to extract always has a space before it and always ends with '['
DEV NS [
CI-DEV [
TST [
My regex so far is thus -rex "(?\w+) \["
But it is only extracting single blocks of text which is fine if there is only one block (in the case or TST) but if there are 2 blocks (eg DEV NS) or text with a hypen (CI-DEV) then it is not extracting the string.
Long story short... how do I modify the expression to include the whole string (space and hypen)
As always help is very much appreciated.
Cheers,
Alastair
This captures uppercase letters, numbers and dashes after an " O " when the capture group is followed by a space and an open bracket:
https://regex101.com/r/gD4eW7/3
| rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \["
Added: If you want to match starting after " O " while ignoring only "nevo-web" and specifically, the most efficient regex is probably:
| rex "O (nevo-web )?(?<myfield>[A-Z\-\s]+) \["
I used "O[^A-Z]*" in case there were other unanticipated lowercase words in front of your pattern of interest.
This captures uppercase letters, numbers and dashes after an " O " when the capture group is followed by a space and an open bracket:
https://regex101.com/r/gD4eW7/3
| rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \["
Added: If you want to match starting after " O " while ignoring only "nevo-web" and specifically, the most efficient regex is probably:
| rex "O (nevo-web )?(?<myfield>[A-Z\-\s]+) \["
I used "O[^A-Z]*" in case there were other unanticipated lowercase words in front of your pattern of interest.
Hello... have tried the above 2 examples but neither give me what I am after and manage to exclude most of the entries I am after.
Why will my solution give me problems ? it is only dealing with a small set of data and returns everything I am after.
From my limited knowledge my query looks for a 'O' and then excludes the work nevo-web if it exists. It then returns everything else before the [ with the end result of spitting out the string I am after.
The problem with the examples to date is that they are missing the text after the first white space and before the second ( BLD NS) and are only returning NS
Open to better solutions and I do appreciate everyone's input
MuS made a good catch by adding \s
to capture multiple words in the pattern, including "BLD". I meant to do that originally, but I was only looking at two full events when I created the regex.
Addressing the problems question, in general, regex works best by matching patterns from left to right. Look-aheads, etc. are not that efficient and they require the pattern to exist or to not exist (less flexibility). Since this is Splunk, I assumed large datasets, and even small datasets can become large over time. Also, it is best to match as generally as possible in case the logs deviate from your test data.
That makes sense.. thank you for taking the time to clarify.
Cheers.
Alastair
In addition this little modification will get all needed results:
.... | rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \[" | ...
Good catch. I agree.
Perfect... just out of curiosity why is then any better than excluding a specific string as in [^"nevo-web"]
Thanks for all your help.
Alastair
Got it... rex "(?[^\.O+[^"nevo-web"]+)\s\["
seems to do the trick
Thanks for the help and suggestions
The formatting for that regex did not come come through right, but if it is doing what it looks like, that approach will give you problems and will take much more time than it should to complete the task even if it works right. Check out my answer below..
Why not capture everything between brackets...
... | rex "\]\s(?<myField>[^\]]+)\[" | ...
because I am only after the specific text. I am gathering everything except for the string composed of 2 part (BLD NS)
Hi ahogbin,
based on your example and your regex try this:
... | rex "(?<myField>[^\s]+)\s\[" | ...
Hope this helps ...
cheers, MuS
Looking good... however it is not picking up any string that has whitespace between the words (eg BLD NS - it is only including the NS component).
[7/03/16 12:23:27:936 AEDT] 0000005c SystemOut O BLD NS [WebContainer : 0]
Other than that is is working perfectly
Cheers
Can you provide all possible combinations please?
There are 3 possible combinations
[7/03/16 12:42:24:999 AEDT] 0000005e SystemOut O BLD NS [WebContainer : 2]
[7/03/16 12:02:13:370 EST] 00000060 SystemOut O nevo-web CI-BLD [WebContainer : 4]
[7/03/16 11:58:06:564 EST] 00000092 SystemOut O TST [WebContainer : 2]
The extracted string is BLD NS or CI-BLD or TST
Yo example works perfectly for all but BLD NS
Thank you so much for your help
Cheers,
Alastair
Have gotten a little closer
rex "(?\w{1,4}(?:\s|\-)\w{1,4}) \["
extracts the string I am after but for some reason some of the strings are extracted with an 'O' in front of them and other not
O TST
The log entry is
[7/03/16 11:32:49:079 EST] 000000b4 SystemOut O TST [WebContainer : 4]
The ones working correctly are
[7/03/16 11:32:49:101 EST] 00000060 SystemOut O nevo-web CI-TST [WebContainer : 0]
IE: the CI-TST string is extracted.
How do I stop the leading 'O' from being included in the string ?