Solved: Regex to extract varying string

ahogbin · ‎03-06-2016

Hello..

I am attempting to extract a string of varying format using regex. I have successfully extracted part of the string but am struggling to extract the string if it contains white space or a special character '-' for example

The text I am trying to extract always has a space before it and always ends with '['
DEV NS [
CI-DEV [
TST [

My regex so far is thus -rex "(?\w+) \["

But it is only extracting single blocks of text which is fine if there is only one block (in the case or TST) but if there are 2 blocks (eg DEV NS) or text with a hypen (CI-DEV) then it is not extracting the string.

Long story short... how do I modify the expression to include the whole string (space and hypen)

As always help is very much appreciated.

Cheers,

Alastair

landen99 · ‎03-06-2016

This captures uppercase letters, numbers and dashes after an " O " when the capture group is followed by a space and an open bracket:
https://regex101.com/r/gD4eW7/3

| rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \["

Added: If you want to match starting after " O " while ignoring only "nevo-web" and specifically, the most efficient regex is probably:

   | rex "O (nevo-web )?(?<myfield>[A-Z\-\s]+) \["

I used "O[^A-Z]*" in case there were other unanticipated lowercase words in front of your pattern of interest.

View solution in original post

landen99 · ‎03-06-2016

This captures uppercase letters, numbers and dashes after an " O " when the capture group is followed by a space and an open bracket:
https://regex101.com/r/gD4eW7/3

| rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \["

Added: If you want to match starting after " O " while ignoring only "nevo-web" and specifically, the most efficient regex is probably:

   | rex "O (nevo-web )?(?<myfield>[A-Z\-\s]+) \["

I used "O[^A-Z]*" in case there were other unanticipated lowercase words in front of your pattern of interest.

ahogbin · ‎03-06-2016

Hello... have tried the above 2 examples but neither give me what I am after and manage to exclude most of the entries I am after.
Why will my solution give me problems ? it is only dealing with a small set of data and returns everything I am after.

From my limited knowledge my query looks for a 'O' and then excludes the work nevo-web if it exists. It then returns everything else before the [ with the end result of spitting out the string I am after.

The problem with the examples to date is that they are missing the text after the first white space and before the second ( BLD NS) and are only returning NS

Open to better solutions and I do appreciate everyone's input

landen99 · ‎03-06-2016

MuS made a good catch by adding \s to capture multiple words in the pattern, including "BLD". I meant to do that originally, but I was only looking at two full events when I created the regex.

Addressing the problems question, in general, regex works best by matching patterns from left to right. Look-aheads, etc. are not that efficient and they require the pattern to exist or to not exist (less flexibility). Since this is Splunk, I assumed large datasets, and even small datasets can become large over time. Also, it is best to match as generally as possible in case the logs deviate from your test data.

ahogbin · ‎03-06-2016

That makes sense.. thank you for taking the time to clarify.

Cheers.

Alastair

MuS · ‎03-06-2016

In addition this little modification will get all needed results:

.... | rex "O [^A-Z]*(?<myfield>[A-Z\-\s]+) \[" | ...

landen99 · ‎03-06-2016

Good catch. I agree.

ahogbin · ‎03-06-2016

Perfect... just out of curiosity why is then any better than excluding a specific string as in [^"nevo-web"]

Thanks for all your help.

Alastair

ahogbin · ‎03-06-2016

Got it... rex "(?[^\.O+[^"nevo-web"]+)\s\["seems to do the trick

Thanks for the help and suggestions

landen99 · ‎03-06-2016

The formatting for that regex did not come come through right, but if it is doing what it looks like, that approach will give you problems and will take much more time than it should to complete the task even if it works right. Check out my answer below..

esix_splunk · ‎03-06-2016

Why not capture everything between brackets...

 ... | rex "\]\s(?<myField>[^\]]+)\[" | ...

ahogbin · ‎03-06-2016

because I am only after the specific text. I am gathering everything except for the string composed of 2 part (BLD NS)

MuS · ‎03-06-2016

Hi ahogbin,

based on your example and your regex try this:

... | rex "(?<myField>[^\s]+)\s\[" | ...

Hope this helps ...

cheers, MuS

ahogbin · ‎03-06-2016

Looking good... however it is not picking up any string that has whitespace between the words (eg BLD NS - it is only including the NS component).

[7/03/16 12:23:27:936 AEDT] 0000005c SystemOut O BLD NS [WebContainer : 0]

Other than that is is working perfectly

Cheers

MuS · ‎03-06-2016

Can you provide all possible combinations please?

ahogbin · ‎03-06-2016

There are 3 possible combinations
[7/03/16 12:42:24:999 AEDT] 0000005e SystemOut O BLD NS [WebContainer : 2]
[7/03/16 12:02:13:370 EST] 00000060 SystemOut O nevo-web CI-BLD [WebContainer : 4]
[7/03/16 11:58:06:564 EST] 00000092 SystemOut O TST [WebContainer : 2]

The extracted string is BLD NS or CI-BLD or TST

Yo example works perfectly for all but BLD NS

Thank you so much for your help

Cheers,

Alastair

ahogbin · ‎03-06-2016

Have gotten a little closer

rex "(?\w{1,4}(?:\s|\-)\w{1,4}) \["

extracts the string I am after but for some reason some of the strings are extracted with an 'O' in front of them and other not

O TST

The log entry is

[7/03/16 11:32:49:079 EST] 000000b4 SystemOut O TST [WebContainer : 4]

The ones working correctly are

[7/03/16 11:32:49:101 EST] 00000060 SystemOut O nevo-web CI-TST [WebContainer : 0]

IE: the CI-TST string is extracted.

How do I stop the leading 'O' from being included in the string ?

Regex to extract varying string

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

Are you a member of the Splunk Community?

Regex to extract varying string

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...