Solved: Regex expression help!

kailun92 · ‎07-23-2013

I used regex (?i)Area>(?P<Message>[^<]+) to extract the whole field below.

Originally <d:Message>(22/7)17:53 Accident on AYE (towards Tuas) after Jurong Port Rd Exit. Avoid lanes 2 and 3.</d:Message>

How can I extract only starting from the word after (Jurong Port Rd Exit) till the word Exit ? The data is updated daily on every 5 minutes interval. Thanks if you guys can help ! 😃

More of my XML is at here Xml Data (Only need to extract accident event)

This picture is search by Type="Accident".

I have total 6 Types.

After using the | rex "\)\s(?<Message>.*Exit|[^.]+)" | dedup Message , there are still duplication of: (Note on
after Buona Vista Exit
after Buona Vista Exit with congestion till Buona Vista Exit
after Buona Vista Exit with congestion till Clementi Ave 2 Exit
after Buona Vista Exit with congestion till Clementi Ave 6 Exit
after Buona Vista Exit with congestion till Jurong Town Hall Exit) is all the same accident at Buona Vista Exit.

wpreston · ‎07-24-2013

Adding another answer here to avoid confusing the issue with all the different regular expressions. I'll go ahead and apologize now since this will be a pretty long winded answer. There may be a much simpler or easier way to extract these that some Splunk ninja out there knows, but this is what I could come up with. I'd recommend testing each of these regular expressions on the command line first and, if they work for you, putting them into your transforms.conf so that you don't have to enter them on the search bar every time you need them. This page in the Knowledge Manager manual explains how to put them into transforms.conf and represent them in props.conf if you have any questions about it.

I've been thinking a lot about this one and I think I've figured out the pattern your events follow. I'll write out the structure I see, then how to extract each part of it, using the following event as an example (my field names might not match yours, but go with me for a minute):

example: (22/7)19:55 Accident on ECP (towards Changi Airport) after Maxwell Rd Entrance. Avoid lane 1./d:Message

This event is made up of:

One or more Accident Locations: (22/7)19:55 Accident on ECP

(date)time Accident at|on <Accident_Location>

Followed by a direction modifier: (towards Tuas)

(towards <Direction_Modifier>)

Followed by a location modifier: after Maxwell Rd Entrance. Note that this field is difficult to extract because its endpoint is arbitrary, i.e. does the field stop at Mandai Ave or at Mandai Ave Exit? Sometimes it stops at Entrance, as in the example.

before|after <Location_Modifier>

Optionally followed by a condition modifier: (missing from this event, but something like "with congestion blah blah".) Note that this field is difficult to extract since it appears that this field starts with arbitrary key words, like "with". Extracting this one will be an evolving experience for you as you come across the arbitrary key words and continue to add them to the rex.

with <Condition_Modifier>

Optionally followed by traffic Advice: Avoid lane 1.

<Advice>

They seem to basically follow this formula, so you can use the following regex's to extract all of these fields. They should account for the special cases where there is more than one Location in the same record. Adjust the field names to match the field names you want to use. (Again, note that these extractions may not cover every possible instance since I don't know your data, this is just how it appears to me. You know your data much better than I and can adapt the regex's to meet your needs)

To extract the Location field:

... your search ... | rex ":\d+\sAccident\s(on|at)\s(?<Location>(\w|\s|[?\/])+)?,?\s\("

To extract the Direction_Modifier field:

... your search ... | rex "\(towards\s(?<Direction_Modifier)[^\)]+)"

Be sure to add in any additional keywords that start this field, like "towards". For example, if there is some data where this field starts with "near", modify the rex like this to account for it:

... your search ... | rex "\((towards|near)\s(?<Direction_Modifier)[^\)]+)"

To extract the Location_Modifier field will take some work on your part, and will be an evolving experience for you as you come across the arbitrary key words and continue to add them to the rex. The rex I set up below ends the field caputre after a street designator, entry or exit designator, a period(.), or before the word "with" (since the Condition_Modifier field seems to always start with it). You will need to add any other street types or abbreviations into the piped list inside the rex if there are any that I missed. You will also need to add any other words besides "with" that are the start of the Conditional Modifier:

... your search ... | rex "\D\)\s(?<Location_Modifier>[^\.]*?(Exit|Road|Entrance|Avenue|Junction|Parkway|Rd|Pkwy|Ave|Way)\s?(Exit|Entrance)?)\s?(with)?"

I'm going to leave off the traffic Advice field since I've rambled on for long enough. Hopefully this gets you what you need and I think it covers all the cases you've posted about so far.

View solution in original post

wpreston · ‎07-24-2013

Adding another answer here to avoid confusing the issue with all the different regular expressions. I'll go ahead and apologize now since this will be a pretty long winded answer. There may be a much simpler or easier way to extract these that some Splunk ninja out there knows, but this is what I could come up with. I'd recommend testing each of these regular expressions on the command line first and, if they work for you, putting them into your transforms.conf so that you don't have to enter them on the search bar every time you need them. This page in the Knowledge Manager manual explains how to put them into transforms.conf and represent them in props.conf if you have any questions about it.

I've been thinking a lot about this one and I think I've figured out the pattern your events follow. I'll write out the structure I see, then how to extract each part of it, using the following event as an example (my field names might not match yours, but go with me for a minute):

example: (22/7)19:55 Accident on ECP (towards Changi Airport) after Maxwell Rd Entrance. Avoid lane 1./d:Message

This event is made up of:

One or more Accident Locations: (22/7)19:55 Accident on ECP

(date)time Accident at|on <Accident_Location>

Followed by a direction modifier: (towards Tuas)

(towards <Direction_Modifier>)

Followed by a location modifier: after Maxwell Rd Entrance. Note that this field is difficult to extract because its endpoint is arbitrary, i.e. does the field stop at Mandai Ave or at Mandai Ave Exit? Sometimes it stops at Entrance, as in the example.

before|after <Location_Modifier>

Optionally followed by a condition modifier: (missing from this event, but something like "with congestion blah blah".) Note that this field is difficult to extract since it appears that this field starts with arbitrary key words, like "with". Extracting this one will be an evolving experience for you as you come across the arbitrary key words and continue to add them to the rex.

with <Condition_Modifier>

Optionally followed by traffic Advice: Avoid lane 1.

<Advice>

They seem to basically follow this formula, so you can use the following regex's to extract all of these fields. They should account for the special cases where there is more than one Location in the same record. Adjust the field names to match the field names you want to use. (Again, note that these extractions may not cover every possible instance since I don't know your data, this is just how it appears to me. You know your data much better than I and can adapt the regex's to meet your needs)

To extract the Location field:

... your search ... | rex ":\d+\sAccident\s(on|at)\s(?<Location>(\w|\s|[?\/])+)?,?\s\("

To extract the Direction_Modifier field:

... your search ... | rex "\(towards\s(?<Direction_Modifier)[^\)]+)"

Be sure to add in any additional keywords that start this field, like "towards". For example, if there is some data where this field starts with "near", modify the rex like this to account for it:

... your search ... | rex "\((towards|near)\s(?<Direction_Modifier)[^\)]+)"

To extract the Location_Modifier field will take some work on your part, and will be an evolving experience for you as you come across the arbitrary key words and continue to add them to the rex. The rex I set up below ends the field caputre after a street designator, entry or exit designator, a period(.), or before the word "with" (since the Condition_Modifier field seems to always start with it). You will need to add any other street types or abbreviations into the piped list inside the rex if there are any that I missed. You will also need to add any other words besides "with" that are the start of the Conditional Modifier:

... your search ... | rex "\D\)\s(?<Location_Modifier>[^\.]*?(Exit|Road|Entrance|Avenue|Junction|Parkway|Rd|Pkwy|Ave|Way)\s?(Exit|Entrance)?)\s?(with)?"

I'm going to leave off the traffic Advice field since I've rambled on for long enough. Hopefully this gets you what you need and I think it covers all the cases you've posted about so far.

kailun92 · ‎07-24-2013

Thanks will figure out on all cases 😃

paddygriffin · ‎07-24-2013

rex ")s(?.*Exit|[^.]+)" | dedup Message
Case sensitivity in field names: I notice you used "message" [all lower case] in the regex but "Message" in the dedup. Field names are case sensitive so this may be part of your problem

paddygriffin · ‎07-23-2013

Have you tried using the Interactive Field Extraction (IFX) feature and having Splunk do the heavy lifting with regex while you feed it examples to train it? http://docs.splunk.com/Documentation/Splunk/5.0.3/Knowledge/ExtractfieldsinteractivelywithIFX
A second advantage is that this creates a persistent field definition unlike rex command which is transient.

kailun92 · ‎07-23-2013

thanks will try it out !

gfuente · ‎07-23-2013

Here you go:

Including the word exit:

after\s(?< yourfield >(\w|\s)+)\.

Without the "exit":

after\s(?< yourfield >(\w|\s)+)\sExit\.

*Remove the blanks before and after "yourfield"

Regards

kailun92 · ‎07-23-2013

Check out the update, sorry for the brief summary earlier. I tried the regex removing after and before space but it is giving me Invalid regex: syntax error
Regex does not extract any named fields.

wpreston · ‎07-23-2013

Updated to address comment

I've updated the Regular Expression to address the data you're working with, and I believe the following will work:

... your search ... | rex "\)\s(?<Message>[^<]+)"

In the sample data you provided (thanks for that!), it extracts the following data to the Message field:

Message = before Mandai Ave Exit with congestion till BKE Entrance. Avoid lane 4.
Message = after Toa Payoh Exit. Avoid lane 1.
Message = after Thomson Rd.
Message = before Mandai Ave Exit with congestion till BKE Entrance. Avoid lane 4.
Message = before Kallang Way.
Message = after Toa Payoh Exit. Avoid lane 1.
Message = after Thomson Rd.
Message = before Mandai Ave Exit with congestion till BKE Entrance. Avoid lane 4.

Is this what you're looking for?

Update 2

Try this regex:

... your search ... | rex "\)\s(?<Message>.*Exit|[^.]+)"

This basically looks for everything up to and including the work "Exit", or everything up to the first "." in the message field. I don't know if it will work in all possible cases, but it will work in the sample you provided.

As to using a Splunk Generated Pattern (regex), I don't really use that feature so unfortunately I don't know the answer.

kailun92 · ‎07-24-2013

I just realise that there are still a little bit of duplication, check out the update last picture. Is there any way to remove ? I used | dedup Message and is not helping.

kailun92 · ‎07-23-2013

Thank you sooo much ! Have a great day ! Good job !

kailun92 · ‎07-23-2013

I had some data (22/7)23:38 Accident at Guillemard Road/Mountbatten Road Junction, (21/7)9:03 Accident on Dairy Farm Road (towards Bukit Timah Expressway) after Petir Road. Avoid left lane. It is NULL because of the / and (). How can I solve that ?

kailun92 · ‎07-23-2013

Is it possible to do Message = before Mandai Ave Exit, Message = after Toa Payoh Exit, Message = after Thomson Rd, Message = before Mandai Ave Exit, Message = before Kallang Way, Message = after Toa Payoh Exit, Message = after Thomson Rd, Message = before Mandai Ave Exit ? Without the Avoid.

kailun92 · ‎07-23-2013

check out the update, sorry for the brief summary earlier. 😃 I tried the expression but it wont work. Invalid regex: syntax error
Regex does not extract any named fields.

Regex expression help!

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?