I am trying to extract fields out of events that are tab-delimited unless there are quotes around them. For example,
CMSFFMHC 11/21/16 5:19 PM "This is an error message
at some point in the code" 16891349 USERNAME 4 function 1234567890
The format is:
string -tab- date/time -tab- message -tab- error_code -tab- username -tab- transaction_code -tab- function -tab- transaction_id
Sometimes "message" contains a tab character. When it does, it's enclosed in quotes ( "message" ). I'm trying to write a regular expression to extract the message field that would:
- if there is a quotation mark before message, ignore it
- capture everything until a tab followed by eight digits
- if there is a quotation mark at the end of the message, ignore it
Further examples:
ABCDEFG 11/21/16 11:14 PM The request channel timed out while waiting for a reply after 00:00:17.9375000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a long 27332747 HOST12345 5 Function1 12345678901234567890123456
ABCDEFG 11/21/16 11:13 PM The request channel timed out while waiting for a reply after 00:00:18. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeo 16964220 HOST23456 5 Function2 23456789012345678901234567
ABCDEFG 11/21/16 5:19 PM "
The operationFFMContactLookupfailed.
Error Code:[OSB-380000]
Reason :[[OSB-381304]Exception in HttpOutboundMessageContext.RetrieveHttpResponseWork.run: java.net.SocketTimeoutException
java.net.SocketTimeoutException
at weblogic.net.http.Sock" 16891349 HOST34567 4 Function3 34567890123456789012345678
ABCDEFG 11/21/16 4:06 PM "
The operationFFMContactLookupfailed.
Error Code:[OSB-380000]
Reason :[[OSB-381304]Exception in HttpOutboundMessageContext.RetrieveHttpResponseWork.run: java.net.SocketTimeoutException
java.net.SocketTimeoutException
at weblogic.net.http.Sock" 16865750 HOST45678 4 Function4 45678901234567890123456789
Here is my current field extraction in props.conf:
EXTRACT-message=^\w+\t[^\t]+\t(?P[^\t]+)
The problem is this keeps the quotation marks and stops at the tab character in the midst of the message.
Can someone more versed in PCRE please lend me a hand?
Thanks!
Like this:
| rex "(?ms)^(?:[^\t]*\t){2}\"?(?<message>.+)\"?\t(?:[^\t]*\t){4}[^\t]*$"
See if this regex suffices which will captureall the cases you mentioned:
^(?<first>[^\t]+)\t(?<second>[^\t]+)(\t|\t")(?<message>(.*))("\t|\t)\d{8}\t
See extraction at work here
How about something like this?
^[^\t]+\t[^\t]+\t\"?(?<message>[^\"]+)\"?\t\d{8}\t
\"(?P<message>[^\"]+)
should give you a good start.
Sometimes the rex can be funny with quote marks, you may end up having to add in multiples e.g. \" or \\"
Sorry - just re-read your post. This should suffice, I think:
"(?P<message>.+)"
this will extract everything between double quote marks and assign the field named "message". (remove the single quote marks)
If you can post the contents of your props.conf and a few more sample events, we can review and make sure it will work.
Thank you, sshelly, but your suggestion doesn't work. The regex should only consider the quotations if they are there, which most of the time they are not. More importantly, it needs to stop at the next tab, unless the tab is part of the message when it is enclosed in quotation marks.
So - if a double quote exists, there is a pair of them around the message body? If no double quotes, than no "message" ? Just trying to picture it in my head. You can do a conditional pipe in the middle of the regex. Can you describe what would constitute a "message" or at least, what would precede a message? I think I can do it; just need a bit of detail first.
Can u post at least 2 examples of the event data with "messages" fields ?
There's only one double quote in your events above. Can u edit maybe? As well, if you could include the props.conf entries and/or transforms that come into play, that would help. If you're doing a tab separated extract, you might need to get more specific using transforms since tabs can appear within field values.