Splunk Search

How to write a regular expression to extract a long string until a group of multiple characters, and ignore if it begins or ends with double quotes?

jwalthour
Communicator

I am trying to extract fields out of events that are tab-delimited unless there are quotes around them. For example,

CMSFFMHC    11/21/16 5:19 PM    "This is an error message 
    at some point in the code"  16891349    USERNAME    4   function    1234567890

The format is:

string -tab- date/time -tab- message -tab- error_code -tab- username -tab- transaction_code -tab- function -tab- transaction_id

Sometimes "message" contains a tab character. When it does, it's enclosed in quotes ( "message" ). I'm trying to write a regular expression to extract the message field that would:
- if there is a quotation mark before message, ignore it
- capture everything until a tab followed by eight digits
- if there is a quotation mark at the end of the message, ignore it

Further examples:

ABCDEFG 11/21/16 11:14 PM   The request channel timed out while waiting for a reply after 00:00:17.9375000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a long    27332747    HOST12345   5   Function1   12345678901234567890123456
ABCDEFG 11/21/16 11:13 PM   The request channel timed out while waiting for a reply after 00:00:18. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeo    16964220    HOST23456   5   Function2   23456789012345678901234567
ABCDEFG 11/21/16 5:19 PM    "
The operationFFMContactLookupfailed. 
  Error Code:[OSB-380000]
  Reason    :[[OSB-381304]Exception in HttpOutboundMessageContext.RetrieveHttpResponseWork.run: java.net.SocketTimeoutException
java.net.SocketTimeoutException
    at weblogic.net.http.Sock"  16891349    HOST34567   4   Function3   34567890123456789012345678
    ABCDEFG 11/21/16 4:06 PM    "
The operationFFMContactLookupfailed. 
  Error Code:[OSB-380000]
  Reason    :[[OSB-381304]Exception in HttpOutboundMessageContext.RetrieveHttpResponseWork.run: java.net.SocketTimeoutException
java.net.SocketTimeoutException
    at weblogic.net.http.Sock"  16865750    HOST45678   4   Function4   45678901234567890123456789

Here is my current field extraction in props.conf:

EXTRACT-message=^\w+\t[^\t]+\t(?P[^\t]+)
The problem is this keeps the quotation marks and stops at the tab character in the midst of the message.

Can someone more versed in PCRE please lend me a hand?

Thanks!

0 Karma

woodcock
Esteemed Legend

Like this:

| rex "(?ms)^(?:[^\t]*\t){2}\"?(?<message>.+)\"?\t(?:[^\t]*\t){4}[^\t]*$"
0 Karma

gokadroid
Motivator

See if this regex suffices which will captureall the cases you mentioned:

^(?<first>[^\t]+)\t(?<second>[^\t]+)(\t|\t")(?<message>(.*))("\t|\t)\d{8}\t

See extraction at work here

0 Karma

maciep
Champion

How about something like this?

^[^\t]+\t[^\t]+\t\"?(?<message>[^\"]+)\"?\t\d{8}\t
0 Karma

mrgibbon
Contributor

\"(?P<message>[^\"]+) should give you a good start.
Sometimes the rex can be funny with quote marks, you may end up having to add in multiples e.g. \" or \\"

0 Karma

sshelly_splunk
Splunk Employee
Splunk Employee

Sorry - just re-read your post. This should suffice, I think:
"(?P<message>.+)"
this will extract everything between double quote marks and assign the field named "message". (remove the single quote marks)
If you can post the contents of your props.conf and a few more sample events, we can review and make sure it will work.

0 Karma

jwalthour
Communicator

Thank you, sshelly, but your suggestion doesn't work. The regex should only consider the quotations if they are there, which most of the time they are not. More importantly, it needs to stop at the next tab, unless the tab is part of the message when it is enclosed in quotation marks.

0 Karma

sshelly_splunk
Splunk Employee
Splunk Employee

So - if a double quote exists, there is a pair of them around the message body? If no double quotes, than no "message" ? Just trying to picture it in my head. You can do a conditional pipe in the middle of the regex. Can you describe what would constitute a "message" or at least, what would precede a message? I think I can do it; just need a bit of detail first.

0 Karma

sshelly_splunk
Splunk Employee
Splunk Employee

Can u post at least 2 examples of the event data with "messages" fields ?

0 Karma

sshelly_splunk
Splunk Employee
Splunk Employee

There's only one double quote in your events above. Can u edit maybe? As well, if you could include the props.conf entries and/or transforms that come into play, that would help. If you're doing a tab separated extract, you might need to get more specific using transforms since tabs can appear within field values.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...