How to write the regex for field extractions of ke...

himynamesdave · ‎01-04-2015

I have events that look like this.

I have indexed the data using a props.conf like thisL

[gmail-mbox]
MAX_EVENTS = 10000
BREAK_ONLY_BEFORE = From\s.+?@
MAX_TIMESTAMP_LOOKAHEAD = 150
NO_BINARY_CHECK = 1
TRUNCATE = 10000
pulldown_type = 1

Now trying to exact fields from each event. I am only interested in the fields:

X-Gmail-Labels:
Delivered-To:
Subject:
From

The field name can be seen before a colon.

The field value is everything after the colon and on the same line (for the above extractions).

How can I write a regex to extract fields in this format? Note, field values may also contain colons.

Raghav2384 · ‎01-07-2015

Assuming the sample logs break out at a new line as provided, i tried replicating a piece of it. Hope this helps

|gentimes start=-1 
|eval _raw = "X-Gmail-Labels: Sent,Important 
MIME-Version: 1.0 
Received: by 10.52.29.70 with HTTP; Sun, 28 Dec 2014 16:11:00 -0800 (PST) 
X-Originating-IP: [82.13.144.221] 
In-Reply-To: <01ff42fddfded95cfa8b14fa5559b0fb.squirrel@webmail04.register.com>"|extract pairdelim="\n",kvdelim=":"|table *

This extracted all the fields. pairdelim is set to break at \n(newline) and key value pairs with ':'

Hope this helps.

Thanks,
Raghav

eddit0r · ‎01-05-2015

The most efficient way to do the extraction in splunk is to use the REPORT feature and a transforms.conf entry.

props.conf
[gmail-mbox]
REPORT-extract-headers = extract-headers

transforms.conf
[extract-headers]
REGEX = ^([^:]+):([^\r\n]+)
FORMAT = $1::$2

Or if you want to make it specific to just those headers mentioned you can make it explicit as such.

transforms.conf
[extract-headers]
REGEX = ^(X-Gmail-Labels|Delivered-To|Subject|From):([^\r\n]+)
FORMAT = $1::$2

That should grab the fields and values in one repeatable operation.

jayannah · ‎01-04-2015

Based on your new updated log, here are the extractions..

If you are sure that event is having the 4 fields you mentioned, then you can use the single regex mentioned in the below URL
https://regex101.com/r/cJ5vW2/1
P.S : If any one of above 4 mentioned fields is missing, then this regex may not extract for those events.

If you are not sure if these fields are existing in every event mandatorily, then better I would suggest to use individual extraction for each fields..

Extraction for X-Gmail-Labels : https://regex101.com/r/cJ5vW2/2
Extraction for Delivered-To : https://regex101.com/r/cJ5vW2/3
Extraction for Subject : https://regex101.com/r/cJ5vW2/4
Extraction for From : https://regex101.com/r/cJ5vW2/5
If you want extract name and email id from From field seperately : https://regex101.com/r/cJ5vW2/6

These individual extraction works fine event if one of 4 fields missing in any of events.

himynamesdave · ‎01-05-2015

For some reason these extractions capture the event from the start of the specified field to the end of the complete event (capturing everything after the field).

Strangely, when I paste this regex in the field extractor in Splunk GUI the extractions work correctly in the test mode, but fail again when extraction is saved and a search is run.

Any ideas why this might be?

jayannah · ‎01-05-2015

I think that is becos your event in splunk doesn't have the new line character . Can you please put the extracted values after regex are saved for the above 4 parameters with for an event?

Did you use the single regex for all fields or individual regex?

himynamesdave · ‎01-05-2015

I used the individual regex. Take for example "X-Gmail-Labels\s*:\s*(?P.+)"

Using the event in example 1 in the question, I get the following extraction for "X_Gmail_Labels" http://pastebin.com/mwPkakz1

However, when I run the regex in a search (sourcetype="gmail-mbox" | head 10000 | rex "X-Gmail-Labels\s*:\s*(?P.+)" | top 50 X_Gmail_Labels) all the fields are extracted as expected.

jayannah · ‎01-05-2015

Sorry, the URL you mentioned is blocked in office.

Can you please try this https://regex101.com/r/cJ5vW2/9 on saved regex and let me know?

Generally it should work whether you saved the rex or used in search query and should be same.

jayannah · ‎01-05-2015

Check if this helps https://regex101.com/r/cJ5vW2/7

ALso, I removed all the new line characters and still my regex works fine.. please see here https://regex101.com/r/cJ5vW2/8

please let me know if you are still facing any issues

jayannah · ‎01-04-2015

Extracted both values comes after "Received:" and "X-Received:" in the same field. Here is the regex saved.

https://regex101.com/r/lI4kZ4/1

please let me know if you need any changes, I shall modify the regex accordingly.

Also, please answer my questions posted as commented to your question.

jayannah · ‎01-04-2015

Can you please put one complete event log how it looks?
Do intended to extract the value after "Received:" and "X-Received" in to the same field name or different field name?
What is your event line break format?

himynamesdave · ‎01-04-2015

Thanks for the help - I really appreciate it.

I have updated the question, if this helps?

How to write the regex for field extractions of key-value pairs in the format FIELD:VALUE from multiline events?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

How to write the regex for field extractions of key-value pairs in the format FIELD:VALUE from multiline events?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits