Splunk Search

How to write the regex for field extractions of key-value pairs in the format FIELD:VALUE from multiline events?

himynamesdave
Contributor

I have events that look like this.

Example 1.

Example 2.
.......

I have indexed the data using a props.conf like thisL

[gmail-mbox]
MAX_EVENTS = 10000
BREAK_ONLY_BEFORE = From\s.+?@
MAX_TIMESTAMP_LOOKAHEAD = 150
NO_BINARY_CHECK = 1
TRUNCATE = 10000
pulldown_type = 1

Now trying to exact fields from each event. I am only interested in the fields:

X-Gmail-Labels:
Delivered-To:
Subject:
From

The field name can be seen before a colon.

The field value is everything after the colon and on the same line (for the above extractions).

How can I write a regex to extract fields in this format? Note, field values may also contain colons.

0 Karma

Raghav2384
Motivator

Assuming the sample logs break out at a new line as provided, i tried replicating a piece of it. Hope this helps

|gentimes start=-1 
|eval _raw = "X-Gmail-Labels: Sent,Important 
MIME-Version: 1.0 
Received: by 10.52.29.70 with HTTP; Sun, 28 Dec 2014 16:11:00 -0800 (PST) 
X-Originating-IP: [82.13.144.221] 
In-Reply-To: <01ff42fddfded95cfa8b14fa5559b0fb.squirrel@webmail04.register.com>"|extract pairdelim="\n",kvdelim=":"|table *

This extracted all the fields. pairdelim is set to break at \n(newline) and key value pairs with ':'

Hope this helps.

Thanks,
Raghav

0 Karma

eddit0r
Explorer

The most efficient way to do the extraction in splunk is to use the REPORT feature and a transforms.conf entry.

props.conf
[gmail-mbox]
REPORT-extract-headers = extract-headers

transforms.conf
[extract-headers]
REGEX = ^([^:]+):([^\r\n]+)
FORMAT = $1::$2

Or if you want to make it specific to just those headers mentioned you can make it explicit as such.

transforms.conf
[extract-headers]
REGEX = ^(X-Gmail-Labels|Delivered-To|Subject|From):([^\r\n]+)
FORMAT = $1::$2

That should grab the fields and values in one repeatable operation.

jayannah
Builder

Based on your new updated log, here are the extractions..

If you are sure that event is having the 4 fields you mentioned, then you can use the single regex mentioned in the below URL
https://regex101.com/r/cJ5vW2/1
P.S : If any one of above 4 mentioned fields is missing, then this regex may not extract for those events.

If you are not sure if these fields are existing in every event mandatorily, then better I would suggest to use individual extraction for each fields..

Extraction for X-Gmail-Labels : https://regex101.com/r/cJ5vW2/2
Extraction for Delivered-To : https://regex101.com/r/cJ5vW2/3
Extraction for Subject : https://regex101.com/r/cJ5vW2/4
Extraction for From : https://regex101.com/r/cJ5vW2/5
If you want extract name and email id from From field seperately : https://regex101.com/r/cJ5vW2/6

These individual extraction works fine event if one of 4 fields missing in any of events.

0 Karma

himynamesdave
Contributor

For some reason these extractions capture the event from the start of the specified field to the end of the complete event (capturing everything after the field).

Strangely, when I paste this regex in the field extractor in Splunk GUI the extractions work correctly in the test mode, but fail again when extraction is saved and a search is run.

Any ideas why this might be?

0 Karma

jayannah
Builder

I think that is becos your event in splunk doesn't have the new line character . Can you please put the extracted values after regex are saved for the above 4 parameters with for an event?

Did you use the single regex for all fields or individual regex?

0 Karma

himynamesdave
Contributor

I used the individual regex. Take for example "X-Gmail-Labels\s*:\s*(?P.+)"

Using the event in example 1 in the question, I get the following extraction for "X_Gmail_Labels" http://pastebin.com/mwPkakz1

However, when I run the regex in a search (sourcetype="gmail-mbox" | head 10000 | rex "X-Gmail-Labels\s*:\s*(?P.+)" | top 50 X_Gmail_Labels) all the fields are extracted as expected.

0 Karma

jayannah
Builder

Sorry, the URL you mentioned is blocked in office.

Can you please try this https://regex101.com/r/cJ5vW2/9 on saved regex and let me know?

Generally it should work whether you saved the rex or used in search query and should be same.

0 Karma

jayannah
Builder

Check if this helps https://regex101.com/r/cJ5vW2/7

ALso, I removed all the new line characters and still my regex works fine.. please see here https://regex101.com/r/cJ5vW2/8

please let me know if you are still facing any issues

0 Karma

jayannah
Builder

Extracted both values comes after "Received:" and "X-Received:" in the same field. Here is the regex saved.

https://regex101.com/r/lI4kZ4/1

please let me know if you need any changes, I shall modify the regex accordingly.

Also, please answer my questions posted as commented to your question.

0 Karma

jayannah
Builder
  1. Can you please put one complete event log how it looks?
  2. Do intended to extract the value after "Received:" and "X-Received" in to the same field name or different field name?
  3. What is your event line break format?
0 Karma

himynamesdave
Contributor

Thanks for the help - I really appreciate it.

I have updated the question, if this helps?

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...