I have events that look like this.
Example 2.
.......
I have indexed the data using a props.conf like thisL
[gmail-mbox]
MAX_EVENTS = 10000
BREAK_ONLY_BEFORE = From\s.+?@
MAX_TIMESTAMP_LOOKAHEAD = 150
NO_BINARY_CHECK = 1
TRUNCATE = 10000
pulldown_type = 1
Now trying to exact fields from each event. I am only interested in the fields:
X-Gmail-Labels:
Delivered-To:
Subject:
From
The field name can be seen before a colon.
The field value is everything after the colon and on the same line (for the above extractions).
How can I write a regex to extract fields in this format? Note, field values may also contain colons.
Assuming the sample logs break out at a new line as provided, i tried replicating a piece of it. Hope this helps
|gentimes start=-1
|eval _raw = "X-Gmail-Labels: Sent,Important
MIME-Version: 1.0
Received: by 10.52.29.70 with HTTP; Sun, 28 Dec 2014 16:11:00 -0800 (PST)
X-Originating-IP: [82.13.144.221]
In-Reply-To: <01ff42fddfded95cfa8b14fa5559b0fb.squirrel@webmail04.register.com>"|extract pairdelim="\n",kvdelim=":"|table *
This extracted all the fields. pairdelim is set to break at \n(newline) and key value pairs with ':'
Hope this helps.
Thanks,
Raghav
The most efficient way to do the extraction in splunk is to use the REPORT feature and a transforms.conf entry.
props.conf
[gmail-mbox]
REPORT-extract-headers = extract-headers
transforms.conf
[extract-headers]
REGEX = ^([^:]+):([^\r\n]+)
FORMAT = $1::$2
Or if you want to make it specific to just those headers mentioned you can make it explicit as such.
transforms.conf
[extract-headers]
REGEX = ^(X-Gmail-Labels|Delivered-To|Subject|From):([^\r\n]+)
FORMAT = $1::$2
That should grab the fields and values in one repeatable operation.
Based on your new updated log, here are the extractions..
If you are sure that event is having the 4 fields you mentioned, then you can use the single regex mentioned in the below URL
https://regex101.com/r/cJ5vW2/1
P.S : If any one of above 4 mentioned fields is missing, then this regex may not extract for those events.
If you are not sure if these fields are existing in every event mandatorily, then better I would suggest to use individual extraction for each fields..
Extraction for X-Gmail-Labels : https://regex101.com/r/cJ5vW2/2
Extraction for Delivered-To : https://regex101.com/r/cJ5vW2/3
Extraction for Subject : https://regex101.com/r/cJ5vW2/4
Extraction for From : https://regex101.com/r/cJ5vW2/5
If you want extract name and email id from From field seperately : https://regex101.com/r/cJ5vW2/6
These individual extraction works fine event if one of 4 fields missing in any of events.
For some reason these extractions capture the event from the start of the specified field to the end of the complete event (capturing everything after the field).
Strangely, when I paste this regex in the field extractor in Splunk GUI the extractions work correctly in the test mode, but fail again when extraction is saved and a search is run.
Any ideas why this might be?
I think that is becos your event in splunk doesn't have the new line character . Can you please put the extracted values after regex are saved for the above 4 parameters with for an event?
Did you use the single regex for all fields or individual regex?
I used the individual regex. Take for example "X-Gmail-Labels\s*:\s*(?P.+)"
Using the event in example 1 in the question, I get the following extraction for "X_Gmail_Labels" http://pastebin.com/mwPkakz1
However, when I run the regex in a search (sourcetype="gmail-mbox" | head 10000 | rex "X-Gmail-Labels\s*:\s*(?P.+)" | top 50 X_Gmail_Labels) all the fields are extracted as expected.
Sorry, the URL you mentioned is blocked in office.
Can you please try this https://regex101.com/r/cJ5vW2/9 on saved regex and let me know?
Generally it should work whether you saved the rex or used in search query and should be same.
Check if this helps https://regex101.com/r/cJ5vW2/7
ALso, I removed all the new line characters and still my regex works fine.. please see here https://regex101.com/r/cJ5vW2/8
please let me know if you are still facing any issues
Extracted both values comes after "Received:" and "X-Received:" in the same field. Here is the regex saved.
https://regex101.com/r/lI4kZ4/1
please let me know if you need any changes, I shall modify the regex accordingly.
Also, please answer my questions posted as commented to your question.
Thanks for the help - I really appreciate it.
I have updated the question, if this helps?