Solved: With regex, can you help us extract the first word...

zacksoft · ‎02-06-2019

I wanted to extract the first word that comes after the timestamp.

The time stamps are of varied formats

example event1 :

2019-02-05 11:89:17,642 EST BROCOD bla bla bla ......

example event2 :

2019-02-05 19:35:18,642 MARC bla bla bla........

I wanted to parse BROCOD and MARC

I tried the following....it should work..but I'm not sure why it is not showing me any result

| rex "^(?:[^ \n]* ){3}(?P<level>\w+)" | table  level

horsefez · ‎02-06-2019

Hey zacksoft,

this one is a bit complicated as you can never be sure if ther will be an abbreviated timezone or not.

https://regex101.com/r/n1RYOu/2

So I found this solution for you, which might look a bit convuluted at first, but basically matches all the possible time-zone-abbreviations we have at the moment. And only, if they are there.

So please give it a careful look and ask me questions about it if you have any.

Regards,
pyro_wood

View solution in original post

Vijeta · ‎02-06-2019

I tried below and worked for me

rex field=x "\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3}\s{0,1}\w{0,3}\s(?<level>\w+)"

Example-

|makeresults| eval x="2019-02-05 11:89:17,642 EST BROCOD bla bla bla" |appendpipe[|eval x="2019-02-05 19:35:18,642 MARC bla bla bla"]| rex field=x "\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3}\s{0,1}\w{0,3}\s(?<level>\w+)"

zacksoft · ‎02-06-2019

Thanks Vijeta....
I am wondering how to implement it....
Instead of .......|appendpipe[|eval x="2019-02-05 19: ...........
I replaced with ...|appendpipe[|eval x=_raw ...........
so it will scan it all events ...but it gives many errors

index=myIndex host=myhost sourcetype="my.source.type"  |makeresults| eval x=_raw |appendpipe[|eval x=_raw]| rex field=x "\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3}\s{0,1}\w{0,3}\s(?<level>\w+)" | table level

Vijeta · ‎02-06-2019

@zacksoft - did you try the below

You need not use makeresults, it was just for creating sample events for me. Your query can be-

index=myIndex host=myhost sourcetype="my.source.type"  |rex field=_raw "\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2},\d{3}\s{0,1}\w{0,3}\s(?<level>\w+)" | table level

horsefez · ‎02-06-2019

Hey zacksoft,

this one is a bit complicated as you can never be sure if ther will be an abbreviated timezone or not.

https://regex101.com/r/n1RYOu/2

So I found this solution for you, which might look a bit convuluted at first, but basically matches all the possible time-zone-abbreviations we have at the moment. And only, if they are there.

So please give it a careful look and ask me questions about it if you have any.

Regards,
pyro_wood

zacksoft · ‎02-06-2019

Thanks @horsefez

Just to confirm this is the regex right ? I am a bit new to this regex arena !!

index=DEMOhost=anything sourcetype="something.something"
rex "^\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2},\d+\s(?:\b(?:ACDT|ACST|ACT|ACT|ACWST|ADT|AEDT|AEST|AFT|AKDT|AKST|AMST|AMT|AMT|ART|AST|AST|AWST|AZOST|AZOT|AZT|BDT|BIOT|BIT|BOT|BRST|BRT|BST|BST|BST|BTT|CAT|CCT|CDT|CDT|CEST|CET|CHADT|CHAST|CHOT|CHOST|CHST|CHUT|CIST|CIT|CKT|CLST|CLT|COST|COT|CST|CST|CST|CT|CVT|CWST|CXT|DAVT|DDUT|DFT|EASST|EAST|EAT|ECT|ECT|EDT|EEST|EET|EGST|EGT|EIT|EST|FET|FJT|FKST|FKT|FNT|GALT|GAMT|GET|GFT|GILT|GIT|GMT|GST|GST|GYT|HDT|HAEC|HST|HKT|HMT|HOVST|HOVT|ICT|IDLW|IDT|IOT|IRDT|IRKT|IRST|IST|IST|IST|JST|KALT|KGT|KOST|KRAT|KST|LHST|LHST|LINT|MAGT|MART|MAWT|MDT|MET|MEST|MHT|MIST|MIT|MMT|MSK|MST|MST|MUT|MVT|MYT|NCT|NDT|NFT|NPT|NST|NT|NUT|NZDT|NZST|OMST|ORAT|PDT|PET|PETT|PGT|PHOT|PHT|PKT|PMDT|PMST|PONT|PST|PST|PYST|PYT|RET|ROTT|SAKT|SAMT|SAST|SBT|SCT|SDT|SGT|SLST|SRET|SRT|SST|SST|SYOT|TAHT|THA|TFT|TJT|TKT|TLT|TMT|TRT|TOT|TVT|ULAST|ULAT|UTC|UYST|UYT|UZT|VET|VLAT|VOLT|VOST|VUT|WAKT|WAST|WAT|WEST|WET|WIT|WST|YAKT|YEKT)\b\s*)?(?\w+)"
| table match

If, yes I tried this..but it yielded no result !!! 😞

horsefez · ‎02-06-2019

Hi @zacksoft,

try this one and tell me if it works.

index=DEMO host=anything sourcetype=something 
| rex "^\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2},\d+\s(?:\b(?:ACDT|ACST|ACT|ACT|ACWST|ADT|AEDT|AEST|AFT|AKDT|AKST|AMST|AMT|AMT|ART|AST|AST|AWST|AZOST|AZOT|AZT|BDT|BIOT|BIT|BOT|BRST|BRT|BST|BST|BST|BTT|CAT|CCT|CDT|CDT|CEST|CET|CHADT|CHAST|CHOT|CHOST|CHST|CHUT|CIST|CIT|CKT|CLST|CLT|COST|COT|CST|CST|CST|CT|CVT|CWST|CXT|DAVT|DDUT|DFT|EASST|EAST|EAT|ECT|ECT|EDT|EEST|EET|EGST|EGT|EIT|EST|FET|FJT|FKST|FKT|FNT|GALT|GAMT|GET|GFT|GILT|GIT|GMT|GST|GST|GYT|HDT|HAEC|HST|HKT|HMT|HOVST|HOVT|ICT|IDLW|IDT|IOT|IRDT|IRKT|IRST|IST|IST|IST|JST|KALT|KGT|KOST|KRAT|KST|LHST|LHST|LINT|MAGT|MART|MAWT|MDT|MET|MEST|MHT|MIST|MIT|MMT|MSK|MST|MST|MUT|MVT|MYT|NCT|NDT|NFT|NPT|NST|NT|NUT|NZDT|NZST|OMST|ORAT|PDT|PET|PETT|PGT|PHOT|PHT|PKT|PMDT|PMST|PONT|PST|PST|PYST|PYT|RET|ROTT|SAKT|SAMT|SAST|SBT|SCT|SDT|SGT|SLST|SRET|SRT|SST|SST|SYOT|TAHT|THA|TFT|TJT|TKT|TLT|TMT|TRT|TOT|TVT|ULAST|ULAT|UTC|UYST|UYT|UZT|VET|VLAT|VOLT|VOST|VUT|WAKT|WAST|WAT|WEST|WET|WIT|WST|YAKT|YEKT)\b\s*)?(?<level>\w+)"

zacksoft · ‎02-06-2019

@pyro_wood - This is the most insane looking query. But it is awesome.. it works perfectly ......
You're a genius. Thank you very much.

horsefez · ‎02-06-2019

@zacksoft,

I agree that it looks complicated at first and I'm glad that it works out for you.

But it's not so complicated.
I will explain to you why it isn't as complicated as it might look.
^ this is called an anchor, and points to the start of the line (will always be there)
\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2},\d+\s* this traverses over the date and timefields (will always be there)
(?:\b(?:ACDT|ACST|ACT|ACWST...|BOT|...|WST|YAKT|YEKT)\b\s*)? this will look for a valid timezone abbreviation. A list of all valid timezone abbreviations I found on the web.
It basically is a OR-list. If it doesn't find ACDT, it will look if it finds ACST, if not it looks if it finds ACT and so on. The very last ? question mark makes the entire statement that is encased in paranteshis optional. It means, that the timezone might be there or not. (optional)
(?<level>\w+) regardless of the existence of the optional timezone field, the field that matches your text comes afterwards (will always be there)

You might have notice the \b in the regex. \b marks a word-boundary. Long story short it makes sure that the timezone matching instruction doesn't match words like for example "ACTION", "BOTTOM", "PETS", "PHOTO" or "WESTWARDS".

Hope this helps a bit.
Regards,
pyro_wood

zacksoft · ‎02-06-2019

Thanks for explaining each step. Now I understand.

lakshman239 · ‎02-06-2019

You can check this out - https://regex101.com/r/cQF8aS/1
You need something like

^.*,\d+\s+(?:EST)?\s?(?\w+)

zacksoft · ‎02-06-2019

Thanks Lakshman.
When I try this it says "unrecognized character after (? or (?-"
Also what is the field name where the extraction is getting stored at?

With regex, can you help us extract the first word that comes after the timestamp?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

With regex, can you help us extract the first word that comes after the timestamp?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers