Re: Need help with regex

bmanikya · ‎05-03-2023

Distcp job application_1681357021637_0984 MAPREDUCE Wed May 3 04:32:32 MST 2023 Wed May 3 04:32:40 MST 2023 SUCCEEDED default Fine edmse2

Oozie Job on Vip 0306563-230428030149477-oozie-oozi-W Shell-Action Wed May 3 04:32:09 MST 2023 Wed May 3 04:32:17 MST 2023 SUCCEEDED default nemoqee2

Spark Python Pi-job application_1681357021637_0983 SPARK Wed May 3 04:32:02 MST 2023 Wed May 3 04:32:11 MST 2023 SUCCEEDED default Fine edmse2

Need to extract fields like the below table fields, since each event is not the same.

Job Succeeded in Nemo-Stage-GLOBAL E2 on lpqecpdb0001556.phx.aexp.com
Application-Name	Application-Id	Application-Type	Start-Time	Finish-Time	Final-State	Queue	Queue Utilization
PI-job	application_1678348796091_805329	MAPREDUCE	Tue May 2 04:30:09 MST 2023	Tue May 2 04:30:22 MST 2023	SUCCEEDED	default	Fine
Spark-job	application_1678348796091_805342	SPARK	Tue May 2 04:31:10 MST 2023	Tue May 2 04:31:17 MST 2023	SUCCEEDED	default	Fine
Spark Python Pi-job	application_1678348796091_805345	SPARK	Tue May 2 04:31:41 MST 2023	Tue May 2 04:31:49 MST 2023	SUCCEEDED	default	Fine
Distcp job	application_1678348796091_805347	MAPREDUCE	Tue May 2 04:32:10 MST 2023	Tue May 2 04:32:18 MST 2023	SUCCEEDED	default	Fine
Oozie Job on Vip	1446459-230327031301376-oozie-oozi-W	Shell-Action	Tue May 2 04:32:10 MST 2023	Tue May 2 04:32:18 MST 2023	SUCCEEDED	default

yuanliu · ‎05-04-2023

As @rut hinted, you need to explicitly break down usable patterns first because only you know how those desired fields are delimited/anchored. If you don't, your developers would know. It's much better them than volunteers who have no intimate knowledge about your set of applications. @richgalloway raised an important question: Do these applications even follow the same log format? If not, no amount of regexing is going to save the day.

To help you get started, I'll take a crack by comparing your sample data with sample desired outputs.

Application ID in most (Hadoop-based?) apps has a prefix "application_" followed by numerals and underscores.
The above breaks with that Oozie job. For that, the application ID begins with a numeral followed by a no-space string.
Application name is whatever comes before application ID.
After application ID are two horrible, terrible, very bad, no good, machine-unfriendly timestamps dreadfully conjoined. (They aren't human-friendly, either.)
Final state is a no-space string after the two timestamps.
Queue name is another no-space string following final state.
In most (Hadoop-based?) applications after queue name, there is a no-space string representing queue utilization, followed by yet another no-space string that is to be discarded.
One single space is inserted between fields.
The above breaks with that Oozie job. Whatever that final non-space string is, it is discarded.

Are the above about right? If it is, the safest approach would be to use two separate regex's to handle the two different application types. For example,

| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```

When you have potentially disparate log formats, be very afraid and be narrow. (That is why even though the last no-space string is to be discarded, I choose to match all the way to the end of line and mark any unmatched event as needing attention.) The above further assumes that those "oozie" job names do not contain numerals. If this is not the case, you need some other methods to anchor these elements.

With that, your sample data will give

Application_id	Application_name	Application_type	End_time	Final_state	Queue	Queue_utilization	Start_time
application_1681357021637_0984	Distcp job	MAPREDUCE	Wed May 3 04:32:40 MST 2023	SUCCEEDED	default	Fine	Wed May 3 04:32:32 MST 2023
0306563-230428030149477-oozie-oozi-W	Oozie Job on Vip	Shell-Action	Wed May 3 04:32:17 MST 2023	SUCCEEDED	default		Wed May 3 04:32:09 MST 2023
application_1681357021637_0983	Spark Python Pi-job	SPARK	Wed May 3 04:32:11 MST 2023	SUCCEEDED	default	Fine	Wed May 3 04:32:02 MST 2023

PickleRick · ‎05-04-2023

+1 on that. If this is your in-house developed application, do put pressure on the dev team to be consistent about logging. I know that there are some things that are, and will always be, a free-form text but some of the common fields should be structured. Even if some fields will be blank in some cases. It greatly improves handling such logs.

rut · ‎05-03-2023

The format of your data example varies a lot. Writing a pattern for those specific examples would be possible, but that doesn't guarantee that it will work predictable for the rest of your data.

I've tested the following pattern on the three given examples:

| rex field=_raw "(?<ApplicationName>.+)\s(?<ApplicationId>[\w-]+)\s(?<ApplicationType>[\w-]+)\s(?<StartTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<EndTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<FinalState>[A-Z]+)\s(?<Queue>[^\s]+)\s((?<QueueUtilization>[^\s]+)\s)?\w+$"

You can see it parsing your examples on regex101:

https://regex101.com/r/AkNmTb/1

Apart from predictability, having to implement all those edge cases makes it an inefficient and relatively slow pattern.

bmanikya · ‎05-05-2023

PI-job application_1681360813939_33163 MAPREDUCE Thu May 4 04:30:14 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Spark-job application_1681360813939_33167 SPARK Thu May 4 04:31:17 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Spark Python Pi-job application_1681360813939_33169 SPARK Thu May 4 04:31:48 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Distcp job application_1681360813939_33172 MAPREDUCE Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Oozie Job on Vip 0517949-230412214950046-oozie-oozi-W Shell-Action Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 RUNNING default [Thu May 04 04 cadence2
PI-job application_1681360775209_1286 MAPREDUCE Thu May 4 11:30:15 UTC 2023 Thu May 4 11:30:27 UTC 2023 SUCCEEDED default Fine gcsidle2
Spark-job application_1681360775209_1288 SPARK Thu May 4 11:31:18 UTC 2023 Thu May 4 11:31:24 UTC 2023 SUCCEEDED default Fine gcsidle2
Spark Python Pi-job application_1681360775209_1289 SPARK Thu May 4 11:31:49 UTC 2023 Thu May 4 11:31:57 UTC 2023 SUCCEEDED default Fine gcsidle2
Distcp job application_1681360775209_1290 MAPREDUCE Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default Fine gcsidle2
Oozie Job on Vip 0002335-230419024434725-oozie-oozi-W Shell-Action Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default gcsidle2

If you check the field "FinalState" it is only picking up "SUCCEEDED" wherein other events also have UNDEFINED and RUNNING, it is not picking up those.

yuanliu · ‎05-05-2023

As I predicted previously, a little defensive coding goes a long way in face of such bad formatting. Be specific rather than be aggressive. The dangling partial timestamp after queue name is the only ones throwing off my previous solution. As @PickleRick noted, there is no generic solution for bad logging. Advocating for better format is important.

The following addition handles all variants you posted so far. If there is any other rule breakers, the last line will catch it.

| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```

Your samples yield the following:

Application_id	Application_name	Application_type	End_time	Final_state	Queue	Queue_utilization	Start_time
application_1681360813939_33163	PI-job	MAPREDUCE	Wed Dec 31 17:00:00 MST 1969	UNDEFINED	default	Exceeded	Thu May 4 04:30:14 MST 2023
application_1681360813939_33167	Spark-job	SPARK	Wed Dec 31 17:00:00 MST 1969	UNDEFINED	default	Exceeded	Thu May 4 04:31:17 MST 2023
application_1681360813939_33169	Spark Python Pi-job	SPARK	Wed Dec 31 17:00:00 MST 1969	UNDEFINED	default	Exceeded	Thu May 4 04:31:48 MST 2023
application_1681360813939_33172	Distcp job	MAPREDUCE	Wed Dec 31 17:00:00 MST 1969	UNDEFINED	default	Exceeded	Thu May 4 04:32:18 MST 2023
0517949-230412214950046-oozie-oozi-W	Oozie Job on Vip	Shell-Action	Wed Dec 31 17:00:00 MST 1969	RUNNING	default		Thu May 4 04:32:18 MST 2023
application_1681360775209_1286	PI-job	MAPREDUCE	Thu May 4 11:30:27 UTC 2023	SUCCEEDED	default	Fine	Thu May 4 11:30:15 UTC 2023
application_1681360775209_1288	Spark-job	SPARK	Thu May 4 11:31:24 UTC 2023	SUCCEEDED	default	Fine	Thu May 4 11:31:18 UTC 2023
application_1681360775209_1289	Spark Python Pi-job	SPARK	Thu May 4 11:31:57 UTC 2023	SUCCEEDED	default	Fine	Thu May 4 11:31:49 UTC 2023
application_1681360775209_1290	Distcp job	MAPREDUCE	Thu May 4 11:32:27 UTC 2023	SUCCEEDED	default	Fine	Thu May 4 11:32:19 UTC 2023
0002335-230419024434725-oozie-oozi-W	Oozie Job on Vip	Shell-Action	Thu May 4 11:32:27 UTC 2023	SUCCEEDED	default		Thu May 4 11:32:19 UTC 2023

richgalloway · ‎05-03-2023

What have you tried so far? How did those efforts not fulfill your requirementss?

Please review the sample events and output as they appear to be unrelated. The table contains timestamps and application IDs that are not in the events.

---
If this reply helps you, Karma would be appreciated.

How to extract fields like table below?

regex

Community Content Calendar, November Edition

October Community Champions: A Shoutout to Our Contributors!

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Are you a member of the Splunk Community?

How to extract fields like table below?

regex

Community Content Calendar, November Edition

October Community Champions: A Shoutout to Our Contributors!

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!