Distcp job application_1681357021637_0984 MAPREDUCE Wed May 3 04:32:32 MST 2023 Wed May 3 04:32:40 MST 2023 SUCCEEDED default Fine edmse2
Oozie Job on Vip 0306563-230428030149477-oozie-oozi-W Shell-Action Wed May 3 04:32:09 MST 2023 Wed May 3 04:32:17 MST 2023 SUCCEEDED default nemoqee2
Spark Python Pi-job application_1681357021637_0983 SPARK Wed May 3 04:32:02 MST 2023 Wed May 3 04:32:11 MST 2023 SUCCEEDED default Fine edmse2
Need to extract fields like the below table fields, since each event is not the same.
Job Succeeded in Nemo-Stage-GLOBAL E2 on lpqecpdb0001556.phx.aexp.com |
|||||||
Application-Name |
Application-Id |
Application-Type |
Start-Time |
Finish-Time |
Final-State |
Queue |
Queue Utilization |
PI-job |
application_1678348796091_805329 |
MAPREDUCE |
Tue May 2 04:30:09 MST 2023 |
Tue May 2 04:30:22 MST 2023 |
SUCCEEDED |
default |
Fine |
Spark-job |
application_1678348796091_805342 |
SPARK |
Tue May 2 04:31:10 MST 2023 |
Tue May 2 04:31:17 MST 2023 |
SUCCEEDED |
default |
Fine |
Spark Python Pi-job |
application_1678348796091_805345 |
SPARK |
Tue May 2 04:31:41 MST 2023 |
Tue May 2 04:31:49 MST 2023 |
SUCCEEDED |
default |
Fine |
Distcp job |
application_1678348796091_805347 |
MAPREDUCE |
Tue May 2 04:32:10 MST 2023 |
Tue May 2 04:32:18 MST 2023 |
SUCCEEDED |
default |
Fine |
Oozie Job on Vip |
1446459-230327031301376-oozie-oozi-W |
Shell-Action |
Tue May 2 04:32:10 MST 2023 |
Tue May 2 04:32:18 MST 2023 |
SUCCEEDED |
default |
As @rut hinted, you need to explicitly break down usable patterns first because only you know how those desired fields are delimited/anchored. If you don't, your developers would know. It's much better them than volunteers who have no intimate knowledge about your set of applications. @richgalloway raised an important question: Do these applications even follow the same log format? If not, no amount of regexing is going to save the day.
To help you get started, I'll take a crack by comparing your sample data with sample desired outputs.
Are the above about right? If it is, the safest approach would be to use two separate regex's to handle the two different application types. For example,
| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```
When you have potentially disparate log formats, be very afraid and be narrow. (That is why even though the last no-space string is to be discarded, I choose to match all the way to the end of line and mark any unmatched event as needing attention.) The above further assumes that those "oozie" job names do not contain numerals. If this is not the case, you need some other methods to anchor these elements.
With that, your sample data will give
Application_id | Application_name | Application_type | End_time | Final_state | Queue | Queue_utilization | Start_time |
application_1681357021637_0984 | Distcp job | MAPREDUCE | Wed May 3 04:32:40 MST 2023 | SUCCEEDED | default | Fine | Wed May 3 04:32:32 MST 2023 |
0306563-230428030149477-oozie-oozi-W | Oozie Job on Vip | Shell-Action | Wed May 3 04:32:17 MST 2023 | SUCCEEDED | default | Wed May 3 04:32:09 MST 2023 | |
application_1681357021637_0983 | Spark Python Pi-job | SPARK | Wed May 3 04:32:11 MST 2023 | SUCCEEDED | default | Fine | Wed May 3 04:32:02 MST 2023 |
+1 on that. If this is your in-house developed application, do put pressure on the dev team to be consistent about logging. I know that there are some things that are, and will always be, a free-form text but some of the common fields should be structured. Even if some fields will be blank in some cases. It greatly improves handling such logs.
The format of your data example varies a lot. Writing a pattern for those specific examples would be possible, but that doesn't guarantee that it will work predictable for the rest of your data.
I've tested the following pattern on the three given examples:
| rex field=_raw "(?<ApplicationName>.+)\s(?<ApplicationId>[\w-]+)\s(?<ApplicationType>[\w-]+)\s(?<StartTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<EndTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<FinalState>[A-Z]+)\s(?<Queue>[^\s]+)\s((?<QueueUtilization>[^\s]+)\s)?\w+$"
You can see it parsing your examples on regex101:
https://regex101.com/r/AkNmTb/1
Apart from predictability, having to implement all those edge cases makes it an inefficient and relatively slow pattern.
PI-job application_1681360813939_33163 MAPREDUCE Thu May 4 04:30:14 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2 Spark-job application_1681360813939_33167 SPARK Thu May 4 04:31:17 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2 Spark Python Pi-job application_1681360813939_33169 SPARK Thu May 4 04:31:48 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2 Distcp job application_1681360813939_33172 MAPREDUCE Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2 Oozie Job on Vip 0517949-230412214950046-oozie-oozi-W Shell-Action Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 RUNNING default [Thu May 04 04 cadence2 PI-job application_1681360775209_1286 MAPREDUCE Thu May 4 11:30:15 UTC 2023 Thu May 4 11:30:27 UTC 2023 SUCCEEDED default Fine gcsidle2 Spark-job application_1681360775209_1288 SPARK Thu May 4 11:31:18 UTC 2023 Thu May 4 11:31:24 UTC 2023 SUCCEEDED default Fine gcsidle2 Spark Python Pi-job application_1681360775209_1289 SPARK Thu May 4 11:31:49 UTC 2023 Thu May 4 11:31:57 UTC 2023 SUCCEEDED default Fine gcsidle2 Distcp job application_1681360775209_1290 MAPREDUCE Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default Fine gcsidle2 Oozie Job on Vip 0002335-230419024434725-oozie-oozi-W Shell-Action Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default gcsidle2
If you check the field "FinalState" it is only picking up "SUCCEEDED" wherein other events also have UNDEFINED and RUNNING, it is not picking up those.
As I predicted previously, a little defensive coding goes a long way in face of such bad formatting. Be specific rather than be aggressive. The dangling partial timestamp after queue name is the only ones throwing off my previous solution. As @PickleRick noted, there is no generic solution for bad logging. Advocating for better format is important.
The following addition handles all variants you posted so far. If there is any other rule breakers, the last line will catch it.
| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```
Your samples yield the following:
Application_id | Application_name | Application_type | End_time | Final_state | Queue | Queue_utilization | Start_time |
application_1681360813939_33163 | PI-job | MAPREDUCE | Wed Dec 31 17:00:00 MST 1969 | UNDEFINED | default | Exceeded | Thu May 4 04:30:14 MST 2023 |
application_1681360813939_33167 | Spark-job | SPARK | Wed Dec 31 17:00:00 MST 1969 | UNDEFINED | default | Exceeded | Thu May 4 04:31:17 MST 2023 |
application_1681360813939_33169 | Spark Python Pi-job | SPARK | Wed Dec 31 17:00:00 MST 1969 | UNDEFINED | default | Exceeded | Thu May 4 04:31:48 MST 2023 |
application_1681360813939_33172 | Distcp job | MAPREDUCE | Wed Dec 31 17:00:00 MST 1969 | UNDEFINED | default | Exceeded | Thu May 4 04:32:18 MST 2023 |
0517949-230412214950046-oozie-oozi-W | Oozie Job on Vip | Shell-Action | Wed Dec 31 17:00:00 MST 1969 | RUNNING | default | Thu May 4 04:32:18 MST 2023 | |
application_1681360775209_1286 | PI-job | MAPREDUCE | Thu May 4 11:30:27 UTC 2023 | SUCCEEDED | default | Fine | Thu May 4 11:30:15 UTC 2023 |
application_1681360775209_1288 | Spark-job | SPARK | Thu May 4 11:31:24 UTC 2023 | SUCCEEDED | default | Fine | Thu May 4 11:31:18 UTC 2023 |
application_1681360775209_1289 | Spark Python Pi-job | SPARK | Thu May 4 11:31:57 UTC 2023 | SUCCEEDED | default | Fine | Thu May 4 11:31:49 UTC 2023 |
application_1681360775209_1290 | Distcp job | MAPREDUCE | Thu May 4 11:32:27 UTC 2023 | SUCCEEDED | default | Fine | Thu May 4 11:32:19 UTC 2023 |
0002335-230419024434725-oozie-oozi-W | Oozie Job on Vip | Shell-Action | Thu May 4 11:32:27 UTC 2023 | SUCCEEDED | default | Thu May 4 11:32:19 UTC 2023 |
What have you tried so far? How did those efforts not fulfill your requirementss?
Please review the sample events and output as they appear to be unrelated. The table contains timestamps and application IDs that are not in the events.