Splunk Search

How to extract fields like table below?

bmanikya
Loves-to-Learn Everything

Distcp job application_1681357021637_0984 MAPREDUCE Wed May 3 04:32:32 MST 2023 Wed May 3 04:32:40 MST 2023 SUCCEEDED default Fine edmse2

Oozie Job on Vip 0306563-230428030149477-oozie-oozi-W Shell-Action Wed May 3 04:32:09 MST 2023 Wed May 3 04:32:17 MST 2023 SUCCEEDED default nemoqee2

Spark Python Pi-job application_1681357021637_0983 SPARK Wed May 3 04:32:02 MST 2023 Wed May 3 04:32:11 MST 2023 SUCCEEDED default Fine edmse2

 

Need to extract fields like the below table fields, since each event is not the same. 

 

Job Succeeded in Nemo-Stage-GLOBAL E2 on lpqecpdb0001556.phx.aexp.com

Application-Name

Application-Id

Application-Type

Start-Time

Finish-Time

Final-State

Queue

Queue Utilization

PI-job

application_1678348796091_805329

MAPREDUCE

Tue May 2 04:30:09 MST 2023

Tue May 2 04:30:22 MST 2023

SUCCEEDED

default

Fine

Spark-job

application_1678348796091_805342

SPARK

Tue May 2 04:31:10 MST 2023

Tue May 2 04:31:17 MST 2023

SUCCEEDED

default

Fine

Spark Python Pi-job

application_1678348796091_805345

SPARK

Tue May 2 04:31:41 MST 2023

Tue May 2 04:31:49 MST 2023

SUCCEEDED

default

Fine

Distcp job

application_1678348796091_805347

MAPREDUCE

Tue May 2 04:32:10 MST 2023

Tue May 2 04:32:18 MST 2023

SUCCEEDED

default

Fine

Oozie Job on Vip

1446459-230327031301376-oozie-oozi-W

Shell-Action

Tue May 2 04:32:10 MST 2023

Tue May 2 04:32:18 MST 2023

SUCCEEDED

default

 
Labels (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

As @rut hinted, you need to explicitly break down usable patterns first because only you know how those desired fields are delimited/anchored.  If you don't, your developers would know.  It's much better them than volunteers who have no intimate knowledge about your set of applications.  @richgalloway raised an important question: Do these applications even follow the same log format?  If not, no amount of regexing is going to save the day.

To help you get started, I'll take a crack by comparing your sample data with sample desired outputs.

  1. Application ID in most (Hadoop-based?) apps has a prefix "application_" followed by numerals and underscores.
  2. The above breaks with that Oozie job. For that, the application ID begins with a numeral followed by a no-space string.
  3. Application name is whatever comes before application ID.
  4. After application ID are two horrible, terrible, very bad, no good, machine-unfriendly timestamps dreadfully conjoined. (They aren't human-friendly, either.)
  5. Final state is a no-space string after the two timestamps.
  6. Queue name is another no-space string following final state.
  7. In most (Hadoop-based?) applications after queue name, there is a no-space string representing queue utilization, followed by yet another no-space string that is to be discarded.
  8. One single space is inserted between fields.
  9. The above breaks with that Oozie job.  Whatever that final non-space string is, it is discarded.

Are the above about right?  If it is, the safest approach would be to use two separate regex's to handle the two different application types.  For example,

 

| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+) \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```

 

When you have potentially disparate log formats, be very afraid and be narrow. (That is why even though the last no-space string is to be discarded, I choose to match all the way to the end of line and mark any unmatched event as needing attention.)  The above further assumes that those "oozie" job names do not contain numerals.  If this is not the case, you need some other methods to anchor these elements.

With that, your sample data will give

Application_idApplication_nameApplication_typeEnd_timeFinal_stateQueueQueue_utilizationStart_time
application_1681357021637_0984Distcp jobMAPREDUCEWed May 3 04:32:40 MST 2023SUCCEEDEDdefaultFineWed May 3 04:32:32 MST 2023
0306563-230428030149477-oozie-oozi-WOozie Job on VipShell-ActionWed May 3 04:32:17 MST 2023SUCCEEDEDdefault Wed May 3 04:32:09 MST 2023
application_1681357021637_0983Spark Python Pi-jobSPARKWed May 3 04:32:11 MST 2023SUCCEEDEDdefaultFineWed May 3 04:32:02 MST 2023

PickleRick
SplunkTrust
SplunkTrust

+1 on that. If this is your in-house developed application, do put pressure on the dev team to be consistent about logging. I know that there are some things that are, and will always be, a free-form text but some of the common fields should be structured. Even if some fields will be blank in some cases. It greatly improves handling such logs.

0 Karma

rut
Path Finder

The format of your data example varies a lot. Writing a pattern for those specific examples would be possible, but that doesn't guarantee that it will work predictable for the rest of your data. 

I've tested the following pattern on the three given examples:

 

| rex field=_raw "(?<ApplicationName>.+)\s(?<ApplicationId>[\w-]+)\s(?<ApplicationType>[\w-]+)\s(?<StartTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<EndTime>\w{3}\s\w{3}[\d:\s]+[A-Z]+\s\d{4})\s(?<FinalState>[A-Z]+)\s(?<Queue>[^\s]+)\s((?<QueueUtilization>[^\s]+)\s)?\w+$"

 

 You can see it parsing your examples on regex101:

https://regex101.com/r/AkNmTb/1

Apart from predictability, having to implement all those edge cases makes it an inefficient and relatively slow pattern.

0 Karma

bmanikya
Loves-to-Learn Everything
PI-job application_1681360813939_33163 MAPREDUCE Thu May 4 04:30:14 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Spark-job application_1681360813939_33167 SPARK Thu May 4 04:31:17 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Spark Python Pi-job application_1681360813939_33169 SPARK Thu May 4 04:31:48 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Distcp job application_1681360813939_33172 MAPREDUCE Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 UNDEFINED default [Thu May 04 04 Exceeded cadence2
Oozie Job on Vip 0517949-230412214950046-oozie-oozi-W Shell-Action Thu May 4 04:32:18 MST 2023 Wed Dec 31 17:00:00 MST 1969 RUNNING default [Thu May 04 04 cadence2
PI-job application_1681360775209_1286 MAPREDUCE Thu May 4 11:30:15 UTC 2023 Thu May 4 11:30:27 UTC 2023 SUCCEEDED default Fine gcsidle2
Spark-job application_1681360775209_1288 SPARK Thu May 4 11:31:18 UTC 2023 Thu May 4 11:31:24 UTC 2023 SUCCEEDED default Fine gcsidle2
Spark Python Pi-job application_1681360775209_1289 SPARK Thu May 4 11:31:49 UTC 2023 Thu May 4 11:31:57 UTC 2023 SUCCEEDED default Fine gcsidle2
Distcp job application_1681360775209_1290 MAPREDUCE Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default Fine gcsidle2
Oozie Job on Vip 0002335-230419024434725-oozie-oozi-W Shell-Action Thu May 4 11:32:19 UTC 2023 Thu May 4 11:32:27 UTC 2023 SUCCEEDED default gcsidle2

 

If you check the field "FinalState" it is only picking up "SUCCEEDED" wherein other events also have UNDEFINED and RUNNING, it is not picking up those.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

As I predicted previously, a little defensive coding goes a long way in face of such bad formatting.  Be specific rather than be aggressive.  The dangling partial timestamp after queue name is the only ones throwing off my previous solution.  As @PickleRick noted, there is no generic solution for bad logging.  Advocating for better format is important.

The following addition handles all variants you posted so far.  If there is any other rule breakers, the last line will catch it.

 

| rex "^(?<Application_name>.+) (?<Application_id>application_\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} (?<Queue_utilization>\S+) \S+$"
| rex "^(?<Application_name>\D+) (?<Application_id>\d+\S+) (?<Application_type>\S+) (?<Start_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<End_time>(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +\d+ (\d+:){2}\d+ \S+ \d+) (?<Final_state>\S+) (?<Queue>\S+)(\s+\[(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)( +\d+){2}){0,1} \S+$"
| eval Application_name = if(isnull(Application_name), "Analyze this! " . _raw, Application_name) ``` highlight oddballs ```

 

 Your samples yield the following:

Application_idApplication_nameApplication_typeEnd_timeFinal_stateQueueQueue_utilizationStart_time
application_1681360813939_33163PI-jobMAPREDUCEWed Dec 31 17:00:00 MST 1969UNDEFINEDdefaultExceededThu May 4 04:30:14 MST 2023
application_1681360813939_33167Spark-jobSPARKWed Dec 31 17:00:00 MST 1969UNDEFINEDdefaultExceededThu May 4 04:31:17 MST 2023
application_1681360813939_33169Spark Python Pi-jobSPARKWed Dec 31 17:00:00 MST 1969UNDEFINEDdefaultExceededThu May 4 04:31:48 MST 2023
application_1681360813939_33172Distcp jobMAPREDUCEWed Dec 31 17:00:00 MST 1969UNDEFINEDdefaultExceededThu May 4 04:32:18 MST 2023
0517949-230412214950046-oozie-oozi-WOozie Job on VipShell-ActionWed Dec 31 17:00:00 MST 1969RUNNINGdefault Thu May 4 04:32:18 MST 2023
application_1681360775209_1286PI-jobMAPREDUCEThu May 4 11:30:27 UTC 2023SUCCEEDEDdefaultFineThu May 4 11:30:15 UTC 2023
application_1681360775209_1288Spark-jobSPARKThu May 4 11:31:24 UTC 2023SUCCEEDEDdefaultFineThu May 4 11:31:18 UTC 2023
application_1681360775209_1289Spark Python Pi-jobSPARKThu May 4 11:31:57 UTC 2023SUCCEEDEDdefaultFineThu May 4 11:31:49 UTC 2023
application_1681360775209_1290Distcp jobMAPREDUCEThu May 4 11:32:27 UTC 2023SUCCEEDEDdefaultFineThu May 4 11:32:19 UTC 2023
0002335-230419024434725-oozie-oozi-WOozie Job on VipShell-ActionThu May 4 11:32:27 UTC 2023SUCCEEDEDdefault Thu May 4 11:32:19 UTC 2023
0 Karma

richgalloway
SplunkTrust
SplunkTrust

What have you tried so far?  How did those efforts not fulfill your requirementss?

Please review the sample events and output as they appear to be unrelated.  The table contains timestamps and application IDs that are not in the events. 

---
If this reply helps you, Karma would be appreciated.
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...