Splunk Search

need help to hold the status of a job from an event and make it count till next status change event and show it in the timechart

pench2k19
Explorer

Hi Ninjas,

I have following sample events in splunk.

[02/18/2020 10:47:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX

[02/18/2020 10:48:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX

[02/18/2020 18:25:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX

Now i need your help to calculate the total number of running/starting jobs for every 5 minutes , being the job status get hold by the splunk query and make it count in the timechart.

For example, if i am running a query for total number of running jobs between 10:48 and 18:25 , the job name showing in the sample events should be included in the count.

Your help is much appreciated.

Tags (2)
0 Karma

to4kawa
Ultra Champion
| makeresults
| eval _raw="[02/18/2020 09:45:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:47:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:47:35.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:48:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 18:25:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 19:25:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX"
| makemv delim="
" _raw
| stats count by _raw
| rex "\[(?<timestamp>.*)\] (?<session>\S+) .*STATUS: (?<status>\S+) JOB: (?<job>\S+) MACHINE: (?<machine>\S+)"
| table timestamp session status job machine
| eval _time=strptime(timestamp,"%m/%d/%Y %T.%3Q")
| eval status=if(status="STARTING" OR status="RUNNING","RUNNING",status)
| eval job_status=job.":".status
| timechart cont=f span=5m count by job_status
| makecontinuous span=5m _time
| foreach *:RUNNING [ eval <<FIELD>> = nullif('<<FIELD>>','<<MATCHSTR>>:SUCCESS')]
| foreach *:SUCCESS [ eval <<FIELD>> = if(isnull('<<MATCHSTR>>:RUNNING') AND '<<FIELD>>'=0 ,NULL, '<<FIELD>>')]
| reverse
| filldown *:SUCCESS
| reverse
| foreach *:RUNNING [ eval <<FIELD>> = if(isnull('<<FIELD>>') AND '<<MATCHSTR>>:SUCCESS' > 0 , 1, '<<FIELD>>' )]
| filldown *:RUNNING
| foreach *:* [ eval <<MATCHSEG1>> = max('<<MATCHSEG1>>:RUNNING','<<MATCHSEG1>>:SUCCESS')]
| fields - *:*

It is troublesome.
column chart

0 Karma

pench2k19
Explorer

@to4kawa Thanks for your efforts to present me an answer.

I have modified the query as like yours

index=infra_apps sourcetype=ca:atsys:edemon:txt
| rename Status as status
| table Autosysjob_time status Job hostname
| eval status=if(status="STARTING" OR status="RUNNING","RUNNING",status)
| eval job_status=Job.":".status
| timechart cont=f span=5m count by job_status
| makecontinuous span=5m _time
| foreach :RUNNING [ eval <> = nullif('<>','<>:SUCCESS')]
| foreach *:SUCCESS [ eval <> = if(isnull('<>:RUNNING') AND '<>'=0 ,NULL, '<>')]
| reverse
| filldown *:SUCCESS
| reverse
| foreach *:RUNNING [ eval <> = if(isnull('<>') AND '<>:SUCCESS' > 0 , 1, '<>' )]
| filldown *:RUNNING
| foreach *:
[ eval <> = max('<>:RUNNING','<>:SUCCESS')]
| fields - :

Unfortunately its not giving any result. i would like try this concurrency command as sugessested by @woodcock i am not able to define it to my situation correctly, can you please try for me once

0 Karma

to4kawa
Ultra Champion

| table Autosysjob_time status Job hostname

| table_time status Job hostname Autosysjob

timechart use first field epochtime.

[02/18/2020 09:45:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX is
[(?P(_time)] hostname EVENT: CHANGE_STATUS STATUS:(?P(<status>) JOB: (?P<job>) MACHINE:(?P<foo>) ?

Huh?, Where is Autosysjob?

0 Karma

pench2k19
Explorer

@to4kawa

[02/18/2020 09:45:15.1318] ---Autosys Job Time
CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS ----Autosys Job Name

And I have updated the query and tested for a single job which starts at 12:35 am daily and continuously run for 20 hours , i see the dot at 12:35 am for today but not anything after that, i am expecting a dot at every for minutes in running jobs count.

I see two lines in the graph when i have not filtered with any job name , one is NULL and another one is OTHER, Can you please explain what it is....And the query looks messy , can you please simplify it if you can so that i will have to few more details to this query via lookup file like cluster details.

0 Karma

to4kawa
Ultra Champion

I see,
| rex "\[(?<timestamp>.*)\] (?<session>\S+) .*STATUS: (?<status>\S+) JOB: (?<job>\S+) MACHINE: (?<machine>\S+)"
timestamp is Autosysjob.

no.12

 | table timestamp session status job machine
 ⇨
 | table Autosysjob session status job machine

I think this will work.

0 Karma

pench2k19
Explorer

I am unbale to share the screenshot . I tried to filter this with single job.

But The Job expected to be in running state from 12:30 am to 11:30pm , when i see the graph it doesnt look like so

0 Karma

to4kawa
Ultra Champion

your search results are exactly as you searched
I created it by your sample log.
If you can't do it, your provide log is wrong OR your query is wrong.

0 Karma

pench2k19
Explorer

I am trying filter it with one job...its showing values in the legend

<<>JobName>> : RUNNING
<<>JobName>> : SUCCESS
NULL

And its not showing the continuous line for Running status as expected, Could you please correct this and also i dont need to woryy about success status now...i need data for only how many jobs are in starting state and how many are in running state at each 5 mins

0 Karma

woodcock
Esteemed Legend

Use the concurrency command for this. For each start event, you create a duration field from the stop events, then you do | eval duration = coalesce(duration, now() - _time) to cover those events that have not ended:
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Concurrency
If this won't work, then try this:
https://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html

0 Karma

pench2k19
Explorer

@woodcock
i am struggling to apply this..Can you please spend sometime and share me the concrete query for my scenerio, that will be great help. Thank you!

0 Karma

gcusello
Legend

Hi pench2k19,
I think that:

  • CAUAJM_I_40245 is the transaction_ID,
  • XXXX is the machine you're checking,
  • you already extracted all the fields (Transaction_ID, STATUS, JOB, MACHINE).

if you didn't do it, this is the regex to extract fields:

\]\s+(?<Transaction_ID>.*)\s+STATUS:\s(?<STATUS>.*)\s+JOB:\s+(?<JOB>.*)\s+MACHINE:\s+(?<MACHINE>.*)

that you can test at https://regex101.com/r/szlTxR/1 .

You have two solutions:

  • transaction command,
  • stats command;

The first is easier to use but slower than the other.

Solution 1: Transaction command:

index=your_index STATUS=STARTING OR STATUS=RUNNING  OR STATUS=SUCCESS 
| transaction Transaction_ID MACHINE
| timechart span=300s count BY MACHINE

Solution 2: stats command:

index=your_index STATUS=STARTING OR STATUS=RUNNING  OR STATUS=SUCCESS 
| stats earlier(_time) AS _time BY Transaction_ID MACHINE
| timechart span=300s count BY MACHINE

(I think that you want the time of the transaction start, if instead you prefer the transaction end, you have to use latest instead of earlier in stats command).

I prefer the second.

Ciao.
Giuseppe

pench2k19
Explorer

@gcusello that is not the transaction ID and i have extracted all the fields, the query i have defines is as follows

index=infra_apps sourcetype=ca:atsys:edemon:txt
| Transaction Job Startswith=(Status=STARTING) endswith=(Status=RUNNING)
| fields Job host Autosysjob_time Status
| lookup datalakenodeslist.csv host OUTPUT cluster
| mvexpand cluster
| search Status=STARTING AND cluster= AND host="" AND Job=*
| dedup Job Autosysjob_time host
| timechart span=5m count(Job) by cluster

This is running very due to the transaction command and it has to run against more data. And i am not sur e if this query counting the sample job during the time i specified in my problem statement.

0 Karma

gcusello
Legend

Hi pench2k19,
transaction is a very slow command to use only when there isn't any other solution.
as I showed, you should try to change the transaction command in stats command using the logic you can see in my example:

index=infra_apps sourcetype=ca:atsys:edemon:txt (STATUS=STARTING OR STATUS=RUNNING  OR STATUS=SUCCESS)
| stats  earlier(_time) AS _time BY Job host
| timechart span=5m count(Job) by host

Ciao.
Giuseppe

0 Karma

pench2k19
Explorer

@gcusello my doubt is would your query can hold the status of the job as expected and show it in the count in timechart.

0 Karma

gcusello
Legend

Hi pench2k19,
my search correlates events using as a key the Job (that I think is a Transaction_ID), in this way you have as many records as the number of transactions and for each of them you have the transaction_ID (Job), the host and the starting time, so you can count all the transactions every 5 minutes for each host.

Ciao.
Giuseppe

0 Karma

pench2k19
Explorer

@gcusello your query throwing an error "Error in 'stats' command: The argument 'earlier(_time)' is invalid."

0 Karma

gcusello
Legend

Hi pench2k19,
sorry!

earliest(_time) AS _time

I was sleeping!

Ciao.
Giuseppe

0 Karma

pench2k19
Explorer

@gcusello hope you got my question correct.

I would like to see the number of jobs running or starting at each 5 minutes.

For example as shown in th sample events, if i job a running from 10:48 to 18:25 , I would like to see that job name in the number of running jobs for every 5 minutes event though i have a single event with that status=RUNNING in splunk.

I am sorry to say your query not serving the purpose

0 Karma

gcusello
Legend

Hi pench2k19,
are you sure?
if you use

index=infra_apps sourcetype=ca:atsys:edemon:txt (STATUS=STARTING OR STATUS=RUNNING)
| bin span=5m _time
| stats  count BY Job host _time
| timechart span=5m dc(Job) by host

you have the number of jobs per host every 5 minutes with status=STARTING only or status=RUNNING only or status=both STARTING and RUNNING.

Ciao.
Giuseppe

0 Karma

pench2k19
Explorer

@gcusello for now if i want to see number of running jobs at each 5 minutes , i have filtered your query

index=infra_apps sourcetype=ca:atsys:edemon:txt STATUS=RUNNING
| bin span=5m _time
| stats count BY Job host _time
| timechart span=5m dc(Job) by host

it is still not working as expected, I would like to hold the running status and show it in the running job count till it changes its status

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!