Hi Ninjas,
I have following sample events in splunk.
[02/18/2020 10:47:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:48:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 18:25:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
Now i need your help to calculate the total number of running/starting jobs for every 5 minutes , being the job status get hold by the splunk query and make it count in the timechart.
For example, if i am running a query for total number of running jobs between 10:48 and 18:25 , the job name showing in the sample events should be included in the count.
Your help is much appreciated.
| makeresults
| eval _raw="[02/18/2020 09:45:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:47:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:47:35.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 10:48:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: RUNNING JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 18:25:15.1318] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_C_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
[02/18/2020 19:25:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: SUCCESS JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX"
| makemv delim="
" _raw
| stats count by _raw
| rex "\[(?<timestamp>.*)\] (?<session>\S+) .*STATUS: (?<status>\S+) JOB: (?<job>\S+) MACHINE: (?<machine>\S+)"
| table timestamp session status job machine
| eval _time=strptime(timestamp,"%m/%d/%Y %T.%3Q")
| eval status=if(status="STARTING" OR status="RUNNING","RUNNING",status)
| eval job_status=job.":".status
| timechart cont=f span=5m count by job_status
| makecontinuous span=5m _time
| foreach *:RUNNING [ eval <<FIELD>> = nullif('<<FIELD>>','<<MATCHSTR>>:SUCCESS')]
| foreach *:SUCCESS [ eval <<FIELD>> = if(isnull('<<MATCHSTR>>:RUNNING') AND '<<FIELD>>'=0 ,NULL, '<<FIELD>>')]
| reverse
| filldown *:SUCCESS
| reverse
| foreach *:RUNNING [ eval <<FIELD>> = if(isnull('<<FIELD>>') AND '<<MATCHSTR>>:SUCCESS' > 0 , 1, '<<FIELD>>' )]
| filldown *:RUNNING
| foreach *:* [ eval <<MATCHSEG1>> = max('<<MATCHSEG1>>:RUNNING','<<MATCHSEG1>>:SUCCESS')]
| fields - *:*
It is troublesome.
@to4kawa Thanks for your efforts to present me an answer.
I have modified the query as like yours
index=infra_apps sourcetype=ca:atsys:edemon:txt
| rename Status as status
| table Autosysjob_time status Job hostname
| eval status=if(status="STARTING" OR status="RUNNING","RUNNING",status)
| eval job_status=Job.":".status
| timechart cont=f span=5m count by job_status
| makecontinuous span=5m _time
| foreach :RUNNING [ eval <> = nullif('<>','<>:SUCCESS')]
| foreach *:SUCCESS [ eval <> = if(isnull('<>:RUNNING') AND '<>'=0 ,NULL, '<>')]
| reverse
| filldown *:SUCCESS
| reverse
| foreach *:RUNNING [ eval <> = if(isnull('<>') AND '<>:SUCCESS' > 0 , 1, '<>' )]
| filldown *:RUNNING
| foreach *: [ eval <> = max('<>:RUNNING','<>:SUCCESS')]
| fields - :
Unfortunately its not giving any result. i would like try this concurrency command as sugessested by @woodcock i am not able to define it to my situation correctly, can you please try for me once
| table Autosysjob_time status Job hostname
⇨
| table_time status Job hostname Autosysjob
timechart
use first field epochtime.
[02/18/2020 09:45:15.1318] CAUAJM_I_40244 EVENT: CHANGE_STATUS STATUS: STARTING JOB: CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS_M MACHINE: XXXX
is
[(?P(_time)] hostname EVENT: CHANGE_STATUS STATUS:(?P(<status>) JOB: (?P<job>) MACHINE:(?P<foo>)
?
Huh?, Where is Autosysjob?
[02/18/2020 09:45:15.1318] ---Autosys Job Time
CFDW_ADHOC_A_AIMSAS_D_INV_LNITEM_BILLING_CHGS ----Autosys Job Name
And I have updated the query and tested for a single job which starts at 12:35 am daily and continuously run for 20 hours , i see the dot at 12:35 am for today but not anything after that, i am expecting a dot at every for minutes in running jobs count.
I see two lines in the graph when i have not filtered with any job name , one is NULL and another one is OTHER, Can you please explain what it is....And the query looks messy , can you please simplify it if you can so that i will have to few more details to this query via lookup file like cluster details.
I see,
| rex "\[(?<timestamp>.*)\] (?<session>\S+) .*STATUS: (?<status>\S+) JOB: (?<job>\S+) MACHINE: (?<machine>\S+)"
timestamp is Autosysjob.
no.12
| table timestamp session status job machine
⇨
| table Autosysjob session status job machine
I think this will work.
I am unbale to share the screenshot . I tried to filter this with single job.
But The Job expected to be in running state from 12:30 am to 11:30pm , when i see the graph it doesnt look like so
your search results are exactly as you searched
I created it by your sample log.
If you can't do it, your provide log is wrong OR your query is wrong.
I am trying filter it with one job...its showing values in the legend
<<>JobName>> : RUNNING
<<>JobName>> : SUCCESS
NULL
And its not showing the continuous line for Running status as expected, Could you please correct this and also i dont need to woryy about success status now...i need data for only how many jobs are in starting state and how many are in running state at each 5 mins
Use the concurrency
command for this. For each start
event, you create a duration
field from the stop
events, then you do | eval duration = coalesce(duration, now() - _time)
to cover those events that have not ended:
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Concurrency
If this won't work, then try this:
https://answers.splunk.com/answers/227393/how-to-use-the-concurrency-command-to-timechart-th.html
@woodcock
i am struggling to apply this..Can you please spend sometime and share me the concrete query for my scenerio, that will be great help. Thank you!
Hi pench2k19,
I think that:
if you didn't do it, this is the regex to extract fields:
\]\s+(?<Transaction_ID>.*)\s+STATUS:\s(?<STATUS>.*)\s+JOB:\s+(?<JOB>.*)\s+MACHINE:\s+(?<MACHINE>.*)
that you can test at https://regex101.com/r/szlTxR/1 .
You have two solutions:
The first is easier to use but slower than the other.
Solution 1: Transaction command:
index=your_index STATUS=STARTING OR STATUS=RUNNING OR STATUS=SUCCESS
| transaction Transaction_ID MACHINE
| timechart span=300s count BY MACHINE
Solution 2: stats command:
index=your_index STATUS=STARTING OR STATUS=RUNNING OR STATUS=SUCCESS
| stats earlier(_time) AS _time BY Transaction_ID MACHINE
| timechart span=300s count BY MACHINE
(I think that you want the time of the transaction start, if instead you prefer the transaction end, you have to use latest
instead of earlier in stats command).
I prefer the second.
Ciao.
Giuseppe
@gcusello that is not the transaction ID and i have extracted all the fields, the query i have defines is as follows
index=infra_apps sourcetype=ca:atsys:edemon:txt
| Transaction Job Startswith=(Status=STARTING) endswith=(Status=RUNNING)
| fields Job host Autosysjob_time Status
| lookup datalakenodeslist.csv host OUTPUT cluster
| mvexpand cluster
| search Status=STARTING AND cluster= AND host="" AND Job=*
| dedup Job Autosysjob_time host
| timechart span=5m count(Job) by cluster
This is running very due to the transaction command and it has to run against more data. And i am not sur e if this query counting the sample job during the time i specified in my problem statement.
Hi pench2k19,
transaction is a very slow command to use only when there isn't any other solution.
as I showed, you should try to change the transaction command in stats command using the logic you can see in my example:
index=infra_apps sourcetype=ca:atsys:edemon:txt (STATUS=STARTING OR STATUS=RUNNING OR STATUS=SUCCESS)
| stats earlier(_time) AS _time BY Job host
| timechart span=5m count(Job) by host
Ciao.
Giuseppe
@gcusello my doubt is would your query can hold the status of the job as expected and show it in the count in timechart.
Hi pench2k19,
my search correlates events using as a key the Job (that I think is a Transaction_ID), in this way you have as many records as the number of transactions and for each of them you have the transaction_ID (Job), the host and the starting time, so you can count all the transactions every 5 minutes for each host.
Ciao.
Giuseppe
@gcusello your query throwing an error "Error in 'stats' command: The argument 'earlier(_time)' is invalid."
Hi pench2k19,
sorry!
earliest(_time) AS _time
I was sleeping!
Ciao.
Giuseppe
@gcusello hope you got my question correct.
I would like to see the number of jobs running or starting at each 5 minutes.
For example as shown in th sample events, if i job a running from 10:48 to 18:25 , I would like to see that job name in the number of running jobs for every 5 minutes event though i have a single event with that status=RUNNING in splunk.
I am sorry to say your query not serving the purpose
Hi pench2k19,
are you sure?
if you use
index=infra_apps sourcetype=ca:atsys:edemon:txt (STATUS=STARTING OR STATUS=RUNNING)
| bin span=5m _time
| stats count BY Job host _time
| timechart span=5m dc(Job) by host
you have the number of jobs per host every 5 minutes with status=STARTING only or status=RUNNING only or status=both STARTING and RUNNING.
Ciao.
Giuseppe
@gcusello for now if i want to see number of running jobs at each 5 minutes , i have filtered your query
index=infra_apps sourcetype=ca:atsys:edemon:txt STATUS=RUNNING
| bin span=5m _time
| stats count BY Job host _time
| timechart span=5m dc(Job) by host
it is still not working as expected, I would like to hold the running status and show it in the running job count till it changes its status