I have a working query that uses Transaction to find the Starting / Ending log event.
I am trying to make some changes but Transaction is not working as I expected.
In my current working example I am looking for a 'job name' and then the starting and ending log event.
In my current code I am using one query:
index=anIndex sourcetype=aSourcetype aJobName AND ("START of script" OR "COMPLETED OK").
This works fine when there are no issues but now if a job fails there will be multiple "START of script" and only one 'COMPLETED OK' event.
So, I tried reworking my query to be as follows to only get the most recent of either the start / completed log event.
index=anIndex sourcetype=aSourcetype aJobName AND "START of script" | head 1
| append [ index=anIndex sourcetype=aSourcetype aJobName AND "COMPLETED OK" | head 1 ]
But when I get to the part of creating a transaction the transaction only has the Starting log event ?
| rex "(?<event_name>(START of script)|(COMPLETED OK))"
| eval event_name=CASE(event_name="START of script", "script_start", event_name="COMPLETED OK", "script_complete")
| eval event_time=strftime(_time, "%Y-%m-%d %H:%M:%S")
| eval {event_name}_time=_time
| rex field=_raw "Batch::(?<batchJobName>[^\s]*)"
| transaction keeporphans=true host batchJobName startswith=(event_name="script_start") endswith=(event_name="script_complete")
Is the use of | append [...] the cause ?
If append cannot be used for transaction what other way can I get the data Im looking for ?
Have you tried sorting your data after doing the append? Per the transaction command docs the data needs to be in descending time-order for the command to work correctly:
| sort 0 -_time
When you do an append, you might be tacking on "earlier" timestamps that are not seen as the transaction command works on the stream of data.
Hi @sjringo,
Your original search may work as expected with the transaction keepevicted option, which will retain transactions without a closing event:
index=anIndex sourcetype=aSourcetype aJobName AND ("START of script" OR "COMPLETED OK")
| rex "(?<event_name>(START of script)|(COMPLETED OK))"
| eval event_name=CASE(event_name="START of script", "script_start", event_name="COMPLETED OK", "script_complete")
| rex field=_raw "Batch::(?<batchJobName>[^\s]*)"
| transaction keepevicted=true host batchJobName startswith=(event_name="script_start") endswith=(event_name="script_complete")
The keepevicted=true kind of works for one of the scenarios.
The way it works under normal conditions (no errors) there will be only one Start and one Complete event except when a job has started and not ended due to it currently running...
When there is an exception, the job was restarted and completed with no additional errors. I have from newest to oldest (Completed, Start, Start) or multiple starts depending upon how many errors caused it to abort.
When currently running (Start only, or multiple starts) with no complete
Then to add more to this. I am not looking for just one job but multiple, each wih its unique job name and creating the transaction using the job name and then there could be multiple jobs currently running...
Instead of using Transaction im thinking I might need to do a join with the first query looking for events with the jobName and Started. The 2nd query looking for jobName and Completed, but then having multple starts for a jobName would cause problems on the join?
If your initial search includes only start and end events, you can forego transaction and use stats to gather simple status and duration information, assuming a job with only a start event has actually failed and isn't currently in progress:
sourcetype=sjringo_jobstatus
| stats range(_time) as duration by host job_id jobrun_id
| where duration>0 ``` remove jobs with duration=0, i.e. no completion event ```
No subsearches (append, join, etc.) are required. Here's a set of example events very loosely based on Tidal Enterprise Scheduler, a scheduler I've used in the past:
2023-11-10 20:00:00 job_id=1 jobrun_id=1 jobrun_status=active
2023-11-10 20:01:00 job_id=1 jobrun_id=2 jobrun_status=active
2023-11-10 20:02:00 job_id=1 jobrun_id=2 jobrun_status=normal
2023-11-10 20:03:00 job_id=2 jobrun_id=3 jobrun_status=active
2023-11-10 20:04:00 job_id=2 jobrun_id=3 jobrun_status=normal
2023-11-10 20:05:00 job_id=3 jobrun_id=4 jobrun_status=active
2023-11-10 20:06:00 job_id=3 jobrun_id=5 jobrun_status=active
2023-11-10 20:07:00 job_id=3 jobrun_id=5 jobrun_status=normal
2023-11-10 20:08:00 job_id=4 jobrun_id=6 jobrun_status=active
2023-11-10 20:09:00 job_id=4 jobrun_id=6 jobrun_status=normal
2023-11-10 20:10:00 job_id=5 jobrun_id=7 jobrun_status=active
2023-11-10 20:11:00 job_id=5 jobrun_id=8 jobrun_status=active
2023-11-10 20:12:00 job_id=5 jobrun_id=8 jobrun_status=normal
In this example, job_id is the job definition, jobrun_id is the job instance, and jobrun_status is the state (active = started/running and normal = completed successfully).
We can use the transaction command to uniquely identify the state of each job instance by host, job_id, and jobrun_id:
sourcetype=sjringo_jobstatus
| transaction keepevicted=true host job_id jobrun_id startswith=jobrun_status=active endswith=jobrun_status=normal
| table _time duration closed_txn _raw
Job instances with both a start (open) and end (close) event will have closed_txn=1; job instances with only a start event--an evicted transaction--will have closed_txn=0:
_time | duration | closed_txn | _raw |
2023-11-10 15:11:00 | 60 | 1 | 2023-11-10 20:11:00 job_id=5 jobrun_id=8 jobrun_status=active 2023-11-10 20:12:00 job_id=5 jobrun_id=8 jobrun_status=normal |
2023-11-10 15:10:00 | 0 | 0 | 2023-11-10 20:10:00 job_id=5 jobrun_id=7 jobrun_status=active |
2023-11-10 15:08:00 | 60 | 1 | 2023-11-10 20:08:00 job_id=4 jobrun_id=6 jobrun_status=active 2023-11-10 20:09:00 job_id=4 jobrun_id=6 jobrun_status=normal |
2023-11-10 15:06:00 | 60 | 1 | 2023-11-10 20:06:00 job_id=3 jobrun_id=5 jobrun_status=active 2023-11-10 20:07:00 job_id=3 jobrun_id=5 jobrun_status=normal |
2023-11-10 15:05:00 | 0 | 0 | 2023-11-10 20:05:00 job_id=3 jobrun_id=4 jobrun_status=active |
2023-11-10 15:03:00 | 60 | 1 | 2023-11-10 20:03:00 job_id=2 jobrun_id=3 jobrun_status=active 2023-11-10 20:04:00 job_id=2 jobrun_id=3 jobrun_status=normal |
2023-11-10 15:01:00 | 60 | 1 | 2023-11-10 20:01:00 job_id=1 jobrun_id=2 jobrun_status=active 2023-11-10 20:02:00 job_id=1 jobrun_id=2 jobrun_status=normal |
2023-11-10 15:00:00 | 0 | 0 | 2023-11-10 20:00:00 job_id=1 jobrun_id=1 jobrun_status=active |
You can remove evicted transactions from your output with the default option of keepevicted=false.
Ok, my SPL is getting smaller.. But here is what is happening with the current scenario im testing with.
I have 3 jobs. Two ran with no issues, each has a start & complete log event (No issues)
The 3rd job has two start events and one complete event. Since it failed once, was restarted and then completed.
When I use:
| Transaction keeporphans=true host batchName startwith="start" endswith="completed"
I end up with three transaction where closed_txn=1 and one where closed_txn is not defined or 0... for a total of 4 transactions.
But, what I want to see in the resulting table is all jobs that have run to completion (closed_txn=1) and then any jobs that are currently running (closed_txn=0) or not defined.
How would I go about eliminating from my results the 4th transaction when I have another transaction with the same jobName that has closed_txn=1 and then not remove it (since its still running) when I dont have a tranaction that has closed_txn=1 ?
My Events:
aJobName1 START
aJobName2 START
aJobName1 COMPLETE
aJobName3 START
aJobName2 START
aJobName3 COMPLETE
aJobName2 COMPLETE
Hi,
Remove keeporphans=true from your transaction command and add keepevicted=true.
Without a field to identify unique instances of jobs, you can't know which of multiple START events correlates to a single COMPLETE event:
T0
aJobName2 START ``` A ```
T1
aJobName2 START ``` B ```
T2
aJobName2 COMPLETE
Which instance of aJobName2 completed, A or B?
The transaction command with keepevicted=true startswith=START endswith=COMPLETE will create two transactions:
T0
aJobName2 START
closed_txn=0
duration=0
T1
ajobName2 START
ajobName2 COMPLETE
closed_txn=1
duration=T2-T1
Is there a job number, process identifier, etc. in your _raw event?
I had to re-think the way I was showing my job info in the UI as a orphaned Transaction could mean two different things.
i.e. 'running' start with no end message or multiple starts along with modifying my initial SPL to look for an aborted message when a failure occurred.
It was quite a learning experience is I went through all different permutations of keeporphans=true/false keepevicted=true/false and then look at the ending transaction data.
I did notice that closed_txn defaults to null and not 0 and had to write some code to make it default to 0.
eval closed_txn = if ( isnull(closed_txn,0, closed_txn)
search closed_txn=0
to find those transactions that only has a START log event...
Thanks for all you input as it gave me a different perspective from the way I was looking at it. I was also to rewrite some of my existing SPL to have fewer lines.
Have you tried sorting your data after doing the append? Per the transaction command docs the data needs to be in descending time-order for the command to work correctly:
| sort 0 -_time
When you do an append, you might be tacking on "earlier" timestamps that are not seen as the transaction command works on the stream of data.
This seemed to work by adding | sort -_time | head 1
i.e. index=anIndex sourcetype=aSourcetype aJobName AND "START of script" | sort -_time | head 1
| append [ index=anIndex sourcetype=aSourcetype aJobName AND "COMPLETED OK" ]
The way I understood sort 0 is that it forces all event but in this scenario there should not be more than a handfull. I think I saw the limit without the 0 was 10,000?
So, I think this worked but need to do some additional testing.
For what im working on I have 3 jobs that im trying to track and each has a different jobName so I ended up writing the initial query as:
index=anIndex sourcetype=aSourcetype aJobName1 AND "START of script" | sort -_time | head 1
| append [ search index=anIndex sourcetype=aSourcetype aJobName1 AND "COMPLETED OK" ]
| append [ search index=anIndex sourcetype=aSourcetype aJobName2 AND "START of script" | sort -_time | head 1
| append [ index=anIndex sourcetype=aSourcetype aJobName2 AND "COMPLETED OK" ]]
| append [ search index=anIndex sourcetype=aSourcetype aJobName3 AND "START of script" | sort -_time | head 1 ]
| append [ index=anIndex sourcetype=aSourcetype aJobName3 AND "COMPLETED OK" ]]
Is there a better way to write this besides using three appends ? I cant use one append and then | head 3 for the Start query as one of the jobs multipe STARTS could kick out a START event from one of the other jobs depending upon when the jobs start...