How to correlate nested parent child jobs?

AKG11 · ‎04-06-2023

Hi,

we have to monitor some jobs in which One Job could have multiple sub task. It could be nested dependency as well. One task depended on other and that other dependent on some other....

I am looking to correlate these dependency and want to see how much time job took end to end.

In below example.
1st task dependent on 2nd, 2nd dependent on 3rd, 4th is End of task.

woodcock · ‎04-11-2023

There is going to be homework here but I will give you an analogous problem with solution that you can adapt to your data. First, install The "CIM Toolkit" app (it has a TON of great stuff in it):
https://splunkbase.splunk.com/app/6243

Next, run this search:
| `CIM_Toolkit_Generate_Renames(Authentication)`

Then do Shift-Ctrl-E to Expand the macro into SPL. Tear it apart and take what you need.

Here is why it will do what you need It starts out with data like this:

title child1 objectName parentName

Authentication		Authentication	BaseEvent
Authentication	Default_Authentication	Authentication	BaseEvent
Authentication	Failed_Authentication	Authentication	BaseEvent
Authentication	Insecure_Authentication	Authentication	BaseEvent
Authentication	Privileged_Authentication	Authentication	BaseEvent
Authentication	Successful_Authentication	Authentication	BaseEvent

And transforms it into a final output like this:
| rename Authentication.Successful_Authentication.* AS *
| rename Authentication.Privileged_Authentication.Successful_Privileged_Authentication.* AS *
| rename Authentication.Privileged_Authentication.Failed_Privileged_Authentication.* AS *
| rename Authentication.Privileged_Authentication.* AS *
| rename Authentication.Insecure_Authentication.* AS *
| rename Authentication.Failed_Authentication.* AS *
| rename Authentication.Default_Authentication.Successful_Default_Authentication.* AS *
| rename Authentication.Default_Authentication.Failed_Default_Authentication.* AS *
| rename Authentication.Default_Authentication.* AS *
| rename Authentication.* AS *

PickleRick · ‎04-10-2023

I don't think there's an easy way to do that. It's one of the anti-patterns in data - to try to normalize everything. The similar case is when someone implements a tree structure (for example to have a filesystem-like objects) with simply a "pointer" to a parent in SQL database.

It requires a recursive processing to find the path to the root.

And while in some flavours of SQL you can make a strored procedure which will do that for you (it will still not be very effective mind you), Splunk has no way of "looping" over own results and "re-searching" based on those results.

The only - very ugly - "quasi-solution" would be to assume that your jobs cannot be nested more than N-levels and just do N iterations of mapping a job to its parent.

yuanliu · ‎04-11-2023

Good pointer. In fact, I started with two levels (as the screenshot shows):

| eval u_key = if(isnull(key), parent_key, key)
| transaction u_key startswith="eval(isnull(key))" endswith="eval(isnull(parent_key) OR key == u_key)" keepevicted=true
| foreach parent_key mode=multivalue
    [eval uu_key = mvappend(uu_key, if(mvcount(parent_key) < 2, <<ITEM>>, if(key == <<ITEM>>, null(), <<ITEM>>)))]
| eval _time = if(u_key == uu_key, _time, _time + duration)
| rename duration as duration1
| transaction uu_key startswith="eval(key!=uu_key)" endswith="eval(key==uu_key)" keepevicted=true
| eval duration = sum(duration1) + duration

Then, I realize that each additional nest requires an additional explicit compaction. Very ugly indeed.

Anyway, here is an emulated dataset to test the above:

_time	key	parent_key
2023-03-29 07:26:00	67890
2023-03-29 07:06:00		67890
2023-03-29 06:30:00	12345	67890
2023-03-29 06:04:00		12345

Speaking of test dataset, the OP should always provide that in text. Screenshot is very unhelpful.

The output would look like

_time	closed_txn	duration	duration1	eventcount	field_match_sum	key	parent_key	u_key	uu_key
2023-03-29 06:30:00	0	4920	1200 1560	2	2	12345 67890	12345 67890	12345 67890	67890

tscroggins · ‎04-10-2023

Where supported, common table expressions make trees manageable in relational databases. Graph databases make trees trivial. The problem isn't intractable in Splunk; it's just different.

PickleRick · ‎04-10-2023

CTEs are not your common SQL. 😉

And with your basic SQL it's very annoying to operate on tree-like structures in SQL. I had to do it in ASE 11.9.2 so I know what I'm talking about. Good thing it has stored procedures. MySQL at that time didnt 🙂

tscroggins · ‎04-10-2023

I used CTEs in ad hoc T-SQL (and less often, CONNECT BY in Oracle) for critical path analysis. It worked very well on indexed identifier and dependency columns. Translating relational SQL solutions to streaming Splunk solutions made up the bulk of my early Splunk experience. I do much less of that now, although the background helps when working with clients on DB Connect solutions, particularly with "UPSERT" style audit tables and normalized, often undocumented third-party schemas.

PickleRick · ‎04-11-2023

As I said - sometimes it can be done, sometimes it cannot. CTEs are not part of the SQL-92, recursive queries were added in SQL:1999. Some DBs started implementing them relatively recently. But we're getting sidetracked here. 😉

With Splunk you have no recursion unless you try to do some very very very ugly (and hardly maintainable) stuff. So I'd say that the original requirement is not fulfillable by Splunk in its raw form.

tscroggins · ‎04-07-2023

Hi,

If the start and finish times are in the search results and you only need the total duration, you can use the stats range function to calculate the duration in seconds:

...
| stats min(_time) as start_time max(_time) as finish_time range(_time) as duration
| fieldformat start_time=strftime(start_time, "%F %T")
| fieldformat finish_time=strftime(finish_time, "%F %T")
| fieldformat duration=tostring(duration, "duration")

If you need to know the critical/longest path through the job network, the solution is more involved.

AKG11 · ‎04-07-2023

@tscroggins

Thanks but this is not going to work. Its not that simple. we can't combine everything. Example is just subset there are multiple nested jobs. Also there is not end time.

tscroggins · ‎04-07-2023

In that case, we'll need to see a larger sample dataset (deidentified if necessary). If neither the finish times nor the durations are known, what method have you been using to manually calculate the makespan / end-to-end duration?

How to correlate nested parent child jobs?

join

Fall Into Learning with New Splunk Education Courses

Super Optimize your Splunk Stats Searches: Unlocking the Power of tstats, TERM, and ...

How Splunk Observability Cloud Prevented a Major Payment Crisis in Minutes

Are you a member of the Splunk Community?

How to correlate nested parent child jobs?

join

Fall Into Learning with New Splunk Education Courses

Super Optimize your Splunk Stats Searches: Unlocking the Power of tstats, TERM, and ...

How Splunk Observability Cloud Prevented a Major Payment Crisis in Minutes