Solved: Using join in postprocess takes longer than duplic...

ahmetcepoglu · ‎02-19-2014

Hello

I have 3 searchmanagers like so (the actual queries are longer)

{% searchmanager id="s1" search="index=abc | top name" %}
{% postprocessmanager id ="s2" managerid="ms" search=" | join name [index=abc | stats count(x) by name ]" %}
{% searchmanager id="s3" search="index=abc | top name | join name [index=abc | stats count(x) by name" %}

You can see how s3 does the same thing s1 and s2 combined.
I have tables that show each search result, and they end in the following order: s1,s3,s2

How can post-processing not save me time here?

dwaddle · ‎02-19-2014

"Because join" ... Splunk's join command is useful and sometimes necessary. But for many use cases -- particularly for people coming from SQL background -- they are the worst way to solve the problem.

I realize these are just examples and you have more complex real-life searches but let's consider your "s3" search as an example:

index=abc | top name | join name [index=abc | stats count(x) by name ]

In this search, you have asked Splunk to run a dense search - index=abc - twice. Then you've asked it to join those results together. Before "s3" can return any results, it has to dispatch both of those searches, gather the events matching both, and then perform the join. It is possible (but unconfirmed) that Splunk can run those two searches in parallel. (But it just as well may not)

However, in a postprocess .. the first search definitely absolutely has to finish before the second can begin. So you've asked Splunk to run a dense search, and then wait for it to finish before starting another dense search to join them together. It is just not going to end well.

Almost any use of join for the purpose of adding on stats data can be accomplished much more efficiently through use of the eventstats command, with the occasional eval here and there. I would suggest rewriting your search to be less SQL-like and take advantage of the different tools that Splunk offers to avoid relying on join except where you definitely need it.

View solution in original post

dwaddle · ‎02-19-2014

"Because join" ... Splunk's join command is useful and sometimes necessary. But for many use cases -- particularly for people coming from SQL background -- they are the worst way to solve the problem.

I realize these are just examples and you have more complex real-life searches but let's consider your "s3" search as an example:

index=abc | top name | join name [index=abc | stats count(x) by name ]

In this search, you have asked Splunk to run a dense search - index=abc - twice. Then you've asked it to join those results together. Before "s3" can return any results, it has to dispatch both of those searches, gather the events matching both, and then perform the join. It is possible (but unconfirmed) that Splunk can run those two searches in parallel. (But it just as well may not)

However, in a postprocess .. the first search definitely absolutely has to finish before the second can begin. So you've asked Splunk to run a dense search, and then wait for it to finish before starting another dense search to join them together. It is just not going to end well.

Almost any use of join for the purpose of adding on stats data can be accomplished much more efficiently through use of the eventstats command, with the occasional eval here and there. I would suggest rewriting your search to be less SQL-like and take advantage of the different tools that Splunk offers to avoid relying on join except where you definitely need it.

ahmetcepoglu · ‎02-19-2014

That makes sense, thanks.

somesoni2 · ‎02-19-2014

Postprocesses are primarily for code/query reuse (help maintain it efficienly). Performance is not gauranteed in terms of time, they save resources though. Also, postprocess should be doing more filtering/processing on existing data.

Using join in postprocess takes longer than duplicating the search

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

New in Observability Cloud - Explicit Bucket Histograms