How to optimize a search which contains a join of ...

mah · ‎06-22-2020

Hi, I have a performance issue with a query using a "join" command.

The problem is that the first search using a time picker on last 4 hours, and the search in the join (type outer) hook using "earliest=-30d"

Example of the query :

| table site acc_name id pg Name launched_since
| dedup acc_name id
| eval Name2=if(isnull(Name), "NULL", Name)
| stats count(id) as count by Name2

Is it possible to make a search more efficient ?

Thanks in advance !

bowesmana · ‎06-23-2020

You need to be careful using join, especially when in your example you are searching over 30 days. There are subsearch limits that apply to join, particularly in the time for the subsearch to run and the size of the subsearch result set - I think it's 60 seconds and 50,000 items by default.

Generally there are always ways to avoid joins, but given your example, I have the following observations.

Your subsearch is doing dedup, sort and stats
- It looks as though your dedup may be redundant, i.e. you are deduping by rpg and id (agentid) and then following it with the stats values() command on rpg_name, which I am guessing is related to rpg - so values() will dedup anyway and you are splitting by id
- Is there any need for sort before stats - it looks to be redundant - stats values() will return sorted values anyway
Your lookup can be done _after_ the stats I believe.
- Currently your stats is split by the acc_name and site but unless your lookup returns multiple values for the account_id lookup, you could do the lookup after the stats and instead split by the account_id
I am guessing the "| site=*" is a typo in the search and assume it's intended as a where or search clause
The lookup in both searches could be done after the join?
Always delay operations until after stats commands if possible so you are working on aggregated data rather than raw data. Although stats will remove data not part of the clause, you can often add extra split by fields that you want to preserve that have no effect on the split.

However, I have almost always found the construct of

(search data set 1 date_range_1) OR (search data set 2 date_range_2)
| eval ds1=if(!isnull(field_in_ds1),1,0)

and then using eval/if and stats to aggregate the two data sets accordingly to be more efficient than most joins when working with large data sets.

One big problem with joins is that you will almost never know that your join has hit a limit and that you have not got the right set of results in the subsearch, which will then cause the outer search to be incorrect (but that's when we call them 'statistics' right?? :))

Hope this helps

mah · ‎06-23-2020

Thanks for your answer.

I applied your recommandations but the join is still really inefficient.

I did not mention it in my initial demand, but I have rex command in the sub search : is this can slow down the global search ?

anilchaithu · ‎06-22-2020

@mah

1. Use fields to get the required fields only after the base search in both the searches. this will improve performance

index="A" sourcetype="AB" source="C"

| fields X Y Z

2. You can optimize the sub search specifically these three

| dedup rpg id
| sort rpg
| stats values(rpg_name) as pg by id acc_name site

How to optimize a search which contains a join of 2 queries having different time picker each ?

join

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!