Splunk Search

How to optimize a search which contains a join of 2 queries having different time picker each ?

mah
Builder

Hi, I have a performance issue with a query using a "join" command.

The problem is that the first search using a time picker on last 4 hours, and the search in the join (type outer) hook using "earliest=-30d"

Example of the query : 

index="A" sourcetype="AB" source="C" 
| eval launch_time=round(strptime(launch_time, "%Y-%m-%dT%H:%M:%S"),0)
| eval search_time=now()
| eval launched_since=round((search_time-launch_time)/86400,0)
| where launched_since > 7
| dedup id sortby -_time
| lookup all_ids account_id OUTPUT acc_name site | site=*

| join type=outer id [ search index="A" sourcetype="AC" source="D" earliest=-30d
| lookup all_ids account_id OUTPUT acc_name site | site=*
| rename agentId as id
| dedup rpg id
| sort rpg 
| stats values(rpg_name) as pg by id acc_name site
| eval Name=if(like(pg,"%name1/%"),"Name1","Name2")
| table id title platform pg Name]


| table site acc_name id pg Name launched_since 
| dedup acc_name id
| eval Name2=if(isnull(Name), "NULL", Name)
| stats count(id) as count by Name2

 

Is it possible to make a search more efficient ?

Thanks in advance !

Labels (1)
Tags (1)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

You need to be careful using join, especially when in your example you are searching over 30 days. There are subsearch limits that apply to join, particularly in the time for the subsearch to run and the size of the subsearch result set - I think it's 60 seconds and 50,000 items by default.

Generally there are always ways to avoid joins, but given your example, I have the following observations.

  • Your subsearch is doing dedup, sort and stats
    • It looks as though your dedup may be redundant, i.e. you are deduping by rpg and id (agentid) and then following it with the stats values() command on rpg_name, which I am guessing is related to rpg - so values() will dedup anyway and you are splitting by id
    • Is there any need for sort before stats - it looks to be redundant - stats values() will return sorted values anyway
  • Your lookup can be done _after_ the stats I believe.
    • Currently your stats is split by the acc_name and site but unless your lookup returns multiple values for the account_id lookup, you could do the lookup after the stats and instead split by the account_id
  • I am guessing the "| site=*" is a typo in the search and assume it's intended as a where or search clause
  • The lookup in both searches could be done after the join?
  • Always delay operations until after stats commands if possible so you are working on aggregated data rather than raw data. Although stats will remove data not part of the clause, you can often add extra split by fields that you want to preserve that have no effect on the split.

However, I have almost always found the construct of 

(search data set 1 date_range_1) OR (search data set 2 date_range_2)
| eval ds1=if(!isnull(field_in_ds1),1,0)

and then using eval/if and stats to aggregate the two data sets accordingly to be more efficient than most joins when working with large data sets.

One big problem with joins is that you will almost never know that your join has hit a limit and that you have not got the right set of results in the subsearch, which will then cause the outer search to be incorrect (but that's when we call them 'statistics' right?? :))

Hope this helps

0 Karma

mah
Builder

Thanks for your answer. 

I applied your recommandations but the join is still really inefficient. 

I did not mention it in my initial demand, but I have rex command in the sub search : is this can slow down the global search ?

0 Karma

anilchaithu
Builder

@mah 

1. Use fields to get the required fields only after the base search in both the searches. this will improve performance

index="A" sourcetype="AB" source="C" 

| fields X Y Z

2. You can optimize the sub search specifically these three 

| dedup rpg id
| sort rpg 
| stats values(rpg_name) as pg by id acc_name site

 

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...