Solved: Re: Subsearch Performance Optimization

clincg · ‎08-20-2010

Hi Splunk experts, I have a search that joins the results from two source types based on a common field:

sourcetype="userActivity" earliest=-1h@h | join type=inner userID [search sourcertype="userAccount" | fields userID, userType]| stats sum(activityCost) by userType

The problem with the query is that it runs extremely slow - the outer search " sourcetype="userActivity" earliest=-1h@h" only returns about 300 results, but the subsearch is searching across millions of users accounts. If I removed the sub search, the outer search only takes a few seconds to complete.

Does anyone know how to optimize the subsearch so that it can run faster? Or is there any other way to do it without using the slow subsearch command?

Thanks!

Stephen_Sorkin · ‎08-20-2010

Your best bet here is to use a lookup table.

First export your userAccount sourcetype to a lookup table like:

sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv

You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.

Then your search becomes:

sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType

View solution in original post

gkanapathy · ‎08-21-2010

Seems to me it would be much better to do use a subsearch and use indexed search into your userAccount sourcetype, if your first search only returns about 300 results:

( sourcetype="userAccount"
  [ search sourcetype="userActivity" earliest=-1h@h 
    | fields userID | format maxresults=500 ] ) 
OR 
( sourcetype="userActivity" earliest=-1h@h) 
| stats 
    first(userType) as userType
    sum(activityCost) as activityCost
  by 
    userID
| stats sum(activityCost) by userType

You will also have to increase settings in limits.conf to above 300:

[subsearch]
maxout=500

[format]
maxresults=500

clincg · ‎08-21-2010

thanks for your input. The above query won't work because there is no guarantee that there are less than 500 results from the userActivity search. As more users are created in our system, the userActivity will definitely go above 500.

Stephen_Sorkin · ‎08-20-2010

Your best bet here is to use a lookup table.

First export your userAccount sourcetype to a lookup table like:

sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv

You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.

Then your search becomes:

sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType

Stephen_Sorkin · ‎08-20-2010

Gerald, not needed with the stats. In fact that will make it slower.

gkanapathy · ‎08-20-2010

I would stick a "| dedup userType,userID" in there before the "| outputlookup"

Stephen_Sorkin · ‎08-20-2010

For the lookup table, I would suggest building the table up front using a search like the one above. If it takes a long time to build, I'd suggest incrementally adding to it via "sourcetype=userAccount earliest=-1h@h | intputlookup usertable.csv append=t | stats first(userType) as userType by userID | outputlookup usertable.csv".

If your userActivity search ever only retrieves hundreds of distinct users, and the lookup table doesn't scale well, I'd suggest making a lookup script that executes one or more searches against sourcetype=userAccount to achieve the lookup.

clincg · ‎08-20-2010

the userID->userType mapping is not guaranteed to be in recent time; that's why it has to search the whole userAccount sourcetype across all time.

clincg · ‎08-20-2010

The userID->userType is spread across many different places, so we prefer not to use the python script to query the RDBMS. If we were to go forward with the look up table approach, what approach would you recommend for updating the look up table? Is that possible to update/create the table on the fly using only Splunk query?

Stephen_Sorkin · ‎08-20-2010

Is there any guarantee that the specific userID->userType mapping is in recent time or does this need to run over all time?

Stephen_Sorkin · ‎08-20-2010

You don't have to worry too much about the lookup getting too big. Once it exceeds a certain size, we will index it for fast retrieval. Your other choice here is to create a script based lookup. If the userID->userType mapping lives in an RDBMS, you can write a python script that performs the lookup. The contract for the script is that it will receive CSV with blank values on its stdin, and it should produce CSV on its stdout with the blanks filled in. This should easily scale with the query speed of the RDBMS. Now the last possibility is a script runs searches against sourcetype=userAccount.

clincg · ‎08-20-2010

Thanks for your suggestion. The problem with using look up table is that the number of users is increasing every few minutes, I would imagine that we will have to constantly update the look up table and when the look up table gets way too big, it will not be efficient anymore.

Subsearch Performance Optimization

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Introducing ITSI 5.0: Unified Visibility and Actionable Insights

Inside Splunk Agent Observability: Understanding Agent Behavior, Tokens & Costs

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Join the Conversation