Hi Splunk experts, I have a search that joins the results from two source types based on a common field:
sourcetype="userActivity" earliest=-1h@h | join type=inner userID [search sourcertype="userAccount" | fields userID, userType]| stats sum(activityCost) by userType
The problem with the query is that it runs extremely slow - the outer search " sourcetype="userActivity" earliest=-1h@h" only returns about 300 results, but the subsearch is searching across millions of users accounts. If I removed the sub search, the outer search only takes a few seconds to complete.
Does anyone know how to optimize the subsearch so that it can run faster? Or is there any other way to do it without using the slow subsearch command?
Thanks!
Your best bet here is to use a lookup table.
First export your userAccount sourcetype to a lookup table like:
sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv
You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.
Then your search becomes:
sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType
Seems to me it would be much better to do use a subsearch and use indexed search into your userAccount sourcetype, if your first search only returns about 300 results:
( sourcetype="userAccount"
[ search sourcetype="userActivity" earliest=-1h@h
| fields userID | format maxresults=500 ] )
OR
( sourcetype="userActivity" earliest=-1h@h)
| stats
first(userType) as userType
sum(activityCost) as activityCost
by
userID
| stats sum(activityCost) by userType
You will also have to increase settings in limits.conf to above 300:
[subsearch]
maxout=500
[format]
maxresults=500
thanks for your input. The above query won't work because there is no guarantee that there are less than 500 results from the userActivity search. As more users are created in our system, the userActivity will definitely go above 500.
Your best bet here is to use a lookup table.
First export your userAccount sourcetype to a lookup table like:
sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv
You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.
Then your search becomes:
sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType
Gerald, not needed with the stats. In fact that will make it slower.
I would stick a "| dedup userType,userID" in there before the "| outputlookup"
For the lookup table, I would suggest building the table up front using a search like the one above. If it takes a long time to build, I'd suggest incrementally adding to it via "sourcetype=userAccount earliest=-1h@h | intputlookup usertable.csv append=t | stats first(userType) as userType by userID | outputlookup usertable.csv".
If your userActivity search ever only retrieves hundreds of distinct users, and the lookup table doesn't scale well, I'd suggest making a lookup script that executes one or more searches against sourcetype=userAccount to achieve the lookup.
the userID->userType mapping is not guaranteed to be in recent time; that's why it has to search the whole userAccount sourcetype across all time.
The userID->userType is spread across many different places, so we prefer not to use the python script to query the RDBMS. If we were to go forward with the look up table approach, what approach would you recommend for updating the look up table? Is that possible to update/create the table on the fly using only Splunk query?
Is there any guarantee that the specific userID->userType mapping is in recent time or does this need to run over all time?
You don't have to worry too much about the lookup getting too big. Once it exceeds a certain size, we will index it for fast retrieval. Your other choice here is to create a script based lookup. If the userID->userType mapping lives in an RDBMS, you can write a python script that performs the lookup. The contract for the script is that it will receive CSV with blank values on its stdin, and it should produce CSV on its stdout with the blanks filled in. This should easily scale with the query speed of the RDBMS. Now the last possibility is a script runs searches against sourcetype=userAccount.
Thanks for your suggestion. The problem with using look up table is that the number of users is increasing every few minutes, I would imagine that we will have to constantly update the look up table and when the look up table gets way too big, it will not be efficient anymore.