This widget could not be displayed.
  • Splunk Search

    Subsearch Performance Optimization

    clincg
    Path Finder

    Hi Splunk experts, I have a search that joins the results from two source types based on a common field:

    sourcetype="userActivity" earliest=-1h@h | join type=inner userID [search sourcertype="userAccount" | fields userID, userType]| stats sum(activityCost) by userType
    

    The problem with the query is that it runs extremely slow - the outer search " sourcetype="userActivity" earliest=-1h@h" only returns about 300 results, but the subsearch is searching across millions of users accounts. If I removed the sub search, the outer search only takes a few seconds to complete.

    Does anyone know how to optimize the subsearch so that it can run faster? Or is there any other way to do it without using the slow subsearch command?

    Thanks!

    Tags (1)
    1 Solution

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    Your best bet here is to use a lookup table.

    First export your userAccount sourcetype to a lookup table like:

    sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv
    

    You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.

    Then your search becomes:

    sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType
    

    View solution in original post

    gkanapathy
    Splunk Employee
    Splunk Employee

    Seems to me it would be much better to do use a subsearch and use indexed search into your userAccount sourcetype, if your first search only returns about 300 results:

    ( sourcetype="userAccount"
      [ search sourcetype="userActivity" earliest=-1h@h 
        | fields userID | format maxresults=500 ] ) 
    OR 
    ( sourcetype="userActivity" earliest=-1h@h) 
    | stats 
        first(userType) as userType
        sum(activityCost) as activityCost
      by 
        userID
    | stats sum(activityCost) by userType
    

    You will also have to increase settings in limits.conf to above 300:

    [subsearch]
    maxout=500
    
    [format]
    maxresults=500
    
    0 Karma

    clincg
    Path Finder

    thanks for your input. The above query won't work because there is no guarantee that there are less than 500 results from the userActivity search. As more users are created in our system, the userActivity will definitely go above 500.

    0 Karma

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    Your best bet here is to use a lookup table.

    First export your userAccount sourcetype to a lookup table like:

    sourcetype=userAccount | stats first(userType) as userType by userID | outputlookup usertable.csv
    

    You might run into some trouble with stats only emitting 500K rows. In that case, you'll need to export this table in a different way.

    Then your search becomes:

    sourcetype=userActivity earliest=-1h@h | lookup usertable.csv userID OUTPUT userType | stats sum(activityCost) by userType
    

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    Gerald, not needed with the stats. In fact that will make it slower.

    gkanapathy
    Splunk Employee
    Splunk Employee

    I would stick a "| dedup userType,userID" in there before the "| outputlookup"

    0 Karma

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    For the lookup table, I would suggest building the table up front using a search like the one above. If it takes a long time to build, I'd suggest incrementally adding to it via "sourcetype=userAccount earliest=-1h@h | intputlookup usertable.csv append=t | stats first(userType) as userType by userID | outputlookup usertable.csv".

    If your userActivity search ever only retrieves hundreds of distinct users, and the lookup table doesn't scale well, I'd suggest making a lookup script that executes one or more searches against sourcetype=userAccount to achieve the lookup.

    clincg
    Path Finder

    the userID->userType mapping is not guaranteed to be in recent time; that's why it has to search the whole userAccount sourcetype across all time.

    0 Karma

    clincg
    Path Finder

    The userID->userType is spread across many different places, so we prefer not to use the python script to query the RDBMS. If we were to go forward with the look up table approach, what approach would you recommend for updating the look up table? Is that possible to update/create the table on the fly using only Splunk query?

    0 Karma

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    Is there any guarantee that the specific userID->userType mapping is in recent time or does this need to run over all time?

    0 Karma

    Stephen_Sorkin
    Splunk Employee
    Splunk Employee

    You don't have to worry too much about the lookup getting too big. Once it exceeds a certain size, we will index it for fast retrieval. Your other choice here is to create a script based lookup. If the userID->userType mapping lives in an RDBMS, you can write a python script that performs the lookup. The contract for the script is that it will receive CSV with blank values on its stdin, and it should produce CSV on its stdout with the blanks filled in. This should easily scale with the query speed of the RDBMS. Now the last possibility is a script runs searches against sourcetype=userAccount.

    0 Karma

    clincg
    Path Finder

    Thanks for your suggestion. The problem with using look up table is that the number of users is increasing every few minutes, I would imagine that we will have to constantly update the look up table and when the look up table gets way too big, it will not be efficient anymore.

    0 Karma
    Get Updates on the Splunk Community!

    Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

    WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

    Industry Solutions for Supply Chain and OT, Amazon Use Cases, Plus More New Articles ...

    Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

    Enterprise Security Content Update (ESCU) | New Releases

    In November, the Splunk Threat Research Team had one release of new security content via the Enterprise ...