How to optimize a search performance of a slow, co...

d044160 · ‎01-27-2015

I'd like to have some opinions on the following search. We're not thrilled with it's performance, and I'm sure theres a lot of potential to improve it.

What is the search supposed to do: This is a kind of "service check", on each server we're checking CPU, memory and the disk space. The data comes from different indexes, needs different result calculation and so on. So our approach was to use append quite heavily, basically constructing different searches with the same result fields, that we could combine and put into the same table. The search is limited to look only for data within the last 5 minutes. It is used within an app that uses the Javascript framework, all $parameters$ defaulting to "*".

The search inspector says "This search has completed and has returned 19 results by scanning 25,204 events in 85.338 seconds." The components taking the most time are:
1. dispatch.evaluate.append
2. command.transaction
3. command.search
4. dispatch.stream.remote
5. dispatch.fetch

Any general remarks or hints on inspection and improvement strategies for searches are highly appreciated. Thanks!

There it is (I cut a lot of evals):

search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index="perfmon" (object="LogicalDisk" (instance = "C:" OR instance ="D:") (counter="% Free Space" OR counter="Free Megabytes")) OR (object="Processor" counter="% Processor Time" instance="_Total")
    | 2x eval ..
    | eval transactionKey="" + host + ":" + instance + ":" + _time 
    | transaction maxevents=4 transactionKey 
    | eval dedupKey=host + ":" + instance 
    | dedup dedupKey sortby -_time 
    | 12x eval
    | append [
    search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index=perfmon sourcetype="base_pagefile" 
        | dedup host sortby -_time 
        | rex field=_raw max_match=0  "AllocatedBaseSize=(?<alloc>.*)" 
        | rex max_match=0 field=_raw "CurrentUsage=(?<usage>.*)" 
        | eval tmpCounter=mvcount(alloc) 
        | stats max(tmpCounter) sum(alloc) as Total, sum(usage) as Used by _time, host 
        | 9x eval 
    ] 
    | append [
    search index="os" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" 
            (source=df (MountedOn=/ OR MountedOn=/tmp OR MountedOn=/data OR MountedOn=/net/x OR MountedOn=/mnt/y OR MountedOn=/mnt/z OR MountedOn=/mnt/w)) 
            OR (source=cpu CPU="all") 
            OR (sourcetype="vmstat") 
        | eval 2x
        | dedup dedupKey sortby -_time 
        | 14x eval
    ]
    | append [ 
    search index="perfmon" sourcetype="base_systeminfo" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" 
        | dedup host sortby -_time 
        | eval 9x
    ] 
    | search $filterStatus$ service="$filterService$"  
    | eval statusNumber=case(status=="CRITICAL", 0, status=="WARNING", 1, status=="OK", 2) 
    | sort statusNumber -host, -service 
    | eval "Last Check"=strftime(_time, "%F %T") 
    | rename host as "Host" service as "Service" status as "Status" value AS "Description" 
    | table Host Service Status "Description" "Last Check"

martin_mueller · ‎01-28-2015

Off the top of my head I see two suspect things.

First, startup.handoff at the bottom is seven seconds... that's a lot.

Second, you're streaming five million events off disk. Do you need that many in your results or is there an inefficient filter causing this? To dig down, run the search up until before the first pipe symbol and compare the scanCount with the resultCount.

d044160 · ‎01-28-2015

Sounds reasonable 🙂

Inspection of the first part of the search looks like this:

pedromvieira · ‎01-28-2015

Have you tried to use SET besides APPEND?

http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Set

Also you can try to accelerate it with summary index or Datamodel Acceleration.

d044160 · ‎01-28-2015

From my minimal understanding of Splunk internals: Wouldn't APPENDing the results from 2 searches be cheaper than a SET (UNION) operation? But we'll check that, should be easy enough.

"Summary Index", I think that might be where we're heading. Thanks!

martin_mueller · ‎01-27-2015

I fear the timings in the job inspector may not be that accurate when heavy subsearches are involved... I think the entire subsearch is added to the append command or something like that.

To track down what's taking so long, do run each subsearch individually and post a screenshot of the top of the job inspector, or the time it took, the result count, the scan count, and the biggest time hogs.

My crystal ball says one of those might be loading a huge number of events, but we'll see.

bmacias84 · ‎01-28-2015

I think some of your problems are related to transaction. The transaction command is great but adds quite a bit of overhead. Also you are running sub search which run serially. Consider using mulitsearch instead of saved search.
http://docs.splunk.com/Documentation/Splunk/6.2.1/SearchReference/Multisearch

d044160 · ‎01-28-2015

I did read up on multisearch, but it seems it would collide with the dedups and transactions, right?

How to optimize a search performance of a slow, composite search

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?