Splunk Search

How to optimize a search performance of a slow, composite search

d044160
Explorer

I'd like to have some opinions on the following search. We're not thrilled with it's performance, and I'm sure theres a lot of potential to improve it.

What is the search supposed to do: This is a kind of "service check", on each server we're checking CPU, memory and the disk space. The data comes from different indexes, needs different result calculation and so on. So our approach was to use append quite heavily, basically constructing different searches with the same result fields, that we could combine and put into the same table. The search is limited to look only for data within the last 5 minutes. It is used within an app that uses the Javascript framework, all $parameters$ defaulting to "*".

The search inspector says "This search has completed and has returned 19 results by scanning 25,204 events in 85.338 seconds." The components taking the most time are:
1. dispatch.evaluate.append
2. command.transaction
3. command.search
4. dispatch.stream.remote
5. dispatch.fetch

Any general remarks or hints on inspection and improvement strategies for searches are highly appreciated. Thanks!

There it is (I cut a lot of evals):

search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index="perfmon" (object="LogicalDisk" (instance = "C:" OR instance ="D:") (counter="% Free Space" OR counter="Free Megabytes")) OR (object="Processor" counter="% Processor Time" instance="_Total")
    | 2x eval ..
    | eval transactionKey="" + host + ":" + instance + ":" + _time 
    | transaction maxevents=4 transactionKey 
    | eval dedupKey=host + ":" + instance 
    | dedup dedupKey sortby -_time 
    | 12x eval
    | append [
    search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index=perfmon sourcetype="base_pagefile" 
        | dedup host sortby -_time 
        | rex field=_raw max_match=0  "AllocatedBaseSize=(?<alloc>.*)" 
        | rex max_match=0 field=_raw "CurrentUsage=(?<usage>.*)" 
        | eval tmpCounter=mvcount(alloc) 
        | stats max(tmpCounter) sum(alloc) as Total, sum(usage) as Used by _time, host 
        | 9x eval 
    ] 
    | append [
    search index="os" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" 
            (source=df (MountedOn=/ OR MountedOn=/tmp OR MountedOn=/data OR MountedOn=/net/x OR MountedOn=/mnt/y OR MountedOn=/mnt/z OR MountedOn=/mnt/w)) 
            OR (source=cpu CPU="all") 
            OR (sourcetype="vmstat") 
        | eval 2x
        | dedup dedupKey sortby -_time 
        | 14x eval
    ]
    | append [ 
    search index="perfmon" sourcetype="base_systeminfo" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" 
        | dedup host sortby -_time 
        | eval 9x
    ] 
    | search $filterStatus$ service="$filterService$"  
    | eval statusNumber=case(status=="CRITICAL", 0, status=="WARNING", 1, status=="OK", 2) 
    | sort statusNumber -host, -service 
    | eval "Last Check"=strftime(_time, "%F %T") 
    | rename host as "Host" service as "Service" status as "Status" value AS "Description" 
    | table Host Service Status "Description" "Last Check"

martin_mueller
SplunkTrust
SplunkTrust

Off the top of my head I see two suspect things.

First, startup.handoff at the bottom is seven seconds... that's a lot.

Second, you're streaming five million events off disk. Do you need that many in your results or is there an inefficient filter causing this? To dig down, run the search up until before the first pipe symbol and compare the scanCount with the resultCount.

d044160
Explorer

Sounds reasonable 🙂

Inspection of the first part of the search looks like this: alt text

0 Karma

pedromvieira
Communicator

Have you tried to use SET besides APPEND?

http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Set

Also you can try to accelerate it with summary index or Datamodel Acceleration.

d044160
Explorer

From my minimal understanding of Splunk internals: Wouldn't APPENDing the results from 2 searches be cheaper than a SET (UNION) operation? But we'll check that, should be easy enough.

"Summary Index", I think that might be where we're heading. Thanks!

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I fear the timings in the job inspector may not be that accurate when heavy subsearches are involved... I think the entire subsearch is added to the append command or something like that.

To track down what's taking so long, do run each subsearch individually and post a screenshot of the top of the job inspector, or the time it took, the result count, the scan count, and the biggest time hogs.

My crystal ball says one of those might be loading a huge number of events, but we'll see.

bmacias84
Champion

I think some of your problems are related to transaction. The transaction command is great but adds quite a bit of overhead. Also you are running sub search which run serially. Consider using mulitsearch instead of saved search.
http://docs.splunk.com/Documentation/Splunk/6.2.1/SearchReference/Multisearch

0 Karma

d044160
Explorer

I did read up on multisearch, but it seems it would collide with the dedups and transactions, right?

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...