I'd like to have some opinions on the following search. We're not thrilled with it's performance, and I'm sure theres a lot of potential to improve it.
The search inspector says "This search has completed and has returned 19 results by scanning 25,204 events in 85.338 seconds." The components taking the most time are:
Any general remarks or hints on inspection and improvement strategies for searches are highly appreciated. Thanks!
There it is (I cut a lot of evals):
search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index="perfmon" (object="LogicalDisk" (instance = "C:" OR instance ="D:") (counter="% Free Space" OR counter="Free Megabytes")) OR (object="Processor" counter="% Processor Time" instance="_Total") | 2x eval .. | eval transactionKey="" + host + ":" + instance + ":" + _time | transaction maxevents=4 transactionKey | eval dedupKey=host + ":" + instance | dedup dedupKey sortby -_time | 12x eval | append [ search host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" index=perfmon sourcetype="base_pagefile" | dedup host sortby -_time | rex field=_raw max_match=0 "AllocatedBaseSize=(?<alloc>.*)" | rex max_match=0 field=_raw "CurrentUsage=(?<usage>.*)" | eval tmpCounter=mvcount(alloc) | stats max(tmpCounter) sum(alloc) as Total, sum(usage) as Used by _time, host | 9x eval ] | append [ search index="os" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" (source=df (MountedOn=/ OR MountedOn=/tmp OR MountedOn=/data OR MountedOn=/net/x OR MountedOn=/mnt/y OR MountedOn=/mnt/z OR MountedOn=/mnt/w)) OR (source=cpu CPU="all") OR (sourcetype="vmstat") | eval 2x | dedup dedupKey sortby -_time | 14x eval ] | append [ search index="perfmon" sourcetype="base_systeminfo" host="$filterHost$" host_category="build-servers" host_group="$filterHostGroup$" | dedup host sortby -_time | eval 9x ] | search $filterStatus$ service="$filterService$" | eval statusNumber=case(status=="CRITICAL", 0, status=="WARNING", 1, status=="OK", 2) | sort statusNumber -host, -service | eval "Last Check"=strftime(_time, "%F %T") | rename host as "Host" service as "Service" status as "Status" value AS "Description" | table Host Service Status "Description" "Last Check"
I fear the timings in the job inspector may not be that accurate when heavy subsearches are involved... I think the entire subsearch is added to the
append command or something like that.
To track down what's taking so long, do run each subsearch individually and post a screenshot of the top of the job inspector, or the time it took, the result count, the scan count, and the biggest time hogs.
My crystal ball says one of those might be loading a huge number of events, but we'll see.
I think some of your problems are related to transaction. The transaction command is great but adds quite a bit of overhead. Also you are running sub search which run serially. Consider using mulitsearch instead of saved search.
From my minimal understanding of Splunk internals: Wouldn't APPENDing the results from 2 searches be cheaper than a SET (UNION) operation? But we'll check that, should be easy enough.
"Summary Index", I think that might be where we're heading. Thanks!
Off the top of my head I see two suspect things.
First, startup.handoff at the bottom is seven seconds... that's a lot.
Second, you're streaming five million events off disk. Do you need that many in your results or is there an inefficient filter causing this? To dig down, run the search up until before the first pipe symbol and compare the scanCount with the resultCount.