Splunk Search

How can I optimize a search that times out?

Path Finder

This search is looking back one month over a large dataset. I would like it to be accelerated, and run once a month on its own, but atm it times out.
I have added comments to clarify my thinking. Thanks for any comments!

index="myIndex" sourcetype=mySourcetype source!=*test* (msg=REQ* OR msg=RSP)

>getting total time: I can eliminate this if this causes a lot of overhead
| addinfo
| eval totalSeconds = round(max(info_max_time) - min(info_min_time))
| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, totalSeconds, _time, cid
| stats values(data.CompanyNumber) as CompanyNumber values(data.ElapsedSeconds) as ElapsedSeconds values(HostName) as HostName values(data.IsApproved) as isApproved max(totalSeconds) as totalSeconds by _time,cid
| eval declines=if(isApproved="false",1,0)
| stats  dc(cid) as txnsPerHost avg(ElapsedSeconds) as avgElapsedSeconds max(ElapsedSeconds) as maxElapsedSeconds  sum(declines) as declines  max(totalSeconds) as totalSeconds by HostName
| eval tps=round((txnsPerHost/totalSeconds),2)

> These next two are obvi just prettyprinting-- can be eliminated if it's too much overhead
| eval avgElapsedSeconds = round(avgElapsedSeconds,2) 
| eval maxElapsedSeconds = round(maxElapsedSeconds,2) 

I've not renamed anything for fear of overhead-- txnsPerHost as "transactions per host" would be nice
| table HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines

Tags (2)
0 Karma
1 Solution

SplunkTrust
SplunkTrust

First, if you want to end up with data by host, you cannot stats without including the host in the by field list. As long as a cid only occurs on one host, then it would still work, but it's better practice to break it out. also, values can incur some overhead, so if you only expect one value, then either include it in the by clause or use max() or min(), first() or last().

Here are my assumptions:

1) cid is a transaction id of some kind, that uniquely identifies a transaction that takes place on one Hostname, for one CompanyNumber, and has one IsApproved result.

2) You aren't using _time for anything except your initial aggregation, so it has no other significance to you.

3) You aren't using isApproved for anything except calculating the number of declines.

Try this over a very short period, and see if it produces the expected counts...

index="myIndex" sourcetype=mySourcetype source!=test (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid 

| stats 
    max(data.ElapsedSeconds) as ElapsedSeconds,
    max(eval(if(data.IsApproved="false",1,0)) as declines, 
    by cid, Hostname, data.CompanyNumber 

| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName 

| addinfo 
| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 

| eval tps=round((txnsPerHost/totalSeconds),2) 

View solution in original post

Path Finder

The following version of this search runs over 7-10 days successfully. I want to make a summary index instead, so I can reliably get a month's data. Am I doing this correctly?
index="myIndex" sourcetype=mySourcetype (msg=REQ* OR msg=RSP) (source="E:\\Logs\\Path-*" source!="E:\\Logs\\Path-Test*") (data.HostType<1000 AND data.HostType!=0)
| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid, msg
| stats
max(data.ElapsedSeconds) as ElapsedSeconds,
max(eval(if('data.IsApproved'="false",1,0))) as declines,
by cid, HostName
| stats dc(cid) as txnsPerHost
avg(ElapsedSeconds) as avgElapsedSeconds
max(ElapsedSeconds) as maxElapsedSeconds
sum(declines) as declines
by HostName
| addinfo
| eval totalSeconds = round(info_max_time - info_min_time)
| eval tps=round((txnsPerHost/totalSeconds),2)
| eval avgElapsedSeconds = round(avgElapsedSeconds,2)
| eval maxElapsedSeconds = round(maxElapsedSeconds,2)
| fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines
| sort 0 -txnsPerHost

Summary creating search:
index="myIndex" sourcetype=mySourcetype (msg=REQ* OR msg=RSP) (source="E:\\Logs\\Path-*" source!="E:\\Logs\\Path-Test*") (data.HostType<1000 AND data.HostType!=0)
| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid, msg
| stats
max(data.ElapsedSeconds) as ElapsedSeconds,
max(eval(if('data.IsApproved'="false",1,0))) as declines,
by cid, HostName
| collect index=mySummaryIndex

Search using the summary:
index=mySummaryIndex
| stats dc(cid) as txnsPerHost
avg(ElapsedSeconds) as avgElapsedSeconds
max(ElapsedSeconds) as maxElapsedSeconds
sum(declines) as declines
by HostName
| addinfo
| eval totalSeconds = round(info_max_time - info_min_time)
| eval tps=round((txnsPerHost/totalSeconds),2)
| eval avgElapsedSeconds = round(avgElapsedSeconds,2)
| eval maxElapsedSeconds = round(maxElapsedSeconds,2)
| fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines
| sort 0 -txnsPerHost

0 Karma

SplunkTrust
SplunkTrust

Do post a screenshot of your job inspector.

0 Karma

SplunkTrust
SplunkTrust

First, if you want to end up with data by host, you cannot stats without including the host in the by field list. As long as a cid only occurs on one host, then it would still work, but it's better practice to break it out. also, values can incur some overhead, so if you only expect one value, then either include it in the by clause or use max() or min(), first() or last().

Here are my assumptions:

1) cid is a transaction id of some kind, that uniquely identifies a transaction that takes place on one Hostname, for one CompanyNumber, and has one IsApproved result.

2) You aren't using _time for anything except your initial aggregation, so it has no other significance to you.

3) You aren't using isApproved for anything except calculating the number of declines.

Try this over a very short period, and see if it produces the expected counts...

index="myIndex" sourcetype=mySourcetype source!=test (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid 

| stats 
    max(data.ElapsedSeconds) as ElapsedSeconds,
    max(eval(if(data.IsApproved="false",1,0)) as declines, 
    by cid, Hostname, data.CompanyNumber 

| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName 

| addinfo 
| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 

| eval tps=round((txnsPerHost/totalSeconds),2) 

View solution in original post

SplunkTrust
SplunkTrust

Your two job inspector images are missing a crucial part in between, where command.stats lives. That's probably taking up a big chunk of time.

Small improvement: Drop data.CompanyNumber, totalSeconds, msg from the initial fields... in fact, you can drop the initial fields altogether and let Splunk figure out what fields it needs to load based on the first stats.

Potential small improvement: Assuming data.isApproved only contains "true" or "false", speed up your first stats by replacing max(eval(if(...))) with min('data.isApproved') as isApproved and translating true/false to 1/0 once per cid after the first stats.

Potential big improvement: Why are you splitting the first stats by _time? I see no reason to, and if there are many _time values per cid then this will make the first stats run extremely slow if it crosses the threshold for in-memory sorting, or if large thresholds for in-memory sorting blow your available memory.

0 Karma

Path Finder

Thanks to @DalJeanis I came up with the following version of the search-- runs much faster for a while, but still seems to die before all the values come back (specifically, returns incompletely for tps)

index="myIndex" sourcetype=mySourcetype (source="pathToSource*" source!="pathToSource-Test*") (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, totalSeconds, _time, cid, msg
| stats 
     max(data.ElapsedSeconds) as ElapsedSeconds,
     max(eval(if('data.IsApproved'="false",1,0))) as declines, 
     by cid,_time, HostName
| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName
| addinfo 

| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 
| eval tps=round((txnsPerHost/totalSeconds),2)
| eval avgElapsedSeconds = round(avgElapsedSeconds,2) 
| eval maxElapsedSeconds = round(maxElapsedSeconds,2) 

| fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines

Current status of JI:
![alt text][https://imgur.com/a/24aEuwI]

alt text
alt text
alt text

At this point, one of my SH's will max out memory, and 2 others will "complete" the search but the TPS column is not fully populated ( a bunch of 0.00s)

0 Karma