Solved: Re: How can I optimize a search that times out?

ridwanahmed · ‎07-25-2018

This search is looking back one month over a large dataset. I would like it to be accelerated, and run once a month on its own, but atm it times out.
I have added comments to clarify my thinking. Thanks for any comments!

index="myIndex" sourcetype=mySourcetype source!=*test* (msg=REQ* OR msg=RSP)

>getting total time: I can eliminate this if this causes a lot of overhead
| addinfo
| eval totalSeconds = round(max(info_max_time) - min(info_min_time))
| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, totalSeconds, _time, cid
| stats values(data.CompanyNumber) as CompanyNumber values(data.ElapsedSeconds) as ElapsedSeconds values(HostName) as HostName values(data.IsApproved) as isApproved max(totalSeconds) as totalSeconds by _time,cid
| eval declines=if(isApproved="false",1,0)
| stats  dc(cid) as txnsPerHost avg(ElapsedSeconds) as avgElapsedSeconds max(ElapsedSeconds) as maxElapsedSeconds  sum(declines) as declines  max(totalSeconds) as totalSeconds by HostName
| eval tps=round((txnsPerHost/totalSeconds),2)

> These next two are obvi just prettyprinting-- can be eliminated if it's too much overhead
| eval avgElapsedSeconds = round(avgElapsedSeconds,2) 
| eval maxElapsedSeconds = round(maxElapsedSeconds,2)

I've not renamed anything for fear of overhead-- txnsPerHost as "transactions per host" would be nice
| table HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines

DalJeanis · ‎07-25-2018

First, if you want to end up with data by host, you cannot stats without including the host in the by field list. As long as a cid only occurs on one host, then it would still work, but it's better practice to break it out. also, values can incur some overhead, so if you only expect one value, then either include it in the by clause or use max() or min(), first() or last().

Here are my assumptions:

1) cid is a transaction id of some kind, that uniquely identifies a transaction that takes place on one Hostname, for one CompanyNumber, and has one IsApproved result.

2) You aren't using _time for anything except your initial aggregation, so it has no other significance to you.

3) You aren't using isApproved for anything except calculating the number of declines.

Try this over a very short period, and see if it produces the expected counts...

index="myIndex" sourcetype=mySourcetype source!=test (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid 

| stats 
    max(data.ElapsedSeconds) as ElapsedSeconds,
    max(eval(if(data.IsApproved="false",1,0)) as declines, 
    by cid, Hostname, data.CompanyNumber 

| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName 

| addinfo 
| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 

| eval tps=round((txnsPerHost/totalSeconds),2)

View solution in original post

ridwanahmed · ‎12-14-2018

The following version of this search runs over 7-10 days successfully. I want to make a summary index instead, so I can reliably get a month's data. Am I doing this correctly?
index="myIndex" sourcetype=mySourcetype (msg=REQ* OR msg=RSP) (source="E:\\Logs\\Path-*" source!="E:\\Logs\\Path-Test*") (data.HostType<1000 AND data.HostType!=0) | fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid, msg | stats max(data.ElapsedSeconds) as ElapsedSeconds, max(eval(if('data.IsApproved'="false",1,0))) as declines, by cid, HostName | stats dc(cid) as txnsPerHost avg(ElapsedSeconds) as avgElapsedSeconds max(ElapsedSeconds) as maxElapsedSeconds sum(declines) as declines by HostName | addinfo | eval totalSeconds = round(info_max_time - info_min_time) | eval tps=round((txnsPerHost/totalSeconds),2) | eval avgElapsedSeconds = round(avgElapsedSeconds,2) | eval maxElapsedSeconds = round(maxElapsedSeconds,2) | fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines | sort 0 -txnsPerHost

Summary creating search:
index="myIndex" sourcetype=mySourcetype (msg=REQ* OR msg=RSP) (source="E:\\Logs\\Path-*" source!="E:\\Logs\\Path-Test*") (data.HostType<1000 AND data.HostType!=0) | fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid, msg | stats max(data.ElapsedSeconds) as ElapsedSeconds, max(eval(if('data.IsApproved'="false",1,0))) as declines, by cid, HostName | collect index=mySummaryIndex

Search using the summary:
index=mySummaryIndex | stats dc(cid) as txnsPerHost avg(ElapsedSeconds) as avgElapsedSeconds max(ElapsedSeconds) as maxElapsedSeconds sum(declines) as declines by HostName | addinfo | eval totalSeconds = round(info_max_time - info_min_time) | eval tps=round((txnsPerHost/totalSeconds),2) | eval avgElapsedSeconds = round(avgElapsedSeconds,2) | eval maxElapsedSeconds = round(maxElapsedSeconds,2) | fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines | sort 0 -txnsPerHost

martin_mueller · ‎07-25-2018

Do post a screenshot of your job inspector.

DalJeanis · ‎07-25-2018

First, if you want to end up with data by host, you cannot stats without including the host in the by field list. As long as a cid only occurs on one host, then it would still work, but it's better practice to break it out. also, values can incur some overhead, so if you only expect one value, then either include it in the by clause or use max() or min(), first() or last().

Here are my assumptions:

1) cid is a transaction id of some kind, that uniquely identifies a transaction that takes place on one Hostname, for one CompanyNumber, and has one IsApproved result.

2) You aren't using _time for anything except your initial aggregation, so it has no other significance to you.

3) You aren't using isApproved for anything except calculating the number of declines.

Try this over a very short period, and see if it produces the expected counts...

index="myIndex" sourcetype=mySourcetype source!=test (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, cid 

| stats 
    max(data.ElapsedSeconds) as ElapsedSeconds,
    max(eval(if(data.IsApproved="false",1,0)) as declines, 
    by cid, Hostname, data.CompanyNumber 

| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName 

| addinfo 
| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 

| eval tps=round((txnsPerHost/totalSeconds),2)

martin_mueller · ‎07-29-2018

Your two job inspector images are missing a crucial part in between, where command.stats lives. That's probably taking up a big chunk of time.

Small improvement: Drop data.CompanyNumber, totalSeconds, msg from the initial fields... in fact, you can drop the initial fields altogether and let Splunk figure out what fields it needs to load based on the first stats.

Potential small improvement: Assuming data.isApproved only contains "true" or "false", speed up your first stats by replacing max(eval(if(...))) with min('data.isApproved') as isApproved and translating true/false to 1/0 once per cid after the first stats.

Potential big improvement: Why are you splitting the first stats by _time? I see no reason to, and if there are many _time values per cid then this will make the first stats run extremely slow if it crosses the threshold for in-memory sorting, or if large thresholds for in-memory sorting blow your available memory.

ridwanahmed · ‎07-27-2018

Thanks to @DalJeanis I came up with the following version of the search-- runs much faster for a while, but still seems to die before all the values come back (specifically, returns incompletely for tps)

index="myIndex" sourcetype=mySourcetype (source="pathToSource*" source!="pathToSource-Test*") (msg=REQ* OR msg=RSP)

| fields data.CompanyNumber, data.ElapsedSeconds, HostName, data.IsApproved, totalSeconds, _time, cid, msg
| stats 
     max(data.ElapsedSeconds) as ElapsedSeconds,
     max(eval(if('data.IsApproved'="false",1,0))) as declines, 
     by cid,_time, HostName
| stats dc(cid) as txnsPerHost 
     avg(ElapsedSeconds) as avgElapsedSeconds 
     max(ElapsedSeconds) as maxElapsedSeconds 
     sum(declines) as declines 
     by HostName
| addinfo 

| eval totalSeconds = round(max(info_max_time) - min(info_min_time)) 
| eval tps=round((txnsPerHost/totalSeconds),2)
| eval avgElapsedSeconds = round(avgElapsedSeconds,2) 
| eval maxElapsedSeconds = round(maxElapsedSeconds,2) 

| fields HostName txnsPerHost tps avgElapsedSeconds maxElapsedSeconds declines

Current status of JI:
![alt text][https://imgur.com/a/24aEuwI]

At this point, one of my SH's will max out memory, and 2 others will "complete" the search but the TPS column is not fully populated ( a bunch of 0.00s)

How can I optimize a search that times out?

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Transform your security operations with Splunk Enterprise Security

Splunk Admins and App Developers | Earn a $35 gift card!