Is appendcols or join better for search performanc...

msmapper · ‎08-09-2016

Hi there,

I am trying to decide which Splunk command I should use to give better long-term performance on the search and the search head and am looking for advice. The functions are join type=left OR appendcols override=true. The goal is to see information that may or may not be in both searches (saw page hits in the last 30 days but no hits to that page today)

Example result

field1  field 2  field3  Last 30 Day Total   Last 30 Days %   Today %   Today Total
website1  URL1   Page1   40618             8.605436      7.814981   1083
website1  URL2   Page2   6                 0.001271     
website2  URL1   Page1   95650             12.050409        
website2  URL2   Page2   43862             5.525928      8.103728   50

I have a search with the left join

index=abc data=billing scoring_type=decisioned latest=@d earliest=-30d@d| top field1, field2 by field3 countfield="Last 30 Day Total" percentfield="Last 30 Days %" | join type=left [search index=abc data=billing latest=now earliest=@d| topfield1, field2 by field3  countfield="Today Total" percentfield="Today %"]

And then the same search using appendcols override=true

index=abc latest=@d earliest=-30d@d| top field1, field2 by field3 countfield="Last 30 Day Total" percentfield="Last 30 Days %" |appendcols override=true [search index=abc  latest=now earliest=@d| topfield1, field2 by field3  countfield="Today Total" percentfield="Today %"]

Both searches give me the results I am looking for, but they are both long running and I want to know performance-wise which is the better method.

Thanks for the responses
msmapper

MuS · ‎10-28-2018

Hi msmapper,

Like @somesoni2 already said, avoid all types of sub searches - because soon then later you will hit limits/problems without even knowing.

Regarding your use case:

I find multisearch being really useful to compare different time ranges, see an example here https://answers.splunk.com/answers/663294/timewrap-compare-last-24-hours-to-the-same-day-ove.html

Hope this helps ...

cheers, MuS

somesoni2 · ‎08-09-2016

I would go on other route to completely avoid the subsearches (in turn join/appendcols). Give this a try

 index=abc data=billing scoring_type=decisioned latest=now earliest=-30d@d 
| eval period=if(_time<relative_time(now(),"@d"),"Last 30 Day","Today") | top field1, field2 by period field3 
| eval "Last 30 Day Total"=if(period="Last 30 day",count,null()) | eval "Last 30 Day Percent"=if(period="Last 30 day",percent,null())
| eval "Today Total"=if(period="Today",count,null()) | eval "Today Percent"=if(period="Today",percent,null())
| fields - period count percent | stats values(*) as * by field1 field2 field3 
| table field1 field2 field3 "Last 30 Day Total" "Last 30 Days %"  "Today Total" "Today %"

Update#1

There were many typos than I expected in above query. Try this another variation, which works and performs better (tested with _internal data) in my machine

index=abc data=billing scoring_type=decisioned latest=now earliest=-30d@d
 | eval period=if(_time<relative_time(now(),"@d"),"Last 30 Day","Today") 
| stats count by period field3 field1 field2  | eventstats sum(count) as Total by period, field3 | eval percent=count*100/Total | sort period -count | dedup 10 period | eval "Last 30 Day Total"=if(period="Last 30 Day",count,null()) | eval "Last 30 Day %"=if(period="Last 30 Day",percent,null())
 | eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())
 | fields - period count percent Total| stats values(*) as * by field3 field1 field2 | table field3 field1 field2 "Last 30 Day Total" "Last 30 Days %"  "Today Total" "Today %"

valiquet · ‎10-28-2018

Why using dedup over stats?
stats list( eval (period<11)) by period field3 field1 field2

| dedup 10 period | eval "Last 30 Day Total"=if(period="Last 30 Day",count,null()) | eval "Last 30 Day %"=if(period="Last 30 Day",percent,null())
| eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())

msmapper · ‎08-09-2016

Hi somesoni2,

While the query works really well, it doesn't put the results from Last 30 Days and Today on the same row. I get 2 rows per Field, one for Last 30 Day and one for Today stats. This doesn't seem to give any performance boost over the subsearch or the join or appendcols, in fact it actually seems a bit slower to run. The subsearch runs in less than 1 min and your query takes 2 min, this is over 1.8 million rows.

Am I doing something wrong with your query?

Also in your example as is the eval statements for the Total and Percent fields aren't populating. There are a few typo's that I want to correct for others to use later

| eval "Last 30 Day Total"=if(period=="Last 30 Day",count,null())| eval "Last 30 Days %"=if(period="Last 30 Day",percent,null())| eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())|stats values(*) as * by field1, field2, field3 ) |table field1, field2, field3 "Last 30 Day Total" "Last 30 Days %" "Today Total" "Today %"

somesoni2 · ‎08-09-2016

Thanks for pointing out the typos. Try the updated answer.

msmapper · ‎08-09-2016

Hi somesoni2,

The query performance is a bit better but I am still experiencing the same issue with the results not being on the same line together. Getting the data from each timeframe to show on one line with the best performance is my goal. Right now 30 Day and Today numbers are completely separate rows in the table. I have only every seen join and appendcols put data like that in the same row.

Regards
msmapper

sjohnson_splunk · ‎08-09-2016

One thing you need to be aware of about join is that it has some significant limits OOB. You can alter them in limits.conf if needed:

[join]
subsearch_maxout = integer
* Maximum result rows in output from subsearch to join against.
* Defaults to 50000

subsearch_maxtime = integer
* Maximum search time (in seconds) before auto-finalization of subsearch.
* Defaults to 60

subsearch_timeout = integer
* Maximum time to wait for subsearch to fully finish (in seconds).
* Defaults to 120

msmapper · ‎08-09-2016

Hi sjohnson,

Great reminder of the timeout and maxtime for subsearches. Do you know if it goes, 2 min for search time out and then 1 minute for maxtime, for a total of 3 min?

Right now, the query takes about 30-45 secs to fully display but I am concerned about how it will perform once more pages go live. I want to make sure I start off with the optimal query as well. I do also plan on turning on acceleration for the query once it is finalized to help performance and results.

Regards
msmapper

Is appendcols or join better for search performance?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation