Hi there,
I am trying to decide which Splunk command I should use to give better long-term performance on the search and the search head and am looking for advice. The functions are join type=left
OR appendcols override=true
. The goal is to see information that may or may not be in both searches (saw page hits in the last 30 days but no hits to that page today)
Example result
field1 field 2 field3 Last 30 Day Total Last 30 Days % Today % Today Total
website1 URL1 Page1 40618 8.605436 7.814981 1083
website1 URL2 Page2 6 0.001271
website2 URL1 Page1 95650 12.050409
website2 URL2 Page2 43862 5.525928 8.103728 50
I have a search with the left join
index=abc data=billing scoring_type=decisioned latest=@d earliest=-30d@d| top field1, field2 by field3 countfield="Last 30 Day Total" percentfield="Last 30 Days %" | join type=left [search index=abc data=billing latest=now earliest=@d| topfield1, field2 by field3 countfield="Today Total" percentfield="Today %"]
And then the same search using appendcols override=true
index=abc latest=@d earliest=-30d@d| top field1, field2 by field3 countfield="Last 30 Day Total" percentfield="Last 30 Days %" |appendcols override=true [search index=abc latest=now earliest=@d| topfield1, field2 by field3 countfield="Today Total" percentfield="Today %"]
Both searches give me the results I am looking for, but they are both long running and I want to know performance-wise which is the better method.
Thanks for the responses
msmapper
Hi msmapper,
Like @somesoni2 already said, avoid all types of sub searches - because soon then later you will hit limits/problems without even knowing.
Regarding your use case:
I find multisearch
being really useful to compare different time ranges, see an example here https://answers.splunk.com/answers/663294/timewrap-compare-last-24-hours-to-the-same-day-ove.html
Hope this helps ...
cheers, MuS
I would go on other route to completely avoid the subsearches (in turn join/appendcols). Give this a try
index=abc data=billing scoring_type=decisioned latest=now earliest=-30d@d
| eval period=if(_time<relative_time(now(),"@d"),"Last 30 Day","Today") | top field1, field2 by period field3
| eval "Last 30 Day Total"=if(period="Last 30 day",count,null()) | eval "Last 30 Day Percent"=if(period="Last 30 day",percent,null())
| eval "Today Total"=if(period="Today",count,null()) | eval "Today Percent"=if(period="Today",percent,null())
| fields - period count percent | stats values(*) as * by field1 field2 field3
| table field1 field2 field3 "Last 30 Day Total" "Last 30 Days %" "Today Total" "Today %"
Update#1
There were many typos than I expected in above query. Try this another variation, which works and performs better (tested with _internal data) in my machine
index=abc data=billing scoring_type=decisioned latest=now earliest=-30d@d
| eval period=if(_time<relative_time(now(),"@d"),"Last 30 Day","Today")
| stats count by period field3 field1 field2 | eventstats sum(count) as Total by period, field3 | eval percent=count*100/Total | sort period -count | dedup 10 period | eval "Last 30 Day Total"=if(period="Last 30 Day",count,null()) | eval "Last 30 Day %"=if(period="Last 30 Day",percent,null())
| eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())
| fields - period count percent Total| stats values(*) as * by field3 field1 field2 | table field3 field1 field2 "Last 30 Day Total" "Last 30 Days %" "Today Total" "Today %"
Why using dedup over stats?
stats list( eval (period<11)) by period field3 field1 field2
| dedup 10 period | eval "Last 30 Day Total"=if(period="Last 30 Day",count,null()) | eval "Last 30 Day %"=if(period="Last 30 Day",percent,null())
| eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())
Hi somesoni2,
While the query works really well, it doesn't put the results from Last 30 Days and Today on the same row. I get 2 rows per Field, one for Last 30 Day and one for Today stats. This doesn't seem to give any performance boost over the subsearch or the join or appendcols, in fact it actually seems a bit slower to run. The subsearch runs in less than 1 min and your query takes 2 min, this is over 1.8 million rows.
Am I doing something wrong with your query?
Also in your example as is the eval statements for the Total and Percent fields aren't populating. There are a few typo's that I want to correct for others to use later
| eval "Last 30 Day Total"=if(period=="Last 30 Day",count,null())| eval "Last 30 Days %"=if(period="Last 30 Day",percent,null())| eval "Today Total"=if(period="Today",count,null()) | eval "Today %"=if(period="Today",percent,null())|stats values(*) as * by field1, field2, field3 ) |table field1, field2, field3 "Last 30 Day Total" "Last 30 Days %" "Today Total" "Today %"
Thanks for pointing out the typos. Try the updated answer.
Hi somesoni2,
The query performance is a bit better but I am still experiencing the same issue with the results not being on the same line together. Getting the data from each timeframe to show on one line with the best performance is my goal. Right now 30 Day and Today numbers are completely separate rows in the table. I have only every seen join and appendcols put data like that in the same row.
Regards
msmapper
One thing you need to be aware of about join is that it has some significant limits OOB. You can alter them in limits.conf if needed:
[join]
subsearch_maxout = integer
* Maximum result rows in output from subsearch to join against.
* Defaults to 50000
subsearch_maxtime = integer
* Maximum search time (in seconds) before auto-finalization of subsearch.
* Defaults to 60
subsearch_timeout = integer
* Maximum time to wait for subsearch to fully finish (in seconds).
* Defaults to 120
Hi sjohnson,
Great reminder of the timeout and maxtime for subsearches. Do you know if it goes, 2 min for search time out and then 1 minute for maxtime, for a total of 3 min?
Right now, the query takes about 30-45 secs to fully display but I am concerned about how it will perform once more pages go live. I want to make sure I start off with the optimal query as well. I do also plan on turning on acceleration for the query once it is finalized to help performance and results.
Regards
msmapper