How do I know when | tstats summariesonly=true
is 100% finished on an accelerated Data-model?
I have issues where we upload log drops into Splunk from yesterday, so HOST=_NEW_LOG_DROP (So, No new data will go-into this host).
We have noticed that with | tstats summariesonly=true
, the performance is a lot better, so we want to keep it on.
However often, users are clicking to see this data and getting a blank screen as the data is not 100% ready.
We can use | tstats summariesonly=false
, but we have hundreds of millions of lines, and the performance is better with | tstats summariesonly=true
.
So i was thinking, do I run a command like this?
| tstats summariesonly=true count(All_TPS_Logs.duration) AS count FROM datamodel=TPS_V5 WHERE (nodename=All_TPS_Logs host=LUAS_2019_01_01 (All_TPS_Logs.user=* OR NOT All_TPS_Logs.user=*) All_TPS_Logs.operationIdentity="*") All_TPS_Logs.name =***
vs
| tstats summariesonly=false count(All_TPS_Logs.duration) AS count FROM datamodel=TPS_V5 WHERE (nodename=All_TPS_Logs host=LUAS_2019_01_01 (All_TPS_Logs.user=* OR NOT All_TPS_Logs.user=*) All_TPS_Logs.operationIdentity="*") All_TPS_Logs.name =***
And when one equals the other the data-model is 100% done?
Thanks in Advance
Rob
Just a heads up that an accelerated data model runs 3 concurrent searches every 5 minutes by default to rebuild that summary range. So when setting summariesonly=t
you will not get back the most recent data because the summary range is not 100% up to date
Datamodels are typically never finished so long as data is still streaming in. There are searches that run automatically every 5 minutes by default that create the secondary TSIDX files which power you Accelerated Data Models. So anything newer than 5 minutes ago will never be in the ADM and if you have heavy load it may even go farther back than 5 minutes. Which is why you almost always see searches that use earliest=-65m latest=-5m
instead of Last hour
. That just means that the stuff that is happening in the last 5 minutes will not get examined until an hour later.
THanks for your help woodcock, it has helped me to understand them better. 🙂
Just a heads up that an accelerated data model runs 3 concurrent searches every 5 minutes by default to rebuild that summary range. So when setting summariesonly=t
you will not get back the most recent data because the summary range is not 100% up to date
Can i reduce the 5 minutes to 1 minutes, is there a prop for that?
Ofcourse you can, everything is configurable. But I'm warning you not to do it! Reason being, this will tax the sh** out of your CPU and bring the cluster to a crawl. You're adding 500% load on the CPU. A better approach would be to set summariesonly=f
so you search the accelerated data model AND the raw data. You will get the benefit of fast searches over the summary range and the complete data set
The reason you're seeing slow performance when setting the flag to false is because of the added time it takes to search the raw data. Another question for you, how large is your summary range and what is your timerange set to? If your summary range is 1 month and you're searching 6 months then yeah, it's gunna be slow. Depending on your use case, if your timerange is 6 months, and you want to search the last 6 months to NOW. You should setup a 6 month summary range and set the flag to false. If you want 6 months to now-5min then you can set the flag to true and get a lighting fast search result with a complete dataset. Another thought is, since creating a large summary range takes so much disk and CPU, you could create a smaller summary range and combine it with a summary index. This hybrid approach allows you to take advantage of the benefits from each strategy with minimizing disk and CPU
@robertlynch2020 did this answer your question? If so, can you accept it?
HI Skoelpin, sorry for the delay. I was pulled away for a few days there.
I understand the issues now. My main objective is to stop the user getting a blank screen [This happens if summariesonly=true and the data is not summarized].
What i am going to do is the following.
When a new log drop is uploaded i can get the time of upload T I will minus now() when the user click to see there data.
IF greater then 5m i will set the token of summariesonly=false is less then 5 minutes summariesonly=true.
This way the user will always see data and we can use both techniques.
THanks for the help
Rob..
You could look at the following:
@robertlynch2020
summariesonly=true
Only applies when selecting from an accelerated data model. When false, generates results from both summarized data and data that is not summarized.
Ref: https://docs.splunk.com/Documentation/Splunk/7.2.3/SearchReference/Tstats
Check Review summary creation metrics for data model acceleration status
https://docs.splunk.com/Documentation/Splunk/7.2.3/Knowledge/Acceleratedatamodels
Hi
I have an accelerated datamodel, so what is "data that is not summarized". Is this data that will be summarized if i give it more time?
Thanks
Rob
@robertlynch2020
yes if the summarisation defined in your search range then it might take a little time to get data summarised. After that you can run search with summariesonly=true