Solved: Do we need TTL in war room situations?

ddrillic · ‎01-18-2019

We reach situations in which Splunk is being used heavily in war rooms by many people and there all the quotas work against the Splunk research folks. One of the major problems, in these situations, is the fact that deleting your own jobs via the Activity > Jobs interface can take many precious minutes. I assume that if ttl is set to zero, the user won't need to administer their own searches. Maybe I’m wrong.

Any thoughts?

woodcock · ‎01-27-2019

Do not set TTL to zero. Instead do smaller searches (you should almost never be using All time) and get more disk space in your $SPLUNK_HOME/var/run/dispatch/ directory so that it will allow many more searches. I definitely would not drop TTL to near-zero. It will cause all kinds of other problems for you. Disk is cheap. Add a new 100G disk and create a soft link so that dispatch is there. You will never have this problem again.

View solution in original post

woodcock · ‎01-27-2019

Do not set TTL to zero. Instead do smaller searches (you should almost never be using All time) and get more disk space in your $SPLUNK_HOME/var/run/dispatch/ directory so that it will allow many more searches. I definitely would not drop TTL to near-zero. It will cause all kinds of other problems for you. Disk is cheap. Add a new 100G disk and create a soft link so that dispatch is there. You will never have this problem again.

ddrillic · ‎01-28-2019

My take on that -

-- The architectural overhead for this feature is overwhelming. I've been doing search engines for roughly 20 years and something is off here.

woodcock · ‎01-28-2019

It is possible that your user is being space-limited by his role. It is more likely that all users are being limited because of falling below the minimum dispatch limit of 500mb. Splunk has to protect itself and the host OS. Most people, especially starting out, deploy all of Splunk on the only partition available, which also hosts the OS. Every Splunk search's results have to be stored on disk and this can cause excessive disk usage. So Splunk provides defensive default settings to disallow too many search artifacts from piling up and crashing the server due to using up all the disk space. It defaults to a TTL of 10 minutes and preserving at least 500mb, which on a busy search head is not very much. You probably installed Splunk yourself, without training and you probably did not use PS. Experienced PS architects, setup a separate very large disk partition for dispatch, to ensure that the 500mb minimum will never happen, so the limit makes no difference. I told you what happened and why and what you should do about it. Splunk designed it the way that they did and if you really think about it, it does make sense. In any case, they are not going to redesign it.

ddrillic · ‎01-28-2019

From our sales engineer -

I think you’re taking a narrow view here.
1. How much impact are we actually talking? Can you give me the size of the dispatch directory?
2. Remember; almost ALL of the data in the dispatch directory will be the results of saved searches, alerts, reports, and search accelerations. There’s a ton of usefulness in keeping that data for a day or two so users (and Splunk!) can reference why an alert was triggered or open a dashboard/report with all the data prepopulated (run the search once and use the results over and over).
3. 10 minutes search retention per user isn’t as much as you’d think. If your Splunk cluster is VERY highly performing each active user with a horrible search might be able to add a few GB to that directory, but I doubt it. TTL 10 minutes is a very sane default and is used in much larger and more active clusters than yours.

What we need to figure out at this point is;
1. How big is the dispatch directory?
2. Is the size a problem?
3. How much of that is being taken up by ad-hoc search results?
4. Are the users who are seeing issues hitting their search quota or their disk quota?
5. Is that search quota/disk quota being used up by ad-hoc or by saved searches mostly?

Same conclusion as @woodcock -

I guess I’d summarize by saying; Disk is cheap and CPU is expensive. TTL and disk quotas are meant to help customers trade-off between CPU and disk effectively.

ddrillic · ‎01-28-2019

Let's please keep in mind that the major issue in these war room situations, is the fact that deleting your own jobs via the Activity > Jobs interface can take many precious minutes. So, even if we increase the disk quota, we'll most likely get stuck deleting the old jobs.

woodcock · ‎01-28-2019

The Search Head should be architected so that everyone can run many searches without hitting the dispatch filesystem quota (500MB). User/Role quota is a whole other thing. Get 100GB disk, mount it on the search head as /splunk/dispatch/, stop splunk, move everything out of $SPLUNK_HOME/var/run/splunk/dispatch/*, remove the empty dispatch directory, replace it with a soft link pointing to the other mount, restart splunk, ENJOY LIMITLESS SEARCHING!

ddrillic · ‎03-17-2019

Looking at one of our production search head and I see that the file system of $SPLUNK_HOME is of 240 GBs, 13% utilized and dispatch is of 2.7 GB. I guess we can improve here ; -)

ddrillic · ‎01-28-2019

Right right, the interesting thing is that we do have a 1 TB file system on each search head and only 7% utilized. I can claim that somebody else is maintaining it ; - )

ddrillic · ‎01-28-2019

Very interesting @woodcock. Our sales engineer said -

-- I wouldn't reduce the TTL to zero… I mean, TTL will only affect search artifacts generated by a saved search, so the people in question are probably “power users” building tons of useful things and should simply have their role increased to power user and/or their disk quotas increased.

But lets be realistic, how big is the search dispatch directory right now? Wouldn't it make more sense to increase the disk quota for all users or power users?

Is it right? - TTL affects only search artifacts generated by a saved searches.

woodcock · ‎01-28-2019

As I said, get more disk space.

ddrillic · ‎01-28-2019

The response from our sales engineer -

-- Yes. TTL only affect search artifacts. Searches don’t have artifacts unless they are manually saved or scheduled. A dashboard will have search artifacts, so will a report or alert. But not an ad-hoc search.

woodcock · ‎01-28-2019

As I said, get more disk space.

ddrillic · ‎01-28-2019

Mistake.

Ah, I was mistaken, sorry.

More info here; https://www.splunk.com/blog/2012/09/12/how-long-does-my-search-live-default-search-ttl.html

So, an ad-hoc search will have a small TTL, default is 10 minutes, but usually the search artifacts from saved searches and dashboards are what eats disk space.

woodcock · ‎01-28-2019

As I said, get more disk space.

ddrillic · ‎01-27-2019

Maybe I'm not clear - does ttl of, let's say, really bring value to the community? because we consistently have users who hit the disk usage limit which is a consequence of the ttl settings.

Do we need TTL in war room situations?

Splunk Platform | Upgrading your Splunk Deployment to Python 3.9

From Product Design to User Insights: Boosting App Developer Identity on Splunkbase

Detect and Resolve Issues in a Kubernetes Environment