We have a complex environment and sometimes an accidental misconfiguration will generate heaps of logs, exceeding our Splunk quota. In the past year we've had two events where 5 violations occurred within 30 days, stopping our Splunk search engine. This causes much grief and the managers are furious. There have been talks of ripping out Splunk because of the outages. Obviously I don't want this.
Examples include a Tomcat developer who enabled DEBUG on a production system; it generated 30GB in about an hour. One of the mail servers had a routing problem; it generated 18GB in about an hour. A few events like this is all it takes to break Splunk for weeks.
I've implemented the obvious workarounds. I have filters on the forwarders, send rate limiting, separate indexes, but nothing works. We keep losing our Splunk environment.
What I really want is a quota system per index. So I can assign "2GB for mail, 5GB for Windows, 5GB for Tomcat". Is this possible? Or can I at least limit the damage so we don't get a site-wide Splunk outage whenever a single system goes nuts?
PS: I have to add that Splunk Sales have always been really good (and fast!) with supplying the reset keys, but we want to avoid the outages from happening in the first place.
I typically establish a "soft" threshold policy ie: 85% of my total daily license volume , and have a scheduled search running that sends an email/sms alert to the internal Splunk admin team when this threshold is crossed, so they can then jump on a potential spike before a license breach occurs.
Regarding a "quota system per index" ... well you can do this at the physical Indexer Server level in a license slave/master setup and assign that Indexer to a specific license pool from the total license stack volume. Index violations are then contained within the specific pool.
Ok, I think I understand. The 20GB license "stack" I have can be divided into multiple smaller "pools". Then 5 violations within a pool will only affect slaves using that pool, and won't affect other slaves using different pools within the same stack.
So to minimise the damage I should have as many "pools" as I have license slaves; one slave per pool. That will be beneficial for other reasons, so I'll do exactly that.