I wonder if there is any solution to prevent a simple DoS on a tcp input?
We have a couple of TCP inputs defined on our forwarder. A buggy application was creating a new connection per second and leaved them open, after a while the forwarder was out of order. We've lost a lot of log messages. Most probably the forwarder's OS was still accepting connections but the application itself couldn't process the data because it was running out of file handles. We found in splunkd.log a lot of "too many open files". I discovered that one client has over 3700 opened connections to a one specific tcp input.
So, simply increasing the max number of open files (ulimit) wouldn't help at all, because in this case it would only take a longer time until out of service occur without really preventing it.
I'm looking for a possibility to prevent that in the future. How can I avoid forwarder going out of service and losing log information?
I'm looking for an option which allows me to restrict the number of connections coming from one host or something like this.
I am not an expert, however I've been around the block a few times.
On the one hand, this isn't specifically a Splunk problem and this sort of problem can affect all sorts of systems. There's no real way in Splunk to prevent something from connecting a lot because tcp connections are handled by the OS. The real solution here is to make the connecting program behave better. On the other hand, though, there may be some mitigation that can be done to hopefully make this work better.
First, it really IS possible that increasing the ulimit may fix this, because if this problems is what I think it is, those open connections are causing open files. There is a timeout on those open sessions so they will close at some point (4 minutes, I think), so if you can increase the ulimit high enough that it will cover the files "held open" by those stale but not-yet-terminated sessions, then you could very well get around the problem. You may not be able to increase them that high, but it's worth a try!
On the receiving side I think the issue is that there's a default Maximum Segment Life. I don't know if it's changeable - and I'd test and research the heck out of it before I'd try it even if I could - but you can read a bit here on MSL. Some other searches may turn up if it's changeable on your platform.
Yes in fact, this is not a Splunk problem, but a general one.
But I thought maybe there is some functionality which can help me do that inside of splunk.
I solved that now using iptables with connlimit functionality on the splunk-proxy for the affected ports.
Every further connection attempts will be answered as "connection refused", then our software will cache the logs be it's own until the connection is succeeded.
Well knowing the infrastructure and the behavior of our software i could pick a suitable max number of connections.