I am estimating to receive firewall events of ~200K EPS from 10 core firewalls. My initial thoughts are that if I am to point all 10 firewalls to a single Forwarder, the software performance may not be able to handle such high volume streams.
Are there any benchmark figures on the typical performance of a Forwarder, e.g a single forwarder can handle up to 30K EPS, then I will need about 7 Forwarders at the frontend to handle the load before sending them via round-robin fashion to the backend Indexers.
I was just looking at posting a related question when I saw this one. I'll paste it here with hopes the discussion will help me to.
If I have a forwarder on a server with a few sources, it performs consistently well "out-of-the-box". If I jump up to a few hundred sources, I usually have to do a little tweaking so it can also perform well. (Usually I start by taking the memory hit and bumping up the number of file descriptors.)
However if I have many thousand sources, I find it better to run multiple forwarders on the same server. Increasing the file descriptors and changing time_before_close can only go so far. (And in fact pretty soon the forwarder consumes enough resources on the production system that I have to start defending its existance. Yes, sadly the places where we need Splunk the most, are also the hardest ones to justify it.)
I'm trying to figure out the balance. How do I know when I've got too many inputs, or too much volume for a single forwarder to handle? When do I made that tradeoff between the overhead of an additional forwarder versus continuing to expand the forwarders I already have configured?
I'm looking for a rule of thumb such as "more than 1,000 files which are updated once per minute, or a total of more than 1G per day, or more than 100 files updated 7 times per second, or .... then you need another forwarder".
I have a couple of forwarders trying to service over 1M log files each. They are clearly not able to keep up regardless of how I've tried tuning them. I can't get approval to install more than one forwarder on those servers. As it is, I'm constantly getting flak for how much CPU and memory that one forwarder consumes.
I have another server with 12 forwarders on it. Each forwarder is processing about 1,000 files which are each updating once per minute. They are keeping up (although Splunk's overall footprint on that server is somewhat alarming to the casual observer).
I've also been thinking about when it is justified to look at moving the log files elsewhere, off of the production systems, to a server where we can spare the capacity for Splunk to pick them up. We could do an NFS mount, but we already know that the slow disk performance for NFS introduces its own kind of latency. We could try read-only mounts of the same SAN disk, but I'm going to have to use up a lot of political capital to get the Unix admins to even be willing to try it. We could try a cluster file system, but again, that is going to require more pull (money and time) than I can usually muster. Or, we could write scripts to pull the files back every few minutes using something like rsync
I have no idea if rsync is going to prove to be lighter weight than Splunk for moving sources around. Unfortunately, it appears to be the direction we are heading.
So I'm also looking for a rule of thumb: "If you can't expand the Splunk forwarder footprint, what are good low-footprint solutions for moving the log files to somewhere that can afford the Splunk overhead to get them, and how do you know when you need to start doing that?"
I think there may be a practical limit to the number of FD's you can have open before the forwarder gets tangled up in itself. When I bump the number way up (16,000) the forwarder gets lost and stops forwarding (but never logs there is a problem). I'm experimenting with different values to see if it makes a difference to the forwarder stability - even if latency increases.
We are running 4.1.3. It was hard to tell if 4.1 was a significant improvement. Maybe it did, but I saw nothing obvious on any of the harder hit forwarders. At the least the server footprint didn't really change.