I started a thread a while ago about UDP errors with syslog (http://answers.splunk.com/answers/42645/log-dropping-in-syslog-ng.html). Turned out tuning the UDP buffers resolved the errors at that time. We are running into the issue again. My question now is at what point do you need to consider scaling horizontally for syslog collection what would that architecture look like. Currently we are running two rhel vm's sitting behind a netscaler ViP for UDP 514. The rhel servers are running syslog-ng and then have a UF sending the logs to our 4 indexers. The issue with using the netscaler for UDP is that it is not loadbalancing, but more providing failover only.
So, when should we consider scaling horizontally, i.e. when we exceed 1,500 syslog messages a second? What would be the best way to scale?
Thanks in advance.
The first thing I would suggest is convert all syslog to TCP. This will allow your LB's to actually balance the traffic. This reason why they aren't right now is because UDP is connectionless. Once you go TCP then scaling to multiple boxes is easy and is just a matter of watching device performance and upgrading as needed.
The next thing I would suggest is replacing the UF on the syslog VM's for HF. This will allow you to do more intellegence on the logs coming in and better sourcetype them. HF can parse data before forwarding, unlike UF and therefore makes an excellent companion to syslog. With HF we are able to split out all the different logs types from a single stream; as an example Cisco ASA's have multiple logs for all of the modules, and we use HF to parse each into its own sourcetype for better field identification.
I currently have 2 HF's wth syslog for an entire datacenter worth of network gear and they operate just fine. Watch disk I/O and network utilization as they are your main points of resource consumption; RAM & CPU shouldn't be too bad. If it gets to be too bad then upgrade the devices or add another.
A bit old I know but for those looking a couple of cautions. TCP is quite a bit heavier than UDP so if you are already dropping packets then TCP may require more hardware capacity immediately. For sheer volume UDP is hard to beat pound for pound and in this case pounds=$$$.
How many endpoints are feeding ? A few large ? 10,000 small ? UDP I would prefer in the latter case since the LB will still work on source address ( and today you can set datagram LB )
I have used F5 load balancers with both UDP/TCP to a heavy forwarder running a self contained Splunk app with syslog-ng as the core. This was used to collect, prefilter, proxy to additional subscribers , and log to splunk. This is a "blended" approach that scaled exceptionally well, 1 LB to 4 HF's typical and since the syslog-ng is a Splunk app the logic(configs) and deployment was done via a deployment server. We had 20+ syslog-ng/hf's running as a reliable/flexible/very high performance collection layer.
Ok. I have configured syslog-ng to receive syslog messages over TCP. However I am getting this error from a test device (fortigate 60D)
syslog-ng: Error processing log message:
I figured out my issue. Apparently fortigate security appliances send syslog over TCP using RFC 3195 format, which is not supported by syslog-ng. I am now researching installing rsyslog.
Great information. Thanks. Yeah I had to explain to our netscaler admin why it wasn't truly loadbalancing the syslog messages since it was UDP. I still don't know if he understands completely.
So the HF is accepting the TCP connections as well as parsing/forwarding to indexers, or is there another app/process accepting the TCP connections like syslog-ng?
I still use syslog-NG to accept the connections and write them to files. I usually have it in a tiered file structure like so
The HF is just parsing and forwarding to the indexer. It is a full Splunk Enterprise instance without search capabilities.
As far as UDP loadbalancing, you can reference this: http://support.citrix.com/article/CTX110551. However we discovered on our Cisco ACE loadbalancers that no matter what it did UDP would break in some way. TCP fixed all of our syslog logging issues.
Thanks. I am almost wondering since it is TCP, why not just use a HF solely? There are options for buffering as well as to write to disk if needed. If there are two servers sitting behind a TCP syslog ViP, you would be able to restart HF services individually and should not lose any data, if a single server is able to accept all syslog load. Any reason you keep syslog-ng? I understand the need for it when running syslog UDP.
It allows you to restart Splunk at will without disrupting the syslog stream. Also, syslog restarts are easier, so syslog is down for less time when adding new sources. In short, it's easier to manage this. We've had instances where syslog was an issue, we've had issues where HF had issues, but having them seperate makes troubleshooting conceptually easier.
So, during our most recent upgrade, we were able to blast the package to all HF's at the same time and not loose data because syslog was still running the whole time. It also give us flat files to read when trying to figure out why data isn't making it into Splunk (new data format that the parsers aren't configured to read). Once it's fixes, all the back data can be brought in without loss.
In short, my preference may be just that, a personal preference. YMMV.
We initially had syslog-ng for some components but replaced those. Now we have UFs wherever we can. For some network zones there are intermediate UFs. We ran into problems with those (file descriptors and splunks bandwidth which can be changed in limits.conf). We now run 4 intermediate UFs that receive a about 1MB from 20 000 clients each. The 4 UFs are sending the data to 2 indexers for now but we will move on to 4 soon. This setup can handle weekly sequential puppet updates (which restart Splunk if there is a config change). The systems that do not have an uf send there data to a system that writes all the data to disk and there is an UF that reads that data from disk.
Thanks Chris. How is the overall syslog load distributed to the 4 UF's? Is there some sort of loadbalancer or cluster IP that performs this action? If not, how do you account for dropping syslog if you need to restart the UF service?