I'm in the process of uplifting our existing logging systems and need some help to understand how true HA can be achieved in a Splunk only deployment across multiple datacentres. My number one priority is to ensure that no syslogs are lost under any circumstance.
To start with, here is our current setup:
We have two non-Splunk syslog collectors, one at each datacentre. All of our endpoints are configured to write Syslogs to BOTH of these collectors, regardless of whichever datacentre they are in. These collectors write to a file and use the Splunk universal forwarder to push the syslogs to a single Splunk instance (only exists in one datacentre). In order to prevent duplication of data, only one syslog collector is configured to send data upstream to Splunk at any given time. If the active syslog collector fails, then the second collector must be manually configured to start sending logs.
As you can imagine, this setup is not ideal. On the up side, there is HA for the syslog collection itself, so if a single collector fails then the we won't lose any logs. On the down side, there is absolutely no HA for Splunk itself.
I have been trying to understand how I can replace the Syslog collectors with Splunk forwarders but still retain the same level of HA and redundancy we currently have for syslog collection. HA for searching / reporting is desirable, but not essential.
I have considered a typical clustered deployment with a single master node, a single search head, a pair of indexers (peer nodes), and a pair of forwarders. There would be one peer indexer and one forwarder at each datacentre.
There are two problems I see with this approach:
First - how to I prevent duplication of logs? From what I understand, I would not be able to configure our endpoints to send logs to both forwards, or else we get duplicate data. They would need to send data to a single forwarder, but then if that forwarder fails, we lose logs. I have read in other discussions that a load balancer is a common way to get around this. E.g. we would configure our endpoints to send logs to a single VIP that exists on a load balancer at each site, the load balancer then sends the logs to the upstream forwarders. If one of the load balancers fail, then the endpoints will be routed to the second load balancer at the other datacentre. This is easily achievable with our current network topology, but is this the best way to handle HA in this situation?
The Second problem - we would only have single instances of the master node and search head. From what I understand, this shouldn't be a huge problem. If we lose the datacentre where the master node resides, then the indexing peer at the active datacentre will continue to index all logs. We might run into problems with bucket replication & searching, but our logs should be retained at the very least - which is no worse than our current setup.
For High availability, you should have Search Head and Indexer Cluster in place. So that in case of any failure on Indexer side or search head side it can easily achievable through additional SH and Indexer.
Now lets discuss about your first concern: The best approach is to have a VIP enabled which can load balanced the load in case of any failure on indexer side. And in case of any issue with universal forwarder as the logs are getting stored in a file, so no need to worry once you fix the issue with universal forwarder, the logs will start moving from the same point itself so no duplication.
second concern: Mostly everyone has a single master server. And you are correct here, the logs will still be indexed without any concerns, just replication of buckets will be a issue for some time.