Does anyone have experience with using a large number of universal forwarders (Around 6000) and the deployment server?
I seem to have hit a limit with my deployment server, somewhere around 2500 to 3000 it pegs out the CPU on the box. How have you guys dealt with this in your environments?
In my current project, the deployment server is able to handle 5000+ clients in AWS .. But we face lot of slowness while adding new server class, clients etc. Otherwise, its fine. We are planning to deploy more deployment servers to handle 1000 clients per Deployment server.
In the enterprise infra, the deployment server is able to handle ~800 clients without any issues and no slowness aswell.
CPU load is always < 1 .. And the poll interval set to 300 seconds
[default]
phoneHomeIntervalInSecs=300
We were able to significantly reduce thrashing of CPU on deployment servers by increasing handShakeRetryInterval in deploymentClient.conf among deployment clients. The default value of 60 seconds amplifies problems with already-overloaded deployment servers.
I'm guessing you could actually use a single deployment server for X number of clients, if you just increase the polling period. By default the client checks in every 30 seconds (or is it 60 - don't remember), which at your size would be 200 (or 100) connections a second, which can certainly be too much, especially if you are going to actually deploy configuration data to the clients every now and then. The recommendation is somewhere around 1 DS per 300-500 clients (~10-15 connections/sec).
If you increase the polling interval to 10 or 20 minutes, there would be a lesser load on the DS (~5-10 connections/sec).
This setting is done in the deploymentclient.conf
file on each forwarder. Unfortunately this file may not be accessible to you through the DS mechanism. If it's located in $SPLUNK_HOME/etc/system/local
, you have no way to override those settings in another deploymentclient.conf
elsewhere in the filesystem, because the /etc/system/local
files have the highest precedence.
Unless you have some means other than DS to put files on a server, you'd probably need to start honing your ssh/rdp skills (or invest in another DS).
NB: this is a purely mathematical exercise based on concurrent connections - there may memory constraints or other internal stuff behind the 300-500 limit that I'm not aware of.
Hope this helps,
Kristian
I would try to break up my deployment server at that level. In the past I have setup a deployment servers based on my data centers. You could setup a dns name and use a load balancer to point the forwarder to the correct server.
We tried using a load balancer in front of deployment server resource pool but found that deployment-apps have different checksums on each deployment-server in resource pool. -this causes deployment-clients to re-download and re-install apps any time they hit a different deployment-server.
This comment is quite old now, but I thought I'd mention in case others had a similar issue.
If you want multiple deployment servers behind a load balancer you can configure crossServerChecksum=true in the serverclass.conf on the deployment server.
From docs:
* Ensures that each app will have the same checksum across different deployment servers.
* Useful if you have multiple deployment servers behind a load-balancer.
* Defaults to false.