Is there a way to limit the number of clients a Splunk deployment server pushes a new/updated app to at the same time? We've run into issues where to the Splunk universal forwarder is restarted on hundreds of virtual machines simultaneously causing massive IO delays on shared storage.
I'm looking for a way to have a Splunk deployment server stage out the deployment of app changes so that only X number of nodes are deployed to at a time.
There is no way to do this with Splunk native functionality aside from groups in serverclasses [ group by subnet / server location / server role / etc.] The best option is to adjust the phonehomeinterval to a level that reduces your i/o to acceptable levels. The caveat to this is that it takes longer for your clients to update.
Outside of Splunk, there are some other things you could potentially do, although not recommended. One option Ive seen is to run with multiple dns cnames for your deployment server. Push your updates in a staggered manny by dropping the cnames out while pushing updates. The UF will still phone home, but wont be able to connect until you put the cnames back in. You need to be aware of the ttl values in dns in this case though and have it set really low.
Another option would be firewall rules, block inbound to the DS for portions of your UF's subnets while updates are being pushed...
A good way to achieve this would be to break up your fleet into smaller serverclasses:
You can then confirm the successful install of your apps and updates per serverclass prior to rolling out to the rest of the fleet.
Hope this helps.