Question: what's the best "maxKBps" settings in such Environment?
About 2000 Forwarders
I know, the correct answer does not exist, since may vary from server to server, and from Env to Env,
but there's surely a Best Practice to set this fundamental value, right?
So, from months to now on i stay well with 0 value (no bandwidth limit),
but sometimes i get a real Indexers stress while people load many many GB of logs (more than 1TB, for pregress analisyes),
since Indexers receive so many datas to saturate their resources, so i need to force a maxKBps to 10240 ONLY for some servers to stay well.
Now, is a 10240 value a right compromise for *ALL* Forwarders, maybe, to raise the value succesively after?
If someone is interested, this is a simple addon to index all the Instance/UFs value for maxKBps, and get it from Splunk directly, to manage paths and the value, and build Reports/Alarms/Dashboards.
printf "$(date +'%Y-%m-%d %T') [$(hostname)] ";[ -z "$SPLUNK_HOME" ] && printf "NO SPLUNK_HOME" || $SPLUNK_HOME/bin/splunk btool limits list --debug|grep "maxKBps"
interval=0 * * * *
After the run, you get the field maxKBps from all your Instances/UFs,
index=* sourcetype=maxKBps earliest=-1h | rex ".*\]\s(?<path>.*)\smaxKBps" | table _time host path maxKBps | sort 0 + maxKBps host
obviously a gneral answer doesn't exist because it depends on your environment, network and load situation.
But anyway you should start from the default values (512 for UFs and 0 for HFs) and analyze is there are queues on your environment to identify if there are systems with critical queues and then modify maxKBps in those systems.
Obviously the most suitable value must be found by trial and error.
The search to find critial queues is
index=_internal "has reached maxKBps. As a result, data forwarding may be throttled" sourcetype=splunkd | bin _time span=1h | stats count as countPerHost by host, _time | where countPerHost > 1
Almost ALL Forwarders get their limit (256 is the default, not 512, inside app Splunkforwarder/default/limits.conf), and got queues full.
For this I decided without if to set the maximum value (=0), monitoring the situation. And it's here that i found, from my Monitoring Dashboards, often absurd peaks (greater than TB), which created sudden bottlenecks on the Indexers, which for several minutes no longer sent back the correct ACKs. This is because several users, to whom I have configured RESTORE path, throw in tons of data, and here THEIR SPECIFIC Forwarder is filed.
The question was: 0 could be actually "dangerous" (a bit like intervening with TRUNCATE=0, if you are not careful), for the reasons explained above, put it for other Applications running and for bandwidth saturation. So i was wondering what an "average" value could be, based on your experiences. From mine, a 10240 is enough in 90% of cases, allowing the UF to send inputs and metrics, without blocking the latter.
I'll try to find a compromise by monitoring 👍
It highly depends on your environment. I had to deliberately lower the max thruput because one of my ufs was getting a huge file to ingest every hour. The UF was fine with reading it but as it wash pushing the chunks of data to HFs for parsing, they started getting OOM-killed. So I lowered the thruput limit from 40MBps to 4MBps. And now the queue is getting clogged fa bit on UF for some time but my HFs are safe.
if this answer solves your need, please accept it for the other people of Community, otherwise, please tell us how we can help you.
Ciao and happy splunking.
P.S.: Karma Points are appreciated by all the Contributors 😉