- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an installation where I am trying to leverage an intermediate forwarder (IF) to send logs to my indexers. I have approximately 3000 Universal Forwarders (UFs) that I want to send through the IF, but something is limiting the IF to around 1000 connections. The IF is a Windows Server 2019.
I am monitoring the connections with this PowerShell command: netstat -an | findstr 9997 | measure | select count. I never see more than ~1000 connections, even though I have several thousand UFs configured to connect to this IF.
I have already tried increasing the max user ports, but there was no change:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\MaxUserPort
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpTimedWaitDelay
I have validated the network by creating a simple client and server to test the maximum connections. It reached the expected maximum of 16,000 connections from the client network to the IF. I can also configure a server to listen on port 9997 and see several thousand clients trying to connect to the port.
I believe there must be something wrong with the Splunk IF configuration, but I am at a loss as to what it could be. There are no limits.conf configurations, and the setup is generally very basic.
My official Splunk support is advising me to build more IFs and limit the clients to less than 1000, which I consider a suboptimal solution. Everything I’ve read indicates that an IF should be capable of handling several thousand UFs.
Any help would be greatly appreciated.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MichaelM1,
MaxUserPort adjusts limits on ephemeral ports. From the perspective of the intermediate forwarder, this would be the maximum port number allocated for an outbound connection to a downstream receiver. The intermediate forwarder would only listen on one port or however many input ports you have defined.
TcpTimedWaitDelay adjusts the amount of time a closed socket will be held until it can be reused by another winsock client/server.
As a quick test, I installed Splunk Universal Forwarder 9.4.0 for Windows on a clean install of Windows Server 2019 Datacenter Edition named win2019 with the following settings:
# %SPLUNK_HOME%\etc\system\local\inputs.conf
[splunktcp://9997]
disabled = 0
# %SPLUNK_HOME%\etc\system\local\outputs.conf
[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = splunk:9997
[tcpout-server://splunk:9997]
where splunk is a downstream receiver.
To simulate 1000+ connections, I installed Splunk Universal Forwarder 9.4.0 for Linux on a separate system with the following settings:
# $SPLUNK_HOME/etc/system/local/limits.conf
[thruput]
maxKBps = 0
# $SPLUNK_HOME/etc/system/local/outputs.conf
[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = win2019:9997
[tcpout-server://win2019:9997]
# $SPLUNK_HOME/etc/system/local/server.conf
# additional default settings not shown
[general]
parallelIngestionPipelines = 2000
[queue]
maxSize = 1KB
[queue=AQ]
maxSize = 1KB
[queue=WEVT]
maxSize = 1KB
[queue=aggQueue]
maxSize = 1KB
[queue=fschangemanager_queue]
maxSize = 1KB
[queue=parsingQueue]
maxSize = 1KB
[queue=remoteOutputQueue]
maxSize = 1KB
[queue=rfsQueue]
maxSize = 1KB
[queue=vixQueue]
maxSize = 1KB
parallelIngestionPipelines = 2000 creates 2000 connections to win2019:9997. (Don't do this in real life. It's a Splunk instance using 2000x the resources of a typical instance. You'll consumer memory very quickly as stack space is allocated for new threads.)
So far, I have no issues creating 2000 connections.
Do you have a firewall or transparent proxy between forwarders and your intermediate forwarder? If yes, does the device limit the number of inbound connections per destination ip:port:protocol tuple?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may want to try
In $SPLUNK_HOME/etc/splunk-launch.conf on IF.
SPLUNK_LISTEN_BACKLOG = 512
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the help on this.
The final solution for me is the
[general]
parallelIngestionPipelines = 200
I am not sure I see the benefit of taking the time to find the optimal size for the various Queues a you suggest. I have the available CPU and memory to simply increase the pipelines. I will be adding several IFs and allow them to load balance and ultimately 200 will be way over kill and I may drop this back to something like 50 (or maybe I will not bother with this either 🙂 since everything is working)
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If setting 200 parallel pipelines helped, that means existing pipeline thread(s) on IF (depending on how many pipelines ) maxed out.
Checkout
index=_internal source=*metrics.log host=<IF> ratio thread=fwddatareceiverthread* | timechart span=30s max(ratio) by thread.
If all are > .95 then you need to add more pipelines.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could leave it that way, but you're maintaining 200 connections to the downstream receivers. If you have, for example, 16 cores on your intermediate forwarder and want to leave 2 cores free for other activity (so much overhead!), you can do the same thing with larger queues and fewer pipelines by increasing maxSize values by the same relative factor. If your forwarder doesn't have enough memory to hold all queues, keep an eye on memory, paging, and disk queue metrics.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MichaelM1,
MaxUserPort adjusts limits on ephemeral ports. From the perspective of the intermediate forwarder, this would be the maximum port number allocated for an outbound connection to a downstream receiver. The intermediate forwarder would only listen on one port or however many input ports you have defined.
TcpTimedWaitDelay adjusts the amount of time a closed socket will be held until it can be reused by another winsock client/server.
As a quick test, I installed Splunk Universal Forwarder 9.4.0 for Windows on a clean install of Windows Server 2019 Datacenter Edition named win2019 with the following settings:
# %SPLUNK_HOME%\etc\system\local\inputs.conf
[splunktcp://9997]
disabled = 0
# %SPLUNK_HOME%\etc\system\local\outputs.conf
[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = splunk:9997
[tcpout-server://splunk:9997]
where splunk is a downstream receiver.
To simulate 1000+ connections, I installed Splunk Universal Forwarder 9.4.0 for Linux on a separate system with the following settings:
# $SPLUNK_HOME/etc/system/local/limits.conf
[thruput]
maxKBps = 0
# $SPLUNK_HOME/etc/system/local/outputs.conf
[tcpout]
defaultGroup = default-autolb-group
[tcpout:default-autolb-group]
server = win2019:9997
[tcpout-server://win2019:9997]
# $SPLUNK_HOME/etc/system/local/server.conf
# additional default settings not shown
[general]
parallelIngestionPipelines = 2000
[queue]
maxSize = 1KB
[queue=AQ]
maxSize = 1KB
[queue=WEVT]
maxSize = 1KB
[queue=aggQueue]
maxSize = 1KB
[queue=fschangemanager_queue]
maxSize = 1KB
[queue=parsingQueue]
maxSize = 1KB
[queue=remoteOutputQueue]
maxSize = 1KB
[queue=rfsQueue]
maxSize = 1KB
[queue=vixQueue]
maxSize = 1KB
parallelIngestionPipelines = 2000 creates 2000 connections to win2019:9997. (Don't do this in real life. It's a Splunk instance using 2000x the resources of a typical instance. You'll consumer memory very quickly as stack space is allocated for new threads.)
So far, I have no issues creating 2000 connections.
Do you have a firewall or transparent proxy between forwarders and your intermediate forwarder? If yes, does the device limit the number of inbound connections per destination ip:port:protocol tuple?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes there is at least one firewall between the client network and the Intermediate forward network. I did a quick and dirty test like you did by making a powershell script that ran on the client subnet and simply opened as many connection to the IF as it could. I created a corresponding server script to listen on a port. As expected the server maxed out at 16000 connections. This confirms that there is not a networking device between the client network and the IF network that would limit the total number of connections. The inputs and outputs that you have are effective the same as what I have. I am not doing anything special with them and it is just about as basic as it comes.
The next hop from the IF to the indexers needs to go through a NAT as my IF is a private address and the indexers are public. I don't suspect that the IF server would not allow more that 1k connections if the upstream is limiting the connections but I don't have a easy way to verify this. I don't control the indexers and so I cant do a similar end to end connection test with a lot of port.
I am still scratching my head on this and like I said I am not satisfied with the suggestion of just building more IF servers and limiting them to 1k clients each.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MichaelM1,
Does your test script fail at ~1000 connections when sending a handshake directly to the intermediate forwarder input port and not your server script port? Completing a handshake and sending no data while holding the connection open should work. The splunktcp input will not reset the connection for at least (by default) 10 minutes (see the inputs.conf splunktcp s2sHeartbeatTimeout setting).
It still seems as though there may be a limit at the firewall specific to your splunktcp port, but the firewall would be logging corresponding drops or resets.
The connection(s) from the intermediate forwarder to the downstream receiver(s) shouldn't directly impact new connections from forwarders to the intermediate forwarder, although blocked queues may prevent new connections or close existing ones.
Have you checked metrics.log on the intermediate forwarder for blocked=true events? A large number of streams moving through a single pipeline on an intermediate forwarder will likely require increasing queue sizes or adding pipelines.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I set my client test script to connect to 9997 on the IF server that is the and currently has ~1000 client connected to it, and it immediately fails to establish any more connections. I have since built a new server that is using the exact same IF configurations and my client script connect to 9997 up to 16000 connections as expected. I don’t understand this, I guess the splunk process with the 1000 connection knows there is data traversing and that it is too busy to accept more connections.
In the metrics.log I see this ever minute:
<date time> Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=512, current_size_kb=511, current_size=1217, largest_size=1217,smallest_size=0
I don’t know what this mean or if it is significant and I will start researching this.
I also am not sure what you mean by adding pipeline so I will look into that too.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was trying to play around with the
[general]
pipelineSetAutoScale
[queue]
autoAdjustQueue
maxSize
and nothing had an effect
Then after some more researching more I added this line to the server.conf
[general]
parallelIngestionPipelines = 200
The server immediately allowed over 7000 connections and consumed all 32gb of memory and used a huge amount of networking on my IF for a few minutes before settling to about 6800 connections and 22 -30GB of memory.
It is currently receiving about 1.5Gbps and sending ~100Mbps. Presumably this is burning off the backlog of all the logs that it was not able to forward before. I am up to 3200 devices reporting and increasing.
I am hesitant to say this is fixed and I would like to know if there are any long term issues with keeping the parallelIngestionPipelines = 200. It is also unclear why a single pipeline is limited to 1000 connections as I have not seen that document anywhere.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MichaelM1,
Increasing parallelIngestionPipelines to a value larger than 1 is similar to running multiple instances of splunkd with splunktcp inputs on different ports. As a starting point, however, I would leave parallelIngestionPipelines unset or at the default value of 1.
splunkd uses a series of queues in a pipeline to process events. Of note:
- parsingQueue
- aggQueue
- typingQueue
- rulesetQueue
- indexQueue
There are other queues, but these are the most well-documented. See https://community.splunk.com/t5/Getting-Data-In/Diagrams-of-how-indexing-works-in-the-Splunk-platfor.... I have copies of the printer and high-DPI display friendly PDFs if you need them.
On a typical universal forwarder acting as an intermediate forwarder, parsingQueue, which performs minimal event parsing, and indexQueue, which sends events to outputs, are the likely bottlenecks.
Your metrics.log event provides a hint:
<date time> Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=512, current_size_kb=511, current_size=1217, largest_size=1217,smallest_size=0
Note that metrics.log logs queue names in lower case, but queue names are case-sensitive in configuration files.
parsingQueue is blocked because 1217KB is greater than 512KB. The inputs.conf splunktcp stopAcceptorAfterQBlock setting controls what happens to the listener port when a queue is blocked, but you don't need to modify this setting.
In your case, I would start by leaving parallelIngestionPipelines at the default value of 1 as noted above and increasing indexQueue to the next highest factor of 128 bytes larger than twice the largest_size value observed for parsingQueue. In %SPLUNK_HOME\etc\systeml\local\server.conf on the intermediate forwarder:
[queue=indexQueue]
# 2 * 1217KB <= 20 * 128B = 2560KB
maxSize = 2560KB
(x86-64, ARM64, and SPARC architectures have 64 byte cache lines, but on the off chance you encounter AIX on PowerPC with 128 byte caches lines, for example, you'll avoid buffer alignment performance penalties, closed-source splunkd memory allocation overhead notwithstanding.)
Observe metrics.log following the change and keep increasing maxSize until you no longer see instances of blocked=true. If you run out of memory, add more memory to your intermediate forwarder host or consider scaling your intermediate forwarders horizontally with additional hosts.
As an alternative, you can start by increasing maxSize for parsingQueue and only increase maxSize for indexQueue if you see blocked=true messages in metrics.log:
[queue=parsingQueue]
maxSize = 2560KB
You can usually find the optimal values through trail and error without resorting to a queue-theoretic analysis.
If you find that your system becomes CPU-bound at some maxSize limit, you can increase parallelIngestionPipelines, for example, to N-2, where N is the number of cores available. Following that change, modify maxSize from default values by observing metrics.log. Note that each pipeline consumes as much memory as a single-pipeline splunkd process with the same memory settings.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm also assuming that you've already set maxKBps = 0 in limits.conf:
# $SPLUNK_HOME/etc/system/local/limits.conf [thruput] maxKBps = 0
