Solved: Why did ingestion slow way down after I added thou...

davidpaper · ‎05-27-2020

My Splunk environment was humming right along until I had a need to very quickly add several thousand new FWDs and a bunch of new apps on those endpoints collecting many new sources and sourcetypes. I happen to have an intermediate forwarding tier of HWFs, but I don't know if that makes any difference.

After I added all these new FWDs, sources and sourcetypes, I noticed my rate of ingestion on all my IDXs dropped a lot. Restarting IDXs via rolling restart doesn't seem to make a difference.

What's going on?

davidpaper · ‎05-27-2020

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers.

What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like

04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090.

So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below.

With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look.

On the FWD end:
- Many new FWD endpoints
- Many new source/sourcetypes
- Combo of both
- FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested
- A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels
- Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion
- netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections
- tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX.

On the IDX end:
- Ingestion throughput on indexers drops significantly
- Ingestion queues aren't blocked any more than before
- No disk I/O issues or CPU contention issues apparent
- Indexer bucket replication queue may be full
- Adding additional ingestion pipelines to indexers hasn't helped
- Events are getting delayed on ingestion, by minutes or hours
- netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer
- tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important)

Metrics.log example with channel info:
metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0

Search to sum up new_channel creations and removals by indexer:

index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f

Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer)
Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover
Avg of over 10,000/min will almost always results in problems
If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action.

Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side.

On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer.

For all FWDs, update outputs.conf
- unset autoLBvolume (if set)
- autoLBFrequency=120 (up from default of 30)

If UF, update props.conf for all sourcetypes
- EVENT_BREAKER_ENABLE=true
- EVENT_BREAKER=regex_expression

If HWF, update outputs.conf
- forceTimebasedAutoLB=true

On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue.

limits.conf
- [input_channels]
- max_inactive = 20000

The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

View solution in original post

hrawat · ‎11-07-2024

See https://community.splunk.com/t5/Knowledge-Management/Persistent-queue-problems/td-p/703859

davidpaper · ‎05-27-2020