Getting Data In

Why did ingestion slow way down after I added thousands of new forwarders and/or sources and sourcetypes?

Contributor

My Splunk environment was humming right along until I had a need to very quickly add several thousand new FWDs and a bunch of new apps on those endpoints collecting many new sources and sourcetypes. I happen to have an intermediate forwarding tier of HWFs, but I don't know if that makes any difference.

After I added all these new FWDs, sources and sourcetypes, I noticed my rate of ingestion on all my IDXs dropped a lot. Restarting IDXs via rolling restart doesn't seem to make a difference.

What's going on?

Labels (2)
1 Solution

Contributor

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers.

What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like

04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090.

So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below.

With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look.

On the FWD end:
- Many new FWD endpoints
- Many new source/sourcetypes
- Combo of both
- FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested
- A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels
- Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion
- netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections
- tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX.

On the IDX end:
- Ingestion throughput on indexers drops significantly
- Ingestion queues aren't blocked any more than before
- No disk I/O issues or CPU contention issues apparent
- Indexer bucket replication queue may be full
- Adding additional ingestion pipelines to indexers hasn't helped
- Events are getting delayed on ingestion, by minutes or hours
- netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer
- tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important)

Metrics.log example with channel info:
metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0

Search to sum up new_channel creations and removals by indexer:

index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f

  • Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer)
  • Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover
  • Avg of over 10,000/min will almost always results in problems
  • If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action.

Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side.

On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer.

For all FWDs, update outputs.conf
- unset autoLBvolume (if set)
- autoLBFrequency=120 (up from default of 30)

If UF, update props.conf for all sourcetypes
- EVENT_BREAKER_ENABLE=true
- EVENT_BREAKER=regex_expression

If HWF, update outputs.conf
- forceTimebasedAutoLB=true

On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue.

limits.conf
- [input_channels]
- max_inactive = 20000

The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

View solution in original post

Contributor

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers.

What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like

04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090.

So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below.

With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look.

On the FWD end:
- Many new FWD endpoints
- Many new source/sourcetypes
- Combo of both
- FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested
- A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels
- Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion
- netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections
- tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX.

On the IDX end:
- Ingestion throughput on indexers drops significantly
- Ingestion queues aren't blocked any more than before
- No disk I/O issues or CPU contention issues apparent
- Indexer bucket replication queue may be full
- Adding additional ingestion pipelines to indexers hasn't helped
- Events are getting delayed on ingestion, by minutes or hours
- netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer
- tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important)

Metrics.log example with channel info:
metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0

Search to sum up new_channel creations and removals by indexer:

index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f

  • Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer)
  • Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover
  • Avg of over 10,000/min will almost always results in problems
  • If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action.

Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side.

On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer.

For all FWDs, update outputs.conf
- unset autoLBvolume (if set)
- autoLBFrequency=120 (up from default of 30)

If UF, update props.conf for all sourcetypes
- EVENT_BREAKER_ENABLE=true
- EVENT_BREAKER=regex_expression

If HWF, update outputs.conf
- forceTimebasedAutoLB=true

On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue.

limits.conf
- [input_channels]
- max_inactive = 20000

The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

View solution in original post