Getting Data In

Why did ingestion slow way down after I added thousands of new forwarders and/or sources and sourcetypes?

davidpaper
Contributor

My Splunk environment was humming right along until I had a need to very quickly add several thousand new FWDs and a bunch of new apps on those endpoints collecting many new sources and sourcetypes. I happen to have an intermediate forwarding tier of HWFs, but I don't know if that makes any difference.

After I added all these new FWDs, sources and sourcetypes, I noticed my rate of ingestion on all my IDXs dropped a lot. Restarting IDXs via rolling restart doesn't seem to make a difference.

What's going on?

Labels (2)
1 Solution

davidpaper
Contributor

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers.

What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like

04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090.

So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below.

With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look.

On the FWD end:
- Many new FWD endpoints
- Many new source/sourcetypes
- Combo of both
- FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested
- A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels
- Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion
- netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections
- tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX.

On the IDX end:
- Ingestion throughput on indexers drops significantly
- Ingestion queues aren't blocked any more than before
- No disk I/O issues or CPU contention issues apparent
- Indexer bucket replication queue may be full
- Adding additional ingestion pipelines to indexers hasn't helped
- Events are getting delayed on ingestion, by minutes or hours
- netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer
- tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important)

Metrics.log example with channel info:
metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0

Search to sum up new_channel creations and removals by indexer:

index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f

  • Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer)
  • Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover
  • Avg of over 10,000/min will almost always results in problems
  • If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action.

Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side.

On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer.

For all FWDs, update outputs.conf
- unset autoLBvolume (if set)
- autoLBFrequency=120 (up from default of 30)

If UF, update props.conf for all sourcetypes
- EVENT_BREAKER_ENABLE=true
- EVENT_BREAKER=regex_expression

If HWF, update outputs.conf
- forceTimebasedAutoLB=true

On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue.

limits.conf
- [input_channels]
- max_inactive = 20000

The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

View solution in original post

davidpaper
Contributor

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers.

What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like

04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090.

So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below.

With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look.

On the FWD end:
- Many new FWD endpoints
- Many new source/sourcetypes
- Combo of both
- FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested
- A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels
- Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion
- netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections
- tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX.

On the IDX end:
- Ingestion throughput on indexers drops significantly
- Ingestion queues aren't blocked any more than before
- No disk I/O issues or CPU contention issues apparent
- Indexer bucket replication queue may be full
- Adding additional ingestion pipelines to indexers hasn't helped
- Events are getting delayed on ingestion, by minutes or hours
- netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer
- tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important)

Metrics.log example with channel info:
metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0

Search to sum up new_channel creations and removals by indexer:

index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f

  • Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer)
  • Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover
  • Avg of over 10,000/min will almost always results in problems
  • If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action.

Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side.

On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer.

For all FWDs, update outputs.conf
- unset autoLBvolume (if set)
- autoLBFrequency=120 (up from default of 30)

If UF, update props.conf for all sourcetypes
- EVENT_BREAKER_ENABLE=true
- EVENT_BREAKER=regex_expression

If HWF, update outputs.conf
- forceTimebasedAutoLB=true

On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue.

limits.conf
- [input_channels]
- max_inactive = 20000

The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

jeffland
SplunkTrust
SplunkTrust

Thank you for this in-depth write up. I observe our indexers at values slightly below and sometimes above 5k, and we still have that setting on the default, auto, which limits.conf.spec explains as:



When set to 'auto', the Splunk platform will tune this setting based on the physical RAM present in the server at startup.

Our indexers have enough ram to support increasing this value. Is there a way to see what it's currently effectively set to, to estimate the room for improvement there? Btool says it's "auto", and I couldn't find anything logged in _internal, e.g. on indexer restart in the loader component. btool outputs a setting max_mem_usage_mb under the input_channels stanza, is that used in combination with max_inactive = auto? Should I just increase max_mem_usage_mb, which by default is 200?

Furthermore, can you explain what the other stats mean? I understand new_channels means number of new tuples that has been created, either because it hadn't been used at all previously (e.g. new file) or because the channel hadn't been used in a while, the limit for open channels had been reached and an inactive channel was therefore closed and needs to be reopened. I guess I also understand that inactive_channels is the number of channels currently not actively sending data (probably inactive since inactive_eligibility_age_seconds?).

But what's the difference between a removed, timed out, reclaimed and abandoned channel? I'm looking for clues about why we have a lot of new channels - is it because we remove channels too early, or are we actually receiving many net new tuples - and if I knew which of these numbers tell me what happens in my environment, I could make an informed decision whether increasing the number of channels is going to help.

0 Karma
Get Updates on the Splunk Community!

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...