Solved: Re: Heavy Forwarders randomly shuts down one of it... - Page 2

mortenklow · ‎07-04-2022

Hi,

We've recently tested out a new path for data to flow into our Splunk environment from Universal Forwarders.
We have rolled this out to a small porsion of our UFs for now, and we've been running into an issue a few times already, that I can't figure out.

Data flow is as follows:
UFs sending logs with [httpout] -> Load Balancer -> A set of Heavy Forwarders receiving data via [http://..], and forwarding data with [tcpout] -> Index cluster receiving data with [tcp://..].

The heavy forwarders are there to do some filtering, and also routing of specific data to other Splunk environments. The Heavy Forwarders are also configured with parallelIngestionPipelines=2.
I can also mention that all parts of this environment is running on Windows.

The issue:
A few times I've suddenly had 100% queue on all the four data queues(parsing/aggregation/typing/indexing) on one of the Heavy Forwarders and only on one of its pipelines, but not the other. It turns out after looking in splunkd.log, that it seems like one pipeline has simply shut down (I didn't know this was even possible).

At first, splunkd.log is filled with the usual over and over:

07-01-2022 12:35:55.724 +0200 INFO  TailReader [2092 tailreader1] - Batch input finished reading file='D:\Splunk\var\spool\splunk\tracker.log'
07-01-2022 12:35:59.129 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Connected to idx=X.X.X.X:9997:0, pset=1, reuse=0. autoBatch=1
07-01-2022 12:35:59.602 +0200 INFO  AutoLoadBalancedConnectionStrategy [7160 TcpOutEloop] - Connected to idx=Y.Y.Y.Y:9997:0, pset=0, reuse=0. autoBatch=1

Until it suddenly logs this one or a few times:

07-01-2022 12:36:15.628 +0200 WARN  TcpOutputProc [4360 indexerPipe_1] - Pipeline data does not have indexKey. [_path] = C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe\n[_raw] = \n[_meta] = punct::\n[_stmid] = FSDNhUnXitGgjKJ.C\n[MetaData:Source] = source::WinEventLog\n[MetaData:Host] = host::HOSTNAME-1\n[MetaData:Sourcetype] = sourcetype::WinEventLog\n[_done] = _done\n[_linebreaker] = _linebreaker\n[_conf] = source::WinEventLog|host::HOSTNAME-1|WinEventLog|\n
07-01-2022 12:36:18.966 +0200 WARN  TcpOutputProc [4360 indexerPipe_1] - Pipeline data does not have indexKey. [_path] = C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe\n[_raw] = \n[_meta] = punct::\n[_stmid] = WpC6rhVY9yDblPB.C\n[MetaData:Source] = source::WinEventLog\n[MetaData:Host] = host::HOSTNAME-2\n[MetaData:Sourcetype] = sourcetype::WinEventLog\n[_done] = _done\n[_linebreaker] = _linebreaker\n[_conf] = source::WinEventLog|host::HOSTNAME-2|WinEventLog|\n

Then in the same millisecond this happens:

07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Shutting down auto load balanced connection strategy
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Auto load balanced connection strategy shutdown finished
07-01-2022 12:36:18.966 +0200 INFO  TcpOutputProc [4360 indexerPipe_1] - Waiting for TcpOutputGroups to shutdown
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Shutting down auto load balanced connection strategy
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Auto load balanced connection strategy shutdown finished
07-01-2022 12:36:19.968 +0200 INFO  TcpOutputProc [4360 indexerPipe_1] - Received shutdown control key.
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - shutting down: start
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_audit Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_configtracker Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_internal Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_thefishbucket Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_introspection Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics_rollup Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_telemetry Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=history Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=main Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=summary Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - request state change from=RUN to=SHUTDOWN_IN_PROGRESS
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_audit Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_configtracker Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_internal Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_introspection Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics_rollup Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_telemetry Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_thefishbucket Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=history Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=main Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=summary Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_metrics
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_audit
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_configtracker
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_internal
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_thefishbucket
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_introspection
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_metrics_rollup
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_telemetry
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=history
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=main
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=summary
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - request state change from=SHUTDOWN_IN_PROGRESS to=SHUTDOWN_COMPLETE
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - shutting down: end

So it seems like pipeline-1 shut down, and pipeline-0 continues to work(this over and over, with only pset=0):

07-01-2022 12:36:29.527 +0200 INFO  AutoLoadBalancedConnectionStrategy [7160 TcpOutEloop] - Connected to idx=X.X.X.X:9997:0, pset=0, reuse=0. autoBatch=1

Pipeline-1 seems to still receive data somehow, as I can see queues for that one pipeline is constantly 100% in metrics.log, and we're also loosing logs to our indexers.
And then this starts appearing in splunkd.log over and over until I restart splunk:

07-01-2022 12:36:30.522 +0200 ERROR PipelineComponent [8012 CallbackRunnerThread] - Monotonic time source didn't increase; is it stuck?
07-01-2022 12:36:32.525 +0200 ERROR PipelineComponent [8012 CallbackRunnerThread] - Monotonic time source didn't increase; is it stuck?
07-01-2022 12:36:34.528 +0200 ERROR PipelineComponent [8012 CallbackRunnerThread] - Monotonic time source didn't increase; is it stuck?
07-01-2022 12:36:36.532 +0200 ERROR PipelineComponent [8012 CallbackRunnerThread] - Monotonic time source didn't increase; is it stuck?

I've had this happen to me at least 3-4 times the last week. The heavy forwarders are running Splunk 9.0.0, but I also had this happen once a few weeks ago when I ran 8.2.4. I didn't think too much of it at the time, as I was in the process of replacing them with new servers anyway, and with 9.0.0. But the logs and the symptoms are exactly the same on 9.0.0 as they were that one time at 8.2.4.

I tried looking into the "Pipeline data does not have indexKey"-part, but as this is coming from UFs with identical config where all input stanzas has defined an index, I couldn't find anything wrong there.
That said, even if I were missing index for some log sources, the Heavy Forwarders should not just randomly shut down a pipeline anyway.

Has anyone seen this happen before, or have any suggestion to what I should try or check.
It really seems like a bug to me, but I of course don't know that.

Best Regards
Morten

hrawat_splunk · ‎11-25-2022

The root cause of the problem is SPL-210081. Fixed on httpout side. Requires upgrade to Forwarder.

SPL-224974 is re-fixing SPL-210081. Fixed on the server side. This option is better because

Don't have to upgrade tons of forwarders. Just upgrade HF(or whichever layer is receiving httpout)
It will also take care of all s2s clients( splunk or 3rd party software) .

SPL-229622 is new fix that is likely to make into 9.0.3 is another enhanced version of SPL-224974. But you should be fine with SPL-210081(UF upgrade) or SPL-224974(HF upgrade).

View solution in original post

mortenklow · ‎07-11-2022

Hi,

Thank you for the reply!

Glad to know there is a known issue with splunk-winevtlog.exe that will be fixed in 9.0.1.
I assume this bug also affects Universal Forwarders, as it's pretty much the same executable?
We've had UFs stop sending windows event log many times over the last year, and I've noticed splunk-winevtlog.exe had stopped running, but I never got around to report it as it was so sporadic.

Regarding the blocked queue messages and the shutdown message, I did eventually restart splunk on the heavy forwarder in question, but that shutdown message you quoted came long before that.
There are no entries matching 'DeploymentClient' at all in splunkd.log leading up to this.

What happened is, right after the "pipeline data does not have indexKey" entry in the log, on the exact same timestamp, one of the pipelines(in lack of a better way of explaining it) shut down immediately.

I realized I split up the different parts of the log in my previous post, so I've included below the log in its entirety from start of the error to shutdown complete.

Splunk continues to run, and the other pipeline continues to forward data as expected, but the "crashed pipeline" got 100% on all four queues and were not doing anything from what I could tell, but I don't know how to verify that.
I didn't always have two pipelines by the way, and when I ran with just one pipeline, the same thing happened and the same log messages were logged. Splunk continued to run, but then the single and only pipeline had 100% queues.

07-01-2022 12:36:18.966 +0200 WARN  TcpOutputProc [4360 indexerPipe_1] - Pipeline data does not have indexKey. [_path] = C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe\n[_raw] = \n[_meta] = punct::\n[_stmid] = WpC6rhVY9yDblPB.C\n[MetaData:Source] = source::WinEventLog\n[MetaData:Host] = host::HOSTNAME-1\n[MetaData:Sourcetype] = sourcetype::WinEventLog\n[_done] = _done\n[_linebreaker] = _linebreaker\n[_conf] = source::WinEventLog|host::HOSTNAME-1|WinEventLog|\n
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Shutting down auto load balanced connection strategy
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Auto load balanced connection strategy shutdown finished
07-01-2022 12:36:18.966 +0200 INFO  TcpOutputProc [4360 indexerPipe_1] - Waiting for TcpOutputGroups to shutdown
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Shutting down auto load balanced connection strategy
07-01-2022 12:36:18.966 +0200 INFO  AutoLoadBalancedConnectionStrategy [4860 TcpOutEloop] - Auto load balanced connection strategy shutdown finished
07-01-2022 12:36:19.968 +0200 INFO  TcpOutputProc [4360 indexerPipe_1] - Received shutdown control key.
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - shutting down: start
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_audit Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_configtracker Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_internal Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_thefishbucket Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_introspection Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics_rollup Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_telemetry Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=history Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=main Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=summary Sync before shutdown
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - request state change from=RUN to=SHUTDOWN_IN_PROGRESS
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_audit Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_configtracker Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_internal Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_introspection Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_metrics_rollup Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_telemetry Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=_thefishbucket Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=history Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=main Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  IndexWriter [4360 indexerPipe_1] - idx=summary Handling shutdown or signal, reason=1
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_metrics
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_audit
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_configtracker
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_internal
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_thefishbucket
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_introspection
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_metrics_rollup
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=_telemetry
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=history
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=main
07-01-2022 12:36:19.968 +0200 INFO  HotDBManager [4360 indexerPipe_1] - closing hot mgr for idx=summary
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - request state change from=SHUTDOWN_IN_PROGRESS to=SHUTDOWN_COMPLETE
07-01-2022 12:36:19.968 +0200 INFO  IndexProcessor [4360 indexerPipe_1] - shutting down: end
07-01-2022 12:36:23.531 +0200 ERROR PipelineComponent [8012 CallbackRunnerThread] - Monotonic time source didn't increase; is it stuck?

Why do our Heavy Forwarders randomly shut down one of its parallel pipelines?

heavy forwarder

HTTP Event Collector

Windows

Index This | I’m short for "configuration file.” What am I?

New Articles from Academic Learning Partners, Help Expand Lantern’s Use Case Library, ...

Your Guide to SPL2 at .conf24!