Solved: What does it mean when in Internal metrics logs St...

landen99 · ‎07-18-2017

The following internal metrics logs from statusmgr logs show the same host (X) fail, try, and connect to the same indexer (Y) in quick succession:

Host           _raw
DESKTOP-X   07-18-2017 04:25:27.193 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_done, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
DESKTOP-X   07-18-2017 04:25:27.149 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_try, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
DESKTOP-X   07-18-2017 04:25:26.749 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor

This same pattern appears for all the forwarders in the environment. What could this mean, is it a problem and how do I troubleshoot it further?

Also, it seems that there are blocked queues:
index=_internal blocked=true | table host _raw

DESKTOP-X     07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X     07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X     07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y    07-18-2017 15:46:24.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z    07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z    07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z    07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X    07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X    07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y    07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y    07-18-2017 15:45:53.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0

maraman_splunk · ‎07-19-2017

Hi,

this seems that for whatever reason, your indexer queues are blocked (either all the times or from time to time)
in that case, the indexer will shut down input port until the situation resolve
from your host, as the port is now closed, your splunk will try to reconnect. As the port is now closed, this is a very fast operation, which explain the small delay in internal logs.

So this is a consequence of your indexer(s) queue being blocked
you have to investigate until you find the root cause which could be many (more data arriving that you can handle, slow disks, busy cpus, configuration pb (ie for example if you have a loop !)
use of the Monitoring Console may help you in that case.

View solution in original post

sgadmin9997 · ‎10-01-2018

Hi All - I've been through all the above chat interestingly I'm seeing the same issue in my Env there're wide range of blocks seen in the events coming from the Heavy forwarders. The thruput values on the Indexer and HF seems to be fair and collides at a point of time in an hour where traffic is more. So, I think the root cause of the blocks is because of the indexer thruput. Correct me if I'm wrong

INFO Metrics - group=queue, ingest_pipe=0, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1804, largest_size=1954, smallest_size=0
INFO Metrics - group=queue, ingest_pipe=4, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1914, largest_size=1914, smallest_size=0

Any suggestions will be appreciated.

magnusmolbach · ‎11-15-2017

This is a bug.

All of these are blocked by default. the execprocessorinternalq only on linux hosts though.
name!=udp_queue name!=httpinputq name!=tcpin_cooked_pqueue name!=execprocessorinternalq

cramasta · ‎10-13-2017

we are also seeing the same blocked=true for multiple queues on forwarders that are on 6.6.2. The forwarder is only sending a few hundred events a hour.

maraman_splunk · ‎07-19-2017

Hi,

this seems that for whatever reason, your indexer queues are blocked (either all the times or from time to time)
in that case, the indexer will shut down input port until the situation resolve
from your host, as the port is now closed, your splunk will try to reconnect. As the port is now closed, this is a very fast operation, which explain the small delay in internal logs.

So this is a consequence of your indexer(s) queue being blocked
you have to investigate until you find the root cause which could be many (more data arriving that you can handle, slow disks, busy cpus, configuration pb (ie for example if you have a loop !)
use of the Monitoring Console may help you in that case.

mattymo · ‎07-18-2017

Hey landen99!

If these are all the same downstream node, I would wonder about temporary loss of connectivity or dropped packets?

How often does it occur?

Looks like a desktop connecting to what?? Another machine running your indexer?

Any sign of blocked queues? You can see that in metrics.log, search for blocked=true

- MattyMo

landen99 · ‎07-18-2017

I clarified the question to indicate that there is only one indexer configured and referenced in these logs. And that this happens to other forwarders as well.

landen99 · ‎07-18-2017

Also, it seems that there are blocked queues:
index=_internal blocked=true | table host _raw

DESKTOP-X   07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X   07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X   07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y   07-18-2017 15:46:24.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z   07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z   07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z   07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X   07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X   07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y   07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y   07-18-2017 15:45:53.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0

What could this mean?

mattymo · ‎07-18-2017

meh, these are all 0 so i doubt they are anything to worry about. Your forwarders would be sending on tcpin anyways.

- MattyMo

andre_pietsch · ‎07-31-2017

i have the same problem since the update from 6.5.1 to 6.6.2. This stuff hasn't been logged in 6.5.1. So why is it logged anyway? I get millions of those lines with current_size=0.

Looks to me as if someone loggs disabled input-queues on the wrong loglevel. Worth a ticket?

landen99 · ‎07-18-2017

I filtered out "current_size=0" and only found some old events:
index=_internal blocked=true NOT "current_size=0" | table host _raw

DESKTOP-Z   07-15-2017 01:02:00.551 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1006, largest_size=1006, smallest_size=0
DESKTOP-Z   07-14-2017 01:02:03.697 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=989, largest_size=989, smallest_size=0
DESKTOP-Z   07-12-2017 01:03:08.728 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1006, largest_size=1007, smallest_size=0

mattymo · ‎07-19-2017

I'd keep an eye on how often those happen and check monitoring console indexer performance graphs to ensure you aren't fill queues too often. What does this set up look like? VMs? How many forwarders are forwarding to what? Other than having found these logs, are you noticing any issues?

- MattyMo

mattymo · ‎07-18-2017

Yeah, then i would check for blocking in the pipeline.

Also Any error/warns in splunkd.log? can I assume no SSL? what versions for the forwarders?

- MattyMo

landen99 · ‎07-18-2017

Forwarders are 6.6.2
SSL is turned off, I believe.
Here are some of the errors; I guess there are a few problems here but not sure what they mean or what to do about them:
index=_internal error OR warn| table host _raw

DESKTOP-X   {"timestamp":"2017-07-18T20:55:23.198225Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X   2017-07-18 15:55:23 ERROR [12572] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y   2017-07-18 15:54:52 ERROR [4604] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y   {"timestamp":"2017-07-18T20:54:52.120805Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X   {"timestamp":"2017-07-18T20:55:18.185058Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X   2017-07-18 15:55:18 ERROR [6380] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y   2017-07-18 15:54:47 ERROR [6648] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y   {"timestamp":"2017-07-18T20:54:47.112086Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-Z   2017-07-18T15:56:23-0500 [CRITICAL] [rpcstart.py], line 381: action=rpc_server_has_been_abnormally_terminated error=Unsupported JRE version detected Traceback (most recent call last): File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 379, in <module> _start_rpc_server(sys.stdin.read()) File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 363, in _start_rpc_server run(configs) File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 156, in run raise Exception("Unsupported JRE version detected") Exception: Unsupported JRE version detected
DESKTOP-Z   2017-07-18T15:56:23-0500 [CRITICAL] [jre_validator.py], line 76 : Traceback (most recent call last): File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\dbx2\jre_validator.py", line 31, in target self.process = subprocess.Popen(self.command, **kwargs) File "C:\Program Files\Splunk\Python-2.7\Lib\subprocess.py", line 390, in __init__ errread, errwrite) File "C:\Program Files\Splunk\Python-2.7\Lib\subprocess.py", line 640, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified

landen99 · ‎07-18-2017

I filtered out some events and found these:

DESKTOP-X   {"timestamp":"2017-07-18T23:06:25.020252Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-Y   07-18-2017 18:06:24.514 -0500 WARN DistributedPeer - Peer:https://192.168.0.103:8089 Failed to get server info from https://192.168.0.103:8089/services/server/info response code=401
DESKTOP-Y   07-18-2017 18:06:24.512 -0500 WARN GetRemoteAuthToken - Unable to get authentication token from peeruri="https://192.168.0.103:8089/services/admin/auth-tokens".
DESKTOP-Z   07-18-2017 18:07:03.658 -0500 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user
DESKTOP-X   {"timestamp":"2017-07-18T23:06:50.074866Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}

Perhaps something in one of those? What exactly should I be looking for?

mattymo · ‎07-18-2017

these are just stream and dbx making noise...likely unrelated to tcpoutputproc, just config things..

- MattyMo

What does it mean when in Internal metrics logs StatusMgr TcpOutputProcessor fails, tries, and connects in rapid succession?

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...