The following internal metrics logs from statusmgr logs show the same host (X) fail, try, and connect to the same indexer (Y) in quick succession:
Host _raw
DESKTOP-X 07-18-2017 04:25:27.193 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_done, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
DESKTOP-X 07-18-2017 04:25:27.149 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_try, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
DESKTOP-X 07-18-2017 04:25:26.749 -0500 INFO StatusMgr - destHost=192.168.0.Y, destIp=192.168.0.Y, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
This same pattern appears for all the forwarders in the environment. What could this mean, is it a problem and how do I troubleshoot it further?
Also, it seems that there are blocked queues:
index=_internal blocked=true | table host _raw
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:46:24.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:45:53.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
Hi,
this seems that for whatever reason, your indexer queues are blocked (either all the times or from time to time)
in that case, the indexer will shut down input port until the situation resolve
from your host, as the port is now closed, your splunk will try to reconnect. As the port is now closed, this is a very fast operation, which explain the small delay in internal logs.
So this is a consequence of your indexer(s) queue being blocked
you have to investigate until you find the root cause which could be many (more data arriving that you can handle, slow disks, busy cpus, configuration pb (ie for example if you have a loop !)
use of the Monitoring Console may help you in that case.
Hi All - I've been through all the above chat interestingly I'm seeing the same issue in my Env there're wide range of blocks seen in the events coming from the Heavy forwarders. The thruput values on the Indexer and HF seems to be fair and collides at a point of time in an hour where traffic is more. So, I think the root cause of the blocks is because of the indexer thruput. Correct me if I'm wrong
INFO Metrics - group=queue, ingest_pipe=0, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1804, largest_size=1954, smallest_size=0
INFO Metrics - group=queue, ingest_pipe=4, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1914, largest_size=1914, smallest_size=0
Any suggestions will be appreciated.
This is a bug.
All of these are blocked by default. the execprocessorinternalq only on linux hosts though.
name!=udp_queue name!=httpinputq name!=tcpin_cooked_pqueue name!=execprocessorinternalq
we are also seeing the same blocked=true for multiple queues on forwarders that are on 6.6.2. The forwarder is only sending a few hundred events a hour.
Hi,
this seems that for whatever reason, your indexer queues are blocked (either all the times or from time to time)
in that case, the indexer will shut down input port until the situation resolve
from your host, as the port is now closed, your splunk will try to reconnect. As the port is now closed, this is a very fast operation, which explain the small delay in internal logs.
So this is a consequence of your indexer(s) queue being blocked
you have to investigate until you find the root cause which could be many (more data arriving that you can handle, slow disks, busy cpus, configuration pb (ie for example if you have a loop !)
use of the Monitoring Console may help you in that case.
Hey landen99!
If these are all the same downstream node, I would wonder about temporary loss of connectivity or dropped packets?
How often does it occur?
Looks like a desktop connecting to what?? Another machine running your indexer?
Any sign of blocked queues? You can see that in metrics.log, search for blocked=true
I clarified the question to indicate that there is only one indexer configured and referenced in these logs. And that this happens to other forwarders as well.
Also, it seems that there are blocked queues:
index=_internal blocked=true | table host _raw
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:45:20.067 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:46:24.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Z 07-18-2017 15:44:40.758 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-X 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:44:49.066 -0500 INFO Metrics - group=queue, name=udp_queue, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
DESKTOP-Y 07-18-2017 15:45:53.696 -0500 INFO Metrics - group=queue, name=httpinputq, blocked=true, max_size_kb=0, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
What could this mean?
meh, these are all 0 so i doubt they are anything to worry about. Your forwarders would be sending on tcpin anyways.
i have the same problem since the update from 6.5.1 to 6.6.2. This stuff hasn't been logged in 6.5.1. So why is it logged anyway? I get millions of those lines with current_size=0.
Looks to me as if someone loggs disabled input-queues on the wrong loglevel. Worth a ticket?
I filtered out "current_size=0" and only found some old events:
index=_internal blocked=true NOT "current_size=0" | table host _raw
DESKTOP-Z 07-15-2017 01:02:00.551 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1006, largest_size=1006, smallest_size=0
DESKTOP-Z 07-14-2017 01:02:03.697 -0500 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=989, largest_size=989, smallest_size=0
DESKTOP-Z 07-12-2017 01:03:08.728 -0500 INFO Metrics - group=queue, name=tcpin_cooked_pqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1006, largest_size=1007, smallest_size=0
I'd keep an eye on how often those happen and check monitoring console indexer performance graphs to ensure you aren't fill queues too often. What does this set up look like? VMs? How many forwarders are forwarding to what? Other than having found these logs, are you noticing any issues?
Yeah, then i would check for blocking in the pipeline.
Also Any error/warns in splunkd.log? can I assume no SSL? what versions for the forwarders?
Forwarders are 6.6.2
SSL is turned off, I believe.
Here are some of the errors; I guess there are a few problems here but not sure what they mean or what to do about them:
index=_internal error OR warn| table host _raw
DESKTOP-X {"timestamp":"2017-07-18T20:55:23.198225Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X 2017-07-18 15:55:23 ERROR [12572] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y 2017-07-18 15:54:52 ERROR [4604] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y {"timestamp":"2017-07-18T20:54:52.120805Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X {"timestamp":"2017-07-18T20:55:18.185058Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-X 2017-07-18 15:55:18 ERROR [6380] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y 2017-07-18 15:54:47 ERROR [6648] (CaptureServer.cpp:2151) stream.CaptureServer - Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol
DESKTOP-Y {"timestamp":"2017-07-18T20:54:47.112086Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (f58c7e0a-7060-45dd-855e-80784cd6348d): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-Z 2017-07-18T15:56:23-0500 [CRITICAL] [rpcstart.py], line 381: action=rpc_server_has_been_abnormally_terminated error=Unsupported JRE version detected Traceback (most recent call last): File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 379, in <module> _start_rpc_server(sys.stdin.read()) File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 363, in _start_rpc_server run(configs) File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\rpcstart.py", line 156, in run raise Exception("Unsupported JRE version detected") Exception: Unsupported JRE version detected
DESKTOP-Z 2017-07-18T15:56:23-0500 [CRITICAL] [jre_validator.py], line 76 : Traceback (most recent call last): File "C:\Program Files\Splunk\etc\apps\splunk_app_db_connect\bin\dbx2\jre_validator.py", line 31, in target self.process = subprocess.Popen(self.command, **kwargs) File "C:\Program Files\Splunk\Python-2.7\Lib\subprocess.py", line 390, in __init__ errread, errwrite) File "C:\Program Files\Splunk\Python-2.7\Lib\subprocess.py", line 640, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified
I filtered out some events and found these:
DESKTOP-X {"timestamp":"2017-07-18T23:06:25.020252Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
DESKTOP-Y 07-18-2017 18:06:24.514 -0500 WARN DistributedPeer - Peer:https://192.168.0.103:8089 Failed to get server info from https://192.168.0.103:8089/services/server/info response code=401
DESKTOP-Y 07-18-2017 18:06:24.512 -0500 WARN GetRemoteAuthToken - Unable to get authentication token from peeruri="https://192.168.0.103:8089/services/admin/auth-tokens".
DESKTOP-Z 07-18-2017 18:07:03.658 -0500 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user
DESKTOP-X {"timestamp":"2017-07-18T23:06:50.074866Z","agentMode":0,"level":"ERROR","message":"Unable to ping server (5f43b1e8-b496-4a99-98ae-f557d42eabd5): Unable to establish connection to localhost: unknown protocol"}
Perhaps something in one of those? What exactly should I be looking for?
these are just stream and dbx making noise...likely unrelated to tcpoutputproc, just config things..