Archive

"Ended without a done-key" messages

Builder

In the past, i have seen these messages before in splunkd.log and usually across two components; tcpoutput and PipelineInputChannel components. A couple of questions:

  • What does this error mean?
  • Am i losing data?
  • What are the recommended troubleshooting steps?
  • If i were to file a support case, besides a diag, is there anything else i should provide?
Tags (1)
1 Solution

Splunk Employee
Splunk Employee

"Ended without a done key" means that a stream of data (roughly a combination of host, source, and sourcetype; or alternatively described, the contents of some log file, input script, or plain non-splunk received network socket etc.) was ended without a message indicating that the stream of data was supposed to end.

For an analogy, imagine a phone line going dead before anyone says goodbye.

If Tcp Output complains, it means that it was forwarding this stream to a receiving splunk system (maybe an indexer), and the stream ended abruptly. In this case, Tcp Output will insert a flush key (an explicit messaging that there is an incompletely ending stream) to try to cause the stream to be at least cleaned up correctly on the receiver/indexer.

If PipelineInputChannel complains this means no one tried to clean up. PipelineInputChannel is the class where the state of the stream is tracked.

If Tcp Output is complaining, it usually means that an input of some kind has misbehaved. We should investigate.

If PipelineInputChannel is complaining, it (currently) usually means that a forwarder has shut down uncleanly, or the network has dropped a forwarding socket abruptly. Once we correct TcpInput to send a flush in these cases, there will be no usual cause for this message.

These messages indicate that some component in the system is behaving incorrectly, but typically they will not correlate with data loss because the whole system should try to track completeness via acknowledgements and resend the data to another system as neeeded to ensure it is all indexed. They are likely to correlate with small amounts of duplication around the time of the breakage as the data is resent to another node, and they can correlate with some incomplete event fragments.

As with all data handling problems, a complete description of the entire path by which the data travels from the generating appliation or device until it reaches the index (networks, filesystems, forwarders, etc) is essential. The logged messages contain the source, sourcetype, and host strings, so use these to try to narrow down what is affected vs what is not. Be sure to provide diags of at least one relevant forwarder, any intermediate forwarders, and an affected system where some of the data has arrived.

View solution in original post

Splunk Employee
Splunk Employee

"Ended without a done key" means that a stream of data (roughly a combination of host, source, and sourcetype; or alternatively described, the contents of some log file, input script, or plain non-splunk received network socket etc.) was ended without a message indicating that the stream of data was supposed to end.

For an analogy, imagine a phone line going dead before anyone says goodbye.

If Tcp Output complains, it means that it was forwarding this stream to a receiving splunk system (maybe an indexer), and the stream ended abruptly. In this case, Tcp Output will insert a flush key (an explicit messaging that there is an incompletely ending stream) to try to cause the stream to be at least cleaned up correctly on the receiver/indexer.

If PipelineInputChannel complains this means no one tried to clean up. PipelineInputChannel is the class where the state of the stream is tracked.

If Tcp Output is complaining, it usually means that an input of some kind has misbehaved. We should investigate.

If PipelineInputChannel is complaining, it (currently) usually means that a forwarder has shut down uncleanly, or the network has dropped a forwarding socket abruptly. Once we correct TcpInput to send a flush in these cases, there will be no usual cause for this message.

These messages indicate that some component in the system is behaving incorrectly, but typically they will not correlate with data loss because the whole system should try to track completeness via acknowledgements and resend the data to another system as neeeded to ensure it is all indexed. They are likely to correlate with small amounts of duplication around the time of the breakage as the data is resent to another node, and they can correlate with some incomplete event fragments.

As with all data handling problems, a complete description of the entire path by which the data travels from the generating appliation or device until it reaches the index (networks, filesystems, forwarders, etc) is essential. The logged messages contain the source, sourcetype, and host strings, so use these to try to narrow down what is affected vs what is not. Be sure to provide diags of at least one relevant forwarder, any intermediate forwarders, and an affected system where some of the data has arrived.

View solution in original post