VMware ESXi syslog: broken format (not a question,...

swasserroth · ‎02-10-2020

Hi *,

basically this is not a real question but more an analysis of the somewhat broken syslog format of some messages issued by ESXi. No answers are expected, but comments are welcome, especially if you are hit by the problems described here...

Our setup is a pretty standard setup according to the Splunk syslog Forwarder recommendations:

We are using part of the Splunk Add-on for VMware (Version 3.4.6, only the Splunk_TA_esxilogs) for indexing the ESXi logs into Splunk.
The syslogs from the ESXi hosts are forwarded (with UDP or TCP) to a central (r)syslog-server, which itself is a Splunk Universal Forwarder and sends all received logs from various devices to the Splunk indexer.
The syslog parameters on the ESXi hosts have been changed for longer syslog sizes (i.e. 4096 bytes instead the usual 1024 bytes).

Our observations:

Large ESXi syslog messages are identified and indexed as expected -- mainly, with some exceptions!!
But: The "health check" of the monitoring console often finds some recognition problems related to timestamps and line breaks. All the events with the problems have the sourcetype vmw-syslog, which is used by the installed TA_esxilogs as an temporary sourcetype. If you use this TA, then searching and finding this sourcetype in your index may be an indication that you are affected by the problems described in this article.
Sometimes a very large number of VMware ESXi events (sometimes millions!) are indexed on just one (!) timestamp, this usually correlates with the timestamp recognition problems above.
Fiddling with the LINE_BREAKER, SHOULD_LINEMERGE, TZ, timestamp format or other pre-indexing conversions really do not help.
The are various other question and answers regarding ESXi syslogs at Splunk Answers, but none of these help us to get rid of the problem...

The Wireshark analysis:

Time for examining ESXi syslogs on the network packet level with Wireshark. We captured the syslog traffic on the (r)syslog-server at the incoming network interface to catch the packets in exactly the same format as they are sent from the ESXi host.
The format of some ESXi syslog messages is badly broken!

ESXi uses a funny kind of multiline syslog message for a few events. And these events are chunked into packets of less than 1024 bytes regardless of the syslog packte size set on the ESXi host (see above).
The first packet (a "syslog line") is correct according to the syslog packet format:

<12> [priority]
2020-02-06T10:35:44.222Z [timestamp]
bxa-b4... [hostname]
VSANMGMTSVC: [process]
... [syslog message text]
\n [terminating LF]

So far, so good. BUT the next 2-3 continuation lines are just totally mangled up and wrong for syslog packets (according to the RFC, which by the way does not define multiline syslogs...):

The continuation lines start with the priority field, I would accept this,
but there is NO timestamp, instead the next few bytes of the syslog message text are following,
and now it gets ugly: the packet actually CONTAINS both the hostname and the process,
followed by more text of the syslog message and the terminating LF.

This funny kind of continuation lines goes on, until the current "long event message text" of this single message is processed completly. Then the next "normal" syslog message follows. This broken format cannot be fixed easily with props or transforms inside Splunk!

To repeat very clearly: This is the format ORIGINATING from the ESXi host as captured directly on the wire without any processing!

Attached is a sample of one of the broken messages captured by Wireshark: textual output from the Wireshark capture. IP- and MAC-addresses have been shortened.

We will file a bug report with VMware, but expectations regarding getting a fix is very low...

Hope, this helps others wondering about issues with ESXi logging...

Have fun and happy Splunking!
Stephan

jotne · ‎01-21-2022

Hi

Do you have any updated on this?

We are seeing the same problem.