We are populating Splunk using an HEC connection with a source type of _json, set to the default character set of UTF-8. However, a field shown in the raw data as:
"Character test: 0242 (\\u00f2): >\uC3B2<"
is displayed as:
Character test: 0242 (\u00f2): >쎲<
I would have expected the display to show the character, ò, which is the UTF-8 equivalent of hexadecimal C3B2, rather than the displayed UNICODE character
With JSON, with \u#### encoding the digits are the literal unicode code point (or the UTF-16 representation of the character.)
See: https://datatracker.ietf.org/doc/html/rfc8259#section-7
So, for example, a string containing only a single reverse solidus character may be represented as "\u005C"
If it was UTF-8, that encoding wouldn't have the leading zeros
\uc3b2 is indeed Hangul Syllable Ssyeobs
The character you're looking for LATIN SMALL LETTER O WITH GRAVE is encoded in JSON correctly as \u00f2
With JSON, with \u#### encoding the digits are the literal unicode code point (or the UTF-16 representation of the character.)
See: https://datatracker.ietf.org/doc/html/rfc8259#section-7
So, for example, a string containing only a single reverse solidus character may be represented as "\u005C"
If it was UTF-8, that encoding wouldn't have the leading zeros
\uc3b2 is indeed Hangul Syllable Ssyeobs
The character you're looking for LATIN SMALL LETTER O WITH GRAVE is encoded in JSON correctly as \u00f2
Thanks, this was a misread of the RFC on my part. I appreciate the help.
Your sourcetype might be set to utf-8 but how is your source sending the data?
It's sending an HTTP format message to an HEC whose default source type is also set to _json. Here is a dump of the request:
POST /services/collector/event?host=myhost&source=KEN-STUFF&sourcetype=_json&index=galaxy&channel=FE0ECFAD-13D5-401B-847D-77833BD77131 HTTP/1.1
Host: <target URL>
User-Agent: XYGATEMA
Connection: keep-alive
Content-Type: application/json
Authorization: Splunk <HEC token>
Content-Length: 1073
{"TIME":"2023-03-24 07:56:55.707","AUDIT": {"RECORDGMT":"2023-03-24:14:56:55.707636","GMTSEQNO":null,"RECORDLCT":"2023-03-24:07:56:55.707636","RECORDAUDITKEY":"","RECORDSESSIONKEY":"","SEQNO":null,"OUTCOME":4,"WARNINGMODE":"N","TESTMODE":"N","SEVERITY":"1","ALERTED":"A","PRODUCTCODE":"EMS","SUBJECT_USERNUMBER_MAJOR":null,"SUBJECT_USERNUMBER_MINOR":null,"TARGET_USERNUMBER_MAJOR":null,"TARGET_USERNUMBER_MINOR":null,"SUBJECTLOGIN":"","SUBJECTSYSTEM":"\\GALAXY","TARGETLOGIN":"","OBJECTTYPE":"COMFORTE.1.B00","OBJECTNAME":"","OPERATION":"EMS-EVENT","TERMINAL":"","MESSAGEID":2135,"MESSAGECODE":null,"RULENAME":"","USER_DATA":"REST alert","RESULT":"07:56 24MAR23 200,00,1268 Character test: 0242 (\\u00f2): >\uC3B2<"},"SESSION": {"RECORDSESSIONKEY":"","RECORDINSTALLKEY":"","SESSIONID":"\\GALAXY.$X98B:51790513","FOUNDSESSIONSTART":"N","FOUNDSESSIONEND":"N","SESSIONNAME":"","PROCESSTHREADID":"\\GALAXY.$X98B:51790513","PROCESSTHREADID2":"\\200.0,1268","CLIENTPROGRAM":"$Unknown.unknown.unknown","ANCESTORPROCESSTHREADID":"","IPADDRV46":"","DNSNAME":"","CLIENTCURRDIR":""}}
So you're sending \uC3B2. Not literal sequence of bytes \xC3\xB2
Yes, and that was indeed the issue. Thanks