Splunk Search

Converting multiple html entities in each field to unicode characters

bitnapper
Path Finder

Hi,

 I have a dataset with very poor qulity and multiple encoding error. Some fields contain data like "Алексей" which sould be "Алексей". My first idea to convert taht, was to search every falty dataset and convert this extermally with a script but I'm curious if theres a better way using splunk. But I have no idea how to get there.

I somehow need to cet every &#(\d{4}); and I could facilitate printf("%c", \1) to get the correct unicode character but I have no Idea how to apply that to every occourance in a single field. Currently I have data like this:

idname
1Алексей

 

Where I wanno get is that:

idnamecorrect_name
1АлексейАлексей

 

Any ideas if that is possible without using python sripts in splunk?

Regards

Thorsten

Labels (1)
0 Karma

bitnapper
Path Finder

Thanks. @bowesmana solution seems to work. But it seems, processing it externally is more efficient. many thanks to all.

bowesmana
SplunkTrust
SplunkTrust

Here's an example that extracts all the &#nnnn; sequences to a multivalue char field, which is then converted to the chars MV.

It seems like the mvdedup can be used to remove duplicates in each case and the ordering appears to be preserved between the to MVs, so then the final foreach will replace that inside name

| makeresults format=csv data="id,name
1,Алексей"
``` Extract all sequences ```
| rex field=name max_match=0 "\&#(?<char>\d{4});"
``` Create the char array ```
| eval chars=mvmap(char, printf("%c", char))
``` Remove duplicates from each MV - assumes ordering is preserved ```
| eval char=mvdedup(char), chars=mvdedup(chars)
``` Now replace each item ```
| eval c=0
| foreach chars mode=multivalue [ eval name=replace(name, "\&#".mvindex(char, c).";", <<ITEM>>), c=c+1 ]

You could make this a macro and pass a string to the macro and the macro could do the conversion.

Note that fixing the ingest is always the best option, but this can deal with any existing data.

Assumes you're running Splunk 9

Tags (1)

tscroggins
Champion

Here's a one-liner that handles both decimal and hexadecimal code points:

| eval name=mvjoin(mvmap(split(name, ";"), printf("%c", if(match(name, "^&#x"), tonumber(replace(name, "&#x", ""), 16), tonumber(replace(name, "&#", ""), 10)))), "")

You can also pad the value with XML tags and use the spath command:

| eval name="<name>".name."</name>"
| spath input=name path=name

or the xpath command:

| eval name="<name>".name."</name>"
| xpath outfield=name "/name" field=name

However, avoid the xpath command in this case. It's an external search command and requires creating a separate Python process to invoke $SPLUNK_HOME/etc/apps/search/bin/xpath.py.

bowesmana
SplunkTrust
SplunkTrust

spath works nicely, but the one liner only works if ALL the data is made up of codepoints only

tscroggins
Champion

Here's an alternative using rex and eval that should accommodate chars that aren't XML entities:

| rex field=name max_match=0 "(?<name>(?:&#[^;]+;|.))"
| eval name=mvjoin(mvmap(name, if(match(name, "^&#"), printf("%c", if(match(name, "^&#x"), tonumber(replace(name, "[&#x;]", ""), 16), tonumber(replace(name, "[&#;]", ""), 10))), name)), "")

If I were using any of these solutions myself, I'd choose spath. It should handle any valid XML without needing to handle edge cases in SPL.

tscroggins
Champion

Indeed, as it is in @bitnapper's original question. nullif, match, etc. could be added for input validation.

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...