<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Converting multiple html entities in each field to unicode characters in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674324#M230824</link>
    <description>&lt;P&gt;Here's an example that extracts all the &amp;amp;#nnnn; sequences to a multivalue char field, which is then converted to the chars MV.&lt;/P&gt;&lt;P&gt;It seems like the mvdedup can be used to remove duplicates in each case and the ordering appears to be preserved between the to MVs, so then the&amp;nbsp;final foreach will replace that inside name&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults format=csv data="id,name
1,&amp;amp;#1040;&amp;amp;#1083;&amp;amp;#1077;&amp;amp;#1082;&amp;amp;#1089;&amp;amp;#1077;&amp;amp;#1081;"
``` Extract all sequences ```
| rex field=name max_match=0 "\&amp;amp;#(?&amp;lt;char&amp;gt;\d{4});"
``` Create the char array ```
| eval chars=mvmap(char, printf("%c", char))
``` Remove duplicates from each MV - assumes ordering is preserved ```
| eval char=mvdedup(char), chars=mvdedup(chars)
``` Now replace each item ```
| eval c=0
| foreach chars mode=multivalue [ eval name=replace(name, "\&amp;amp;#".mvindex(char, c).";", &amp;lt;&amp;lt;ITEM&amp;gt;&amp;gt;), c=c+1 ]&lt;/LI-CODE&gt;&lt;P&gt;You could make this a macro and pass a string to the macro and the macro could do the conversion.&lt;/P&gt;&lt;P&gt;Note that fixing the ingest is always the best option, but this can deal with any existing data.&lt;/P&gt;&lt;P&gt;Assumes you're running Splunk 9&lt;/P&gt;</description>
    <pubDate>Mon, 15 Jan 2024 23:56:51 GMT</pubDate>
    <dc:creator>bowesmana</dc:creator>
    <dc:date>2024-01-15T23:56:51Z</dc:date>
    <item>
      <title>Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674292#M230812</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;I have a dataset with very poor qulity and multiple encoding error. Some fields contain data like "&amp;amp;#1040;&amp;amp;#1083;&amp;amp;#1077;&amp;amp;#1082;&amp;amp;#1089;&amp;amp;#1077;&amp;amp;#1081;" which sould be "Алексей". My first idea to convert taht, was to search every falty dataset and convert this extermally with a script but I'm curious if theres a better way using splunk. But I have no idea how to get there.&lt;/P&gt;&lt;P&gt;I somehow need to cet every &amp;amp;#(\d{4}); and I could facilitate printf("%c", \1) to get the correct unicode character but I have no Idea how to apply that to every occourance in a single field. Currently I have data like this:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="50%" height="25px"&gt;id&lt;/TD&gt;&lt;TD width="50%" height="25px"&gt;name&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="50%" height="25px"&gt;1&lt;/TD&gt;&lt;TD width="50%" height="25px"&gt;&amp;amp;#1040;&amp;amp;#1083;&amp;amp;#1077;&amp;amp;#1082;&amp;amp;#1089;&amp;amp;#1077;&amp;amp;#1081;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Where I wanno get is that:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="50%"&gt;id&lt;/TD&gt;&lt;TD width="25%"&gt;name&lt;/TD&gt;&lt;TD width="25%"&gt;correct_name&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="50%"&gt;1&lt;/TD&gt;&lt;TD width="25%"&gt;&amp;amp;#1040;&amp;amp;#1083;&amp;amp;#1077;&amp;amp;#1082;&amp;amp;#1089;&amp;amp;#1077;&amp;amp;#1081;&lt;/TD&gt;&lt;TD width="25%"&gt;Алексей&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any ideas if that is possible without using python sripts in splunk?&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Thorsten&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jan 2024 19:05:38 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674292#M230812</guid>
      <dc:creator>bitnapper</dc:creator>
      <dc:date>2024-01-15T19:05:38Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674324#M230824</link>
      <description>&lt;P&gt;Here's an example that extracts all the &amp;amp;#nnnn; sequences to a multivalue char field, which is then converted to the chars MV.&lt;/P&gt;&lt;P&gt;It seems like the mvdedup can be used to remove duplicates in each case and the ordering appears to be preserved between the to MVs, so then the&amp;nbsp;final foreach will replace that inside name&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults format=csv data="id,name
1,&amp;amp;#1040;&amp;amp;#1083;&amp;amp;#1077;&amp;amp;#1082;&amp;amp;#1089;&amp;amp;#1077;&amp;amp;#1081;"
``` Extract all sequences ```
| rex field=name max_match=0 "\&amp;amp;#(?&amp;lt;char&amp;gt;\d{4});"
``` Create the char array ```
| eval chars=mvmap(char, printf("%c", char))
``` Remove duplicates from each MV - assumes ordering is preserved ```
| eval char=mvdedup(char), chars=mvdedup(chars)
``` Now replace each item ```
| eval c=0
| foreach chars mode=multivalue [ eval name=replace(name, "\&amp;amp;#".mvindex(char, c).";", &amp;lt;&amp;lt;ITEM&amp;gt;&amp;gt;), c=c+1 ]&lt;/LI-CODE&gt;&lt;P&gt;You could make this a macro and pass a string to the macro and the macro could do the conversion.&lt;/P&gt;&lt;P&gt;Note that fixing the ingest is always the best option, but this can deal with any existing data.&lt;/P&gt;&lt;P&gt;Assumes you're running Splunk 9&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jan 2024 23:56:51 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674324#M230824</guid>
      <dc:creator>bowesmana</dc:creator>
      <dc:date>2024-01-15T23:56:51Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674327#M230826</link>
      <description>&lt;P&gt;Here's a one-liner that handles both decimal and hexadecimal code points:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| eval name=mvjoin(mvmap(split(name, ";"), printf("%c", if(match(name, "^&amp;amp;#x"), tonumber(replace(name, "&amp;amp;#x", ""), 16), tonumber(replace(name, "&amp;amp;#", ""), 10)))), "")&lt;/LI-CODE&gt;&lt;P&gt;You can also pad the value with XML tags and use the spath command:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| eval name="&amp;lt;name&amp;gt;".name."&amp;lt;/name&amp;gt;"
| spath input=name path=name&lt;/LI-CODE&gt;&lt;P&gt;or the xpath command:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| eval name="&amp;lt;name&amp;gt;".name."&amp;lt;/name&amp;gt;"
| xpath outfield=name "/name" field=name&lt;/LI-CODE&gt;&lt;P&gt;However, avoid the xpath command in this case. It's an external search command and requires creating a separate Python process to invoke $SPLUNK_HOME/etc/apps/search/bin/xpath.py.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 01:21:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674327#M230826</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-16T01:21:36Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674329#M230827</link>
      <description>&lt;P&gt;spath works nicely, but the one liner only works if ALL the data is made up of codepoints only&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 02:26:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674329#M230827</guid>
      <dc:creator>bowesmana</dc:creator>
      <dc:date>2024-01-16T02:26:30Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674330#M230828</link>
      <description>&lt;P&gt;Indeed, as it is in&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/245847"&gt;@bitnapper&lt;/a&gt;'s original question. nullif, match, etc. could be added for input validation.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 02:48:09 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674330#M230828</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-16T02:48:09Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674928#M230996</link>
      <description>&lt;P&gt;Here's an alternative using rex and eval that should accommodate chars that aren't XML entities:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| rex field=name max_match=0 "(?&amp;lt;name&amp;gt;(?:&amp;amp;#[^;]+;|.))"
| eval name=mvjoin(mvmap(name, if(match(name, "^&amp;amp;#"), printf("%c", if(match(name, "^&amp;amp;#x"), tonumber(replace(name, "[&amp;amp;#x;]", ""), 16), tonumber(replace(name, "[&amp;amp;#;]", ""), 10))), name)), "")&lt;/LI-CODE&gt;&lt;P&gt;If I were using any of these solutions myself, I'd choose spath. It should handle any valid XML without needing to handle edge cases in SPL.&lt;/P&gt;</description>
      <pubDate>Sat, 20 Jan 2024 15:39:43 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674928#M230996</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-20T15:39:43Z</dc:date>
    </item>
    <item>
      <title>Re: Converting multiple html entities in each field to unicode characters</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674974#M231007</link>
      <description>&lt;P&gt;Thanks. &lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/6367"&gt;@bowesmana&lt;/a&gt; solution seems to work. But it seems, processing it externally is more efficient. many thanks to all.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jan 2024 20:44:56 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Converting-multiple-html-entities-in-each-field-to-unicode/m-p/674974#M231007</guid>
      <dc:creator>bitnapper</dc:creator>
      <dc:date>2024-01-21T20:44:56Z</dc:date>
    </item>
  </channel>
</rss>

