<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: non english words length function not working as expected in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673465#M230613</link>
    <description>&lt;P&gt;Normalization may not help for Tamil, which doesn't appear to have canonically equivalent composed forms of most characters in Unicode. I.e. The string&amp;nbsp;இடும்பைக்கு can only (?) be represented in Unicode using11 code points.&lt;/P&gt;</description>
    <pubDate>Sun, 07 Jan 2024 20:19:53 GMT</pubDate>
    <dc:creator>tscroggins</dc:creator>
    <dc:date>2024-01-07T20:19:53Z</dc:date>
    <item>
      <title>non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668798#M229411</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi Splunk Gurus... As you can see,&amp;nbsp;non English words length function not working as expected. checked the old posts, documentations, but no luck. any suggestions please. thanks.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval _raw="இடும்பைக்கு"
| eval length=len(_raw) | table _raw length

this produces:
_raw	        length
இடும்பைக்கு	11
(that word இடும்பைக்கு is actually 6 charactors, not 11)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Nov 2023 23:38:57 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668798#M229411</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2023-11-15T23:38:57Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668825#M229419</link>
      <description>Hi&lt;BR /&gt;Probably best to create a support case and/or add comments/question to docs and/or ask this on Slack?&lt;BR /&gt;r. Ismo</description>
      <pubDate>Thu, 16 Nov 2023 08:03:33 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668825#M229419</guid>
      <dc:creator>isoutamo</dc:creator>
      <dc:date>2023-11-16T08:03:33Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668843#M229422</link>
      <description>&lt;P&gt;It does seem like a bug. Splunk is supposed to calculate length based on number of characters, not bytes (and the same goes for parameters in settings, like TRUNCATE or MAX_TIMESTAMP_LOOKAHEAD; so it might be interesting to see if those are also affected).&lt;/P&gt;&lt;P&gt;EDIT: TRUNCATE is actually in bytes. Should get rounded down if it would fall in the middle of a multibyte character. MAX_TIMESTAMP_LOOKAHEAD is in characters however. Confusing.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 14:17:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668843#M229422</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2024-01-07T14:17:19Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668932#M229451</link>
      <description>&lt;P&gt;Filled the form for Splunk Channel signup for Slack yesterday. Awaiting update from them. Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 16 Nov 2023 23:33:46 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/668932#M229451</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2023-11-16T23:33:46Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/669126#M229513</link>
      <description>&lt;P&gt;its been 4 days, but still no response from Splunk Slack Admins. Any ideas, suggestions on how to proceed, please, thanks.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 19 Nov 2023 23:39:22 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/669126#M229513</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2023-11-19T23:39:22Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/669172#M229521</link>
      <description>&lt;P&gt;Splunk Slack is &lt;STRONG&gt;not&lt;/STRONG&gt; support. Create a case via support portal.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Nov 2023 11:35:16 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/669172#M229521</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2023-11-20T11:35:16Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673422#M230596</link>
      <description>&lt;P&gt;tried to raise a bug report, it asked me raise as an idea.. so here it is:&lt;/P&gt;&lt;P&gt;&lt;A href="https://ideas.splunk.com/ideas/EID-I-2176" target="_blank"&gt;https://ideas.splunk.com/ideas/EID-I-2176&lt;/A&gt;&lt;/P&gt;&lt;P&gt;could you pls upvote it, so that Splunk will resolve it soon, thanks.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 05 Jan 2024 22:13:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673422#M230596</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2024-01-05T22:13:36Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673443#M230605</link>
      <description>&lt;P&gt;Hi All,&amp;nbsp;&lt;A href="https://docs.splunk.com/Documentation/Community/latest/community/SplunkIdeas" target="_blank" rel="noopener"&gt;https://docs.splunk.com/Documentation/Community/latest/community/SplunkIdeas&lt;/A&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT size="3"&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;How Ideas are reviewed and prioritized&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P&gt;Due to our large and active community of Splunkers, the number of enhancement requests we receive can be voluminous. Splunk Ideas allows us to see which ideas are being requested most across different types of customers and end-user personas.&lt;/P&gt;&lt;P&gt;When we determine which ideas to triage we look at total vote count across a variety of cohorts, which include but are not limited to:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of total votes (this is the number displayed in the "Vote" box for an idea)&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of unique customers requesting an idea ("customers" refers to organizations, not employees)&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of votes by customer size or industry. For example: large, small, financial services, government, and so forth.&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of votes by customer geography. For example: Americas, Europe, Asia, and so forth.&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of votes by end-user persona. For example: admin, SOC analyst, business analyst, and so forth.&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;Number of votes from special audiences. For example: &lt;STRONG&gt;Splunk Trust, Design Partners, and so forth.&lt;/STRONG&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could some long time Splunkers please upvote the idea, so that the Splunk will review it please. thanks.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best Regards,&lt;/P&gt;&lt;P&gt;Sekar&lt;/P&gt;</description>
      <pubDate>Sat, 06 Jan 2024 22:40:07 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673443#M230605</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2024-01-06T22:40:07Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673451#M230607</link>
      <description>&lt;P&gt;I would insist on treating it as a bug.&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/CommonEvalFunctions" target="_blank" rel="noopener"&gt;https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/CommonEvalFunctions&lt;/A&gt;&lt;/P&gt;&lt;P&gt;says explicitly&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD&gt;&lt;A class="" href="http://docs.splunk.com/Documentation/Splunk/9.1.2/SearchReference/TextFunctions#len.28.26lt.3Bstr.26gt.3B.29" target="_blank" rel="noopener"&gt;len(&amp;lt;str&amp;gt;)&lt;/A&gt;&lt;/TD&gt;&lt;TD&gt;Returns the count of the number of characters, not bytes, in the string.&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Sun, 07 Jan 2024 14:14:49 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673451#M230607</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2024-01-07T14:14:49Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673460#M230610</link>
      <description>&lt;P&gt;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/80737"&gt;@inventsekar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your account manager can speak directly to Splunk engineers and product managers to review your support case. This certainly looks like a bug as&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/231884"&gt;@PickleRick&lt;/a&gt;&amp;nbsp;pointed out. At the very least, it should result in product documentation being updated to reflect product behavior with respect to multibyte characters.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 16:30:38 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673460#M230610</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-07T16:30:38Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673463#M230611</link>
      <description>&lt;P&gt;Other UTF-8 solutions also count 11 characters, so it may not be a bug if the len() function counts UTF-8 code points.&lt;/P&gt;&lt;P&gt;The string translates to this table split by code points, using &lt;A href="https://en.wiktionary.org/wiki/Appendix:Unicode/Tamil" target="_self"&gt;https://en.wiktionary.org/wiki/Appendix:Unicode/Tamil&lt;/A&gt; as a reference for Tamil characters:&lt;/P&gt;&lt;TABLE border="0" width="256" cellspacing="0" cellpadding="0"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="64" height="20"&gt;URL-encoded UTF-8&lt;/TD&gt;&lt;TD width="64"&gt;Binary-decoded&lt;/TD&gt;&lt;TD width="64"&gt;Unicode Code Point&lt;/TD&gt;&lt;TD width="64"&gt;Character&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%87&lt;/TD&gt;&lt;TD&gt;00001011 10000111&lt;/TD&gt;&lt;TD&gt;U+0B87&lt;/TD&gt;&lt;TD&gt;இ&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%9F&lt;/TD&gt;&lt;TD&gt;00001011 10011111&lt;/TD&gt;&lt;TD&gt;U+0B9F&lt;/TD&gt;&lt;TD&gt;ட&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AF%81&lt;/TD&gt;&lt;TD&gt;00001011 11000001&lt;/TD&gt;&lt;TD&gt;U+0BC1&lt;/TD&gt;&lt;TD&gt;ு&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%AE&lt;/TD&gt;&lt;TD&gt;00001011 10101110&lt;/TD&gt;&lt;TD&gt;U+0BAE&lt;/TD&gt;&lt;TD&gt;ம&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AF%8D&lt;/TD&gt;&lt;TD&gt;00001011 11001101&lt;/TD&gt;&lt;TD&gt;U+0BCD&lt;/TD&gt;&lt;TD&gt;◌்&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%AA&lt;/TD&gt;&lt;TD&gt;00001011 10101010&lt;/TD&gt;&lt;TD&gt;U+0BAA&lt;/TD&gt;&lt;TD&gt;ப&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AF%88&lt;/TD&gt;&lt;TD&gt;00001011 11001000&lt;/TD&gt;&lt;TD&gt;U+0BC8&lt;/TD&gt;&lt;TD&gt;ை&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%95&lt;/TD&gt;&lt;TD&gt;00001011 10010101&lt;/TD&gt;&lt;TD&gt;U+0B95&lt;/TD&gt;&lt;TD&gt;க&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AF%8D&lt;/TD&gt;&lt;TD&gt;00001011 11001101&lt;/TD&gt;&lt;TD&gt;U+0BCD&lt;/TD&gt;&lt;TD&gt;◌்&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AE%95&lt;/TD&gt;&lt;TD&gt;00001011 10010101&lt;/TD&gt;&lt;TD&gt;U+0B95&lt;/TD&gt;&lt;TD&gt;க&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD height="20"&gt;%E0%AF%81&lt;/TD&gt;&lt;TD&gt;00001011 11000001&lt;/TD&gt;&lt;TD&gt;U+0BC1&lt;/TD&gt;&lt;TD&gt;ு&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;If we ignore the Unicode mark category, we count 6 characters.&lt;/P&gt;&lt;P&gt;Splunk's implementation of len() would need to be modified to ignore mark characters. Would that produce the correct result across all languages, or would Splunk need to normalize the code points first? Whether a bug or an idea, Splunk would need to address it in a language agnostic way.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 18:33:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673463#M230611</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-07T18:33:05Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673464#M230612</link>
      <description>&lt;P&gt;Ok. So it calls at least for docs clarification. I can understand that it can be difficult if not next to impossible to do a reliable character counting in case of such combined ones. But indeed ambiguity of wording of the description should be clarified. OTOH, I'm wonder how TRUNCATE would behave if it hit the middle of such "multicharacter" character.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 20:06:53 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673464#M230612</guid>
      <dc:creator>PickleRick</dc:creator>
      <dc:date>2024-01-07T20:06:53Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673465#M230613</link>
      <description>&lt;P&gt;Normalization may not help for Tamil, which doesn't appear to have canonically equivalent composed forms of most characters in Unicode. I.e. The string&amp;nbsp;இடும்பைக்கு can only (?) be represented in Unicode using11 code points.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 20:19:53 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673465#M230613</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-07T20:19:53Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673466#M230614</link>
      <description>&lt;P&gt;In a quick test, TRUNCATE truncates the event following the last complete UTF-8 character less than the byte limit. No partial code points are indexed.&lt;/P&gt;</description>
      <pubDate>Sun, 07 Jan 2024 20:32:20 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673466#M230614</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-07T20:32:20Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673479#M230617</link>
      <description>&lt;P&gt;(I've been a bit fascinated by this off and on today.)&lt;/P&gt;&lt;P&gt;I hadn't done this earlier, but a simple rex command can be used to decompose the field value into its component code points:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults
| eval _raw="இடும்பைக்கு"
| rex max_match=0 "(?&amp;lt;tmp&amp;gt;.)"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="33.333333333333336%"&gt;&lt;STRONG&gt;_raw&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;&lt;STRONG&gt;_time&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;&lt;STRONG&gt;tmp&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="33.333333333333336%"&gt;&lt;SPAN&gt;இடும்பைக்கு&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;&lt;SPAN&gt;2024-01-07 18:04:18&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD width="33.333333333333336%"&gt;&lt;DIV class=""&gt;இ&lt;/DIV&gt;&lt;DIV class=""&gt;ட&lt;/DIV&gt;&lt;DIV class=""&gt;ு&lt;/DIV&gt;&lt;DIV class=""&gt;ம&lt;/DIV&gt;&lt;DIV class=""&gt;்&lt;/DIV&gt;&lt;DIV class=""&gt;ப&lt;/DIV&gt;&lt;DIV class=""&gt;ை&lt;/DIV&gt;&lt;DIV class=""&gt;க&lt;/DIV&gt;&lt;DIV class=""&gt;்&lt;/DIV&gt;&lt;DIV class=""&gt;க&lt;/DIV&gt;&lt;DIV class=""&gt;ு&lt;/DIV&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Determining the Unicode category requires a lookup against a Unicode database, a subset of which I've attached as tamil_unicode_block.csv converted to pdf. The general_category field determines whether a code point is a mark (M*):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults 
| eval _raw="இடும்பைக்கு"
| rex max_match=0 "(?&amp;lt;char&amp;gt;.)"
| lookup tamil_unicode_block.csv char output general_category
| eval length=mvcount(mvfilter(NOT match(general_category, "^M")))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE border="1" width="69.44444444444446%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;STRONG&gt;_raw&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;STRONG&gt;_time&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;STRONG&gt;char&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;STRONG&gt;general_category&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;STRONG&gt;length&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;SPAN&gt;இடும்பைக்கு&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;SPAN&gt;2024-01-07 21:41:41&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;இ&lt;/DIV&gt;&lt;DIV class=""&gt;ட&lt;/DIV&gt;&lt;DIV class=""&gt;ு&lt;/DIV&gt;&lt;DIV class=""&gt;ம&lt;/DIV&gt;&lt;DIV class=""&gt;்&lt;/DIV&gt;&lt;DIV class=""&gt;ப&lt;/DIV&gt;&lt;DIV class=""&gt;ை&lt;/DIV&gt;&lt;DIV class=""&gt;க&lt;/DIV&gt;&lt;DIV class=""&gt;்&lt;/DIV&gt;&lt;DIV class=""&gt;க&lt;/DIV&gt;&lt;DIV class=""&gt;ு&lt;/DIV&gt;&lt;/DIV&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Mc&lt;/DIV&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Mn&lt;/DIV&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Mc&lt;/DIV&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Mn&lt;/DIV&gt;&lt;DIV class=""&gt;Lo&lt;/DIV&gt;&lt;DIV class=""&gt;Mc&lt;/DIV&gt;&lt;/TD&gt;&lt;TD width="16.666666666666668%" height="25px"&gt;6&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;I don't know if this is the&amp;nbsp;&lt;EM&gt;correct&lt;/EM&gt; way to count Unicode "characters," but libraries do use the Unicode character database (see&amp;nbsp;&lt;A href="https://www.unicode.org/reports/tr44/" target="_blank" rel="noopener"&gt;https://www.unicode.org/reports/tr44/&lt;/A&gt;)&amp;nbsp;to determine the general category of code points. Splunk would have access to this functionality via e.g. libicu.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jan 2024 02:53:23 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673479#M230617</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-08T02:53:23Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673995#M230732</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/49493"&gt;@tscroggins&lt;/a&gt;&amp;nbsp;and all,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried to download that&amp;nbsp;tamil_unicode_block.csv, after spending 20 mins i left it.&amp;nbsp;&lt;/P&gt;&lt;P&gt;from your pdf file i created that&amp;nbsp;tamil_unicode_block.csv myself and uploaded to Splunk.&amp;nbsp;&lt;/P&gt;&lt;P&gt;but still the rex counting does not work as i expected. Could you pls help me in counting characters, thanks.&amp;nbsp;&lt;/P&gt;&lt;P&gt;sample event -&amp;nbsp;&lt;BR /&gt;&lt;EM&gt;இடும்பைக்கு இடும்பை&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;படுப்பர்&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;இடும்பைக்கு &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;இடும்பை&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;படாஅ தவர்&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;background details - my idea is to Splunk on tamil language Thirukkural and do some analytics.&lt;/P&gt;&lt;P&gt;each event will be a two lines containing (seven words exactly)&lt;BR /&gt;onboarding details are available in youtube video(@siemnewbies channel name)(i should not post the youtube link here as it may look like marketing)&amp;nbsp;i take care of this youtube channel, focusing only Splunk and SIEM newbies.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best Regards,&lt;/P&gt;&lt;P&gt;Sekar&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jan 2024 23:50:52 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/673995#M230732</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2024-01-11T23:50:52Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674008#M230736</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/80737"&gt;@inventsekar&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;The PDF appears to have modified the code points! I prefer to use SPL because it doesn't usually require elevated privileges; however, it might be simpler to use an external lookup script. The lookup command treats fields containing only whitespace as empty/null, so the lookup will only identify non-whitespace characters.&lt;/P&gt;&lt;P&gt;We'll need to create a script and a transform, which I've encapsulated in an app:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py&amp;nbsp;&lt;/STRONG&gt;(this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)&lt;/P&gt;&lt;LI-CODE lang="python"&gt;#!/usr/bin/env python

import csv
import unicodedata
import sys

def main():
    if len(sys.argv) != 3:
        print("Usage: python category_lookup.py [char] [category]")
        sys.exit(1)

    charfield = sys.argv[1]
    categoryfield = sys.argv[2]

    infile = sys.stdin
    outfile = sys.stdout

    r = csv.DictReader(infile)
    header = r.fieldnames

    w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
    w.writeheader()

    for result in r:
        if result[charfield]:
            result[categoryfield] = unicodedata.category(result[charfield])
            w.writerow(result)

main()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;STRONG&gt;$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;[ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3&lt;/LI-CODE&gt;&lt;P&gt;&lt;STRONG&gt;$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;[]
access = read : [ * ], write : [ admin, power ]
export = system&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With the app in place, we count 31 non-whitespace characters using the lookup:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?&amp;lt;char&amp;gt;.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M")))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.&lt;/P&gt;&lt;P&gt;We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count&lt;/LI-CODE&gt;&lt;P&gt;If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 03:49:07 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674008#M230736</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-12T03:49:07Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674009#M230737</link>
      <description>&lt;P&gt;Excellent&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/49493"&gt;@tscroggins&lt;/a&gt;&amp;nbsp;.. (if community could allow, i should have added more than 1 upvote. thanks a ton! )&lt;BR /&gt;(I should start focusing on Python more, python really solves "big issues,.. just like that")&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 03:59:44 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674009#M230737</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2024-01-12T03:59:44Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674010#M230738</link>
      <description>&lt;P&gt;Sure &lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/49493"&gt;@tscroggins&lt;/a&gt;&amp;nbsp;.. i spoke with my account mgr and wrote to a Splunk account manager(or sales manager i am not sure) and he said he will look into it and reply back within a day.. and three days passed. still i am waiting, waiting and waiting. lets see, thanks a lot for your help.&lt;BR /&gt;(as you can see in my youtube channel "siemnewbies", i have been working on this for more than half year.. but good learning actually)&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 04:03:44 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674010#M230738</guid>
      <dc:creator>inventsekar</dc:creator>
      <dc:date>2024-01-12T04:03:44Z</dc:date>
    </item>
    <item>
      <title>Re: non english words length function not working as expected</title>
      <link>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674011#M230739</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/80737"&gt;@inventsekar&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;There's a much simpler solution! The regular expression \X token will match any Unicode grapheme. Combined with a lookahead to match only non-whitespace characters, we can extract and count each grapheme:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?&amp;lt;char&amp;gt;(?=\\S)\\X)"
| eval length=mvcount(char)&lt;/LI-CODE&gt;&lt;P&gt;length = 31&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults 
| eval _raw="இடும்பைக்கு"
| rex max_match=0 "(?&amp;lt;char&amp;gt;(?=\\S)\\X)"
| eval length=mvcount(char)&lt;/LI-CODE&gt;&lt;P&gt;length = 6&lt;/P&gt;&lt;P&gt;We can condense that to a single eval expression:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| eval length=len(replace(replace(_raw, "(?=\\S)\\X", "x"), "\\s", ""))&lt;/LI-CODE&gt;&lt;P&gt;length = 31&lt;/P&gt;&lt;P&gt;You can then use the eval expression in a macro definition and call the macro directly:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| eval length=`num_graphemes(_raw)`&lt;/LI-CODE&gt;&lt;P&gt;To count whitespace characters, remove (?=\S) from the regular expression:&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| eval length=len(replace(_raw, "\\X", "x"))&lt;/LI-CODE&gt;&lt;P&gt;length = 37&lt;/P&gt;&lt;P&gt;Your new macro would then count each Unicode grapheme, including whitespace characters.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jan 2024 04:29:30 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/non-english-words-length-function-not-working-as-expected/m-p/674011#M230739</guid>
      <dc:creator>tscroggins</dc:creator>
      <dc:date>2024-01-12T04:29:30Z</dc:date>
    </item>
  </channel>
</rss>

