topic Re: non english words length function not working as expected in Splunk Search

non english words length function not working as expected

inventsekar — Wed, 15 Nov 2023 23:38:57 GMT

Hi Splunk Gurus... As you can see, non English words length function not working as expected. checked the old posts, documentations, but no luck. any suggestions please. thanks.

| makeresults | eval _raw="இடும்பைக்கு" | eval length=len(_raw) | table _raw length this produces: _raw length இடும்பைக்கு 11 (that word இடும்பைக்கு is actually 6 charactors, not 11)

Re: non english words length function not working as expected

isoutamo — Thu, 16 Nov 2023 08:03:33 GMT

Hi
Probably best to create a support case and/or add comments/question to docs and/or ask this on Slack?
r. Ismo

Re: non english words length function not working as expected

PickleRick — Sun, 07 Jan 2024 14:17:19 GMT

It does seem like a bug. Splunk is supposed to calculate length based on number of characters, not bytes (and the same goes for parameters in settings, like TRUNCATE or MAX_TIMESTAMP_LOOKAHEAD; so it might be interesting to see if those are also affected).

EDIT: TRUNCATE is actually in bytes. Should get rounded down if it would fall in the middle of a multibyte character. MAX_TIMESTAMP_LOOKAHEAD is in characters however. Confusing.

Re: non english words length function not working as expected

inventsekar — Thu, 16 Nov 2023 23:33:46 GMT

Filled the form for Splunk Channel signup for Slack yesterday. Awaiting update from them. Thanks!

Re: non english words length function not working as expected

inventsekar — Sun, 19 Nov 2023 23:39:22 GMT

its been 4 days, but still no response from Splunk Slack Admins. Any ideas, suggestions on how to proceed, please, thanks.

Re: non english words length function not working as expected

PickleRick — Mon, 20 Nov 2023 11:35:16 GMT

Splunk Slack is not support. Create a case via support portal.

Re: non english words length function not working as expected

inventsekar — Fri, 05 Jan 2024 22:13:36 GMT

tried to raise a bug report, it asked me raise as an idea.. so here it is:

https://ideas.splunk.com/ideas/EID-I-2176

could you pls upvote it, so that Splunk will resolve it soon, thanks.

Re: non english words length function not working as expected

inventsekar — Sat, 06 Jan 2024 22:40:07 GMT

Hi All, https://docs.splunk.com/Documentation/Community/latest/community/SplunkIdeas

How Ideas are reviewed and prioritized

Due to our large and active community of Splunkers, the number of enhancement requests we receive can be voluminous. Splunk Ideas allows us to see which ideas are being requested most across different types of customers and end-user personas.

When we determine which ideas to triage we look at total vote count across a variety of cohorts, which include but are not limited to:

Number of total votes (this is the number displayed in the "Vote" box for an idea)
Number of unique customers requesting an idea ("customers" refers to organizations, not employees)
Number of votes by customer size or industry. For example: large, small, financial services, government, and so forth.
Number of votes by customer geography. For example: Americas, Europe, Asia, and so forth.
Number of votes by end-user persona. For example: admin, SOC analyst, business analyst, and so forth.
Number of votes from special audiences. For example: Splunk Trust, Design Partners, and so forth.

Could some long time Splunkers please upvote the idea, so that the Splunk will review it please. thanks.

Best Regards,

Sekar

Re: non english words length function not working as expected

PickleRick — Sun, 07 Jan 2024 14:14:49 GMT

I would insist on treating it as a bug.

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/CommonEvalFunctions

says explicitly

len(<str>)

Returns the count of the number of characters, not bytes, in the string.

Re: non english words length function not working as expected

tscroggins — Sun, 07 Jan 2024 16:30:38 GMT

@inventsekar

Your account manager can speak directly to Splunk engineers and product managers to review your support case. This certainly looks like a bug as @PickleRick pointed out. At the very least, it should result in product documentation being updated to reflect product behavior with respect to multibyte characters.

Re: non english words length function not working as expected

tscroggins — Sun, 07 Jan 2024 18:33:05 GMT

Other UTF-8 solutions also count 11 characters, so it may not be a bug if the len() function counts UTF-8 code points.

The string translates to this table split by code points, using https://en.wiktionary.org/wiki/Appendix:Unicode/Tamil as a reference for Tamil characters:

URL-encoded UTF-8	Binary-decoded	Unicode Code Point	Character
%E0%AE%87	00001011 10000111	U+0B87	இ
%E0%AE%9F	00001011 10011111	U+0B9F	ட
%E0%AF%81	00001011 11000001	U+0BC1	ு
%E0%AE%AE	00001011 10101110	U+0BAE	ம
%E0%AF%8D	00001011 11001101	U+0BCD	◌்
%E0%AE%AA	00001011 10101010	U+0BAA	ப
%E0%AF%88	00001011 11001000	U+0BC8	ை
%E0%AE%95	00001011 10010101	U+0B95	க
%E0%AF%8D	00001011 11001101	U+0BCD	◌்
%E0%AE%95	00001011 10010101	U+0B95	க
%E0%AF%81	00001011 11000001	U+0BC1	ு

If we ignore the Unicode mark category, we count 6 characters.

Splunk's implementation of len() would need to be modified to ignore mark characters. Would that produce the correct result across all languages, or would Splunk need to normalize the code points first? Whether a bug or an idea, Splunk would need to address it in a language agnostic way.

Re: non english words length function not working as expected

PickleRick — Sun, 07 Jan 2024 20:06:53 GMT

Ok. So it calls at least for docs clarification. I can understand that it can be difficult if not next to impossible to do a reliable character counting in case of such combined ones. But indeed ambiguity of wording of the description should be clarified. OTOH, I'm wonder how TRUNCATE would behave if it hit the middle of such "multicharacter" character.

Re: non english words length function not working as expected

tscroggins — Sun, 07 Jan 2024 20:19:53 GMT

Normalization may not help for Tamil, which doesn't appear to have canonically equivalent composed forms of most characters in Unicode. I.e. The string இடும்பைக்கு can only (?) be represented in Unicode using11 code points.

Re: non english words length function not working as expected

tscroggins — Sun, 07 Jan 2024 20:32:20 GMT

In a quick test, TRUNCATE truncates the event following the last complete UTF-8 character less than the byte limit. No partial code points are indexed.

Re: non english words length function not working as expected

tscroggins — Mon, 08 Jan 2024 02:53:23 GMT

(I've been a bit fascinated by this off and on today.)

I hadn't done this earlier, but a simple rex command can be used to decompose the field value into its component code points:

| makeresults | eval _raw="இடும்பைக்கு" | rex max_match=0 "(?<tmp>.)"

_raw	_time	tmp
இடும்பைக்கு	2024-01-07 18:04:18	இ ட ு ம ் ப ை க ் க ு

Determining the Unicode category requires a lookup against a Unicode database, a subset of which I've attached as tamil_unicode_block.csv converted to pdf. The general_category field determines whether a code point is a mark (M*):

_raw	_time	char	general_category	length
இடும்பைக்கு	2024-01-07 21:41:41	இ ட ு ம ் ப ை க ் க ு	Lo Lo Mc Lo Mn Lo Mc Lo Mn Lo Mc	6

I don't know if this is the correct way to count Unicode "characters," but libraries do use the Unicode character database (see https://www.unicode.org/reports/tr44/) to determine the general category of code points. Splunk would have access to this functionality via e.g. libicu.

Re: non english words length function not working as expected

inventsekar — Thu, 11 Jan 2024 23:50:52 GMT

Hi @tscroggins and all,

I tried to download that tamil_unicode_block.csv, after spending 20 mins i left it.

from your pdf file i created that tamil_unicode_block.csv myself and uploaded to Splunk.

but still the rex counting does not work as i expected. Could you pls help me in counting characters, thanks.

sample event -
இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு

இடும்பை படாஅ தவர்

background details - my idea is to Splunk on tamil language Thirukkural and do some analytics.

each event will be a two lines containing (seven words exactly)
onboarding details are available in youtube video(@siemnewbies channel name)(i should not post the youtube link here as it may look like marketing) i take care of this youtube channel, focusing only Splunk and SIEM newbies.

Best Regards,

Sekar

Re: non english words length function not working as expected

tscroggins — Fri, 12 Jan 2024 03:49:07 GMT

Hi @inventsekar,

The PDF appears to have modified the code points! I prefer to use SPL because it doesn't usually require elevated privileges; however, it might be simpler to use an external lookup script. The lookup command treats fields containing only whitespace as empty/null, so the lookup will only identify non-whitespace characters.

We'll need to create a script and a transform, which I've encapsulated in an app:

$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)

#!/usr/bin/env python import csv import unicodedata import sys def main(): if len(sys.argv) != 3: print("Usage: python category_lookup.py [char] [category]") sys.exit(1) charfield = sys.argv[1] categoryfield = sys.argv[2] infile = sys.stdin outfile = sys.stdout r = csv.DictReader(infile) header = r.fieldnames w = csv.DictWriter(outfile, fieldnames=r.fieldnames) w.writeheader() for result in r: if result[charfield]: result[categoryfield] = unicodedata.category(result[charfield]) w.writerow(result) main()

$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf

[ucd_category_lookup] external_cmd = ucd_category_lookup.py char category fields_list = char, category python.version = python3

$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta

[] access = read : [ * ], write : [ admin, power ] export = system

With the app in place, we count 31 non-whitespace characters using the lookup:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | rex max_match=0 "(?<char>.)" | lookup ucd_category_lookup char output category | eval length=mvcount(mvfilter(NOT match(category, "^M")))

Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.

We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | lookup ucd_count_chars_lookup _raw output count

If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first.

Re: non english words length function not working as expected

inventsekar — Fri, 12 Jan 2024 03:59:44 GMT

Excellent @tscroggins .. (if community could allow, i should have added more than 1 upvote. thanks a ton! )
(I should start focusing on Python more, python really solves "big issues,.. just like that")

Re: non english words length function not working as expected

inventsekar — Fri, 12 Jan 2024 04:03:44 GMT

Sure @tscroggins .. i spoke with my account mgr and wrote to a Splunk account manager(or sales manager i am not sure) and he said he will look into it and reply back within a day.. and three days passed. still i am waiting, waiting and waiting. lets see, thanks a lot for your help.
(as you can see in my youtube channel "siemnewbies", i have been working on this for more than half year.. but good learning actually)

Re: non english words length function not working as expected

tscroggins — Fri, 12 Jan 2024 04:29:30 GMT

Hi @inventsekar,

There's a much simpler solution! The regular expression \X token will match any Unicode grapheme. Combined with a lookahead to match only non-whitespace characters, we can extract and count each grapheme:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | rex max_match=0 "(?<char>(?=\\S)\\X)" | eval length=mvcount(char)

length = 31

| makeresults | eval _raw="இடும்பைக்கு" | rex max_match=0 "(?<char>(?=\\S)\\X)" | eval length=mvcount(char)

length = 6

We can condense that to a single eval expression:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | eval length=len(replace(replace(_raw, "(?=\\S)\\X", "x"), "\\s", ""))

length = 31

You can then use the eval expression in a macro definition and call the macro directly:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | eval length=`num_graphemes(_raw)`

To count whitespace characters, remove (?=\S) from the regular expression:

| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | eval length=len(replace(_raw, "\\X", "x"))

length = 37

Your new macro would then count each Unicode grapheme, including whitespace characters.