Re: non english words length function not working ... - Page 2

inventsekar · ‎11-15-2023

Hi Splunk Gurus... As you can see, non English words length function not working as expected. checked the old posts, documentations, but no luck. any suggestions please. thanks.

| makeresults 
| eval _raw="இடும்பைக்கு"
| eval length=len(_raw) | table _raw length

this produces:
_raw	        length
இடும்பைக்கு	11
(that word இடும்பைக்கு is actually 6 charactors, not 11)

thanks and best regards,
Sekar

PS - If this or any post helped you in any way, pls consider upvoting, thanks for reading !

inventsekar · ‎12-01-2024

Thanks @tscroggins for your upvote and karma points, much appreciated!
last few months i was busy, could not spend time for this one.

May I know what would be your suggestions about these points pls:

1) I have been thinking to create an app as your suggestion listed below. would you recommend an app or a custom command or simply all important languages unicodes lookup(tamil_unicode_block.csv) uploading to Splunk

| makeresults 
| eval _raw="இடும்பைக்கு"
| rex max_match=0 "(?<char>.)"
| lookup tamil_unicode_block.csv char output general_category
| eval length=mvcount(mvfilter(NOT match(general_category, "^M")))

2) i assume that if i encapsulate his below listed python script in that app should be the work-around for this issue in a language agnostic way(this app should work for Tamil or Hindi or Telegu, etc)

3) or any other suggestions pls, thanks.

the app idea (your script from previous reply):

$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)

#!/usr/bin/env python

import csv
import unicodedata
import sys

def main():
    if len(sys.argv) != 3:
        print("Usage: python category_lookup.py [char] [category]")
        sys.exit(1)

    charfield = sys.argv[1]
    categoryfield = sys.argv[2]

    infile = sys.stdin
    outfile = sys.stdout

    r = csv.DictReader(infile)
    header = r.fieldnames

    w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
    w.writeheader()

    for result in r:
        if result[charfield]:
            result[categoryfield] = unicodedata.category(result[charfield])
            w.writerow(result)

main()

$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf

[ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3

$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta

[]
access = read : [ * ], write : [ admin, power ]
export = system

With the app in place, we count 31 non-whitespace characters using the lookup:

| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?<char>.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M")))

Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.

We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:

| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count

If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first.

$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)

#!/usr/bin/env python

import csv
import unicodedata
import sys

def main():
    if len(sys.argv) != 3:
        print("Usage: python category_lookup.py [char] [category]")
        sys.exit(1)

    charfield = sys.argv[1]
    categoryfield = sys.argv[2]

    infile = sys.stdin
    outfile = sys.stdout

    r = csv.DictReader(infile)
    header = r.fieldnames

    w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
    w.writeheader()

    for result in r:
        if result[charfield]:
            result[categoryfield] = unicodedata.category(result[charfield])
            w.writerow(result)

main()

$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf

[ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3

$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta

[]
access = read : [ * ], write : [ admin, power ]
export = system

With the app in place, we count 31 non-whitespace characters using the lookup:

| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?<char>.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M")))

Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.

We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:

| makeresults 
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count

thanks and best regards,
Sekar

PS - If this or any post helped you in any way, pls consider upvoting, thanks for reading !

tscroggins · ‎12-09-2024

Hi @inventsekar,

1) As I recall, I only generated the lookup CSV file for testing in Tamil. An all-language lookup might be size prohibitive. The best SPL-based workaround to count all graphemes seems to be an eval expression using the \X regular expression token to match Unicode sequences. The simplest expression was:

| eval count=len(replace(field, "\\X", "x"))

2) The external lookup allowed programmatic access to Python modules or any other library/program if called from an arbitrary script. The example returned a Unicode character category, but the subsequent counting solution wasn't comprehensive.

In Bash, calculating the number of characters may be as simple as:

echo ${#field}
=>
58

but this suffers the same problem as our earlier efforts by not taking into account marks and code sequences used to generate graphemes.

Is Perl better?

perl -CS -lnE 'say length' <<<${field}
=>
58

As before, the length is incorrect. I'm not a Perl expert, but see https://perldoc.perl.org/perlunicode:

The only time that Perl considers a sequence of individual code points as a single logical character is in the \X construct ....

That leads us to:

perl -CS -lnE 's/\X/x/g; say length' <<<${field}
=>
37

There may be a better native Perl, Python, etc. solution, but calling an external program is more expensive than the equivalent SPL.

3) If you only need to count graphemes, I would use the eval command. What other use cases did you have in mind?

tscroggins · ‎01-07-2024

In a quick test, TRUNCATE truncates the event following the last complete UTF-8 character less than the byte limit. No partial code points are indexed.

PickleRick · ‎11-16-2023

It does seem like a bug. Splunk is supposed to calculate length based on number of characters, not bytes (and the same goes for parameters in settings, like TRUNCATE or MAX_TIMESTAMP_LOOKAHEAD; so it might be interesting to see if those are also affected).

EDIT: TRUNCATE is actually in bytes. Should get rounded down if it would fall in the middle of a multibyte character. MAX_TIMESTAMP_LOOKAHEAD is in characters however. Confusing.

isoutamo · ‎11-16-2023

Hi
Probably best to create a support case and/or add comments/question to docs and/or ask this on Slack?
r. Ismo

inventsekar · ‎11-16-2023

Filled the form for Splunk Channel signup for Slack yesterday. Awaiting update from them. Thanks!

thanks and best regards,
Sekar

PS - If this or any post helped you in any way, pls consider upvoting, thanks for reading !

inventsekar · ‎11-19-2023

its been 4 days, but still no response from Splunk Slack Admins. Any ideas, suggestions on how to proceed, please, thanks.

thanks and best regards,
Sekar

PS - If this or any post helped you in any way, pls consider upvoting, thanks for reading !

PickleRick · ‎11-20-2023

Splunk Slack is not support. Create a case via support portal.

non english words length function not working as expected

eval

.conf25 Community Recap

Splunk App Developers | .conf25 Recap & What’s Next

Congratulations to the 2025-2026 SplunkTrust!

Are you a member of the Splunk Community?

non english words length function not working as expected

eval

.conf25 Community Recap

Splunk App Developers | .conf25 Recap & What’s Next

Congratulations to the 2025-2026 SplunkTrust!