Hi Splunk Gurus... As you can see, non English words length function not working as expected. checked the old posts, documentations, but no luck. any suggestions please. thanks.
| makeresults
| eval _raw="இடும்பைக்கு"
| eval length=len(_raw) | table _raw length
this produces:
_raw length
இடும்பைக்கு 11
(that word இடும்பைக்கு is actually 6 charactors, not 11)
Thanks @tscroggins for your upvote and karma points, much appreciated!
last few months i was busy, could not spend time for this one.
May I know what would be your suggestions about these points pls:
1) I have been thinking to create an app as your suggestion listed below. would you recommend an app or a custom command or simply all important languages unicodes lookup(tamil_unicode_block.csv) uploading to Splunk
| makeresults | eval _raw="இடும்பைக்கு" | rex max_match=0 "(?<char>.)" | lookup tamil_unicode_block.csv char output general_category | eval length=mvcount(mvfilter(NOT match(general_category, "^M")))
2) i assume that if i encapsulate his below listed python script in that app should be the work-around for this issue in a language agnostic way(this app should work for Tamil or Hindi or Telegu, etc)
3) or any other suggestions pls, thanks.
the app idea (your script from previous reply):
$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)
#!/usr/bin/env python import csv import unicodedata import sys def main(): if len(sys.argv) != 3: print("Usage: python category_lookup.py [char] [category]") sys.exit(1) charfield = sys.argv[1] categoryfield = sys.argv[2] infile = sys.stdin outfile = sys.stdout r = csv.DictReader(infile) header = r.fieldnames w = csv.DictWriter(outfile, fieldnames=r.fieldnames) w.writeheader() for result in r: if result[charfield]: result[categoryfield] = unicodedata.category(result[charfield]) w.writerow(result) main()
$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf
[ucd_category_lookup] external_cmd = ucd_category_lookup.py char category fields_list = char, category python.version = python3
$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta
[] access = read : [ * ], write : [ admin, power ] export = system
With the app in place, we count 31 non-whitespace characters using the lookup:
| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | rex max_match=0 "(?<char>.)" | lookup ucd_category_lookup char output category | eval length=mvcount(mvfilter(NOT match(category, "^M")))
Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.
We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:
| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | lookup ucd_count_chars_lookup _raw output count
If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first.
$SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500)
#!/usr/bin/env python import csv import unicodedata import sys def main(): if len(sys.argv) != 3: print("Usage: python category_lookup.py [char] [category]") sys.exit(1) charfield = sys.argv[1] categoryfield = sys.argv[2] infile = sys.stdin outfile = sys.stdout r = csv.DictReader(infile) header = r.fieldnames w = csv.DictWriter(outfile, fieldnames=r.fieldnames) w.writeheader() for result in r: if result[charfield]: result[categoryfield] = unicodedata.category(result[charfield]) w.writerow(result) main()
$SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf
[ucd_category_lookup] external_cmd = ucd_category_lookup.py char category fields_list = char, category python.version = python3
$SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta
[] access = read : [ * ], write : [ admin, power ] export = system
With the app in place, we count 31 non-whitespace characters using the lookup:
| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | rex max_match=0 "(?<char>.)" | lookup ucd_category_lookup char output category | eval length=mvcount(mvfilter(NOT match(category, "^M")))
Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points.
We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly:
| makeresults | eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு இடும்பை படாஅ தவர்" | lookup ucd_count_chars_lookup _raw output count
Hi @inventsekar,
1) As I recall, I only generated the lookup CSV file for testing in Tamil. An all-language lookup might be size prohibitive. The best SPL-based workaround to count all graphemes seems to be an eval expression using the \X regular expression token to match Unicode sequences. The simplest expression was:
| eval count=len(replace(field, "\\X", "x"))
2) The external lookup allowed programmatic access to Python modules or any other library/program if called from an arbitrary script. The example returned a Unicode character category, but the subsequent counting solution wasn't comprehensive.
In Bash, calculating the number of characters may be as simple as:
echo ${#field}
=>
58
but this suffers the same problem as our earlier efforts by not taking into account marks and code sequences used to generate graphemes.
Is Perl better?
perl -CS -lnE 'say length' <<<${field}
=>
58
As before, the length is incorrect. I'm not a Perl expert, but see https://perldoc.perl.org/perlunicode:
The only time that Perl considers a sequence of individual code points as a single logical character is in the \X construct ....
That leads us to:
perl -CS -lnE 's/\X/x/g; say length' <<<${field}
=>
37
There may be a better native Perl, Python, etc. solution, but calling an external program is more expensive than the equivalent SPL.
3) If you only need to count graphemes, I would use the eval command. What other use cases did you have in mind?
In a quick test, TRUNCATE truncates the event following the last complete UTF-8 character less than the byte limit. No partial code points are indexed.
It does seem like a bug. Splunk is supposed to calculate length based on number of characters, not bytes (and the same goes for parameters in settings, like TRUNCATE or MAX_TIMESTAMP_LOOKAHEAD; so it might be interesting to see if those are also affected).
EDIT: TRUNCATE is actually in bytes. Should get rounded down if it would fall in the middle of a multibyte character. MAX_TIMESTAMP_LOOKAHEAD is in characters however. Confusing.
Filled the form for Splunk Channel signup for Slack yesterday. Awaiting update from them. Thanks!
its been 4 days, but still no response from Splunk Slack Admins. Any ideas, suggestions on how to proceed, please, thanks.
Splunk Slack is not support. Create a case via support portal.