Hi @inventsekar, 1) As I recall, I only generated the lookup CSV file for testing in Tamil. An all-language lookup might be size prohibitive. The best SPL-based workaround to count all graphemes see...
See more...
Hi @inventsekar, 1) As I recall, I only generated the lookup CSV file for testing in Tamil. An all-language lookup might be size prohibitive. The best SPL-based workaround to count all graphemes seems to be an eval expression using the \X regular expression token to match Unicode sequences. The simplest expression was: | eval count=len(replace(field, "\\X", "x")) 2) The external lookup allowed programmatic access to Python modules or any other library/program if called from an arbitrary script. The example returned a Unicode character category, but the subsequent counting solution wasn't comprehensive. In Bash, calculating the number of characters may be as simple as: echo ${#field}
=>
58 but this suffers the same problem as our earlier efforts by not taking into account marks and code sequences used to generate graphemes. Is Perl better? perl -CS -lnE 'say length' <<<${field}
=>
58 As before, the length is incorrect. I'm not a Perl expert, but see https://perldoc.perl.org/perlunicode: The only time that Perl considers a sequence of individual code points as a single logical character is in the \X construct .... That leads us to: perl -CS -lnE 's/\X/x/g; say length' <<<${field}
=>
37 There may be a better native Perl, Python, etc. solution, but calling an external program is more expensive than the equivalent SPL. 3) If you only need to count graphemes, I would use the eval command. What other use cases did you have in mind?