Hi @inventsekar, The PDF appears to have modified the code points! I prefer to use SPL because it doesn't usually require elevated privileges; however, it might be simpler to use an external lookup ...
See more...
Hi @inventsekar, The PDF appears to have modified the code points! I prefer to use SPL because it doesn't usually require elevated privileges; however, it might be simpler to use an external lookup script. The lookup command treats fields containing only whitespace as empty/null, so the lookup will only identify non-whitespace characters. We'll need to create a script and a transform, which I've encapsulated in an app: $SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500) #!/usr/bin/env python
import csv
import unicodedata
import sys
def main():
if len(sys.argv) != 3:
print("Usage: python category_lookup.py [char] [category]")
sys.exit(1)
charfield = sys.argv[1]
categoryfield = sys.argv[2]
infile = sys.stdin
outfile = sys.stdout
r = csv.DictReader(infile)
header = r.fieldnames
w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
w.writeheader()
for result in r:
if result[charfield]:
result[categoryfield] = unicodedata.category(result[charfield])
w.writerow(result)
main() $SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf [ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3 $SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta []
access = read : [ * ], write : [ admin, power ]
export = system With the app in place, we count 31 non-whitespace characters using the lookup: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?<char>.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M"))) Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points. We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first.