Solved: How can I find events that contain non english wor...

indeed_2000 · ‎10-14-2021

Hi

how can I find events that contain non english words?

e.g i have log file that some lines contain germany or arabic words, how can i recognize these lines?

thanks

DanielMustaine · ‎06-23-2022

Try this:

| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"

This one I didn't have as much success with but not much time to play.

| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)"

Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values.

View solution in original post

VatsalJagani · ‎06-20-2022

@indeed_2000 - Try the below search, it will find anything that contains anything other than desired character set.

<your base search>
| rex _raw="[^\x00-\x7F]"

I hope this helps!!!

indeed_2000 · ‎06-21-2022

@VatsalJagani not work!

VatsalJagani · ‎06-21-2022

@indeed_2000 - The regex does work in the event that you provided.

https://regex101.com/r/sG3IdX/1

indeed_2000 · ‎06-21-2022

Consider each character as separate, in this line i only have one non English word!

Excpected output:

| table NonEnglish

دالكي

Any idea?

DanielMustaine · ‎06-20-2022

I can't help you too much for non-English words in ASCII character sets, but for languages with characters in the unicode set you could consider using Unicode Categories:

https://www.regular-expressions.info/unicode.html

Quick example:

| makeresults | eval text="كلب means dog according to google" | rex field=text "(?<capturetext>\p{Arabic}*)" | table text capturetext

indeed_2000 · ‎06-21-2022

@DanielMustaine not work on this

2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي

any idea?

DanielMustaine · ‎06-22-2022

Hey, try this:

| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[\p{Arabic}]+)" | table text capturetext

Alternatively using VatsalJagani's Regex (which will match all non-ASCII characters so get you the Russian sets etc.) like:

| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[^\x00-\x7F]+)" | table text capturetext

If you know which field the non-standard characters will be in make sure to sub that fieldname into field=xxx. Otherwise, and it will perform poorly, but you can look over the whole log with field=_raw.

indeed_2000 · ‎06-22-2022

Still not work as expected, if more that one non english word exist on each line i expect consider them.

e.g.

Expected result for below line: دالكي هلت

"2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي هلت "

DanielMustaine · ‎06-23-2022

Try this:

| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"

This one I didn't have as much success with but not much time to play.

| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)"

Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values.

PickleRick · ‎06-25-2022

Just remember that as it was stated before it has nothing to do with language as such. It will not capture, for example, a sentence "moja matka jada pomidory" ("my mother eats tomatoes" in Polish) even though it's clearly not an English sentence.

indeed_2000 · ‎06-24-2022

is there anyway to consider whole non english charecter in each line as one extraction?

2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي string تست

exception result:

تست دالكي

How can I find events that contain non english words?

eval

fields

regex

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!