Try this:
| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"
This one I didn't have as much success with but not much time to play.
| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)"
Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values.
@indeed_2000 - Try the below search, it will find anything that contains anything other than desired character set.
<your base search>
| rex _raw="[^\x00-\x7F]"
I hope this helps!!!
@VatsalJagani not work!
Consider each character as separate, in this line i only have one non English word!
Excpected output:
| table NonEnglish
دالكي
Any idea?
I can't help you too much for non-English words in ASCII character sets, but for languages with characters in the unicode set you could consider using Unicode Categories:
https://www.regular-expressions.info/unicode.html
Quick example:
| makeresults | eval text="كلب means dog according to google" | rex field=text "(?<capturetext>\p{Arabic}*)" | table text capturetext
@DanielMustaine not work on this
2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي
any idea?
Hey, try this:
| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[\p{Arabic}]+)" | table text capturetext
Alternatively using VatsalJagani's Regex (which will match all non-ASCII characters so get you the Russian sets etc.) like:
| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[^\x00-\x7F]+)" | table text capturetext
If you know which field the non-standard characters will be in make sure to sub that fieldname into field=xxx. Otherwise, and it will perform poorly, but you can look over the whole log with field=_raw.
Still not work as expected, if more that one non english word exist on each line i expect consider them.
e.g.
Expected result for below line: دالكي هلت
"2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي هلت "
Try this:
| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"
This one I didn't have as much success with but not much time to play.
| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)"
Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values.
Just remember that as it was stated before it has nothing to do with language as such. It will not capture, for example, a sentence "moja matka jada pomidory" ("my mother eats tomatoes" in Polish) even though it's clearly not an English sentence.
is there anyway to consider whole non english charecter in each line as one extraction?
2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي string تست
exception result:
تست دالكي