Splunk Search

How can I find events that contain non english words?

indeed_2000
Motivator

Hi

how can I find events that contain non english words?

e.g i have log file that some lines contain germany or arabic words, how can i recognize these lines?

thanks

Labels (3)
0 Karma
1 Solution

DanielMustaine
Explorer

Try this:

| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"

 

This one I didn't have as much success with but not much time to play.

| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)" 

 

Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values. 

 

View solution in original post

VatsalJagani
SplunkTrust
SplunkTrust

@indeed_2000 - Try the below search, it will find anything that contains anything other than desired character set.

<your base search>
| rex _raw="[^\x00-\x7F]"

 

I hope this helps!!!

0 Karma

indeed_2000
Motivator

@VatsalJagani not work!

0 Karma

VatsalJagani
SplunkTrust
SplunkTrust

@indeed_2000 - The regex does work in the event that you provided.

https://regex101.com/r/sG3IdX/1

VatsalJagani_0-1655825720128.png

 

0 Karma

indeed_2000
Motivator

Consider each character as separate, in this line i only have one non English word!

Excpected output:

| table NonEnglish 

دالكي

 

Any idea?

0 Karma

DanielMustaine
Explorer

I can't help you too much for non-English words in ASCII character sets, but for languages with characters in the unicode set you could consider using Unicode Categories:

https://www.regular-expressions.info/unicode.html 

Quick example:

| makeresults | eval text="كلب means dog according to google" | rex field=text "(?<capturetext>\p{Arabic}*)" | table text capturetext

0 Karma

indeed_2000
Motivator

@DanielMustaine not work on this

 

2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي 

any idea?

 

0 Karma

DanielMustaine
Explorer

Hey, try this:

 

| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[\p{Arabic}]+)" | table text capturetext

 

Alternatively using VatsalJagani's Regex (which will match all non-ASCII characters so get you the Russian sets etc.) like:

| makeresults | eval text="2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي " |rex field=text "(?<capturetext>[^\x00-\x7F]+)" | table text capturetext

 

If you know which field the non-standard characters will be in make sure to sub that fieldname into field=xxx. Otherwise, and it will perform poorly, but you can look over the whole log with field=_raw.

0 Karma

indeed_2000
Motivator

Still not work as expected, if more that one non english word exist on each line i expect consider them.

e.g.

Expected result for below line: دالكي  هلت 

"2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي  هلت " 

0 Karma

DanielMustaine
Explorer

Try this:

| rex max_match=0 field=text "(?<capturetext>\p{Arabic}[\p{Arabic}\s]+)"

 

This one I didn't have as much success with but not much time to play.

| rex field=text max_match=0 "(?<capturetext>[^\x00-\x7F][^\x00-\x7F\s]+)" 

 

Max match will also mean that multiple phrases in one log are pulled into capturetext as new multivalue field values. 

 

PickleRick
SplunkTrust
SplunkTrust

Just remember that as it was stated before it has nothing to do with language as such. It will not capture, for example, a sentence "moja matka jada pomidory" ("my mother eats tomatoes" in Polish) even though it's clearly not an English sentence.

0 Karma

indeed_2000
Motivator

is there anyway to consider whole non english charecter in each line as one extraction?

2022-06-20 11:16:10,381 INFO [APP] log in : 38773763@#123@دالكي  string تست

 

exception result:

تست دالكي   

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...