Splunk Search

Can Splunk find similar strings in a log?

samlinsongguo
Communicator

Hi
Does Splunk can do similar string search?
For example the given string is mystring, and I want to return any log that contain string which looks similar as my given string such as my5tring or mystrings etc.
Cheers

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

Hi - that depends on your criteria for similarity.

It seems like you are looking for something that will search for all terms within a certain Levenshtein distance. Here's something that will calculate that distance, given two words... https://splunkbase.splunk.com/app/1898/

There is no native Splunk method of getting all such possible terms, and it would be a very expensive search. However, we can string together that expensive search if you want to try.

in essence, to find all similar items to "mystring" you would need to search for ( "*ystring" OR "m*string" OR "my*tring" OR "mys*ring" OR "mystr*ing" OR "mystri*g" OR "mystri*")

Efficiency-wise, you would probably be best searching for...

 index=foo ("*ystring" OR "m*g" OR "mystrin*")
| fields ... list the fields you care about ... (_raw and _time will survive this command anyway)

and then limiting extracting the results by a regular expression that is more specific. In this case, we've just translated the above search into a regex to pull it to a field called myalmostmatch.

 | rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"

In the above expression \w* will match any number of word characters (including zero of them). \b matches a word break, and | represents a logical OR between the different things that might match. Thus, this will match any single word that looks about like mystring.

Now, that extraction has not specifically dealt with transpositions - mytsring and so on... but as long as the m and g are there at the beginning and end, those words will be pulled out.

Okay, we now have the result universe, but the middle term "m*g" could include "meeting" and "mutating" . We have to calculate the Levenshtein to each of the potential terms that we extracted.

| rename COMMENT as "give each potential record a unique number so we can put them back together later"
| rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
| streamstats count as recno
| mvexpand myalmostmatch

| rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 3 changes to match"
| levenshtein distance "mystring" myalmostmatch
| where distance < 3   

| rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
| rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
| eval mymatch="match=".myalmostmatch.";levenshtein=".distance
| fields - myalmostmatch distance
| mvcombine mymatch

The above will provide the basis to get more or less what you are looking for.

View solution in original post

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Hi - that depends on your criteria for similarity.

It seems like you are looking for something that will search for all terms within a certain Levenshtein distance. Here's something that will calculate that distance, given two words... https://splunkbase.splunk.com/app/1898/

There is no native Splunk method of getting all such possible terms, and it would be a very expensive search. However, we can string together that expensive search if you want to try.

in essence, to find all similar items to "mystring" you would need to search for ( "*ystring" OR "m*string" OR "my*tring" OR "mys*ring" OR "mystr*ing" OR "mystri*g" OR "mystri*")

Efficiency-wise, you would probably be best searching for...

 index=foo ("*ystring" OR "m*g" OR "mystrin*")
| fields ... list the fields you care about ... (_raw and _time will survive this command anyway)

and then limiting extracting the results by a regular expression that is more specific. In this case, we've just translated the above search into a regex to pull it to a field called myalmostmatch.

 | rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"

In the above expression \w* will match any number of word characters (including zero of them). \b matches a word break, and | represents a logical OR between the different things that might match. Thus, this will match any single word that looks about like mystring.

Now, that extraction has not specifically dealt with transpositions - mytsring and so on... but as long as the m and g are there at the beginning and end, those words will be pulled out.

Okay, we now have the result universe, but the middle term "m*g" could include "meeting" and "mutating" . We have to calculate the Levenshtein to each of the potential terms that we extracted.

| rename COMMENT as "give each potential record a unique number so we can put them back together later"
| rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
| streamstats count as recno
| mvexpand myalmostmatch

| rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 3 changes to match"
| levenshtein distance "mystring" myalmostmatch
| where distance < 3   

| rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
| rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
| eval mymatch="match=".myalmostmatch.";levenshtein=".distance
| fields - myalmostmatch distance
| mvcombine mymatch

The above will provide the basis to get more or less what you are looking for.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Here's some run-anywhere code using the jellyfisher app https://splunkbase.splunk.com/app/3626/#/details to calculate the Levenshtein distance.

| makeresults | eval mydata="test one mystring!!!!test two m5strng and mystr1ng!!!!test three I'm mortgaging my kid's future!!!!test 5 making my day my5tring!!!!test 6 whatever!!!!test 7 there was no matching word to mystring in test 6"" | makemv delim="!!!!" mydata | mvexpand mydata | rename mydata as _raw
|streamstats count | eval _time = _time + count  | fields - count 
| rename COMMENT as "the above just generates test data"

| rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"
| rename COMMENT as "give each potential record a unique number so we can put them back together later"
 | rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
 | streamstats count as recno
 | rename _raw as Raw, _time as Time
 | mvexpand myalmostmatch
 | rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 2 changes to match"
 | eval target="mystring"
 | jellyfisher levenshtein_distance(target,myalmostmatch)
 | rename levenshtein_distance as distance
 | where distance < 3   

 | rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
 | rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
 | eval mymatch="match=".myalmostmatch.";levenshtein=".distance
 | fields - myalmostmatch distance
 | mvcombine mymatch
 | rename Raw as _raw, Time as _time
 | sort 0 _time 
 | table _time _raw recno mymatch

Resulting in this output

_time                _raw                                                     recno  mymatch
2018-07-21 22:26:36  test one mystring                                          1    match=mystring;levenshtein=0
2018-07-21 22:26:37  test two m5strng and mystr1ng                              2    match=m5strng;levenshtein=2
                                                                                     match=mystr1ng;levenshtein=1
2018-07-21 22:26:39  test 5 making my day my5tring                              4    match=my5tring;levenshtein=1
2018-07-21 22:26:41  test 7 there was no matching word to mystring in test 6    5    match=mystring;levenshtein=0
0 Karma

samlinsongguo
Communicator

thank you for your details explaination

woodcock
Esteemed Legend

Cute joke in subject.

0 Karma

renjith_nair
SplunkTrust
SplunkTrust

@samlinsongguo

Splunk can do searches using wildcard. For e.g. below is my data inputs(events)

1,This string contain mystring
2,This string contain mystrings
3,This string contain my5tring

Below search gives me all three rows

index="test" sourcetype="strings"|search *my*tring*

Below gives me only first 2 rows

index="test" sourcetype="strings"|search *mystring*

And below only the first row

index="test" sourcetype="strings"|search *mystring

Hope it clarifies

Happy Splunking!

samlinsongguo
Communicator

Thank you for your suggestion but it is not exactly I am looking for. I want to search any string that similar to mystring, not just two string I given.

0 Karma

thambisetty
SplunkTrust
SplunkTrust

Hi @samlinsongguo,

Hope this helps you https://docs.splunk.com/Documentation/Splunk/7.1.2/Search/UseCASEandTERMtomatchphrases

————————————
If this helps, give a like below.
0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...