Solved: Can Splunk find similar strings in a log?

samlinsongguo · ‎07-21-2018

Hi
Does Splunk can do similar string search?
For example the given string is mystring, and I want to return any log that contain string which looks similar as my given string such as my5tring or mystrings etc.
Cheers

DalJeanis · ‎07-21-2018

Hi - that depends on your criteria for similarity.

It seems like you are looking for something that will search for all terms within a certain Levenshtein distance. Here's something that will calculate that distance, given two words... https://splunkbase.splunk.com/app/1898/

There is no native Splunk method of getting all such possible terms, and it would be a very expensive search. However, we can string together that expensive search if you want to try.

in essence, to find all similar items to "mystring" you would need to search for ( "*ystring" OR "m*string" OR "my*tring" OR "mys*ring" OR "mystr*ing" OR "mystri*g" OR "mystri*")

Efficiency-wise, you would probably be best searching for...

 index=foo ("*ystring" OR "m*g" OR "mystrin*")
| fields ... list the fields you care about ... (_raw and _time will survive this command anyway)

and then limiting extracting the results by a regular expression that is more specific. In this case, we've just translated the above search into a regex to pull it to a field called myalmostmatch.

 | rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"

In the above expression \w* will match any number of word characters (including zero of them). \b matches a word break, and | represents a logical OR between the different things that might match. Thus, this will match any single word that looks about like mystring.

Now, that extraction has not specifically dealt with transpositions - mytsring and so on... but as long as the m and g are there at the beginning and end, those words will be pulled out.

Okay, we now have the result universe, but the middle term "m*g" could include "meeting" and "mutating" . We have to calculate the Levenshtein to each of the potential terms that we extracted.

| rename COMMENT as "give each potential record a unique number so we can put them back together later"
| rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
| streamstats count as recno
| mvexpand myalmostmatch

| rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 3 changes to match"
| levenshtein distance "mystring" myalmostmatch
| where distance < 3   

| rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
| rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
| eval mymatch="match=".myalmostmatch.";levenshtein=".distance
| fields - myalmostmatch distance
| mvcombine mymatch

The above will provide the basis to get more or less what you are looking for.

View solution in original post

DalJeanis · ‎07-21-2018

Hi - that depends on your criteria for similarity.

It seems like you are looking for something that will search for all terms within a certain Levenshtein distance. Here's something that will calculate that distance, given two words... https://splunkbase.splunk.com/app/1898/

There is no native Splunk method of getting all such possible terms, and it would be a very expensive search. However, we can string together that expensive search if you want to try.

in essence, to find all similar items to "mystring" you would need to search for ( "*ystring" OR "m*string" OR "my*tring" OR "mys*ring" OR "mystr*ing" OR "mystri*g" OR "mystri*")

Efficiency-wise, you would probably be best searching for...

 index=foo ("*ystring" OR "m*g" OR "mystrin*")
| fields ... list the fields you care about ... (_raw and _time will survive this command anyway)

and then limiting extracting the results by a regular expression that is more specific. In this case, we've just translated the above search into a regex to pull it to a field called myalmostmatch.

 | rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"

In the above expression \w* will match any number of word characters (including zero of them). \b matches a word break, and | represents a logical OR between the different things that might match. Thus, this will match any single word that looks about like mystring.

Now, that extraction has not specifically dealt with transpositions - mytsring and so on... but as long as the m and g are there at the beginning and end, those words will be pulled out.

Okay, we now have the result universe, but the middle term "m*g" could include "meeting" and "mutating" . We have to calculate the Levenshtein to each of the potential terms that we extracted.

| rename COMMENT as "give each potential record a unique number so we can put them back together later"
| rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
| streamstats count as recno
| mvexpand myalmostmatch

| rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 3 changes to match"
| levenshtein distance "mystring" myalmostmatch
| where distance < 3   

| rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
| rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
| eval mymatch="match=".myalmostmatch.";levenshtein=".distance
| fields - myalmostmatch distance
| mvcombine mymatch

The above will provide the basis to get more or less what you are looking for.

DalJeanis · ‎07-21-2018

Here's some run-anywhere code using the jellyfisher app https://splunkbase.splunk.com/app/3626/#/details to calculate the Levenshtein distance.

| makeresults | eval mydata="test one mystring!!!!test two m5strng and mystr1ng!!!!test three I'm mortgaging my kid's future!!!!test 5 making my day my5tring!!!!test 6 whatever!!!!test 7 there was no matching word to mystring in test 6"" | makemv delim="!!!!" mydata | mvexpand mydata | rename mydata as _raw
|streamstats count | eval _time = _time + count  | fields - count 
| rename COMMENT as "the above just generates test data"

| rex max_match=0 "\b(?<myalmostmatch>\w*ystring\w*|m\w*g|mystrin\w*)\b"
| rename COMMENT as "give each potential record a unique number so we can put them back together later"
 | rename COMMENT as "then split apart the records that got multiple hits. mvexpand would also kill any records that got no hits."
 | streamstats count as recno
 | rename _raw as Raw, _time as Time
 | mvexpand myalmostmatch
 | rename COMMENT as "calculate the levenshtein distance and kill all records that require more than 2 changes to match"
 | eval target="mystring"
 | jellyfisher levenshtein_distance(target,myalmostmatch)
 | rename levenshtein_distance as distance
 | where distance < 3   

 | rename COMMENT as "collapse the myalmostmatch string and the distance field into a single field, then delete them so that we can rejoin the record"
 | rename COMMENT as "(mvcombine only allows a single field to differ between two records or it won't combine them."
 | eval mymatch="match=".myalmostmatch.";levenshtein=".distance
 | fields - myalmostmatch distance
 | mvcombine mymatch
 | rename Raw as _raw, Time as _time
 | sort 0 _time 
 | table _time _raw recno mymatch

Resulting in this output

_time                _raw                                                     recno  mymatch
2018-07-21 22:26:36  test one mystring                                          1    match=mystring;levenshtein=0
2018-07-21 22:26:37  test two m5strng and mystr1ng                              2    match=m5strng;levenshtein=2
                                                                                     match=mystr1ng;levenshtein=1
2018-07-21 22:26:39  test 5 making my day my5tring                              4    match=my5tring;levenshtein=1
2018-07-21 22:26:41  test 7 there was no matching word to mystring in test 6    5    match=mystring;levenshtein=0

samlinsongguo · ‎07-22-2018

thank you for your details explaination

woodcock · ‎07-21-2018

Cute joke in subject.

renjith_nair · ‎07-21-2018

@samlinsongguo

Splunk can do searches using wildcard. For e.g. below is my data inputs(events)

1,This string contain mystring
2,This string contain mystrings
3,This string contain my5tring

Below search gives me all three rows

index="test" sourcetype="strings"|search *my*tring*

Below gives me only first 2 rows

index="test" sourcetype="strings"|search *mystring*

And below only the first row

index="test" sourcetype="strings"|search *mystring

Hope it clarifies

---
What goes around comes around. If it helps, hit it with Karma 🙂

samlinsongguo · ‎07-22-2018

Thank you for your suggestion but it is not exactly I am looking for. I want to search any string that similar to mystring, not just two string I given.

thambisetty · ‎07-21-2018

Hi @samlinsongguo,

Hope this helps you https://docs.splunk.com/Documentation/Splunk/7.1.2/Search/UseCASEandTERMtomatchphrases

————————————
If this helps, give a like below.

Can Splunk find similar strings in a log?

[Puzzles] Solve, Learn, Repeat: Reprocessing XML into Fixed-Length Events

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Join the Conversation

Can Splunk find similar strings in a log?

[Puzzles] Solve, Learn, Repeat: Reprocessing XML into Fixed-Length Events

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?