For example, if i have a username of bsmith843 in a field returned by one search, and bsmiths845 as a field from another search, is there any way to gauge the similarity between the two strings? I know i can use wildcards/regex to try and match the strings, but if i can't match everyone i would like to know how similar they are..
And from even further in the future...
There is an app in Splunkbase which supports Levenshtein distance, Damerau-Levenshtein_distance, Jaro distance, Jaro winkler, match rating comparison, and Hamming distance comparisons, plus a number of phonetic algorithms, including soundex. It is called JellyFisher. Here is a sample Levenshtein distance evaluation using this app:
... | jellyfisher levensthein_distance(sourcetype,source)
What would be returned here is an integer, according to this description of Levenshtein distance.
Each of the JellyFisher functions returns the result in a field named after the function (i.e., levensthein_distance, damerau_levenshtein_distance, soundex).
Here is a link to the JellyFisher app.
Here is a mocked-up use of it:
| makeresults | eval foo="kitten", bar="smitten" | jellyfisher levenshtein_distance(foo, bar) | table foo bar levenshtein_distance
There is a python function that does something very close to this. It returns a number between 0 and 1 based on the similarity of two terms. You can find it in the
Here is a really quick example of an app named "fieldcompare" which contains a single python search command. The app is made up of the following files:
import splunk.Intersplunk import difflib (isgetinfo, sys.argv) = splunk.Intersplunk.isGetInfo(sys.argv) args, kwargs = splunk.Intersplunk.getKeywordsAndOptions() if isgetinfo: # streaming, generating, retevs, reqsop, preop splunk.Intersplunk.outputInfo(True, False, False, False, None) (results, dummyresults, settings) = splunk.Intersplunk.getOrganizedResults() field1_name = kwargs.get("field1", "field1") field2_name = kwargs.get("field2", "field2") output_field = kwargs.get("result", "ratio") try: for result in results: try: f1 = result[field1_name] f2 = result[field2_name] except KeyError: # If either field is missing, simply ignore continue sm = difflib.SequenceMatcher(None, f1, f2) result[output_field] = sm.ratio() splunk.Intersplunk.outputResults(results) except Exception, e: splunk.Intersplunk.generateErrorResults("Unhandled exception: %s" % (e,))
[fieldcompare] filename = fieldcompare.py supports_getinfo = true
[commands/fieldcompare] access = read : [ * ], write : [ admin ] export = system [scripts/fieldcompare.py] access = read : [ * ], write : [ admin ] export = system
If the example show above, the search command and app are called "fieldcompare", but you can use any name you want.
Here is a usage example:
... | fieldcompare field1=first_field field2=compare_field results=output | eval percent=round(100*output,2) | sort - percent
Be sure to look over the Custom search commands docs page for additional details about how you go about setting this up within your splunk environment.
I used this script but its throwing "Error in 'script': Getinfo probe failed for external search command 'fieldcompare'" error. Any suggestions ?
Yes, this can be done using a custom search script and one of the many Python modules that can compare strings. You can take a look at http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison which discusses using the Levenshtein distance as a measure. With more detail about your use case, I could suggest how to structure a search and custom command, but this should be enough to start with.
I bring to you a message from the future! Nimsh wrote a Levenshtein custom command at some point .. https://splunkbase.splunk.com/app/1898/