For example, if i have a username of bsmith843 in a field returned by one search, and bsmiths845 as a field from another search, is there any way to gauge the similarity between the two strings? I know i can use wildcards/regex to try and match the strings, but if i can't match everyone i would like to know how similar they are..
And from even further in the future...
There is an app in Splunkbase which supports Levenshtein distance, Damerau-Levenshtein_distance, Jaro distance, Jaro winkler, match rating comparison, and Hamming distance comparisons, plus a number of phonetic algorithms, including soundex. It is called JellyFisher. Here is a sample Levenshtein distance evaluation using this app:
... | jellyfisher levensthein_distance(sourcetype,source)
What would be returned here is an integer, according to this description of Levenshtein distance.
Each of the JellyFisher functions returns the result in a field named after the function (i.e., levenstheindistance, dameraulevenshtein_distance, soundex).
Here is a link to the JellyFisher app.
Here is a mocked-up use of it:
| makeresults | eval foo="kitten", bar="smitten" | jellyfisher levenshtein_distance(foo, bar) | table foo bar levenshtein_distance
There is a python function that does something very close to this. It returns a number between 0 and 1 based on the similarity of two terms. You can find it in the
Here is a really quick example of an app named "fieldcompare" which contains a single python search command. The app is made up of the following files:
import splunk.Intersplunk import difflib (isgetinfo, sys.argv) = splunk.Intersplunk.isGetInfo(sys.argv) args, kwargs = splunk.Intersplunk.getKeywordsAndOptions() if isgetinfo: # streaming, generating, retevs, reqsop, preop splunk.Intersplunk.outputInfo(True, False, False, False, None) (results, dummyresults, settings) = splunk.Intersplunk.getOrganizedResults() field1_name = kwargs.get("field1", "field1") field2_name = kwargs.get("field2", "field2") output_field = kwargs.get("result", "ratio") try: for result in results: try: f1 = result[field1_name] f2 = result[field2_name] except KeyError: # If either field is missing, simply ignore continue sm = difflib.SequenceMatcher(None, f1, f2) result[output_field] = sm.ratio() splunk.Intersplunk.outputResults(results) except Exception, e: splunk.Intersplunk.generateErrorResults("Unhandled exception: %s" % (e,))
[fieldcompare] filename = fieldcompare.py supports_getinfo = true
[commands/fieldcompare] access = read : [ * ], write : [ admin ] export = system [scripts/fieldcompare.py] access = read : [ * ], write : [ admin ] export = system
If the example show above, the search command and app are called "fieldcompare", but you can use any name you want.
Here is a usage example:
... | fieldcompare field1=first_field field2=compare_field results=output | eval percent=round(100*output,2) | sort - percent
Be sure to look over the Custom search commands docs page for additional details about how you go about setting this up within your splunk environment.
Yes, this can be done using a custom search script and one of the many Python modules that can compare strings. You can take a look at http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison which discusses using the Levenshtein distance as a measure. With more detail about your use case, I could suggest how to structure a search and custom command, but this should be enough to start with.