I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is. Generate sample data: | makeresults
| eval raw="1:i like cats,2:i like turtles,3:i like t...
See more...
I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is. Generate sample data: | makeresults
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_* Sample data output: event_id event_log 1 i like cats 2 i like turtles 3 i like turtles 4 cats are mean 5 mary had a little lamb The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later. Satisfactory output example - using percent shared words: similarity event_ids 100% 2, 3 66% 1, 2 66% 1, 3 33% 1, 4 I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event. FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable. Half-baked ideas welcome