I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.
Generate sample data:
| makeresults
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*
Sample data output:
event_id | event_log |
1 | i like cats |
2 | i like turtles |
3 | i like turtles |
4 | cats are mean |
5 | mary had a little lamb |
The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.
Satisfactory output example - using percent shared words:
similarity | event_ids |
100% | 2, 3 |
66% | 1, 2 |
66% | 1, 3 |
33% | 1, 4 |
I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.
FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.
Half-baked ideas welcome 🙂
The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.
| makeresults
| eval _raw="event_id event_log
1 i like cats
2 i like turtles
3 i like turtles
4 cats are mean
5 mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log
The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.
| makeresults
| eval _raw="event_id event_log
1 i like cats
2 i like turtles
3 i like turtles
4 cats are mean
5 mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log
Perfect, thank you. High values of t are doing what I'd hoped for.