Solved: Re: Count unique words in a given field across all...

modalexii · ‎06-19-2020

I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.

Generate sample data:

| makeresults 
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*

Sample data output:

event_id	event_log
1	i like cats
2	i like turtles
3	i like turtles
4	cats are mean
5	mary had a little lamb

The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.

Satisfactory output example - using percent shared words:

similarity	event_ids
100%	2, 3
66%	1, 2
66%	1, 3
33%	1, 4

I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.

FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.

Half-baked ideas welcome 🙂

richgalloway · ‎06-19-2020

The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway · ‎06-19-2020

The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

---
If this reply helps you, Karma would be appreciated.

modalexii · ‎06-22-2020

Perfect, thank you. High values of t are doing what I'd hoped for.

Count unique words in a given field across all events

count

stats

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability

Are you a member of the Splunk Community?