Solved: Re: Count unique words in a given field across all...

modalexii · ‎06-19-2020

I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.

Generate sample data:

| makeresults 
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*

Sample data output:

event_id	event_log
1	i like cats
2	i like turtles
3	i like turtles
4	cats are mean
5	mary had a little lamb

The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.

Satisfactory output example - using percent shared words:

similarity	event_ids
100%	2, 3
66%	1, 2
66%	1, 3
33%	1, 4

I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.

FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.

Half-baked ideas welcome 🙂

richgalloway · ‎06-19-2020

The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway · ‎06-19-2020

The cluster command may be what you're looking for. Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

---
If this reply helps you, Karma would be appreciated.

modalexii · ‎06-22-2020

Perfect, thank you. High values of t are doing what I'd hoped for.

Count unique words in a given field across all events

count

stats

Accelerating Observability as Code with the Splunk AI Assistant

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!

Join the Conversation