Splunk Search

Count unique words in a given field across all events

modalexii
Engager

I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.

Generate sample data:

 

| makeresults 
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*

 

Sample data output:

event_idevent_log
1i like cats
2i like turtles
3i like turtles
4cats are mean
5mary had a little lamb

 

The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.

Satisfactory output example - using percent shared words:

similarityevent_ids
100%2, 3
66%1, 2
66%1, 3
33%1, 4

 

I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.

FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.

Half-baked ideas welcome 🙂

Labels (2)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

modalexii
Engager

Perfect, thank you. High values of t are doing what I'd hoped for.

0 Karma
Get Updates on the Splunk Community!

Accelerating Observability as Code with the Splunk AI Assistant

We’ve seen in previous posts what Observability as Code (OaC) is and how it’s now essential for managing ...

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

 Splunk is More Than Just the Web Console For Digital Forensics and Incident Response (DFIR) practitioners, ...

Congratulations to the 2025-2026 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...