Splunk Search

Count unique words in a given field across all events

modalexii
Engager

I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.

Generate sample data:

 

| makeresults 
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*

 

Sample data output:

event_idevent_log
1i like cats
2i like turtles
3i like turtles
4cats are mean
5mary had a little lamb

 

The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.

Satisfactory output example - using percent shared words:

similarityevent_ids
100%2, 3
66%1, 2
66%1, 3
33%1, 4

 

I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.

FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.

Half-baked ideas welcome 🙂

Labels (2)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

modalexii
Engager

Perfect, thank you. High values of t are doing what I'd hoped for.

0 Karma
Get Updates on the Splunk Community!

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

Splunk Decoded: Business Transactions vs Business IQ

It’s the morning of Black Friday, and your e-commerce site is handling 10x normal traffic. Orders are flowing, ...

Fastest way to demo Observability

I’ve been having a lot of fun learning about Kubernetes and Observability. I set myself an interesting ...