Splunk Search

Count unique words in a given field across all events

modalexii
Engager

I''m trying to figure out a way to sort events by how similar the wording in a free-form text field is.

Generate sample data:

 

| makeresults 
| eval raw="1:i like cats,2:i like turtles,3:i like turtles,4:cats are mean,5:mary had a little lamb"
| makemv delim="," raw
| rex field="raw" "(?<event_id>\d):(?<event_log>.+)"
| table event_*

 

Sample data output:

event_idevent_log
1i like cats
2i like turtles
3i like turtles
4cats are mean
5mary had a little lamb

 

The output I'm after must yield a value that I can sort or filter on to identify the events with the most similar text. None of the specifics of the examples below are important - percent shared words is preferred but I can work with count of shared words and likely other outputs. The formatting of the example is not important, e.g. a MV field would be just fine in place of the CSV field "event_ids". Myriad other considerations, like how exactly to split on words that may contain punctuation, etc, will be handled later.

Satisfactory output example - using percent shared words:

similarityevent_ids
100%2, 3
66%1, 2
66%1, 3
33%1, 4

 

I've tried a good handful of things involving splitting followed by multiple rounds of stats by but I can't quite get there. I'm familiar with the Levenshtein feature of the URL Toolbox too but I couldn't think of how to use it to compare each event with every other event.

FWIW this solution does not need to be especially performant - it will process a few hundred events at a time on a schedule, so expensive options like map and foreach black magic are acceptable.

Half-baked ideas welcome 🙂

Labels (3)
0 Karma
1 Solution

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway
SplunkTrust
SplunkTrust

The cluster command may be what you're looking for.  Experiment with the value for the t option to get the desired results.

| makeresults 
| eval _raw="event_id event_log
1         i like cats
2         i like turtles
3         i like turtles
4         cats are mean
5         mary had a little lamb"
| multikv forceheader=1
| cluster field=event_log showcount=t t=0.5
| sort - cluster_count
| table event_id event_log

 

---
If this reply helps you, Karma would be appreciated.

modalexii
Engager

Perfect, thank you. High values of t are doing what I'd hoped for.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...