Splunk Search

How to extract Most popular words from the source data?


I've a source file in which I need to find most popular English words (excluding prepositions and pronouns) and display it.

This is a sample of text I've : Yaaah...,Goal for Arsenal. City don't deal with the corner and Koscielny smashes home..

Like this I've more than one file now I need to extract the most popular English words from the text as shown in my sample text.

Thanks for helping

Tags (2)

Splunk Employee
Splunk Employee

Interesting use case.
Here is a search time method to do it, ( to be tested on large set of events).

| sort -_time
| rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g"
| eval word=_raw
| makemv delim=" " word
| mvexpand word
| eval word=lower(word)
| eval position=1 | streamstats sum(position) AS position
| table position word
| stats count min(position) max(position) by word

to describe the steps : we use a field named word, we replace all special characters by spaces, we generate multivalue field using space a separator, then we split each value into a new event, then convert to lowercase, we generate a counter for the position of the word in the text, and finally count the values, with the first and last occurrence of.each word.

New Member

For anyone else who might be doing this, I was able to achieve the desired result by using a combination of the rex command to extract individual words from the twitter post body and then piping it to a dynamic lookup table fed by a simple python script.

The command to extract each word was:
rex field=body "(?[a-zA-Z]{2,}\s)"


0 Karma

Ultra Champion

Sorry, but I don't think that you'll be able to reliably filter out French/German/Spanish/etc etc automatically.

I believe though that you could possibly break your text into separate events (one event per word), with use of

in props.conf

EXTRACT-tweetword = ^(?<words>.*)$

And then search like;

sourcetype=my_tweets NOT the NOT a NOT an NOT for NOT by | top 1000 words

and then successively add to the "NOT someword" that turns up unwanted.

Sadly, this is one of those times where there is probably a better tool than Splunk.


0 Karma


Yes , there're more than one language. It's collection of online tweeter data , now I need to separate the most popular English words(excluding propositions and pronouns)

0 Karma

Ultra Champion

Interesting... what does the file look like?
Is there more than one language involved?
All words on a separate line?



...yes? Where did you get with this so far? Is there a question you'd like to ask?

0 Karma
Get Updates on the Splunk Community!

Detecting Remote Code Executions With the Splunk Threat Research Team

WATCH NOWRemote code execution (RCE) vulnerabilities pose a significant risk to organizations. If exploited, ...

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

The Splunk Community Dashboard Challenge is underway! This is your chance to showcase your skills in creating ...

.conf24 | Session Scheduler is Live!!

.conf24 is happening June 11 - 14 in Las Vegas, and we are thrilled to announce that the conference catalog ...