Splunk Search

How to extract Most popular words from the source data?

warhead
Engager

I've a source file in which I need to find most popular English words (excluding prepositions and pronouns) and display it.

This is a sample of text I've : Yaaah...,Goal for Arsenal. City don't deal with the corner and Koscielny smashes home..

Like this I've more than one file now I need to extract the most popular English words from the text as shown in my sample text.

Thanks for helping

Tags (2)

yannK
Splunk Employee
Splunk Employee

Interesting use case.
Here is a search time method to do it, ( to be tested on large set of events).

source=*mybook*
| sort -_time
| rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g"
| eval word=_raw
| makemv delim=" " word
| mvexpand word
| eval word=lower(word)
| eval position=1 | streamstats sum(position) AS position
| table position word
| stats count min(position) max(position) by word

to describe the steps : we use a field named word, we replace all special characters by spaces, we generate multivalue field using space a separator, then we split each value into a new event, then convert to lowercase, we generate a counter for the position of the word in the text, and finally count the values, with the first and last occurrence of.each word.

jcampos8782
New Member

For anyone else who might be doing this, I was able to achieve the desired result by using a combination of the rex command to extract individual words from the twitter post body and then piping it to a dynamic lookup table fed by a simple python script.

The command to extract each word was:
rex field=body "(?[a-zA-Z]{2,}\s)"

Jason

0 Karma

kristian_kolb
Ultra Champion

Sorry, but I don't think that you'll be able to reliably filter out French/German/Spanish/etc etc automatically.

I believe though that you could possibly break your text into separate events (one event per word), with use of

in props.conf

[my_tweets]
SHOULD_LINEMERGE = false
LINE_BREAKER=(\s+)
EXTRACT-tweetword = ^(?<words>.*)$

And then search like;

sourcetype=my_tweets NOT the NOT a NOT an NOT for NOT by | top 1000 words

and then successively add to the "NOT someword" that turns up unwanted.

Sadly, this is one of those times where there is probably a better tool than Splunk.

/Kristian

0 Karma

warhead
Engager

Yes , there're more than one language. It's collection of online tweeter data , now I need to separate the most popular English words(excluding propositions and pronouns)

0 Karma

kristian_kolb
Ultra Champion

Interesting... what does the file look like?
Is there more than one language involved?
All words on a separate line?

/k

Ayn
Legend

...yes? Where did you get with this so far? Is there a question you'd like to ask?

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...