Re: How to extract Most popular words from the sou...

warhead · ‎10-17-2012

I've a source file in which I need to find most popular English words (excluding prepositions and pronouns) and display it.

This is a sample of text I've : Yaaah...,Goal for Arsenal. City don't deal with the corner and Koscielny smashes home..

Like this I've more than one file now I need to extract the most popular English words from the text as shown in my sample text.

Thanks for helping

yannK · ‎10-20-2012

Interesting use case.
Here is a search time method to do it, ( to be tested on large set of events).

source=*mybook* | sort -_time | rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g" | eval word=_raw | makemv delim=" " word | mvexpand word | eval word=lower(word) | eval position=1 | streamstats sum(position) AS position | table position word | stats count min(position) max(position) by word

to describe the steps : we use a field named word, we replace all special characters by spaces, we generate multivalue field using space a separator, then we split each value into a new event, then convert to lowercase, we generate a counter for the position of the word in the text, and finally count the values, with the first and last occurrence of.each word.

jcampos8782 · ‎10-20-2012

For anyone else who might be doing this, I was able to achieve the desired result by using a combination of the rex command to extract individual words from the twitter post body and then piping it to a dynamic lookup table fed by a simple python script.

The command to extract each word was:
rex field=body "(?[a-zA-Z]{2,}\s)"

Jason

kristian_kolb · ‎10-18-2012

Sorry, but I don't think that you'll be able to reliably filter out French/German/Spanish/etc etc automatically.

I believe though that you could possibly break your text into separate events (one event per word), with use of

in props.conf

[my_tweets]
SHOULD_LINEMERGE = false
LINE_BREAKER=(\s+)
EXTRACT-tweetword = ^(?<words>.*)$

And then search like;

sourcetype=my_tweets NOT the NOT a NOT an NOT for NOT by | top 1000 words

and then successively add to the "NOT someword" that turns up unwanted.

Sadly, this is one of those times where there is probably a better tool than Splunk.

/Kristian

warhead · ‎10-18-2012

Yes , there're more than one language. It's collection of online tweeter data , now I need to separate the most popular English words(excluding propositions and pronouns)

kristian_kolb · ‎10-18-2012

Interesting... what does the file look like?
Is there more than one language involved?
All words on a separate line?

/k

Ayn · ‎10-18-2012

...yes? Where did you get with this so far? Is there a question you'd like to ask?

How to extract Most popular words from the source data?

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases