Archive

How to extract Most popular words from the source data?

warhead
Engager

I've a source file in which I need to find most popular English words (excluding prepositions and pronouns) and display it.

This is a sample of text I've : Yaaah...,Goal for Arsenal. City don't deal with the corner and Koscielny smashes home..

Like this I've more than one file now I need to extract the most popular English words from the text as shown in my sample text.

Thanks for helping

Tags (2)

yannK
Splunk Employee
Splunk Employee

Interesting use case.
Here is a search time method to do it, ( to be tested on large set of events).

source=*mybook*
| sort -_time
| rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g"
| eval word=_raw
| makemv delim=" " word
| mvexpand word
| eval word=lower(word)
| eval position=1 | streamstats sum(position) AS position
| table position word
| stats count min(position) max(position) by word

to describe the steps : we use a field named word, we replace all special characters by spaces, we generate multivalue field using space a separator, then we split each value into a new event, then convert to lowercase, we generate a counter for the position of the word in the text, and finally count the values, with the first and last occurrence of.each word.

jcampos8782
New Member

For anyone else who might be doing this, I was able to achieve the desired result by using a combination of the rex command to extract individual words from the twitter post body and then piping it to a dynamic lookup table fed by a simple python script.

The command to extract each word was:
rex field=body "(?[a-zA-Z]{2,}\s)"

Jason

0 Karma

kristian_kolb
Ultra Champion

Sorry, but I don't think that you'll be able to reliably filter out French/German/Spanish/etc etc automatically.

I believe though that you could possibly break your text into separate events (one event per word), with use of

in props.conf

[my_tweets]
SHOULD_LINEMERGE = false
LINE_BREAKER=(\s+)
EXTRACT-tweetword = ^(?<words>.*)$

And then search like;

sourcetype=my_tweets NOT the NOT a NOT an NOT for NOT by | top 1000 words

and then successively add to the "NOT someword" that turns up unwanted.

Sadly, this is one of those times where there is probably a better tool than Splunk.

/Kristian

0 Karma

warhead
Engager

Yes , there're more than one language. It's collection of online tweeter data , now I need to separate the most popular English words(excluding propositions and pronouns)

0 Karma

kristian_kolb
Ultra Champion

Interesting... what does the file look like?
Is there more than one language involved?
All words on a separate line?

/k

Ayn
Legend

...yes? Where did you get with this so far? Is there a question you'd like to ask?

0 Karma