I've a source file in which I need to find most popular English words (excluding prepositions and pronouns) and display it.
This is a sample of text I've : Yaaah...,Goal for Arsenal. City don't deal with the corner and Koscielny smashes home..
Like this I've more than one file now I need to extract the most popular English words from the text as shown in my sample text.
Thanks for helping
Interesting use case.
Here is a search time method to do it, ( to be tested on large set of events).
| sort -_time
| rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g"
| eval word=_raw
| makemv delim=" " word
| mvexpand word
| eval word=lower(word)
| eval position=1 | streamstats sum(position) AS position
| table position word
| stats count min(position) max(position) by word
to describe the steps : we use a field named word, we replace all special characters by spaces, we generate multivalue field using space a separator, then we split each value into a new event, then convert to lowercase, we generate a counter for the position of the word in the text, and finally count the values, with the first and last occurrence of.each word.
For anyone else who might be doing this, I was able to achieve the desired result by using a combination of the rex command to extract individual words from the twitter post body and then piping it to a dynamic lookup table fed by a simple python script.
The command to extract each word was:
rex field=body "(?
Sorry, but I don't think that you'll be able to reliably filter out French/German/Spanish/etc etc automatically.
I believe though that you could possibly break your text into separate events (one event per word), with use of
[my_tweets] SHOULD_LINEMERGE = false LINE_BREAKER=(\s+) EXTRACT-tweetword = ^(?<words>.*)$
And then search like;
sourcetype=my_tweets NOT the NOT a NOT an NOT for NOT by | top 1000 words
and then successively add to the "NOT someword" that turns up unwanted.
Sadly, this is one of those times where there is probably a better tool than Splunk.