Splunk Search

How can I calculate the term frequency for all the words in a field's values?

mhqssyh
Explorer

I am trying to calculate some term frequency on the field. The field is defined as follow.
rex field=_raw "Notes : (?.*)"
And, the field is generated correctly, but it hasn't any format, such as:

Notes :

Notes : Troubleshooting, I am simply reinstalling.
Notes : program would not start. I am reinstalling.
Notes : Made MacBook too slow
Notes : computer to slow when using the program! Need to install it into another!

There are thousands line of information, and I want to know the term frequency of all the words in the field of notes. I'd like to know whether there is a command to do this, or how can I achieve this in splunk.

Any ideas?
Thanks, Yi

1 Solution

jimodonald
Contributor

Here is a related post.... http://answers.splunk.com/answers/62413/how-to-extract-most-popular-words-from-the-source-data.html

I think the REX from that post should get you going in the right direction. I've pasted the REX below. Please see the original for more details.

source=*mybook* | sort -_time | rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g" | eval word=_raw | makemv delim=" " word | mvexpand word | eval word=lower(word) | eval position=1 | streamstats sum(position) AS position | table position word | stats count min(position) max(position) by word 

View solution in original post

jzapantis
Path Finder

I used the following syntax to count the frequency of terms in my field:

              | rename COMMENTS_4 AS text
              | rex mode=sed field=text "s/[,|.|!]/ /"
              | makemv text
              | mvexpand text
              | eval wordCount = mvcount(text)
              | stats sum(wordCount) as "Word Map Text Analysis" by text

the line: | rename COMMENTS_4 AS text
just names my field variable to "text". So assuming you rename your field variable with text, you can count the terms using MV* commands

0 Karma

jimodonald
Contributor

Here is a related post.... http://answers.splunk.com/answers/62413/how-to-extract-most-popular-words-from-the-source-data.html

I think the REX from that post should get you going in the right direction. I've pasted the REX below. Please see the original for more details.

source=*mybook* | sort -_time | rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g" | eval word=_raw | makemv delim=" " word | mvexpand word | eval word=lower(word) | eval position=1 | streamstats sum(position) AS position | table position word | stats count min(position) max(position) by word 

View solution in original post

mhqssyh
Explorer

Thanks, jimodonald! I tried the REX. It works. But now I have another question that can I cluster some similar words to one class, such as fast, quick, rapid, swift.

0 Karma

jzapantis
Path Finder

you have to use a lexicon. Look up the nodejs library for Word Net. Upload that library. Then build a new app in splunk. Once that is done, create a .js file that calls the word net library, then define a search manager in the .js file that returns your splunk search. Loop through all the words, and pass each one to the word net library to built a temporary sysnonym dictionary. You can optionally save this dictionary as a KV store and continually update.

I know I didnt give details, but thats because it is a highly involved solution. But it is possible. Start poking around with Word Net and the capabilities.

Keep in mind, that all custom Splunk apps are basically Node.js apps - at least that is my current understanding. Community, let me know if I am wrong!

jimodonald
Contributor

Splunk is not going to know what words are synonyms. It could likely be done with a case statement or a lookup table. Either way the synonyms would need to be identified and linked back to a common word.

0 Karma