I am trying to calculate some term frequency on the field. The field is defined as follow.
rex field=_raw "Notes : (?.*)"
And, the field is generated correctly, but it hasn't any format, such as:
Notes :
Notes : Troubleshooting, I am simply reinstalling.
Notes : program would not start. I am reinstalling.
Notes : Made MacBook too slow
Notes : computer to slow when using the program! Need to install it into another!
There are thousands line of information, and I want to know the term frequency of all the words in the field of notes. I'd like to know whether there is a command to do this, or how can I achieve this in splunk.
Any ideas?
Thanks, Yi
Here is a related post.... http://answers.splunk.com/answers/62413/how-to-extract-most-popular-words-from-the-source-data.html
I think the REX from that post should get you going in the right direction. I've pasted the REX below. Please see the original for more details.
source=*mybook* | sort -_time | rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g" | eval word=_raw | makemv delim=" " word | mvexpand word | eval word=lower(word) | eval position=1 | streamstats sum(position) AS position | table position word | stats count min(position) max(position) by word
I used the following syntax to count the frequency of terms in my field:
| rename COMMENTS_4 AS text
| rex mode=sed field=text "s/[,|.|!]/ /"
| makemv text
| mvexpand text
| eval wordCount = mvcount(text)
| stats sum(wordCount) as "Word Map Text Analysis" by text
the line: | rename COMMENTS_4 AS text
just names my field variable to "text". So assuming you rename your field variable with text, you can count the terms using MV* commands
Here is a related post.... http://answers.splunk.com/answers/62413/how-to-extract-most-popular-words-from-the-source-data.html
I think the REX from that post should get you going in the right direction. I've pasted the REX below. Please see the original for more details.
source=*mybook* | sort -_time | rex mode=sed "s/(\.|,|;|=|\"|'|\(|\)|\[|\]| -|!|^-)/ /g" | eval word=_raw | makemv delim=" " word | mvexpand word | eval word=lower(word) | eval position=1 | streamstats sum(position) AS position | table position word | stats count min(position) max(position) by word
Thanks, jimodonald! I tried the REX. It works. But now I have another question that can I cluster some similar words to one class, such as fast, quick, rapid, swift.
you have to use a lexicon. Look up the nodejs library for Word Net. Upload that library. Then build a new app in splunk. Once that is done, create a .js file that calls the word net library, then define a search manager in the .js file that returns your splunk search. Loop through all the words, and pass each one to the word net library to built a temporary sysnonym dictionary. You can optionally save this dictionary as a KV store and continually update.
I know I didnt give details, but thats because it is a highly involved solution. But it is possible. Start poking around with Word Net and the capabilities.
Keep in mind, that all custom Splunk apps are basically Node.js apps - at least that is my current understanding. Community, let me know if I am wrong!
Splunk is not going to know what words are synonyms. It could likely be done with a case statement or a lookup table. Either way the synonyms would need to be identified and linked back to a common word.