Hello,
I'm currently performing analysis on a free text field and the first step is to remove stop words. This is my approach:
makemv
to convert the free text field into a list of wordsmvexpand
to create an event for each wordSPL snippet:
...
| makemv text_field
| mvexpand text_field
| search NOT [ | inputlookup stopwords.csv | rename StopWord as text_field ]
...
When I am using this approach on large sets of data I reach my performance limits very quickly. What I'd like to know is: is there a different approach that I can take to remove the stop words that is less performance heavy than my current approach?
Thank you and best regards,
Andrew
With |sed
Can you provide the csv?
@valiquet
Thank you for your reply.
The csv is a single-column lookup with column name StopWord
and is list of all of the words that I would like to remove. Here is a sample from the list (it's much longer):
StopWord
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
I'd like to point out that I am currently using the sed
command to remove punctuation:
rex mode=sed field=text_field"s/[^a-zA-Z0-9_-]+/ /g"
If this can somehow be extended to cover the list of stop words in the lookup (which is a couple of hundred words long) then that would be amazing. Is this possible?
Thank you and best regards,
Andrew