Is there a more efficient way to remove stop words...

andrewtrobec · ‎02-28-2018

Hello,

I'm currently performing analysis on a free text field and the first step is to remove stop words. This is my approach:

makemv to convert the free text field into a list of words
mvexpand to create an event for each word
search with a lookup containing stop words to remove events I don't need

SPL snippet:

...
| makemv text_field
| mvexpand text_field
| search NOT [ | inputlookup stopwords.csv | rename StopWord as text_field ]
...

When I am using this approach on large sets of data I reach my performance limits very quickly. What I'd like to know is: is there a different approach that I can take to remove the stop words that is less performance heavy than my current approach?

Thank you and best regards,

Andrew

valiquet · ‎03-09-2018

With |sed
Can you provide the csv?

andrewtrobec · ‎03-10-2018

@valiquet
Thank you for your reply.
The csv is a single-column lookup with column name StopWord and is list of all of the words that I would like to remove. Here is a sample from the list (it's much longer):

StopWord
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between

I'd like to point out that I am currently using the sed command to remove punctuation:

rex mode=sed field=text_field"s/[^a-zA-Z0-9_-]+/ /g"

If this can somehow be extended to cover the list of stop words in the lookup (which is a couple of hundred words long) then that would be amazing. Is this possible?

Thank you and best regards,

Andrew

Is there a more efficient way to remove stop words from a text field than using the makemv and mvexpand combo?

Splunk Enterprise Security(ES) 7.3 is approaching the end of support. Get ready for ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability for AI

Are you a member of the Splunk Community?

Is there a more efficient way to remove stop words from a text field than using the makemv and mvexpand combo?

Splunk Enterprise Security(ES) 7.3 is approaching the end of support. Get ready for ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability for AI