Splunk Search

Is there a more efficient way to remove stop words from a text field than using the makemv and mvexpand combo?

andrewtrobec
Motivator

Hello,

I'm currently performing analysis on a free text field and the first step is to remove stop words. This is my approach:

  1. makemv to convert the free text field into a list of words
  2. mvexpand to create an event for each word
  3. search with a lookup containing stop words to remove events I don't need

SPL snippet:

...
| makemv text_field
| mvexpand text_field
| search NOT [ | inputlookup stopwords.csv | rename StopWord as text_field ]
...

When I am using this approach on large sets of data I reach my performance limits very quickly. What I'd like to know is: is there a different approach that I can take to remove the stop words that is less performance heavy than my current approach?

Thank you and best regards,

Andrew

Tags (1)

valiquet
Contributor

With |sed
Can you provide the csv?

0 Karma

andrewtrobec
Motivator

@valiquet
Thank you for your reply.
The csv is a single-column lookup with column name StopWord and is list of all of the words that I would like to remove. Here is a sample from the list (it's much longer):

StopWord
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between

I'd like to point out that I am currently using the sed command to remove punctuation:

rex mode=sed field=text_field"s/[^a-zA-Z0-9_-]+/ /g"

If this can somehow be extended to cover the list of stop words in the lookup (which is a couple of hundred words long) then that would be amazing. Is this possible?

Thank you and best regards,

Andrew

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...

Splunk and Fraud

Watch Now!Watch an insightful webinar where we delve into the innovative approaches to solving fraud using the ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...