Solved: Re: Search performance of raw without wildcards

fredclown · ‎06-23-2022

I am somewhat puzzled by the performance of this search. When I leave the wildcards off the search is WAY faster than with the wildcards. In essence, shouldn't I get the same results from both searches?

index="myindex" sourcetype="mysourcetype" "my term"

vs

index="myindex" sourcetype="mysourcetype" "*my term*"

On another answer I saw a Splunk employee state that ...

"my term"

was essentially the same as ...

_raw="*my term*"

The performance difference on my system is undeniable, so I guess my question would be is there a reason I would want/need to put the wildcards in? Would I potentially get different results? Thanks.

PickleRick · ‎06-23-2022

Contrary to most "typical" databases, siems and whatnot, Splunk does the search "in reverse". Whereas your typical arcsight, elasticsearch or whatever else splits the data and parses it into separate fields on ingest and then stores the data in specific field-oriented structures, Splunk "only" splits the input events into "words" and makes a "reverse index" of those words.

So (simplifying a bit but not much) if you're searching for "word1 word2" Splunks looks into a bloomfilter whether there are events containing those words at all, then looks into this reverse index to see which events contain word1, which contain word2, checks which of those contain both those words and then checks if the words appear in the sequence you provided.

If you use a wildcard at the end of your search (like "word1 word2*"), Splunk still can be quite fast, it just has to find in the reverse index all words beginning with word2 and do the process on a bit bigger number of events. But it's relatively easy to find those words in the index and therefore get all matching events.

But if you add a wildcard at the beginning, Splunk would have to scan the index for all words that match the wildcarded beginning which requires trying to match every single word from the index. To be honest I'm not sure if it does that or simply does a search across the raw event data in this case.

It's obviously way way less efficient than getting events by the reverse index of words.

EDIT: and to be precise, searching for "word1 word2" is not the same as _raw="*word1 word2*". Since splunk splits events by so-called "breaks" (spaces, tabs, punctuation), search for "word1 word2" searches for those "words". It will _not_ find something like "myword1 word2". But it will find "my word1 word2" or even "field=word1 word2/whatever". But searching for "*word1 word2*" would find "myword1 word2too" at expense of the search performance.

View solution in original post

PickleRick · ‎06-23-2022

Contrary to most "typical" databases, siems and whatnot, Splunk does the search "in reverse". Whereas your typical arcsight, elasticsearch or whatever else splits the data and parses it into separate fields on ingest and then stores the data in specific field-oriented structures, Splunk "only" splits the input events into "words" and makes a "reverse index" of those words.

So (simplifying a bit but not much) if you're searching for "word1 word2" Splunks looks into a bloomfilter whether there are events containing those words at all, then looks into this reverse index to see which events contain word1, which contain word2, checks which of those contain both those words and then checks if the words appear in the sequence you provided.

If you use a wildcard at the end of your search (like "word1 word2*"), Splunk still can be quite fast, it just has to find in the reverse index all words beginning with word2 and do the process on a bit bigger number of events. But it's relatively easy to find those words in the index and therefore get all matching events.

But if you add a wildcard at the beginning, Splunk would have to scan the index for all words that match the wildcarded beginning which requires trying to match every single word from the index. To be honest I'm not sure if it does that or simply does a search across the raw event data in this case.

It's obviously way way less efficient than getting events by the reverse index of words.

EDIT: and to be precise, searching for "word1 word2" is not the same as _raw="*word1 word2*". Since splunk splits events by so-called "breaks" (spaces, tabs, punctuation), search for "word1 word2" searches for those "words". It will _not_ find something like "myword1 word2". But it will find "my word1 word2" or even "field=word1 word2/whatever". But searching for "*word1 word2*" would find "myword1 word2too" at expense of the search performance.

fredclown · ‎06-23-2022

Makes sense. Thanks.

richgalloway · ‎06-23-2022

The search "_raw=my term" is the same as "_raw="*my term*" in concept, but not in execution. Without wildcards, Splunk can use bloomfilters and other metadata to reduce the number of events that have to be examined for a match. With the leading wildcard present, Splunk has to examine every event to see if it matches. That's what takes so long.

It's a good practice to avoid leading wildcards.

---
If this reply helps you, Karma would be appreciated.

fredclown · ‎06-23-2022

Thanks for the help sir.

Search performance of raw without wildcards

other

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio