Splunk Search

Search performance of raw without wildcards

fredclown
Communicator

I am somewhat puzzled by the performance of this search. When I leave the wildcards off the search is WAY faster than with the wildcards. In essence, shouldn't I get the same results from both searches?

index="myindex" sourcetype="mysourcetype" "my term"

vs

index="myindex" sourcetype="mysourcetype" "*my term*"

 

On another answer I saw a Splunk employee state that ...

"my term"

was essentially the same as ...

_raw="*my term*"

 

The performance difference on my system is undeniable, so I guess my question would be is there a reason I would want/need to put the wildcards in? Would I potentially get different results? Thanks.

Labels (1)
0 Karma
1 Solution

PickleRick
Ultra Champion

Contrary to most "typical" databases, siems and whatnot, Splunk does the search "in reverse". Whereas your typical arcsight, elasticsearch or whatever else splits the data and parses it into separate fields on ingest and then stores the data in specific field-oriented structures, Splunk "only" splits the input events into "words" and makes a "reverse index" of those words.

So (simplifying a bit but not much) if you're searching for "word1 word2" Splunks looks into a bloomfilter whether there are events containing those words at all, then looks into this reverse index to see which events contain word1, which contain word2, checks which of those contain both those words and then checks if the words appear in the sequence you provided.

If you use a wildcard at the end of your search (like "word1 word2*"), Splunk still can be quite fast, it just has to find in the reverse index all words beginning with word2 and do the process on a bit bigger number of events. But it's relatively easy to find those words in the index and therefore get all matching events.

But if you add a wildcard at the beginning, Splunk would have to scan the index for all words that match the wildcarded beginning which requires trying to match every single word from the index. To be honest I'm not sure if it does that or simply does a search across the raw event data in this case.

It's obviously way way less efficient than getting events by the reverse index of words.

EDIT: and to be precise, searching for "word1 word2" is not the same as _raw="*word1 word2*". Since splunk splits events by so-called "breaks" (spaces, tabs, punctuation), search for "word1 word2" searches for those "words". It will _not_ find something like "myword1 word2". But it will find "my word1 word2" or even "field=word1 word2/whatever". But searching for "*word1 word2*" would find "myword1 word2too" at expense of the search performance.

View solution in original post

PickleRick
Ultra Champion

Contrary to most "typical" databases, siems and whatnot, Splunk does the search "in reverse". Whereas your typical arcsight, elasticsearch or whatever else splits the data and parses it into separate fields on ingest and then stores the data in specific field-oriented structures, Splunk "only" splits the input events into "words" and makes a "reverse index" of those words.

So (simplifying a bit but not much) if you're searching for "word1 word2" Splunks looks into a bloomfilter whether there are events containing those words at all, then looks into this reverse index to see which events contain word1, which contain word2, checks which of those contain both those words and then checks if the words appear in the sequence you provided.

If you use a wildcard at the end of your search (like "word1 word2*"), Splunk still can be quite fast, it just has to find in the reverse index all words beginning with word2 and do the process on a bit bigger number of events. But it's relatively easy to find those words in the index and therefore get all matching events.

But if you add a wildcard at the beginning, Splunk would have to scan the index for all words that match the wildcarded beginning which requires trying to match every single word from the index. To be honest I'm not sure if it does that or simply does a search across the raw event data in this case.

It's obviously way way less efficient than getting events by the reverse index of words.

EDIT: and to be precise, searching for "word1 word2" is not the same as _raw="*word1 word2*". Since splunk splits events by so-called "breaks" (spaces, tabs, punctuation), search for "word1 word2" searches for those "words". It will _not_ find something like "myword1 word2". But it will find "my word1 word2" or even "field=word1 word2/whatever". But searching for "*word1 word2*" would find "myword1 word2too" at expense of the search performance.

fredclown
Communicator

Makes sense. Thanks.

Tags (1)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

The search "_raw=my term" is the same as "_raw="*my term*" in concept, but not in execution.  Without wildcards, Splunk can use bloomfilters and other metadata to reduce the number of events that have to be examined for a match.  With the leading wildcard present, Splunk has to examine every event to see if it matches.  That's what takes so long.

It's a good practice to avoid leading wildcards.

---
If this reply helps you, Karma would be appreciated.

fredclown
Communicator

Thanks for the help sir.

0 Karma
Get Updates on the Splunk Community!

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...

Observability Highlights | January 2023 Newsletter

 January 2023New Product Releases Splunk Network Explorer for Infrastructure MonitoringSplunk unveils Network ...