Where can I find a detailed explanation on how the splunk search algorithm works? There is a pretty good explanation in the docs on how the indexes themselves are created, but I can't find anything as detailed on how the indexes are used.
Is there anything else?
For example, if I search for all events where IP=188.8.131.52, how does splunk use the default indexed fields, the raw data files, and the extracted fields to find the matches?
If a field is extracted at search time, does it essentially have to do a raw-text search through all possible events?
Do fields parsed at index time behave differently?
If splunk search is mapreduce and distributed, why do I always receive search results in a reverse time-linear fashion?
That's a nice article. You might also want to look here in the docs
BTW, don't change the default segmentation (which is
How Splunk actually does the searching is proprietary, but I can give you a high-level example. (And maybe someone will correct me if I am off on this.) Assume the following search:
sourcetype=xyz user=sam 184.108.40.206
First, Splunk determines which indexes you are allowed to search. (Note: if you have multiple indexes, but this data appears in only one of them, you can increase the search speed by specifying
index=security or whatever.)
sourcetype are special fields that help to quickly identify the buckets. In addition, Splunk can determine the date range of a bucket from the timestamp embedded in its name. So the time range for the search, the name of the indexes and the sourcetype (
xyz) allow Splunk to identify the subset of buckets to actually search.
Third, Splunk looks at the search keywords. In this case, the keywords are
220.127.116.11. Splunk uses bloom filters to quickly identify buckets that do not have both of these keywords, and eliminate them from the search subset. (Lookup "bloom filters" in Wikipedia for a nice description.)
Fourth: Splunk is now ready to actually look at the inverted keyword index for each bucket. At this point, I am not sure how Splunk optimizes the work. It identifies all the events that have one of the keywords and then eliminates the events that do not have the second keyword.
Notice that through all this work , Splunk has not created the search-time fields yet! Instead, Splunk has eliminated as many events as possible. Imagine that at this point, the search has retrieved 10,000 events:
All of the events contain the keyword
sam and the keyword
7,000 of the events have the keyword
sam in the
user field. The other 3,000 events have the word sam in some other part of the event.
Now - Splunk performs the necessary search time field extractions. It eliminates the 3000 events that do not match the user=sam condition. The search is complete.
That's the concept. Now you can see why it doesn't help to index most fields - Splunk searches for the value (
sam) and that usually is a great way to initially reduce the size of the data set.
Finally, take a look at this : Configure index-time field extraction
The docs basically say "if the keyword
sam appears in MANY places in your data - not just the field
user, then maybe you want an index-time field." Maybe. Otherwise, it may hurt more than it helps.
Yeah, I've already read those other docs. Thanks for the thoughtful reply, however it's not really answering what I'm trying to understand.
Thanks for the reply. Yeah, I've already read through the docs you linked to. The crux of my question is hinted at in this: "Second, the host, source and sourcetype are special fields that help to quickly identify the buckets." This makes it sound like these indexed fields act like indexes on a database table. However, we're told that adding more indexed fields doesn't speed up the search.
What I'm trying to understand is if Splunk can avoid having to do a linear, backwards text search when searching for fields that are extracted at search time. Say I want to find all events with user=XYZ.
What I would expect is that if the user field is added as a custom index field, there is some sort of hash table or lookup table or other optimization such that if I specify a specific user or groups of users to search for, Splunk wouldn't have to actually go back and search through all of the events, or buckets, for those values.
On the other hand, with search-time extracted fields, I get the understanding/feeling that Splunk is going back and actually searching for these fields event by event, and will have to basically chew through every event in the time span to do so.
Actually, there is no overall index in Splunk - each bucket has its own index files. So the "index" is segmented by bucket.
For search-time extractions, Splunk does not build a "index" like an RDBMs index. Instead, the field name simply becomes another filtering criteria. Since the search-time extractions are not applied until after all other filtering criteria, Splunk only has to "chew through" a relatively small set of potential matches - not every event in the time span.
"Since the search-time extractions are not applied until after all other filtering criteria, Splunk only has to "chew through" a relatively small set of potential matches" -- Yeah, that makes sense, and I understand that. But my main question is really about how index-time fields work. Are they used to initially reduce the set of buckets?
Well yes, but I don't think that they work like you may expect. Here is how I understand it, based on the previous example: Remember that 10,000 events have keyword
sam. Assume that 500,000 events have the field
Which is faster? (1) For Splunk to retrieve events with
sam and then determine which of those has user=sam, or
(2) retrieve events that have a
user field and then figure out which user fields have the value
Even for index-time fields, I don't think Splunk creates the kind of lookup index that you would find in an RDBMS. There is no cross-bucket index.
Oh, I forgot to answer this part: If splunk search is mapreduce and distributed, why do I always receive search results in a reverse time-linear fashion?
Splunk knows the timerange of the data in the buckets. It searches most recent buckets first. Even when there are multiple indexers, the search combines and sorts the events from the indexers in reverse time order.
One reason for this is that many people stop the search when only partial results have been retrieved. Once they see the most recent events on screen, they may see "what's going on" and have no need to see older results.
If you want to see your results in time order, just append
| reverse to your search string. It may take longer before you start to see the results though.