When I search using key-value pairs as terms, what kind of optimizations does Splunk perform to retrieve the events that match my terms in the smallest amount of time?
For a given search, such as
> index=myindex myfield=yak
Splunk does not normally retreive all events from myindex and
determine after extracting them whether the myfield exists after the
fact with the value yak, because that would be far far too slow for
typical use.
Instead, Splunk makes use of searchtime configuration information to
determine what possible origins could exist for the field, and
generates a much more constrained search which will only return events
which could possibly generate a field called myfield with a value
"yak".
In the simplest case, there are no lookups, extractions, calculated
fields, etc etc which can ever produce the field called myfield. In
this case the search effectively becomes:
> index=myindex yak myfield=yak
Splunk assumes that if the field is going to come from something like
autoKV, that the text of the field will be present in the keyword
index. Thus, Splunk can via index traversal only return events which
are referenced by this keyword. Later after autoKV has performed its
work on all of the returned events, we can test to see if myfield has
come to exist on those events, and if so if contains the field "yak".
Thus an event such as
TIMESTAMP yak yak yak myfield=ox
will be returned by the index, autoKV'ed and then filtered out,
because myfield will be ox.
However, an event such as
TIMESTAMP cow cow cow myfield=ox
will not be returned from the index at all, and we can skip all the
work of autoKV and other implicit steps prior to postfiltering.
This becomes more complicated when additional potential sources of the
field may exist. For example you may have a regex based extraction such as:
props.conf:
REPORT-myfield = myfield-extractor
transforms.conf:
[myfield-extractor]
REGEX = chicken chicken chicken (\w+)\=
FORMAT = $1::yak
Here we have hardcoded the value of yak in the transform, so it does
not exist in the event.
TIMESTAMP chicken chicken chicken myfield=goats
this would produce a field called myfield with the value yak. The
default optimization won't work, because 'yak' may not be in the
event. In this case you must give splunk a hint in fields.conf:
[myfield]
INDEXED_VALUE = false
Now splunk knows it cannot make this optimization, and must simply
retreive all the events for this index and test for the field and
value presence.
Of course there are many other ways a field can come to exist.
Take for example a lookup such country_animals.csv:
Country, Animal
USA, chicken
France, rooster
Nepal, yak
Bhutan, yak
Now, this might be configured to be used for your sourcetype, like so:
transforms.conf:
[country_animals]
filename = country_animals.csv
props.conf:
[country_data]
LOOKUP-animals = country_animals Country OUTPUT Animal as myfield
At this point, the lookup could generate our field! So Splunk does a
reverse mapping in this lookup during search startup, and determines
that if Country=Nepal or Country=Bhutan, then myfield would be yak.
Thus the resulting search will look like.
> index=myindex (yak OR
(sourcetype=country_animals AND (Country=Nepal OR Country=Bhutan))
myfield=yak
Or something along these lines.
And so on. The more possible ways the the field could come to exist,
the more elaborate the resulting search passed down to the
optimization and fetch layers may be.
For a given search, such as
> index=myindex myfield=yak
Splunk does not normally retreive all events from myindex and
determine after extracting them whether the myfield exists after the
fact with the value yak, because that would be far far too slow for
typical use.
Instead, Splunk makes use of searchtime configuration information to
determine what possible origins could exist for the field, and
generates a much more constrained search which will only return events
which could possibly generate a field called myfield with a value
"yak".
In the simplest case, there are no lookups, extractions, calculated
fields, etc etc which can ever produce the field called myfield. In
this case the search effectively becomes:
> index=myindex yak myfield=yak
Splunk assumes that if the field is going to come from something like
autoKV, that the text of the field will be present in the keyword
index. Thus, Splunk can via index traversal only return events which
are referenced by this keyword. Later after autoKV has performed its
work on all of the returned events, we can test to see if myfield has
come to exist on those events, and if so if contains the field "yak".
Thus an event such as
TIMESTAMP yak yak yak myfield=ox
will be returned by the index, autoKV'ed and then filtered out,
because myfield will be ox.
However, an event such as
TIMESTAMP cow cow cow myfield=ox
will not be returned from the index at all, and we can skip all the
work of autoKV and other implicit steps prior to postfiltering.
This becomes more complicated when additional potential sources of the
field may exist. For example you may have a regex based extraction such as:
props.conf:
REPORT-myfield = myfield-extractor
transforms.conf:
[myfield-extractor]
REGEX = chicken chicken chicken (\w+)\=
FORMAT = $1::yak
Here we have hardcoded the value of yak in the transform, so it does
not exist in the event.
TIMESTAMP chicken chicken chicken myfield=goats
this would produce a field called myfield with the value yak. The
default optimization won't work, because 'yak' may not be in the
event. In this case you must give splunk a hint in fields.conf:
[myfield]
INDEXED_VALUE = false
Now splunk knows it cannot make this optimization, and must simply
retreive all the events for this index and test for the field and
value presence.
Of course there are many other ways a field can come to exist.
Take for example a lookup such country_animals.csv:
Country, Animal
USA, chicken
France, rooster
Nepal, yak
Bhutan, yak
Now, this might be configured to be used for your sourcetype, like so:
transforms.conf:
[country_animals]
filename = country_animals.csv
props.conf:
[country_data]
LOOKUP-animals = country_animals Country OUTPUT Animal as myfield
At this point, the lookup could generate our field! So Splunk does a
reverse mapping in this lookup during search startup, and determines
that if Country=Nepal or Country=Bhutan, then myfield would be yak.
Thus the resulting search will look like.
> index=myindex (yak OR
(sourcetype=country_animals AND (Country=Nepal OR Country=Bhutan))
myfield=yak
Or something along these lines.
And so on. The more possible ways the the field could come to exist,
the more elaborate the resulting search passed down to the
optimization and fetch layers may be.
What does ::
really do (e.g., country::Japan
)? Does it have any extra value, or does the internal optimizer convert =
to ::
automatically?
In regard to the search optimisations you describe above, do source, sourcetype and host qualify as autoKV fields, or are they something even more optimal than autoKV?
See other (question)[http://answers.splunk.com/comments/174575/view.html]
I don't know if we support acquiring source, sourcetype, and host via autoKV. I think effectively we don't because it will always be superseded by the built-in values, and thus I expect we construct the search string around this assumption.
As for the behavior of source, sourcetype, and host fields and searching on them, that's a bit out of scope for a comment, but they are potentially more efficient than most fields or keywords.