Are search-time fields slow? Can I rely on them to efficiently sort through my data?
Are there significant differences in searching on automatically created fields from the text of my events, vs fields that I configure in manually? Are some types of extractions faster than others?
Mostly, search-time fields have superior performance to parse-time (indexed) fields, regardless of whether they are explicitly configured.
When running a search that includes a term such as fieldname=value
,
Splunk will treat this as a search-time field by default, unless
fieldname
is explicitly configured as an indexed field in
fields.conf. This is true both for configured fields (delimiters,
regular expressions) as well as for automatically identified fields
where, eg you have fieldname:value
in the text of your event. We
call this automatic handling code auto-kv for automatic key-value
extraction.
The Splunk search machinery presumes that value
will be present in
the events as an indexed string, and will apply the same mechanics to
filter the events as if you entered the string directly without the
fieldname or equals sign. For most patterns, this offers all the
performance advantage of a parse-time field, and none of the penalty.
The tradeoffs are discussed in more detail in "About indexed field extraction" in the Getting Data In Manual.
In all cases, the post-filtering is applied to the (hopefully) small
set of events that actually contain the value
string, by applying
any extraction mechanisms, then testing to see if the field has been
created containing the desired value.
Ideally the index-based filtering is the most important factor in the
speed of your search, but there are cases when search-time extraction
must be applied to a large percentage of events. For example if
almost all of your events have the word xml
but only a small portion
have this value in the storage_format
field, the speed of extraction
becomes important. Delim-based extractions are quite fast. Auto-kv
are quite fast. Regex-based extractions are slower. Sourcetypes with
a very large number of regexes or very inefficient regexes can be
slower still.
jrodman has a good answer. I just want to add give a couple of examples of scenarios where indexed fields are the best answer:
ABC123456
in your event. And you want a field with the value 123456
. In this scenario, this lookup can be very slow because you can't use indexed terms for the lookup (In order to actually search on this field, you have to set INDEXED_VALUE=false
in the fields.conf file.) So if you use this field frequently for searching, an indexed field is your best option.... changed object myclass.myfunction in package mypackage
, which is rather frustrating, since what you want is a single field that contains the concatenated value of mypackage.myclass.myfunction
. In this case you have to either use an indexed field, or extract two different fields and use a search-time pipeline command to combine the values.These are aren't your normal scenarios. Using indexed fields are certainly a great option for these situations, but require a lot more maintenance than the normal search-time field extraction setups. So whenever possible, go with field extractions over indexed fields.
Also keep in mind that for the occasional field extraction, you can always use a rex
to explicitly pull out matches at search time. (You could use this in combination with something like | extract limit=0
to disable unwanted extractions at search time). This may be a good option for fields that you never search for interactively, or are only use in one or two saved searches. This is especially true if it's a costly regex.
BTW, Does anyone know if there is a way to profile the regex within splunk? Perhaps find out which ones are the most costly? (I think that's on topic here)
As for your trick for doing the extractions for only some searches, this should be less necessary as we get better at making the UI be less demanding of all fields. Certainly for scheduled searches and for command line searches in 4.0, you don't have to pay for fields you don't use. We don't have search profiling of any significant sort in 4.0, but it's currently under discussion (which makes me think it's 4.2-ish).
To add color to your examples:
Yes, and yes, but do evaluate if these fields will be ones that narrow your dataset by a several orders of magnitude. If yes, and you search on them significantly, the indexed field choice is probably worthwhile. If they are only likely to filter your search by 100 to 1 or so, it may not be worth it. At around 10 to 1, it is unlikely to be worth it.
For your myclass/mypackage/myfunction example, it can be performant to just search on the three fields, unless these terms are quite common in other contexts.
Mostly, search-time fields have superior performance to parse-time (indexed) fields, regardless of whether they are explicitly configured.
When running a search that includes a term such as fieldname=value
,
Splunk will treat this as a search-time field by default, unless
fieldname
is explicitly configured as an indexed field in
fields.conf. This is true both for configured fields (delimiters,
regular expressions) as well as for automatically identified fields
where, eg you have fieldname:value
in the text of your event. We
call this automatic handling code auto-kv for automatic key-value
extraction.
The Splunk search machinery presumes that value
will be present in
the events as an indexed string, and will apply the same mechanics to
filter the events as if you entered the string directly without the
fieldname or equals sign. For most patterns, this offers all the
performance advantage of a parse-time field, and none of the penalty.
The tradeoffs are discussed in more detail in "About indexed field extraction" in the Getting Data In Manual.
In all cases, the post-filtering is applied to the (hopefully) small
set of events that actually contain the value
string, by applying
any extraction mechanisms, then testing to see if the field has been
created containing the desired value.
Ideally the index-based filtering is the most important factor in the
speed of your search, but there are cases when search-time extraction
must be applied to a large percentage of events. For example if
almost all of your events have the word xml
but only a small portion
have this value in the storage_format
field, the speed of extraction
becomes important. Delim-based extractions are quite fast. Auto-kv
are quite fast. Regex-based extractions are slower. Sourcetypes with
a very large number of regexes or very inefficient regexes can be
slower still.