Solved: Do search-time fields have performance considerati...

jrodman · ‎03-23-2010

Are search-time fields slow? Can I rely on them to efficiently sort through my data?

Are there significant differences in searching on automatically created fields from the text of my events, vs fields that I configure in manually? Are some types of extractions faster than others?

jrodman · ‎03-23-2010

Mostly, search-time fields have superior performance to parse-time (indexed) fields, regardless of whether they are explicitly configured.

When running a search that includes a term such as fieldname=value , Splunk will treat this as a search-time field by default, unless fieldname is explicitly configured as an indexed field in fields.conf. This is true both for configured fields (delimiters, regular expressions) as well as for automatically identified fields where, eg you have fieldname:value in the text of your event. We call this automatic handling code auto-kv for automatic key-value extraction.

The Splunk search machinery presumes that value will be present in the events as an indexed string, and will apply the same mechanics to filter the events as if you entered the string directly without the fieldname or equals sign. For most patterns, this offers all the performance advantage of a parse-time field, and none of the penalty. The tradeoffs are discussed in more detail in "About indexed field extraction" in the Getting Data In Manual.

In all cases, the post-filtering is applied to the (hopefully) small set of events that actually contain the value string, by applying any extraction mechanisms, then testing to see if the field has been created containing the desired value.

Ideally the index-based filtering is the most important factor in the speed of your search, but there are cases when search-time extraction must be applied to a large percentage of events. For example if almost all of your events have the word xml but only a small portion have this value in the storage_format field, the speed of extraction becomes important. Delim-based extractions are quite fast. Auto-kv are quite fast. Regex-based extractions are slower. Sourcetypes with a very large number of regexes or very inefficient regexes can be slower still.

View solution in original post

Lowell · ‎03-24-2010

jrodman has a good answer. I just want to add give a couple of examples of scenarios where indexed fields are the best answer:

If the field that you are trying to extract is a subset of a term. In other words, say you have the term ABC123456 in your event. And you want a field with the value 123456. In this scenario, this lookup can be very slow because you can't use indexed terms for the lookup (In order to actually search on this field, you have to set INDEXED_VALUE=false in the fields.conf file.) So if you use this field frequently for searching, an indexed field is your best option.
If the field is a combination of terms that are not adjacent in the event. For example, say you event text like: ... changed object myclass.myfunction in package mypackage, which is rather frustrating, since what you want is a single field that contains the concatenated value of mypackage.myclass.myfunction. In this case you have to either use an indexed field, or extract two different fields and use a search-time pipeline command to combine the values.

These are aren't your normal scenarios. Using indexed fields are certainly a great option for these situations, but require a lot more maintenance than the normal search-time field extraction setups. So whenever possible, go with field extractions over indexed fields.

Also keep in mind that for the occasional field extraction, you can always use a rex to explicitly pull out matches at search time. (You could use this in combination with something like | extract limit=0 to disable unwanted extractions at search time). This may be a good option for fields that you never search for interactively, or are only use in one or two saved searches. This is especially true if it's a costly regex.

BTW, Does anyone know if there is a way to profile the regex within splunk? Perhaps find out which ones are the most costly? (I think that's on topic here)

jrodman · ‎03-24-2010

As for your trick for doing the extractions for only some searches, this should be less necessary as we get better at making the UI be less demanding of all fields. Certainly for scheduled searches and for command line searches in 4.0, you don't have to pay for fields you don't use. We don't have search profiling of any significant sort in 4.0, but it's currently under discussion (which makes me think it's 4.2-ish).

jrodman · ‎03-24-2010

To add color to your examples:

Yes, and yes, but do evaluate if these fields will be ones that narrow your dataset by a several orders of magnitude. If yes, and you search on them significantly, the indexed field choice is probably worthwhile. If they are only likely to filter your search by 100 to 1 or so, it may not be worth it. At around 10 to 1, it is unlikely to be worth it.

For your myclass/mypackage/myfunction example, it can be performant to just search on the three fields, unless these terms are quite common in other contexts.

jrodman · ‎03-23-2010

Mostly, search-time fields have superior performance to parse-time (indexed) fields, regardless of whether they are explicitly configured.

When running a search that includes a term such as fieldname=value , Splunk will treat this as a search-time field by default, unless fieldname is explicitly configured as an indexed field in fields.conf. This is true both for configured fields (delimiters, regular expressions) as well as for automatically identified fields where, eg you have fieldname:value in the text of your event. We call this automatic handling code auto-kv for automatic key-value extraction.

The Splunk search machinery presumes that value will be present in the events as an indexed string, and will apply the same mechanics to filter the events as if you entered the string directly without the fieldname or equals sign. For most patterns, this offers all the performance advantage of a parse-time field, and none of the penalty. The tradeoffs are discussed in more detail in "About indexed field extraction" in the Getting Data In Manual.

In all cases, the post-filtering is applied to the (hopefully) small set of events that actually contain the value string, by applying any extraction mechanisms, then testing to see if the field has been created containing the desired value.

Ideally the index-based filtering is the most important factor in the speed of your search, but there are cases when search-time extraction must be applied to a large percentage of events. For example if almost all of your events have the word xml but only a small portion have this value in the storage_format field, the speed of extraction becomes important. Delim-based extractions are quite fast. Auto-kv are quite fast. Regex-based extractions are slower. Sourcetypes with a very large number of regexes or very inefficient regexes can be slower still.

Do search-time fields have performance considerations?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?