Splunk Search

Should I use an index-time field extraction?


Dear fellow Splunkers,

I have seen the docs on index-time field extractions and a few related answers here, there or there with the general guidance that an index-time extraction is rarely ever needed or beneficial.

However, I have a dedicated index that holds Apache logfiles for a lot of different virtual hosts. I have set up search-time field extractions to get the apache_virtualhost and HTTP status code.

Now following search

index=web apache_virtualhost=some.virtual.host  | timechart count by status    

is very slow and does not complete even after a few minutes, keeping the CPU 100% busy. However,

index=web source=/path/to/logs/for/this/vhost-only.log | timechart count by status

returns a result in an acceptable amount of time.

Being used to relational DBs, I immediately thought "sure, in the second case Splunk can retrieve the small subset of matching rows from the index, whereas the first case needs to push all rows through the regexp first". But that is probably not the way Splunk works.

So, is there an simple explanation for this? Have I found one of the rare cases where index-time field extraction would make sense?

Thanks for sharing your insights!


Re: Should I use an index-time field extraction?


The easy answer is "it depends".

Comparing your two cases, assuming the set of matched events is identical, it depends on how the field-value-pair apache_virtualhost=some.virtual.host works. In the general case, Splunk retrieves all events containing "some.virtual.host", performs field extractions, and then filters for the actual field-value pair - potentially discarding many events. If "some.virtual.host" is fairly unique to those events that actually match, this kind of filtering will be fast. If it's an IP this kind of filtering may be fairly slow - is retrieved as 10 AND 1 AND 2 AND 3, so would also be retrieved off disk, fieldextracted, and then discarded.
If apache_virtualhost is a calculated field, this optimization of only retrieving potential matches cannot be used and all events have to be retrieved.

In general, good reasons for using index-time field extractions can be:

  • you have to search NOT field=value frequently
  • value frequently appears outside of field, common case is small integers
  • value is not a whole token, but rather part of a token, common case would be the country code as part of an IBAN (first two chars).

There are good reasons against too, some of those are:

  • loss of flexibility, no schema on the fly at search time
  • increased storage cost, some downstream speed cost due to "bigger is slower"
  • potential for name conflicts - a field name is either indexed or it's not (see fields.conf for the relevant config) so you can't have one sourcetype where it's search time and another where it's index time

For insight into how the search works, first set infocsv_loglevel=DEBUG in limits.conf to gain the lispy output in the job inspector, then look at the slides after the end of my .conf 2015 talk at http://conf.splunk.com/session/2015/conf2015_MMueller_Consist_Deploying_OptimizingSplunkKnowledge.pd... - the bonus slides talk about how events are identified for retrieval off disk.
The key indicator for "indexed fields may help" is a high scanCount compared to a low eventCount in the job inspector. If those two are similar, indexed fields won't be able to help much in most scenarios.

View solution in original post