So this is really a theoretical question based on me trying to wrap my arms around splunk. The purpose of the common information model is to normalize disparate logs into a common schema in an otherwise schemaless engine. As the documentation states there is no need to index all of those fields. However, at an enterprise scale install with a security bent there are a small handful of fields that are probably used in the majority of searches: source ip, destination ip, destination port, and maybe another field or two for action or message. I get that one of the reasons to not index tons of fields is due to throughput concerns but in this case you probably have several indexers so overall load is distributed and in this case you are talking about 4 or 5 fields.
There are a number of variables in play but I'm wondering if any extra time spent on indexing would be returned in spades at search time.
If your question is "should you extract those key fields?" - my answer is yes. Field extractions make event correlation, searching and reporting easier to do, and give more targeted results. A large number of search-time field extractions can slow your searches, but there are various ways to control field extraction (like the
fields command and the field discovery button). So I would definitely do these key field extractions.
If your question is "should these be index-time field extractions rather than search-time field extractions?" - my answer is no. Index-time field extractions are not faster except in rare corner cases. Your fields are not - most definitely not - one of the corner cases. And index-time field extractions can be troublesome in other ways - they are fragile and difficult to change, for example.
I think one reason that people want to "index fields" is that they are thinking about how things work in relational databases. Splunk is not a relational database, so most of the performance tips for relational databases simply don't apply.
Splunk indexes every word in your data - whether that word is part of a field or not. That's why it's called an "index" instead of a "database." The distinction is important. This is why you do not need index-time field extractions - all the data is already indexed.
Of course, if you meant something else by your questions, then this answer is lame. 🙂
Good point about that data not being there natively for the most part and honestly I'm surprised I didn't make the connection. Thinking back on it I don't recall any of the documentation making that base level distinction.
The reason that source, sourcetype and host are indexed fields is simple: they are not contained within the input data! (Well.sometimes host is contained in the input data, such as in syslog. But you can't count on that.)
And you aren't being antagonistic, these are great questions. I can tell that you are really trying to understand why!
One of the reasons for the concern re: the ES app - if it isn't correctly hooked up to your particular company's data feeds, then it is a dud. Since these are unique to each company, documentation alone may not be enough... At least that's my thinking.
Thanks for the link! Hopefully I'm not coming off as being antagonistic; I'm not trying to be. At some level I'm trying to parse a number of things I've heard about Splunk's ES and compliance apps (eg., a dedicated search head is recommended, a multi-week PS engagement to conform your data into what the app requires for input, etc) and understand how they are influencing the evolution of Splunk itself in 5.0 and beyond.
There's really no noticeable difference in performance between the scenarios you refer to. Please read dwaddle's excellent explanation on what goes on during a Splunk search: http://splunk-base.splunk.com/answers/54207/slow-search-when-evaluating-a-numeric-value?page=1&focus...
So let me ask this in a slightly different way. Are searches that reference fields extracted at index time faster than searches against fields that are extracted at search time? I need to go back and look at the documentation but my thought is at least on some level the host, sourcetype, and source fields are extracted at index time because they they are the lowest common denominator among all log types with the maximum value and should be used in your initial search if you are concerned about efficiency.