I would like to know how could I extract selective fields at Index-time from our log files which are in CSV format. Let's say my log file looks like this.
Let's say I would like to extract only field8, field9, and field11 at index-time. The rest would get extracted at search time.
We're currently extracting all fields at index-time which is causing disk space issues. I'm testing, extract all fields at search-time and would also like to extract a couple of fields which get used almost in all searches at index-time, and the rest at search-time to work out best search performance.
As a general rule, it is better to perform most knowledge-building activities, such as field extraction, at search time. Additionally, custom field extraction, performed at index time, can degrade performance at both index time and search time. When you add to the number of fields extracted during indexing, the indexing process slows. Later, searches on the index are also slower, because the index has been enlarged by the additional fields, and a search on a larger index takes longer. You can avoid such performance issues by instead relying on search-time field extraction.
How are you extracting your fields now? Are you using
INDEXED_EXTRACTION=CSV or are you extracting fields individually?
I wouldn't say that searches on indexes with additional indexed fields are slower. In fact, they can be significantly faster because you have the ability to filter directly on those fields and take advantage of the inverted index and underlying bloom filters, similar to using host, source and sourcetype. Using
INDEXED_EXTRACTIONS will create an index time field extraction for every field in the structured dataset. The size of the index can grow significantly but it depends on how large your events are and the cardinality (number of unique values) that each of the indexed fields has.
You won't be able to use
INDEXED_EXTRACTIONS=CSV to selectively create index time field extractions. That setting will create an index time field extraction for every field in your CSV file which isn't what you want.
If you want to have certain fields be index time field extractions and others search time extractions you'll need a combination of settings in props.conf, transforms.conf and fields.conf.
Have a look at the docs here for creating index time field extractions:
Here are the docs for search time field extractions:
If you expect the format of your CSV file to ever change you need to be cautious about using index time field extractions. If any of your select fields changes position or another field is added then you can wind up with incorrect data in certain fields depending on how well written your regex is/isn't. The recommendation to use search time field extractions is generally for flexibility in case your data ever changes. An index time field extraction is a permanent piece of metadata for every event and it can't be changed. See this section in the docs for more details.
Thanks shaskell. I'll read the provided documentation and would implement it accordingly in my test environment. I don't expect the CSV file to ever change. I'm hoping by extracting couple of fields which get used quite often at index time would speed up our searches and at the same time, it'll keep the index size to the minimum. The fields which I'd like to extract have limited number of values so tsidx files won't be huge.
I reckon the bit in the docs you want to pay closest attention to is at...
Which leads me to conclude you want something like the following. The RegEx is as per the emails you and I exchanged.
Add this to transforms.conf...
[indexed-extractions] REGEX = ^([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),(.*) FORMAT = field8::"$8" field9::"$9" field11::"$11" WRITE_META = true
Add this to props.conf...
[whatever-the-name-of-your-sourcetype] TRANSFORMS-extractions = indexed-extractions
Add this to fields.conf...
[field8] INDEXED=true [field9] INDEXED=true [field11] INDEXED=true
Tony thanks for providing the required configuration. I've managed to get the selective fields extracted at index time and the rest at search time.
Now my next issue.....we currently have a search which creates a field at search time and it works against the index which has all fields extracted at index time but its not working when run on the new index I've created. What am I missing?
index=mynewindex level=debug | rex field=message ".value_to_be_extracted." | fields + newfield earliest latest
When I run this search on the current index it works and creates new field but I don't get the new field when its run on the new index. Though in both cases I'm getting same number of events so data is there and its searching it but just not creating the field.