How to extract selective fields from our log files...

shahzadarif · ‎06-29-2016

I would like to know how could I extract selective fields at Index-time from our log files which are in CSV format. Let's say my log file looks like this.

date,field1,field2,field3,field4,field5......field20

Let's say I would like to extract only field8, field9, and field11 at index-time. The rest would get extracted at search time.

We're currently extracting all fields at index-time which is causing disk space issues. I'm testing, extract all fields at search-time and would also like to extract a couple of fields which get used almost in all searches at index-time, and the rest at search-time to work out best search performance.

tread_splunk · ‎06-29-2016

Hi Shahzad,

I reckon the bit in the docs you want to pay closest attention to is at...

http://docs.splunk.com/Documentation/Splunk/6.4.1/Data/Configureindex-timefieldextraction#Define_a_n...

Which leads me to conclude you want something like the following. The RegEx is as per the emails you and I exchanged.

Add this to transforms.conf...

[indexed-extractions]
REGEX = ^([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),([^,]*?),(.*)
FORMAT = field8::"$8" field9::"$9" field11::"$11"
WRITE_META = true

Add this to props.conf...

[whatever-the-name-of-your-sourcetype]
TRANSFORMS-extractions = indexed-extractions

Add this to fields.conf...

[field8]
INDEXED=true

[field9]
INDEXED=true

[field11]
INDEXED=true

Good luck!

shahzadarif · ‎07-03-2016

Tony thanks for providing the required configuration. I've managed to get the selective fields extracted at index time and the rest at search time.

Now my next issue.....we currently have a search which creates a field at search time and it works against the index which has all fields extracted at index time but its not working when run on the new index I've created. What am I missing?

Search is

index=mynewindex level=debug | rex field=message ".value_to_be_extracted."  | fields + newfield earliest latest

When I run this search on the current index it works and creates new field but I don't get the new field when its run on the new index. Though in both cases I'm getting same number of events so data is there and its searching it but just not creating the field.

shaskell_splunk · ‎06-29-2016

You won't be able to use INDEXED_EXTRACTIONS=CSV to selectively create index time field extractions. That setting will create an index time field extraction for every field in your CSV file which isn't what you want.

If you want to have certain fields be index time field extractions and others search time extractions you'll need a combination of settings in props.conf, transforms.conf and fields.conf.

Have a look at the docs here for creating index time field extractions:
http://docs.splunk.com/Documentation/Splunk/6.4.1/Data/Configureindex-timefieldextraction

Here are the docs for search time field extractions:
http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Createandmaintainsearch-timefieldextract...

If you expect the format of your CSV file to ever change you need to be cautious about using index time field extractions. If any of your select fields changes position or another field is added then you can wind up with incorrect data in certain fields depending on how well written your regex is/isn't. The recommendation to use search time field extractions is generally for flexibility in case your data ever changes. An index time field extraction is a permanent piece of metadata for every event and it can't be changed. See this section in the docs for more details.

http://docs.splunk.com/Documentation/Splunk/6.4.1/Data/Aboutindexedfieldextraction

shahzadarif · ‎06-29-2016

Thanks shaskell. I'll read the provided documentation and would implement it accordingly in my test environment. I don't expect the CSV file to ever change. I'm hoping by extracting couple of fields which get used quite often at index time would speed up our searches and at the same time, it'll keep the index size to the minimum. The fields which I'd like to extract have limited number of values so tsidx files won't be huge.

sundareshr · ‎06-29-2016

As a general rule, it is better to perform most knowledge-building activities, such as field extraction, at search time. Additionally, custom field extraction, performed at index time, can degrade performance at both index time and search time. When you add to the number of fields extracted during indexing, the indexing process slows. Later, searches on the index are also slower, because the index has been enlarged by the additional fields, and a search on a larger index takes longer. You can avoid such performance issues by instead relying on search-time field extraction.

How are you extracting your fields now? Are you using INDEXED_EXTRACTION=CSV or are you extracting fields individually?

shaskell_splunk · ‎06-29-2016

I wouldn't say that searches on indexes with additional indexed fields are slower. In fact, they can be significantly faster because you have the ability to filter directly on those fields and take advantage of the inverted index and underlying bloom filters, similar to using host, source and sourcetype. Using INDEXED_EXTRACTIONS will create an index time field extraction for every field in the structured dataset. The size of the index can grow significantly but it depends on how large your events are and the cardinality (number of unique values) that each of the indexed fields has.

How to extract selective fields from our log files in CSV format at index-time?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation