Why to ever use Structured Data Header Extraction?

jthunnissen · ‎07-20-2018

I am confused about when to use Structured Data Header Extraction. Am I correct in understanding that structured data header extraction as documented here implies the use of custom index-time field extraction as documented here? The is unclear as the documented procedures do not reference each other.

Particularly, I am surprised that the documentation on structured data header extraction does not mention the big performance impact of index time field extraction. This brings to me to my follow-up question. If indeed structured data header extraction imply the use of custom index-time field extraction: Since splunk states in the documentation of the later that it should only be used in exceptional cases for specific fields, when is it EVER not an extremely bad idea to index all fields of an input file (as is the case with Structured Data Header Extraction)??

jplumsdaine22 · ‎07-26-2018

So Header Extraction has a very important role to play, and that is in keeping your data schema free. Consider a csv with three fields, A, B and C. As its a CSV, each event (ie each row) has zero context in it other than column ordering. You could create a search time field extraction based on that column ordering, which would be fine unless someone reordered the columns to say, A, C and B. Now if you modify the field extraction for the source, all events prior to the change will have incorrect field information!

So, instead splunk adds the field metadata from the header to each event . That way if the schema of your csv changes, it doesn't matter - splunk will just read the updated header and apply the metadata as it always does.

In terms of impact bear in mind that the metadata is being added at INPUT time, rather than index time (indeed this setting must be on the forwarder), so the load on the indexer is not the same as doing regex parsing on the indexer to add custom fields.

Now in terms of why you shouldn't do this for unstructured sources, the opposite is true - the file for an unstructured source does not contain schema information. So if you apply a schema through index time field extraction, and the datas schema changes, you have now broken future events, and you will be sad.

Have a look for Martin Muellers conf talks on indexing - they will illuminate how this works. Also, try it for yourself. Turn on indexed_extractions and play with the walklex command to see whats actually happening to the tsidx.

gjanders · ‎07-27-2018

The only counter point here is that I cannot find anything confirming index time fields are in use unless you use the correct syntax for the fields.

For example let's say I have a CSV with columns A, B, C, D
If I search the sourcetype=csv A=value , will that improve performance vs a search time field?
I'd expect that Splunk will attempt to apply search-time field extractions to each event.

However if I search:
sourcetype=csv A::value

I know it's going to use the index time field, is there that much performance saving if the :: syntax is not used when searching index time fields?

I've often seen 50+ column CSV's using index time extractions so I suspect the performance impact is possibly greater than the time saving at search time...

Martin Muller's conf presentation on this is excellent

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

jthunnissen · ‎07-30-2018

I am also afraid that the performance impact is pretty negative. Especially since our Splunk developers are not aware which fields are index extracted and will always use field="value" rather than field::"value".

jthunnissen · ‎07-27-2018

Thank you plumsdaine22. I will have a look at that .conf talk shortly.

Can you link me to the walkley command you mention? (Ironically, if I Google it the only relevant match is your post here)

jplumsdaine22 · ‎07-27-2018

Conf slides here: https://conf.splunk.com/files/2017/slides/fields-indexed-tokens-and-you.pdf

Sorry I typed the command is walklex (http://docs.splunk.com/Documentation/Splunk/7.1.2/Troubleshooting/CommandlinetoolsforusewithSupport#...)

Read the slides first, you'll know what to look for. Also check out and search deep dive talks, I think theres one by Burch out there thats also very good.

jthunnissen · ‎07-20-2018

The documentation links seem to be broken. Here they are:
http://docs.splunk.com/Documentation/Splunk/7.1.2/Data/Extractfieldsfromfileswithstructureddata
http://docs.splunk.com/Documentation/Splunk/7.1.2/Data/Configureindex-timefieldextraction

Why to ever use Structured Data Header Extraction?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!