Knowledge Management

Why to ever use Structured Data Header Extraction?

jthunnissen
Path Finder

I am confused about when to use Structured Data Header Extraction. Am I correct in understanding that structured data header extraction as documented here implies the use of custom index-time field extraction as documented here? The is unclear as the documented procedures do not reference each other.

Particularly, I am surprised that the documentation on structured data header extraction does not mention the big performance impact of index time field extraction. This brings to me to my follow-up question. If indeed structured data header extraction imply the use of custom index-time field extraction: Since splunk states in the documentation of the later that it should only be used in exceptional cases for specific fields, when is it EVER not an extremely bad idea to index all fields of an input file (as is the case with Structured Data Header Extraction)??

Tags (1)

jplumsdaine22
Influencer

So Header Extraction has a very important role to play, and that is in keeping your data schema free. Consider a csv with three fields, A, B and C. As its a CSV, each event (ie each row) has zero context in it other than column ordering. You could create a search time field extraction based on that column ordering, which would be fine unless someone reordered the columns to say, A, C and B. Now if you modify the field extraction for the source, all events prior to the change will have incorrect field information!

So, instead splunk adds the field metadata from the header to each event . That way if the schema of your csv changes, it doesn't matter - splunk will just read the updated header and apply the metadata as it always does.

In terms of impact bear in mind that the metadata is being added at INPUT time, rather than index time (indeed this setting must be on the forwarder), so the load on the indexer is not the same as doing regex parsing on the indexer to add custom fields.

Now in terms of why you shouldn't do this for unstructured sources, the opposite is true - the file for an unstructured source does not contain schema information. So if you apply a schema through index time field extraction, and the datas schema changes, you have now broken future events, and you will be sad.

Have a look for Martin Muellers conf talks on indexing - they will illuminate how this works. Also, try it for yourself. Turn on indexed_extractions and play with the walklex command to see whats actually happening to the tsidx.

gjanders
SplunkTrust
SplunkTrust

The only counter point here is that I cannot find anything confirming index time fields are in use unless you use the correct syntax for the fields.

For example let's say I have a CSV with columns A, B, C, D
If I search the sourcetype=csv A=value , will that improve performance vs a search time field?
I'd expect that Splunk will attempt to apply search-time field extractions to each event.

However if I search:
sourcetype=csv A::value

I know it's going to use the index time field, is there that much performance saving if the :: syntax is not used when searching index time fields?

I've often seen 50+ column CSV's using index time extractions so I suspect the performance impact is possibly greater than the time saving at search time...

Martin Muller's conf presentation on this is excellent

0 Karma

jthunnissen
Path Finder

I am also afraid that the performance impact is pretty negative. Especially since our Splunk developers are not aware which fields are index extracted and will always use field="value" rather than field::"value".

0 Karma

jthunnissen
Path Finder

Thank you plumsdaine22. I will have a look at that .conf talk shortly.

Can you link me to the walkley command you mention? (Ironically, if I Google it the only relevant match is your post here)

0 Karma

jplumsdaine22
Influencer

Conf slides here: https://conf.splunk.com/files/2017/slides/fields-indexed-tokens-and-you.pdf

Sorry I typed the command is walklex (http://docs.splunk.com/Documentation/Splunk/7.1.2/Troubleshooting/CommandlinetoolsforusewithSupport#...)

Read the slides first, you'll know what to look for. Also check out and search deep dive talks, I think theres one by Burch out there thats also very good.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...