Solved: Parsing TSV with variable header names

kristensens · ‎10-07-2024

Hi, I've an eventhub that receives data from multiple application, with different number and values of columns.

The events are typically like so (as an example)

Environment ProductName UtcDate   RequestId Clientid ClientIp #app1 
Environment ProductName UtcDate Instance Region RequestId ClientIp DeviceId #app2
Environment ProductName UtcDate  DeviceId ClientIp #app3
PROD Product1 2024-04-04T20:21:20 abcd-12345-dev bcde-ed-1234 10.12.13.14   #app1
PROD Product2 2024-04-04T20:23:20 gwa us 126d-a23d-1234-def1 10.23.45.67 abcAJHSSz12. #ap
TEST Product3 2024-04-04T20:25:20 Ghsdhg1245 12.34.57.78 #app3
Environment ProductName UtcDate Instance Region RequestId ClientIp DeviceId #app2

#app at end of line, is not part of log, just to annotate the different entrie
How can splunk automagically select which "format" to use with REPORT/EXTRACT in transforms?

On the HeavyForwarder
transforms.conf

[header1]
DELIMS="\t"
FIELDS=Environment,ProductName,UtcDate,  RequestId,Clientid,ClientIp

[header2]
DELIMS="\t"
FIELDS=Environment,ProductName,UtcDate,Instance,Region,RequestId,ClientIp,DeviceId

[header3]
DELIMS="\t"
FIELDS=Environment,ProductName,UtcDate ,DeviceId ClientIp

In props.conf

[eventhub:sourcewithmixedsources]
INDEXED_EXTRACTIONS = TSV
CHECK_FOR_HEADER=true
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE = false
pulldown_type = 1
REPORT-headers = header1, header3,header3

PickleRick · ‎10-07-2024

1. If you're doing indexed extractions, your data is processed as parsed. Adding search-time extractions will only result in double fields (or misassigned fields in case of not-well-defined formats).

2. In general, unless you have a file input with header specifying fields within that file there's no way to assign fields dynamically to indexed-extraction fields.

3. You could try making search-time extraction definitions that match only specific message templates.

Like

REPORT-fields-for-app1 = ^(?<Environment>\S+)\s+(?<ProductName>\S+)\s+\(?<UtcDate>\S+)\s+(<RequestId>\S+)\s+(?<ClientId>\S+)\s+(?<ClientIp>\d+\.\d+\.\d+\.\d+)$

This should match only data for app1 because it has specific number of whitespace-separated files and has IP value anchored in a particular place within an event. You can have several other similar extraction definitions, each covering separate event template.

View solution in original post

kristensens · ‎10-09-2024

Thanks for confirming my suspicion. SED'ed a lot!

PickleRick · ‎10-07-2024

1. If you're doing indexed extractions, your data is processed as parsed. Adding search-time extractions will only result in double fields (or misassigned fields in case of not-well-defined formats).

2. In general, unless you have a file input with header specifying fields within that file there's no way to assign fields dynamically to indexed-extraction fields.

3. You could try making search-time extraction definitions that match only specific message templates.

Like

REPORT-fields-for-app1 = ^(?<Environment>\S+)\s+(?<ProductName>\S+)\s+\(?<UtcDate>\S+)\s+(<RequestId>\S+)\s+(?<ClientId>\S+)\s+(?<ClientIp>\d+\.\d+\.\d+\.\d+)$

This should match only data for app1 because it has specific number of whitespace-separated files and has IP value anchored in a particular place within an event. You can have several other similar extraction definitions, each covering separate event template.

Parsing TSV with variable header names

CSV

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Join the Conversation

Parsing TSV with variable header names

CSV

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...