Splunk Search

n-level nested transform and extraction

Path Finder

I am trying to do the following:

  1. Define a transform 1 in ./apps/search/local/transforms.conf. This creates 4 fields (field1 .... field4)
  2. Define extraction 1 in ./apps/search/local/props.conf.

This all works perfectly. the fields are fully available at search time.

But now comes the tricky part.

  1. Define a transform 2 in ./apps/search/local/transforms.conf. This transform operates on field4 and creates 4 fields (field5...field8)

  2. Define extraction 2 in ./apps/search/local/props.conf.

Unfortunately, this does not work. I have checked the regexp using REX and that works fine.

Are there any special rules to consider when doing nested transforms & extractions?

UDPATE #1

Here are the relevant contents of my ./apps/search/local/transforms.conf

# This is the base transform that parses the raw log  
[cloudfront-cdn-http]  
DELIMS = "\t"  
FIELDS = cdn_date, cdn_time, cdn_location, cdn_bytes, cdn_ip, cdn_method, cdn_host, cdn_uri, cdn_status, cdn_referer, cdn_useragent, cdn_query  

# Secondary transform that parses the cdn_uri field  
[cdn-uri-v1]  
REGEX = ^/(?<cdn_encoding>(ipad|iphone)[^/]+)/(?<cdn_page>.[^\/]+)  
SOURCE_KEY = cdn_uri  

# Tertiary transform that parses the cdn_page  
[cdn-page-v1]  
REGEX = (?<cdn_file>[\w\-]+)-(?<cdn_bandwidth>\d{2,3}+)k.split.(?<cdn_segment>\d{1,4}+).ts  
SOURCE_KEY = cdn_page  

And here are the relevant chunk of my ./apps/search/local/props.conf that extracts the three transforms

[cloudfront_http]  
# extracts the fields from the base transform  
REPORT-cloudfront_http_log = cloudfront-cdn-http  
# extracts the fields from the secondary transform  
REPORT-cloudfront_uri_v1 = cdn-uri-v1  
# extracts the fields from the tertiary transform  
REPORT-cloudfront-page-v1 = cdn-page-v1  
Tags (1)
2 Solutions

Super Champion

Off the top of my head, EXTRACTs (in props.conf) are done first, then REPORTs are evaluated. (Both of these can take a "class" name, which is simply sorted in lexagrapical order.) Of course, another simple way to assure fields are extracted in the proper order like this is to list out both extractions in the order you want them using a single REPORT entry, like so:

props.conf:

[my_sourcetype]
REPORT-myfield = fields1, fields2

transforms.conf:

[fields1]
REGEX = (?<field1.) .... (?<field4>...)

[fields2]
SOURCE_KEY = field4
REGEX = (?<field5>) .... (?<field8>)

In this example, "fields1" will always be evaluated before "fields2". Does that help? (If not, please provide some sample events and the related props.conf and transforms.conf.)

Update:

Based on your updated example, this is one solution that should work: (Assuming all your regex and everything else is fine, I didn't look that closely; and without sample event's it's hard to say anyways.)

[cloudfront_http]  
REPORT-cloudfront = cloudfront-cdn-http, cdn-uri-v1, cdn-page-v1  

The answer provided by southeringtonp should work fine too. It really comes down to your preference. If you had a more complex situation (For example, if "cdn_uri" could be found by two different field extractions based on variations in your events, then it would certainly be better to go with the explicit priority approach that southeringtonp pointed out.), but as is, either should work fine.

View solution in original post

Motivator

As Lowell indicated, lexical order is important when doing multiple tiers of transform.

Often a good practice is to add a number to the extraction's tag to help ensure that the sort order is clear, like so:

[cloudfront_http]  
REPORT-0-cloudfront_http_log = cloudfront-cdn-http
REPORT-1-cloudfront_uri_v1   = cdn-uri-v1  
REPORT-2-cloudfront-page-v1  = cdn-page-v1  

Also, be careful with characters like dashes and underscores when sort order matters. Your example uses both REPORT-couldfront-xxx and REPORT-cloudfront_xxx. It's a good idea to keep those consistent, or better yet rely just on alphanumeric sort.

Assuming normal ASCII sort, a dash comes before an underscore, but that might vary if Splunk honors sort orders for different locales.

View solution in original post

Motivator

As Lowell indicated, lexical order is important when doing multiple tiers of transform.

Often a good practice is to add a number to the extraction's tag to help ensure that the sort order is clear, like so:

[cloudfront_http]  
REPORT-0-cloudfront_http_log = cloudfront-cdn-http
REPORT-1-cloudfront_uri_v1   = cdn-uri-v1  
REPORT-2-cloudfront-page-v1  = cdn-page-v1  

Also, be careful with characters like dashes and underscores when sort order matters. Your example uses both REPORT-couldfront-xxx and REPORT-cloudfront_xxx. It's a good idea to keep those consistent, or better yet rely just on alphanumeric sort.

Assuming normal ASCII sort, a dash comes before an underscore, but that might vary if Splunk honors sort orders for different locales.

View solution in original post

Path Finder

Thanks. That did the tricks. I inserted the numbers into the report names to force the sort order.

0 Karma

Super Champion

Off the top of my head, EXTRACTs (in props.conf) are done first, then REPORTs are evaluated. (Both of these can take a "class" name, which is simply sorted in lexagrapical order.) Of course, another simple way to assure fields are extracted in the proper order like this is to list out both extractions in the order you want them using a single REPORT entry, like so:

props.conf:

[my_sourcetype]
REPORT-myfield = fields1, fields2

transforms.conf:

[fields1]
REGEX = (?<field1.) .... (?<field4>...)

[fields2]
SOURCE_KEY = field4
REGEX = (?<field5>) .... (?<field8>)

In this example, "fields1" will always be evaluated before "fields2". Does that help? (If not, please provide some sample events and the related props.conf and transforms.conf.)

Update:

Based on your updated example, this is one solution that should work: (Assuming all your regex and everything else is fine, I didn't look that closely; and without sample event's it's hard to say anyways.)

[cloudfront_http]  
REPORT-cloudfront = cloudfront-cdn-http, cdn-uri-v1, cdn-page-v1  

The answer provided by southeringtonp should work fine too. It really comes down to your preference. If you had a more complex situation (For example, if "cdn_uri" could be found by two different field extractions based on variations in your events, then it would certainly be better to go with the explicit priority approach that southeringtonp pointed out.), but as is, either should work fine.

View solution in original post