Splunk Search

In Hunk, app-specific field extraction is not picked up by map-reduce jobs

haneoword
Explorer

I'm noticing some weird behavior in a search that is requiring me to inline some regexs in order to get the MR job to work.

Step 0: Create a field extraction in an app that is not search

Here are the relevant contents of

$HUNK_HOME/etc/apps/{non_searchapp_app}/local/props.conf:

[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)

Step 1: Verify Field Extraction works

Example Search: (Smart Mode)

 index=myvix source=*events*
  • Indeed, on the left hand side I see my_field is recognized and has events being counted for each unique value of my_field
  • Hunk auto-field detection is indeed working

Step 2: Now check to see the field is being extracted by the search

Example Search: (Smart Mode)

 index=myvix source=*events* | table _time, my_field

I get the following results:

 _time                my_field
 2015-05-26 16:19:57     
 2015-05-26 16:19:57      
 ...

Known Workaround

Inline the rex and don't rely on the field extraction in props.conf.

 index=myvix source=*events* | rex field=message "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field

results in the following:

 _time                  my_field
 2015-05-26 16:19:57    my_field_value-A
 2015-05-26 16:19:57    my_field_value-B

Interesting corollary:

Inlining the following regex (e.g. field=raw) **_does not work**!!!

 index=myvix source=*events* | rex field=_raw "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field, _raw

results:

 _time                  my_field                  _raw
 2015-05-26 16:19:57                            {"header": {"time": 1432675197252, "threadId": "qtpXXXX", "requestMarker": "abadbeef42c8", "env": "production", "server": "some-prod-server", "service": "some-service"}}
 2015-05-26 16:19:57                            {"header": {"time": 1432675197253, "threadId": "qtpYYYY", "requestMarker": "8badbeef9139", "env": "production", "server": "some-otherprod-server", "service": "some-other-service"}}

Notice that _raw doesn't work because the 'message' field of the _raw avro record is not being included. Only the 'header' field is being included.

FWIW, the regex was generated using the "Event Action -> Extract Fields" UI from the main search view.


Interesting corollary++:

And as one last attempt to self-service and figure this out, I added message to the table command.

and it works!! Go figure.

 index=myvix source=*events* | rex field=_raw "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field, _raw, message

results:

 _time                  my_field            _raw                  message
 2015-05-26 16:19:57    my_field_value-A   {"header": {"time": 1432675197252, "threadId": "qtpXXXX", "requestMarker": "abadbeef42c8", "env": "production", "server": "some-prod-server", "service": "some-service"}, "message": "t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-A|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|"}   t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-A|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|
 2015-05-26 16:19:57    my_field_value-B   {"header": {"time": 1432675197253, "threadId": "qtpYYYY", "requestMarker": "8badbeef9139", "env": "production", "server": "some-otherprod-server", "service": "some-other-service"}, "message": "t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-B|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|"}   t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-B|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|

So it seems I have to tell hunk ahead of time which "raw fields" to include then it will "auto extract" ?

0 Karma
1 Solution

Ledion_Bitincka
Splunk Employee
Splunk Employee

Ahh, the "corollary" and "corollary++" are actually very important in what you're experiencing - basically what is happening is that Hunk does not have any knowledge that the field is being extracted from the "message" field and therefore the Avro reader doesn't output it - thus the extraction fail. Why does it work when you run "index=vix source=events" ? Well, if you're not running a reporting search (stats, timechart etc) the search is effectively ran in "verbose mode"

There are two ways to fix this:
a) if there are some fields that you always need some fields you can tell the record readers to always output them - check this answer for how to do that

b) you can tell the extractor that the field is actually being extracted from another field by modifying the extraction rule as follows:

 [myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+) IN message 

Unfortunately both methods require you to edit .conf files.

View solution in original post

Ledion_Bitincka
Splunk Employee
Splunk Employee

Ahh, the "corollary" and "corollary++" are actually very important in what you're experiencing - basically what is happening is that Hunk does not have any knowledge that the field is being extracted from the "message" field and therefore the Avro reader doesn't output it - thus the extraction fail. Why does it work when you run "index=vix source=events" ? Well, if you're not running a reporting search (stats, timechart etc) the search is effectively ran in "verbose mode"

There are two ways to fix this:
a) if there are some fields that you always need some fields you can tell the record readers to always output them - check this answer for how to do that

b) you can tell the extractor that the field is actually being extracted from another field by modifying the extraction rule as follows:

 [myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+) IN message 

Unfortunately both methods require you to edit .conf files.

haneoword
Explorer

Since the original developer used the UI to create the regex, it would be great if the UI could infer that message is required. It severely limits what end users can do for "schema-on-read" use-cases.... requiring a ticket for each field-extraction for the admin to go in and edit.

I tried both approaches and both worked, as advertised.

Since this is specific to the {non_searchapp_app} and since I only need it to pull in that field when it needs to I went with b).

 [myvix_sourcetype]
 EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?≺my_field≻[^\|]+) in message

It worked like a charm! Thanks @ledion once again!

Ledion_Bitincka
Splunk Employee
Splunk Employee

we're already tracking a similar enhancement request internally, for your reference SPL-94381

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

In the props.conf do you have your HDFS directory?

[source::/user/hunk/data/England/...]
sourcetype = England
EXTRACT-myField = XYZ

0 Karma

haneoword
Explorer

In $HUNK_HOME/etc/system/local/props.conf (note: that's system/local not apps/{non_searchapp_app}/local😞

 [myvix_sourcetype]
 EVAL-_time = strptime('header.time', "%s%3N")
 TRUNCATE = 102400
 MAX_TIMESTAMP_LOOKAHEAD = 30

 [source::/user/hunkuser/data/...]
 sourcetype = myvix_sourcetype

In $HUNK_HOME/etc/apps/{non_searchapp_app}/local/props.conf:

 [myvix_sourcetype]
 EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?≺my_field≻[^\|]+)
0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

I'd also recommend revising the time extraction rule based on this answer - eval based timestamp extraction causes time based partition pruning to be disabled.

0 Karma

haneoword
Explorer

@ledion thanks for pointing that out. I had actually read that answer and always focused on the RHS (e.g. the "%s%3N") and not the LHS (e.g. EXTRACT-_time vs EVAL-_time). I'll investigate and report back.

0 Karma

haneoword
Explorer

@Ledion, going with this:

 [myvix_sourcetype]
 #EVAL-_time = strptime('header.time', "%s%3N")
 #EXTRACT-_time = strptime('header.time', "%s%3N")
 TRUNCATE = 102400
 TIME_PREFIX = "time":[ ]
 TIME_FORMAT = %3N
 MAX_TIMESTAMP_LOOKAHEAD = 40
0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Two more things:
a) make sure to add header.time in the required fields for the vix
b) you'd need to fix TIME_FORMAT, probably need "%s%3N" (or maybe that's what you have and it doesn't render right here)

0 Karma

haneoword
Explorer

Yup.
- Already had header.time as required fields for the vix.
- Missed the %s... added it

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...