Solved: Automatic field extraction not extracting all the ...

nclarkau · ‎05-18-2010

I cannot get the automatic k/v field extraction to completely extract all fields from this event...

18 May 2010 16:09:17,913 INFO  [ExecuteThread: '76' for queue: 'weblogic.kernel.Default'] com.xxx.xxx.sce.processes.helper.impl.ProcessHelperImpl: ProcessHelperImpl.saveContentItemToTransaction - parameters: Transaction = 
Transaction
{
    DeviceIPAddress = xxxxxxxxxxxx
    MSISDN = xxxxxxxx
    UserId = xxxxxxxxxxxx
    UserAgent = Mozilla/5.0 (Linux; U; Android 2.1-update1; en-au; HTC_Desire_A8183 V1.08.841.2 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17
    UserAgentProfile = qwerty
    UserAgentProfileDifferences = not_set
    DeviceType = not_set
    ContentProviderCode = xxxx
    GroupId = xxxxx
    ContentTitle = not_set
    ContentDescription = not_set
    ObjectType = not_set
    ContentURL = not_set
    DeliveryURL = not_set
    RSSDeliveryURL = not_set
    CssNumber = not_set
    Host = not_set
    RssVipAddress = not_set
    IsAudioOnly = false
    IsPacketSwitch = false
    ChannelStartDate = Thu Jan 01 10:00:00 EST 1970
    ChannelEndDate = Thu Jan 01 10:00:00 EST 1970
    DeliveryType = -1
    FairplayDuration = -1
    TotalFairplayDuration = -1
    UnitPrice = -1
    PriceType = -1
    ChargeCode = -1
    MerchantId = not_set
    MerchantModel = not_set
    MerchantName = not_set
    Category = not_set
    BillingDescription = not_set
    BillingProductName = not_set
    ReferenceCode = not_set
    Chargeable = false
    CreatedDate = Tue May 18 16:09:17 EST 2010
    TransactionId = not_set
    TransactionState = -1
    PurchaseValidityPeriod = -1
    AccumulatedDeliveryDuration = -1
    ExpiryDate = Tue May 18 16:09:17 EST 2010
    ErrorCode = false
    AcceptTandC = false
}, ContentItem = 
ContentItem
{
     indicator = video-medium
     content_id = 06002717800b7ec4
     title = xxxxx
     r_object_type = cds_video
     r_modify_date = 2010-04-19 14:35:14.0
     provider_code = xxxxxx
     cp_unique_id = video_medium1-_psv92
     short_description = xxxxx
     i_full_format = 3gp
     media_source = Internally Generated
     object_name = xxxxxx
     background_colour_code = 
     r_content_size = 2178306
     i_chronicle_id = 090027178011287a
     r_folder_path = xxxxxx
     viewable_width = 
     i_contents_id = 06002717800b7ec4
     r_object_id = 0900271780112a4b
     viewable_height = 
     device_type = root^html^mozilla/5^safari^htc-desire
     r_version_label = Active
     effective_date = 2009-05-11 22:55:23.0
     a_webc_url = xxxxxx
     transcoding_profile_name = xxxx
     content_purpose = video-report
     cp_group_id = video_medium1
}

It appears that everything in the ContentItem group is not being parsed. All the fields are extracted successfully to this point in the event.

I have tried explicitly using

| extract auto=t |

to no effect. Given that the first section of the event is extracted successfully I assume the problem lies within the extractor (or at least how we are using it).

Help! Please. Thank you! Oh and we're using the 4.1.2.

Lowell · ‎05-18-2010

Hmm. This could be due to a limit as to how many fields get extracted by default, but your event doesn't seem quite big enough to be hitting that. The other possible issue could be the dangling equals (=), but I don't know why Transaction values are being extracted, but ContentItem values are not. That's strange.

I'm thinking you would be better off disabling automatic KV extraction and setting up your own explicit field extraction rules. (Which isn't too difficult, you'll want to rename the stanzas in my example to match your system.)

Entry in props.conf:

[my_sourcetype]
KV_MODE = none
REPORT-eq-fields = my_eq_extraction
EXTRACT-fields = ^\S+ \S+ \S+ \S+ (?<log_level>\w+)\s*
EXTRACT-thread = \[ExecuteThread: '(?<thread>\d+)'
EXTRACT-queue = queue: '(?<queue>\S+)'
...

Entry in transforms.conf:

[my_eq_extraction]
REGEX = ^\s+(\S+) += +(.+?)$
FORMAT = $1::$2

You could event prevent the not_set values from being extracted, if you wanted to (just as an example of of the flexibility that you gain by using the regex field extraction approach.) This could be accomplished with:

REGEX = ^\s+([A-Za-z_]+) += +((?!not_set).+)$

View solution in original post

sentor · ‎06-22-2011

Well, actually the default value for max number of automatic extractions is 50 (at least as of version 4.2). This limit would be hit rather near where you seem to get stuck - just a few lines into the ContentItem section.

BR/

Kristian

EDIT: Just realised that the original post was from 2010 - a little over a year old...

nclarkau · ‎05-25-2010

Thanks to Lowell for the answer. I did find one problem with the regexes that had me stumped for a while. Watch out for greediness... www.regular-expressions.info

Greedy...

[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+)$
FORMAT = $1::$2

I found that only one field was being extracted but I failed to check what was being extracted. Eventully I did check what was in the one field and found all the key/value pairs which led to much head slapping and then the answer below.

Lazy...

[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+?)$
FORMAT = $1::$2

nclarkau · ‎05-26-2010

Yeah I like to use [^\s] as I find it to be more readable in the sense that it is explicitly "excluding" but \S is definately simpler and cleaner. The "." was greedily matching right to the end of line of the last key value pair. So the first example above (Greedy) created a field which had the first value and all the remaing key/value pairs as its value. Once I made the last match lazy it matched to the first end of line.

Lowell · ‎05-25-2010

Never mind the REPEAT_MATCH=true comment. You don't need it. It appears that using the format $1::$2 is what triggers the multiple-matching behavior.

Lowell · ‎05-25-2010

Greedy vs non-greedy can really be tricky to track down sometimes. However, I am surprised that this makes a difference because of the end of line anchor ($) which requires that the match be continued to the end of line anyways... so now I'm curious. Can you post an example that wasn't matching. BTW, I assume you have kept the REPEAT_MATCH=true (which you should need. That was an oversight on my part, sorry about that). Also, note that the regex [^\s] is the same as \S. I've updated my answer to include your fixes.

Lowell · ‎05-18-2010

Hmm. This could be due to a limit as to how many fields get extracted by default, but your event doesn't seem quite big enough to be hitting that. The other possible issue could be the dangling equals (=), but I don't know why Transaction values are being extracted, but ContentItem values are not. That's strange.

I'm thinking you would be better off disabling automatic KV extraction and setting up your own explicit field extraction rules. (Which isn't too difficult, you'll want to rename the stanzas in my example to match your system.)

Entry in props.conf:

[my_sourcetype]
KV_MODE = none
REPORT-eq-fields = my_eq_extraction
EXTRACT-fields = ^\S+ \S+ \S+ \S+ (?<log_level>\w+)\s*
EXTRACT-thread = \[ExecuteThread: '(?<thread>\d+)'
EXTRACT-queue = queue: '(?<queue>\S+)'
...

Entry in transforms.conf:

[my_eq_extraction]
REGEX = ^\s+(\S+) += +(.+?)$
FORMAT = $1::$2

You could event prevent the not_set values from being extracted, if you wanted to (just as an example of of the flexibility that you gain by using the regex field extraction approach.) This could be accomplished with:

REGEX = ^\s+([A-Za-z_]+) += +((?!not_set).+)$

nclarkau · ‎05-25-2010

Ah, greediness was blame... added note as answer below as comments don't format nicely.

nclarkau · ‎05-25-2010

I cannot seem to get it to work. The transform is not picking anything but the first.

I tried suing repeat match but it did not change the behaviour.

[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+)$
FORMAT = $1::$2
REPEAT_MATCH = true

I found this in the doco..

NOTE: this option is valid only for index time KV extraction.

Any other options?

nclarkau · ‎05-19-2010

Thanks especially the 'not_set' tip.

I'll try the explicit approach via config.

The extraction problem appears to be present in 3.x as well.

Automatic field extraction not extracting all the fields from a particular event

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?