I cannot get the automatic k/v field extraction to completely extract all fields from this event...
18 May 2010 16:09:17,913 INFO [ExecuteThread: '76' for queue: 'weblogic.kernel.Default'] com.xxx.xxx.sce.processes.helper.impl.ProcessHelperImpl: ProcessHelperImpl.saveContentItemToTransaction - parameters: Transaction =
Transaction
{
DeviceIPAddress = xxxxxxxxxxxx
MSISDN = xxxxxxxx
UserId = xxxxxxxxxxxx
UserAgent = Mozilla/5.0 (Linux; U; Android 2.1-update1; en-au; HTC_Desire_A8183 V1.08.841.2 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17
UserAgentProfile = qwerty
UserAgentProfileDifferences = not_set
DeviceType = not_set
ContentProviderCode = xxxx
GroupId = xxxxx
ContentTitle = not_set
ContentDescription = not_set
ObjectType = not_set
ContentURL = not_set
DeliveryURL = not_set
RSSDeliveryURL = not_set
CssNumber = not_set
Host = not_set
RssVipAddress = not_set
IsAudioOnly = false
IsPacketSwitch = false
ChannelStartDate = Thu Jan 01 10:00:00 EST 1970
ChannelEndDate = Thu Jan 01 10:00:00 EST 1970
DeliveryType = -1
FairplayDuration = -1
TotalFairplayDuration = -1
UnitPrice = -1
PriceType = -1
ChargeCode = -1
MerchantId = not_set
MerchantModel = not_set
MerchantName = not_set
Category = not_set
BillingDescription = not_set
BillingProductName = not_set
ReferenceCode = not_set
Chargeable = false
CreatedDate = Tue May 18 16:09:17 EST 2010
TransactionId = not_set
TransactionState = -1
PurchaseValidityPeriod = -1
AccumulatedDeliveryDuration = -1
ExpiryDate = Tue May 18 16:09:17 EST 2010
ErrorCode = false
AcceptTandC = false
}, ContentItem =
ContentItem
{
indicator = video-medium
content_id = 06002717800b7ec4
title = xxxxx
r_object_type = cds_video
r_modify_date = 2010-04-19 14:35:14.0
provider_code = xxxxxx
cp_unique_id = video_medium1-_psv92
short_description = xxxxx
i_full_format = 3gp
media_source = Internally Generated
object_name = xxxxxx
background_colour_code =
r_content_size = 2178306
i_chronicle_id = 090027178011287a
r_folder_path = xxxxxx
viewable_width =
i_contents_id = 06002717800b7ec4
r_object_id = 0900271780112a4b
viewable_height =
device_type = root^html^mozilla/5^safari^htc-desire
r_version_label = Active
effective_date = 2009-05-11 22:55:23.0
a_webc_url = xxxxxx
transcoding_profile_name = xxxx
content_purpose = video-report
cp_group_id = video_medium1
}
It appears that everything in the ContentItem group is not being parsed. All the fields are extracted successfully to this point in the event.
I have tried explicitly using
| extract auto=t |
to no effect. Given that the first section of the event is extracted successfully I assume the problem lies within the extractor (or at least how we are using it).
Help! Please. Thank you! Oh and we're using the 4.1.2.
Hmm. This could be due to a limit as to how many fields get extracted by default, but your event doesn't seem quite big enough to be hitting that. The other possible issue could be the dangling equals (=
), but I don't know why Transaction
values are being extracted, but ContentItem
values are not. That's strange.
I'm thinking you would be better off disabling automatic KV extraction and setting up your own explicit field extraction rules. (Which isn't too difficult, you'll want to rename the stanzas in my example to match your system.)
Entry in props.conf
:
[my_sourcetype]
KV_MODE = none
REPORT-eq-fields = my_eq_extraction
EXTRACT-fields = ^\S+ \S+ \S+ \S+ (?<log_level>\w+)\s*
EXTRACT-thread = \[ExecuteThread: '(?<thread>\d+)'
EXTRACT-queue = queue: '(?<queue>\S+)'
...
Entry in transforms.conf
:
[my_eq_extraction]
REGEX = ^\s+(\S+) += +(.+?)$
FORMAT = $1::$2
You could event prevent the not_set
values from being extracted, if you wanted to (just as an example of of the flexibility that you gain by using the regex field extraction approach.) This could be accomplished with:
REGEX = ^\s+([A-Za-z_]+) += +((?!not_set).+)$
Well, actually the default value for max number of automatic extractions is 50 (at least as of version 4.2). This limit would be hit rather near where you seem to get stuck - just a few lines into the ContentItem section.
BR/
Kristian
EDIT: Just realised that the original post was from 2010 - a little over a year old...
Thanks to Lowell for the answer. I did find one problem with the regexes that had me stumped for a while. Watch out for greediness... www.regular-expressions.info
Greedy...
[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+)$
FORMAT = $1::$2
I found that only one field was being extracted but I failed to check what was being extracted. Eventully I did check what was in the one field and found all the key/value pairs which led to much head slapping and then the answer below.
Lazy...
[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+?)$
FORMAT = $1::$2
Yeah I like to use [^\s] as I find it to be more readable in the sense that it is explicitly "excluding" but \S is definately simpler and cleaner. The "." was greedily matching right to the end of line of the last key value pair. So the first example above (Greedy) created a field which had the first value and all the remaing key/value pairs as its value. Once I made the last match lazy it matched to the first end of line.
Never mind the REPEAT_MATCH=true
comment. You don't need it. It appears that using the format $1::$2
is what triggers the multiple-matching behavior.
Greedy vs non-greedy can really be tricky to track down sometimes. However, I am surprised that this makes a difference because of the end of line anchor ($
) which requires that the match be continued to the end of line anyways... so now I'm curious. Can you post an example that wasn't matching. BTW, I assume you have kept the REPEAT_MATCH=true
(which you should need. That was an oversight on my part, sorry about that). Also, note that the regex [^\s]
is the same as \S
. I've updated my answer to include your fixes.
Hmm. This could be due to a limit as to how many fields get extracted by default, but your event doesn't seem quite big enough to be hitting that. The other possible issue could be the dangling equals (=
), but I don't know why Transaction
values are being extracted, but ContentItem
values are not. That's strange.
I'm thinking you would be better off disabling automatic KV extraction and setting up your own explicit field extraction rules. (Which isn't too difficult, you'll want to rename the stanzas in my example to match your system.)
Entry in props.conf
:
[my_sourcetype]
KV_MODE = none
REPORT-eq-fields = my_eq_extraction
EXTRACT-fields = ^\S+ \S+ \S+ \S+ (?<log_level>\w+)\s*
EXTRACT-thread = \[ExecuteThread: '(?<thread>\d+)'
EXTRACT-queue = queue: '(?<queue>\S+)'
...
Entry in transforms.conf
:
[my_eq_extraction]
REGEX = ^\s+(\S+) += +(.+?)$
FORMAT = $1::$2
You could event prevent the not_set
values from being extracted, if you wanted to (just as an example of of the flexibility that you gain by using the regex field extraction approach.) This could be accomplished with:
REGEX = ^\s+([A-Za-z_]+) += +((?!not_set).+)$
Ah, greediness was blame... added note as answer below as comments don't format nicely.
I cannot seem to get it to work. The transform is not picking anything but the first.
I tried suing repeat match but it did not change the behaviour.
[override-key-value-extraction]
REGEX = ^\s+([^\s]+) += +(.+)$
FORMAT = $1::$2
REPEAT_MATCH = true
I found this in the doco..
Any other options?
Thanks especially the 'not_set' tip.
I'll try the explicit approach via config.
The extraction problem appears to be present in 3.x as well.