I'm trying to extract two index-time fields from the input stream. Both should be multivalued. I successfully extracted the first one, and it is multivalued, just like I wanted. However, the second field, which is to be extracted from the first one (like a short code, which is a suffix of its full version), uses only the first value of it.
Here is a quick example I've created:
transforms.conf
[mainKey]
REGEX = record(?:\.\d+)?\.code="(?P<mainKey>[^"]+)"
#FORMAT = mainKey::$1
WRITE_META = true
REPEAT_MATCH = true
LOOKAHEAD = 1048576
MV_ADD = 1
[subKey]
REGEX = (?m-s)(?<=^|\s)[a-zA-Z]*(?P<subKey>\d+)(?=\s|$)
#FORMAT = subKey::$1
SOURCE_KEY = field:mainKey
WRITE_META = true
REPEAT_MATCH = true
MV_ADD = 1
props.conf
[testIndexFields]
DATETIME_CONFIG =
NO_BINARY_CHECK = true
category = Custom
description = Testing multivalue index-time fields
pulldown_type = true
TRANSFORMS-mainKey = mainKey
TRANSFORMS-subKey = subKey
Where testIndexFields is a sourcetype I'm importing this data to.
I prepared the following file as a data sample:
2016-12-13 17:07:20, record.1.code="MAIN132" record.2.code="PRE9087", record.3.code="1405"
2016-12-13 17:07:40, record.code="SingleCode0123456"
2016-12-13 17:08:00, record.1.code="123BadOne", record.2.code="GoodOne1", record.3.code="NoSubKey"
2016-12-13 17:08:20, record.1.code="!alsobad123",record.2.code="TryThis1508"
2016-12-13 17:07:20, record.code="Unnumbered0001", record.code="Unnumbered0002", record.code="Unnumbered0003"
I'm expecting the data to be extracted like that:
mainKey=MAIN132 mainKey=PRE9087 mainKey=1405 subKey=132 subKey=9087 subKey=1405
mainKey=SingleCode0123456 subKey=0123456
mainKey=123BadOne mainKey=GoodOne1 mainKey=NoSubKey subKey=1
mainKey=Unnumbered0001 mainKey=Unnumbered0002 mainKey=Unnumbered0003 subKey=0001 subKey=0002 subKey=0003
However, I'm getting this:
mainKey = MAIN132 mainKey = PRE9087 mainKey = 1405 subKey = 132
mainKey = SingleCode0123456 subKey = 0123456
mainKey = 123BadOne mainKey = GoodOne1 mainKey = NoSubKey
mainKey = !alsobad123 mainKey = TryThis1508
mainKey = Unnumbered0001 mainKey = Unnumbered0002 mainKey = Unnumbered0003 subKey = 0001
As you can see, the subKey is extracted from the first occurrence of mainKey only. Is there a way to change this behavior?
If your mainKey
regex is working fine and then from mainkey
you end up extracting subKey
then can you try to use the similar regex for subKey
like you have used for mainKey
and see if it works:
REGEX = record(?:\.\d+)?\.code="(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
where mainKeyPrefix
and subKey
fields will be created. Else you can extract this at search time using above regex if thats what may also be an option, something like:
your query to return mainKey
| rex field=mainKey "(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
| table mainKey, subKey
If your mainKey
regex is working fine and then from mainkey
you end up extracting subKey
then can you try to use the similar regex for subKey
like you have used for mainKey
and see if it works:
REGEX = record(?:\.\d+)?\.code="(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
where mainKeyPrefix
and subKey
fields will be created. Else you can extract this at search time using above regex if thats what may also be an option, something like:
your query to return mainKey
| rex field=mainKey "(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
| table mainKey, subKey
I ended up extracting the multivalue subKey field at search time using props.conf
and transforms.conf
, saving it into a summary index and tokenizing it to preserve its multivalue nature in fields.conf
.
The extraction is described in this follow-up question.
The need to tokenize the field in a summary index is due to the following: multivalue fields arrive to a summary index as a single value, apparently created by mvjoin(source,'\n')
. If I want to search on individual values, I need that TOKENIZER in fields.conf
.
In the end I decided to extract that field (subKey) at search time and save into a summary index. The way I did the extraction is described in this follow-up question.
I'm leaving it here because it might be helpful to someone reading it some time later.
Yes, your first suggestion was my next step - I don't really like it too much, because in practice I'm extracting mainKey from differently formatted records, so I have 4 or 5 transforms, all extracting mainKey, and I'd have to replicate, multiply (because subKey is extracted differently from different mainKey formats) and edit them to extract the subKey. Still, if one extraction from a multivalue field doesn't work, I'll have to create all that multitude of subKey extractions.
Extracting subKey at search time doesn't really help because I want to search on subKey=... and it's not an indexed token (one of the points justifying index-time field creation). One of the possibilities that we are still looking at is to put everything into a summary index, extracting the subKey using rex in the summarizing search and saving it along with the mainKey.
By the way, I did find that rex works really differently from the regular expression in transforms.conf
. During search time, rex - even the simplest ^[a-zA-Z]*(?P<subKey>\d+)$
, with the crudest 'beginning of line'/'end of line' anchors, works as expected and returns multiple values when scanning a multivalue field. Is this a bug or a feature?
I'll accept your answer since it seems that not much more can be done during index time, and there are two good workarounds there.
I am happy that it worked out for you! Happy Splunking!