Splunk Search

How to extract a multivalue index-time field from another multivalue index-time field?

Builder

I'm trying to extract two index-time fields from the input stream. Both should be multivalued. I successfully extracted the first one, and it is multivalued, just like I wanted. However, the second field, which is to be extracted from the first one (like a short code, which is a suffix of its full version), uses only the first value of it.

Here is a quick example I've created:
transforms.conf

[mainKey]
REGEX = record(?:\.\d+)?\.code="(?P<mainKey>[^"]+)"
#FORMAT = mainKey::$1
WRITE_META = true
REPEAT_MATCH = true
LOOKAHEAD = 1048576
MV_ADD = 1

[subKey]
REGEX = (?m-s)(?<=^|\s)[a-zA-Z]*(?P<subKey>\d+)(?=\s|$)
#FORMAT = subKey::$1
SOURCE_KEY = field:mainKey
WRITE_META = true
REPEAT_MATCH = true
MV_ADD = 1

props.conf

[testIndexFields]
DATETIME_CONFIG =
NO_BINARY_CHECK = true
category = Custom
description = Testing multivalue index-time fields
pulldown_type = true

TRANSFORMS-mainKey = mainKey
TRANSFORMS-subKey = subKey

Where testIndexFields is a sourcetype I'm importing this data to.
I prepared the following file as a data sample:

2016-12-13 17:07:20, record.1.code="MAIN132" record.2.code="PRE9087", record.3.code="1405"
2016-12-13 17:07:40, record.code="SingleCode0123456"
2016-12-13 17:08:00, record.1.code="123BadOne", record.2.code="GoodOne1", record.3.code="NoSubKey"
2016-12-13 17:08:20, record.1.code="!alsobad123",record.2.code="TryThis1508"
2016-12-13 17:07:20, record.code="Unnumbered0001", record.code="Unnumbered0002", record.code="Unnumbered0003"

I'm expecting the data to be extracted like that:

mainKey=MAIN132 mainKey=PRE9087 mainKey=1405 subKey=132 subKey=9087 subKey=1405
mainKey=SingleCode0123456 subKey=0123456
mainKey=123BadOne mainKey=GoodOne1 mainKey=NoSubKey subKey=1
mainKey=Unnumbered0001 mainKey=Unnumbered0002 mainKey=Unnumbered0003 subKey=0001 subKey=0002 subKey=0003

However, I'm getting this:

mainKey = MAIN132  mainKey = PRE9087  mainKey = 1405 subKey = 132
mainKey = SingleCode0123456 subKey = 0123456
mainKey = 123BadOne  mainKey = GoodOne1  mainKey = NoSubKey
mainKey = !alsobad123  mainKey = TryThis1508
mainKey = Unnumbered0001  mainKey = Unnumbered0002  mainKey = Unnumbered0003 subKey = 0001

As you can see, the subKey is extracted from the first occurrence of mainKey only. Is there a way to change this behavior?

0 Karma
1 Solution

Motivator

If your mainKey regex is working fine and then from mainkey you end up extracting subKeythen can you try to use the similar regex for subKey like you have used for mainKey and see if it works:

REGEX = record(?:\.\d+)?\.code="(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"

where mainKeyPrefix and subKey fields will be created. Else you can extract this at search time using above regex if thats what may also be an option, something like:

your query to return mainKey
| rex field=mainKey "(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
| table mainKey, subKey

View solution in original post

Motivator

If your mainKey regex is working fine and then from mainkey you end up extracting subKeythen can you try to use the similar regex for subKey like you have used for mainKey and see if it works:

REGEX = record(?:\.\d+)?\.code="(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"

where mainKeyPrefix and subKey fields will be created. Else you can extract this at search time using above regex if thats what may also be an option, something like:

your query to return mainKey
| rex field=mainKey "(?<mainKeyPrefix>[^\d]+)(?<subKey>[\d]+)"
| table mainKey, subKey

View solution in original post

Builder

I ended up extracting the multivalue subKey field at search time using props.conf and transforms.conf, saving it into a summary index and tokenizing it to preserve its multivalue nature in fields.conf.

The extraction is described in this follow-up question.

The need to tokenize the field in a summary index is due to the following: multivalue fields arrive to a summary index as a single value, apparently created by mvjoin(source,'\n'). If I want to search on individual values, I need that TOKENIZER in fields.conf.

Builder

In the end I decided to extract that field (subKey) at search time and save into a summary index. The way I did the extraction is described in this follow-up question.

I'm leaving it here because it might be helpful to someone reading it some time later.

Builder

Yes, your first suggestion was my next step - I don't really like it too much, because in practice I'm extracting mainKey from differently formatted records, so I have 4 or 5 transforms, all extracting mainKey, and I'd have to replicate, multiply (because subKey is extracted differently from different mainKey formats) and edit them to extract the subKey. Still, if one extraction from a multivalue field doesn't work, I'll have to create all that multitude of subKey extractions.

Extracting subKey at search time doesn't really help because I want to search on subKey=... and it's not an indexed token (one of the points justifying index-time field creation). One of the possibilities that we are still looking at is to put everything into a summary index, extracting the subKey using rex in the summarizing search and saving it along with the mainKey.

Builder

By the way, I did find that rex works really differently from the regular expression in transforms.conf. During search time, rex - even the simplest ^[a-zA-Z]*(?P<subKey>\d+)$, with the crudest 'beginning of line'/'end of line' anchors, works as expected and returns multiple values when scanning a multivalue field. Is this a bug or a feature?
I'll accept your answer since it seems that not much more can be done during index time, and there are two good workarounds there.

Motivator

I am happy that it worked out for you! Happy Splunking!

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!