Splunk Search

How to convert a flat multivalue field back to a real multivalue format via regex?

Bastelhoff
Path Finder

Hey there!

I am currently having some trouble in converting a flattened multivalue field back into a real multivalue field.
I know that there are easy ways with multi value field commands. However I need to do this via regex in this specific case.
So the _raw is looking like this

06/23/2019 11:59:32 +0000, lotsofrandomstuff="lotsofrandomstuff",interestingIds="1234 2345 56784", lotsofotherrandomstuff="lotsofotherrandomstuff"

However I just cannot get it to work.
If I just take the field interestingIds and extract it from there its fairly easy and working:

| rex field=_raw max_match=100 "(?<Ids>\d*)\s?"

But as soon as I set the beginning (interestingIds=”) and the end (“), I only ever get one value, either the first or the last. Or of course all in the same format as before, but not as multivalue.

Get the same field:

| rex field=_raw "interestingIds=\"(?<Ids>((\d)|(\s))*)\".*"

Other stuff I tried:

This only takes the first value

| rex field=_raw max_match=100 "interestingIds=((\"|\s)(?\d+)(\"|\s))*.*"

This one only takes the last value (which kind of means that the value is overwritten each time I guess)

| rex field=_raw max_match=100 " interestingIds=\"((?<Ids>\d+)\s?)*\".*"

This one only gives the very last digit:

| rex field=_raw max_match=100 " interestingIds=\"(\d+\s?)*(?<Ids>\d+)(\s?\d+)*\".*"

You can replace the first * with a ? to get the second segment or a {2} to get the third segment, but it’s only one segment then.

So maybe you have an idea on how to make it work.

Update:
So my next idea was to not focus on the stuff I want to take away, but to filter out everything else, hoping that the Front and Rear Anchor would work:

| rex field=_raw max_match=0 "^(.*interestingIds=)(\"|(.*\s))(?<Ids>\d+)(\"|(\s.*))(, lotsofotherrandomstuff=.*)$"

| rex field=_raw max_match=0 "^(.*interestingIds=\")(?<Ids>\d+)(\", lotsofotherrandomstuff=.*)$"

Unfortunately both options do not work.
It seems like as soon as you specify any kind of position, it fails.

0 Karma

woodcock
Esteemed Legend
0 Karma

Bastelhoff
Path Finder

Thanks! interesting stuff.
In my case, I am taking the data from a custom written index though, so I don't think it can be applied here.

0 Karma

woodcock
Esteemed Legend

OK, to do it from _raw you can use this:

... | rex max_match=0 (?<= |\")(?<interestingIds>\d{3,})
0 Karma

Bastelhoff
Path Finder

Thanks. While this does create a multivalue field, it also grabs lots of stuff from before and after the specified location. Like other IDs from before the
interestingIds=
and after it. It also grabs some timestamps and other stuff (which I simplified in my initial post by writing lotsofrandomstuff="lotsofrandomstuff")
Furthermore it also grabs all the correct values (interestingIds) twice.

Could you elaborate what the start and the ending is doing?
I guess the d{3,} (whout anything after the comma) is to be read as "repeat 3 times or more" (though apparently the minimum is 2 if you replace the 3 with a 1.
The first part (?<= |\") and specifically the ?<= I don't understand. Also changing it to just:

| rex max_match=0 ( |\")(?<interestingIds>\d{3,})

gives exactly the same results.

0 Karma

woodcock
Esteemed Legend

It works for the dataset that you gave me and only grabs the correct stuff. I cannot build an answer for constraints and requirements that you did not provide. The way that it works is with a positive lookbehind that says "the captured value must be preceded by either a space or a vertical-pipe and it must be at least 3 characters long. This means that the timestamp stuff does not qualify. It would only grab the values twice if they appear in the event twice. Again, give better details and data, get a better answer. Removing the positive lookbehind MOST DEFINITELY will NOT give the same results.

0 Karma

Bastelhoff
Path Finder

The “real” dataset consists of over 50 different fields and just one entry would have been bigger than my whole original post.

There is basically all kinds of different fields, including other multivaluefields with strings and numbers before and after the MV which needs to be extracted.

0 Karma

jnudell_2
Builder

The best way to do that is in @woodcock 's original answer. You'll just have to do a second rex, or a split to change the fields with multiple values to real MV fields. Anything you come up with a single regex (if even possible, and I'm not sure it is) would probably be too expensive to run anyway.

0 Karma

woodcock
Esteemed Legend

If you are trying to create a multivalued field from that raw event, then starting with this:

| makeresults 
| eval _raw="06/23/2019 11:59:32 +0000, lotsofrandomstuff=\"lotsofrandomstuff\",interestingIds=\"1234 2345 56784\", lotsofotherrandomstuff=\"lotsofotherrandomstuff\""
| kv

Either this:

| rex max_match=0 field=interestingIds "(?<interestingIds>\d+)"

Or this:

| eval interestingIds=split(interestingIds, " ")

Or this:

| makemv delim=" " interestingIds

Bastelhoff
Path Finder

I wanted to do that with regex right away, also to use it as field extraction. Extracting that from only one pre-extracted field is fairly easy, but isn't what I was looking for.

0 Karma

jnudell_2
Builder

regex doesn't work that way. Once you anchor it with the interestingIds, it will only match the value following interestingIds.

0 Karma

woodcock
Esteemed Legend

The "out of the box" method is like this:

To save:

... | mvjoin(src, ":::") | outputlookup multivalue_src.csv

To restore:

| inputlookup multivalue_src.csv | makemv delim=":::" src
0 Karma

woodcock
Esteemed Legend

Consider JSON Tools:
https://github.com/doksu/TA-jsontools/wiki

Multi-value field preservation when exporting to csv
Fields with multiple values can be easily preserved when exporting to csv. For example, we can convert a single field to JSON:

... | mkjson outputfield=src src
| outputlookup mylookup

This can then be reconstituted with spath:

... | lookup mylookup ... OUTPUT src
| spath input=src

This is useful to migrate KV Store Collection records from one host to another (or to a SHC), but be mindful of the fact spath removes any preceding underscore from field names. For example, to export Enterprise Security's Notable Event lookup to a csv containing JSON, the following could be used:

| inputlookup es_notable_events
| mkjson includehidden=true outputfield=json
| table json
| outputlookup migration_es_notable_events

If the flat-file csv lookup is then migrated, it can be used to re-populate the contents of the KV Store without loss of fidelity (multi-value fields, etc.) but the fields beginning with an underscore must be renamed:

| inputlookup migration_es_notable_events
| spath input=json
| rename key as _key
| rename time as _time
| outputlookup es_notable_events
0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...