Solved: Help with regex for a highly confusing field extra...

vtsguerrero · ‎10-22-2014

Hello guys!
I know Splunk has a REGEX helper, but in this case, I have an amount of data wich is almost binary, take a look at one event row:
Foreach 38 substr I should have a new extracted field and it's also important to consider these two "blank spaces" when they found else consider numbers but always in a 38 sequence starting after XX01, in the original data I don't have the ** to mark the beggining of each field, just used it to show my problem...

2014-10-2210:13:19XX01*0003 000000650000006000000000000000000004 00000000000000000000000200000000*0005 000000000000000000000005000000000007010000001700000017000000000000000000080100000024000000230000000000000000000901000000060000000600000000000000000011010000001300000011000000010000000000120100000006000000060000000000000000001301000000060000000700000001000000000013 0000000100000001000000000000000000150100000061000000610000000000000000001511000000670000006700000000000000000015 00000149000001480000000100000000

Thanks in advance if anybody has a hint on this.
Bst Rgrds!

richgalloway · ‎10-22-2014

Try this:

... | rex max_match=0 "[A-Z]{2}[0-9]{2}(?<sub38>([\d ]{38})" | ...

This will give you a multivalue variable with all of the matches.

---
If this reply helps you, Karma would be appreciated.

View solution in original post

richgalloway · ‎10-22-2014

Try this:

... | rex max_match=0 "[A-Z]{2}[0-9]{2}(?<sub38>([\d ]{38})" | ...

This will give you a multivalue variable with all of the matches.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-22-2014

Almost on the way I guess, shoud I use this rex in search tab or interactive field extractor?
There's just something else, these blocks, don't always start by XX01, they can vary for other letters AB, CD, DA, and so on...
If I use this rex match in search how will I extract it to a fixed field? Should use it inside field extractor right?
Thanks a lot @richgalloway

richgalloway · ‎10-22-2014

Use this command in the search tab and it will extract the data into a multivalue variable (sub38, in my example). You then can manipulate the variable using the various mv* commands.

I've modified my answer to accommodate eye-catchers other than 'XX'.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-22-2014

Sorry bothering again, but just compared some fields and they also may not always use 01 after two letters, take a look at this other entry:

2014-10-2211:31:41UV020003  000001380000011800000014000000000004  000000000000000000000011000000010005  0000000000000000000000070000000000070100000041000000410000000000000000000801000000530000004300000002000000000009010000000400000004000000000000000000110100000011000000100000000000000000001201000000050000000500000000000000000013010000001300000014000000000000000000150100000144000001430000000100000000001511000000430000004200000001000000000015  00000309000003050000000400000000

Is there any table command I could use after this | rex expression just to confirm if selected fields are ok to extract in seach?
Thanks a lot @richgalloway

richgalloway · ‎10-22-2014

Replace '01' in the rex command with '[0-9]{2}'. I'll update my answer.

You should be able to see the extracted field with a table sub38 command.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-22-2014

It's starting at the letters... I need it to start considering after two letters and two numbers, look:

 UV150003 00000192000001700000001000000000 - UV15 shoud not be considered.
 TF900013 00000000000000010000000000000000 - TF90 shoud not be considered.

But I need the blank spaces after 3 and before 0 to be considered in the extraction. Just one more question, will this work every 38 substr or just for first 38?

richgalloway · ‎10-22-2014

Excluding the 'UV15' is easy - just move that part of the regex outside the field. I'll update the answer.

The max_match=0 option tells rex to match all sets of 38.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-22-2014

It worked! The regex is perfect now! The only problem, it's just gettin' the first 38 subsrt, this is a complete log:

2010-05-2215:04:16XX050003  000001310000012000000010000000010004  000000000000000000000004000000000005  0000000000000000000000080000000000070100000068000000670000000000000000000801000000730000006200000000000000000009010000001200000012000000000000000000110100000026000000210000000100000000001201000000080000000800000000000000000013010000001200000011000000020000000000150100000134000001340000000000000000001511000000380000003800000000000000000015  00000315000003090000000600000000

And when I use the rex and table, it worked fine, for the first 38

rex max_match=0 "[A-Z]{2}[0-9]{2}(?<sub38>([\d ]{38}))" | table sub38

Show as result:

0003 00000131000001200000001000000001

Shoud this table show the rest, starting from 0004 and on as I need to extract as one single field?
Thanks in advance @richgalloway

richgalloway · ‎10-22-2014

You can use the mvcount command to verify more than one substr is matched. For example, rex... | eval count=mvcount(sub38) | table count.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-22-2014

count
1

My count is 1, refeers to 0003 00000131000001200000001000000001 only first substr...

richgalloway · ‎10-22-2014

I think I see what is happening. There is only one instance per event of 'XX01' followed by 38 characters. This is trickier than I thought. Perhaps we can do this in two separate operations. Use

rex "[A-Z]{2}[0-9]{2}(?<subs>[\d ]+)"

to extract everything after 'XX01' then use

rex field=subs max_match=0 "(?<sub38>[\d ]{38})"

to break the result into 38-character chunks.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-23-2014

Got it, working perfectly now @richgalloway...
I just changed the beggining to [A-Za-z]
But there's something else... I'm now working with two rex, I saved the first one into an extraction field and now I've calling this field in the second rex...
I have some subfields I have to create for each 38 chunk, I did it like this:

| eval A = substr(sub38,1,4)
| eval B = substr(sub38,5,2)
| eval C = substr(sub38,7,8)
| eval D = substr(sub38,15,8)
| eval E = substr(sub38,16,8)
| eval F = substr(sub38,24,8)

Consdering that substr function starts at first argument and moves to second argument position number, I need to extract for the first field 4 values of the chunk, for the second field 2 values of the chunk and the rest 8 values for fours fields subsequent.
But if I do that using eval substr I only get the first 38 chunk as a result and I needed for all event's chunk, any suggestion on how I could possibly do that?
Thanks in advance @richgalloway !

richgalloway · ‎10-23-2014

I've about reached the limit of my experience and understanding of multi-value fields. I suggest you open a new question.

---
If this reply helps you, Karma would be appreciated.

vtsguerrero · ‎10-23-2014

Okay bro, thanks in advance for the hints...

Help with regex for a highly confusing field extraction

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Agent Mode Engaged! Enchaining Agentic Operations with Splunk AI Assistant 2.0

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Join the Conversation

Help with regex for a highly confusing field extraction

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Agent Mode Engaged! Enchaining Agentic Operations with Splunk AI Assistant 2.0

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk