Regex - Extracting a string between two records

mgranger1 · ‎12-22-2017

Okay, here we go. Let's get the basics out of the way. We run Splunk Enterprise 6.6.4, on-prem, from Linux based servers (RedHat). We have 4 indexers, but they aren't clustered, they are just autoLB. I don't think any of this will effect my question, but I like to set the stage.

I'm the Splunk admin for our organization, and while I can muddle my way through Regex, I'm not great with it. I have a situation where there is a data source that throws multiple "records" into a single Splunk "event". This data source is coming off of a mainframe feed where I don't really have the option of altering the source data. Every "record" within the "event" starts with a userid that can be any letter, number or character and may be somewhere between 1 and 8 characters. That user id is followed immediate by a space, 9 dashes, another space and then the word "STRING(S)". I basically need a regex that will pull out each "record" into its own string. I have been able to write a regex that successfully pulls out every other record, but because I have to use the " --------- STRING(S) FOUND" as the terminating string as well as the starting string, I don't know how to tell it to read the terminating string to determine the record is over, but then effectively back up and use the terminating string of one record as the starting string of the next record. Further adding to the complexity is the fact that there may be several CR LF (carriage return, line feed) hidden characters in the string that I want to capture. I've included some sample data, and in the sample data, I need to capture from "@1YMD" down to, but not including "@2EDA". Then, I need the next capture string to go from "@2EDA" and go up to but not include "@2EDC" (and then so on, and so forth through the whole event). I want to capture everything from the word prior to " --------- STRING(S)" to the next occurrence of " --------- STRING(S)" without reading the second userid, so that it is available to start the next record. I've tried non capture groups and having it "give back" some of the characters, but I can't get it just right. This is as close as I've gotten:

(?<member_string>(?<member>[a-zA-Z0-9\@]{1,8})\s+---------\sSTRING$S$.*?)\s[a-zA-Z0-9\@]{1,8}\s---------\sSTRING(S)

I'm really hoping this makes sense to all of you, and that I don't sound like an idiot. Any help would be appreciated.

Sincerely,
Matthew Granger

P.S. - I've tried to clean up the regex to display properly in the "preview" to show less than and greater than symbols and such, hopefully I've do okay

@1YMD --------- STRING(S) FOUND ------------------- 1 00001000$KEY(1YMD) TYPE(AKC) 2 00002000 UID(EJB7) ALLOW 3 00003000 UID(EJC7) ALLOW 4 00005000 UID(EJF4) ALLOW 5 00006000 UID(EJF5) ALLOW 6 00007000 UID(EJ03) ALLOW 7 00008000 UID(EJ18) ALLOW 8 00009000 UID(EJ19) ALLOW 9 00010000 UID(EJ20) ALLOW 10 00011000 UID(EJ21) ALLOW 11 00013000 UID(EJ54) ALLOW 12 00014000 UID(EJ55) ALLOW 13 00015000 UID(EJ58) ALLOW 14 00016000 UID(EJ62) ALLOW 15 00017000 UID(E*KG01) ALLOW 16 00018000 UID(EKL00) ALLOW @2EDA --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDA) TYPE(AKC) 3 00002001 UID(EJ19) ALLOW 4 00002101 UID(EJ20) ALLOW 5 00002202 UID(EJ21) ALLOW @2EDC --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDC) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(EKL00ON) ALLOW
ISRSUPC - MVS/PDF FILE/LINE/WORD/BYTE/SFOR COMPARE UTILITY- ISPF FOR z/OS 2017/12/20 0.15 PAGE 6 LINE-# SOURCE SECTION SRCH DSN: SECURITY.ACF2AKC.RULES 15 00015000 UID(E**I9) ALLOW @2EMT --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EMT) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(EKL00ON) ALLOW 15 00015000 UID(E****I9) ALLOW @2FCS --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2FCS) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(EKL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW

wenthold · ‎12-27-2017

I think you may want to use a lookahead match, but this is a very computationally expensive search:

(?s)(?<data>[a-zA-Z0-9@]{1,8}\s+\-+.*?)(?=([@A-Z0-9]{1,8}\s+-+\s+STRING|$))

What I can't account for is how your events are terminated, and that will make a difference. You mention that there are CR/LFs in the data. How your events are ingested into Splunk, linemerged, etc. will matter.

Again ... this is a VERY expensive regex, and if you're processing a high volume of events it could be a problem. Just plugging this into regex101 with your sample data required 12,291 steps and took ~15ms to complete.

You may want to look into your input configuration and attempt to set your event breaking to make your data easier to work with.

Event Breaking

mgranger1 · ‎12-29-2017

I can't thank you enough for that regex. The result set is "relatively" small, and will only be run once daily to create a lookup table. I'll admit that the source data isn't ideal (far from it), but due to it being off of the mainframe, I don't have a lot of options in editing my source.

I have one problem remaining. As part of this process, I am using the "transaction" command to put several events together prior to running this regex. Some of the data goes across multiple original source events, so by using the transaction command, I am able to put all of the original source text from multiple events into a single field and then attempt to parse it out.

However, when the transaction command puts together the original text into a single field, it still has a hidden and (\t\r\n) in the text. On regex101, the provided regex reads right past these hidden characters (the way I want it to), but when this is done as part of a rex command in the search, it seems to break out at these hidden characters. The ".*" portion of the regex should read any character (even hidden ones), but it doesn't seem to. I've tried \s\S (all whitespace and all non-whitespace), but that didn't capture it either.

Any thoughts?

wenthold · ‎12-29-2017

Try including max_match - for example, if your trying to extract from the field "your_field":

| rex max_match=1000 field=your_field "(?s)(?<data>[a-zA-Z0-9@]{1,8}\s+\-+.*?)(?=([@A-Z0-9]{1,8}\s+-+\s+STRING|$))"

You may want to consider trying stats instead of transaction to merge events. For example, if you're working with the field "your_field":

{base search} | stats values(your_field) as your_fields

Note that this is deposited into the field "your_fields". Then run the rex command against the combined your_fields with max_match:

{base search} | stats values(your_field) as your_fields | rex max_match=1000 field=your_fields "(?s)(?<data>[a-zA-Z0-9@]{1,8}\s+\-+.*?)(?=([@A-Z0-9]{1,8}\s+-+\s+STRING|$))"

I would still looking at LINE_BREAKER in props.conf to make this process easier. Note that doing this will change how your events are formatted, approach doing it on product data lightly. Something like this in props.conf may work:

[your_source_type]
SHOULD_LINEMERGE = false
LINE_BREAKER = (([\s\r\n]+))[a-zA-Z0-9\@]{1,8})\s+---------\s+STRING

niketn · ‎12-26-2017

@mgranger1, your issue is that your data delimiter --------- STRING(S) FOUND ------------------- instead of being in front of the entire data is after a key piece of data i.e. User ID, which means this pattern can not be used to split the data into events.

Splunk can do this kind of correction for your, however, I feel that would be an unnecessary overhead on Splunk, since you will be correcting entire raw data in order to extract multiple events from the same. Somehow try to see if either User ID can be pushed after the delimiter String Found message or else User ID is present both before and after the delimiter string.

In the meanwhile following is the replace command which will match User ID as first pattern and String Found as 2nd Pattern and reverse them. Refer to documentation with similar example: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/TextFunctions#replace.28X.2CY.2CZ...

| eval data=replace(_raw,"(\s[^\s]+\s)(---------\sSTRING\(S\) FOUND\s-------------------\s)","\2\1")

Try the following run anywhere example based on your sample data to test:

| makeresults
| eval _raw=" @1YMD --------- STRING(S) FOUND ------------------- 1 00001000$KEY(1YMD) TYPE(AKC) 2 00002000 UID(EJB7) ALLOW 3 00003000 UID(EJC7) ALLOW 4 00005000 UID(EJF4) ALLOW 5 00006000 UID(EJF5) ALLOW 6 00007000 UID(EJ03) ALLOW 7 00008000 UID(EJ18) ALLOW 8 00009000 UID(EJ19) ALLOW 9 00010000 UID(EJ20) ALLOW 10 00011000 UID(EJ21) ALLOW 11 00013000 UID(EJ54) ALLOW 12 00014000 UID(EJ55) ALLOW 13 00015000 UID(EJ58) ALLOW 14 00016000 UID(EJ62) ALLOW 15 00017000 UID(E***KG01) ALLOW 16 00018000 UID(E***KL00) ALLOW @2EDA --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDA) TYPE(AKC) 3 00002001 UID(EJ19) ALLOW 4 00002101 UID(EJ20) ALLOW 5 00002202 UID(EJ21) ALLOW @2EDC --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDC) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW
 ISRSUPC - MVS/PDF FILE/LINE/WORD/BYTE/SFOR COMPARE UTILITY- ISPF FOR z/OS 2017/12/20 0.15 PAGE 6 LINE-# SOURCE SECTION SRCH DSN: SECURITY.ACF2AKC.RULES 15 00015000 UID(E*******I9) ALLOW @2EMT --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EMT) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW @2FCS --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2FCS) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW @2FC2 --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2FC2) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW"
| fields - _time
| eval data=replace(_raw,"(\s[^\s]+\s)(---------\sSTRING\(S\) FOUND\s-------------------\s)","\2\1")
| makemv data delim="--------- STRING(S) FOUND ------------------- "
| mvexpand data
| rename data as _raw

PS: I have used makemv command since it is simple and robust. You can use rex with max_match=0 as well.

 .....
 .....
| eval data=replace(_raw,"(\s[^\s]+\s)(---------\sSTRING\(S\) FOUND\s-------------------\s)","\2\1")
| rex "---------\sSTRING\(S\) FOUND\s-------------------\s(?<data>[^-]+)?" max_match=0
| mvexpand data
| rename data as _raw

If both queries work as expected, choose the one that performs better using Job Inspector. Do consider fixing raw data in the first place as requested above.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

mgranger1 · ‎12-27-2017

As I test more, it seems to not be able to parse out the individual portions of the string. The capture groups of the replace aren't found. I have tried the following:

| rex max_match=0 field=TEXT "(?<member_id>\s[^\s]+\s)(?<label_id>---------\sSTRING\(S\) FOUND\s-------------------\s)" 
| table JOBNAME STEPNAME Page member_id label_id

and there is no response for either member_id or label_id. However, if I just do the following:

| rex max_match=0 field=TEXT "(?<label_id>---------\sSTRING\(S\) FOUND\s-------------------\s)" 
| table JOBNAME STEPNAME Page label_id

it returns every occurrence of the "label". If it can't parse out the individual groups, it makes sense that it wouldn't know how to replace them.

Not sure where to go from here.

mgranger1 · ‎12-27-2017

I wish I had the option of switching the source data. I do not. This is coming as a data extract from a mainframe source, and I do not have access to altering this source.

I'm very interested in the method you describe, as I believe it would work, however, I am not able to make the replace function work as expected.

I have tried the following (where TEXT is the source field):

| eval data=replace(TEXT,"(\s[^\s]+\s)(---------\sSTRING\(S\) FOUND\s-------------------\s)","\2\1")
| table JOBNAME STEPNAME Page TEXT data

And there is no difference between "TEXT" (the original source) and "data" (which should be the result of the eval function).

Regex101 (which I realize isn't perfect), does evaluate the two groups properly, but it doesn't seem to be switching the strings as described.

mgranger1 · ‎12-26-2017

I've never noticed the (101010) button, thank you for bringing it to my attention. Once again, here is my "best guess" regex sample

(?<member_string>(?<member>[a-zA-Z0-9\@]+)\s+---------\sSTRING\(S\).*?)\s[a-zA-Z0-9\@]+\s---------\sSTRING\(S\)

And, here is the data sample:

@1YMD --------- STRING(S) FOUND ------------------- 1 00001000$KEY(1YMD) TYPE(AKC) 2 00002000 UID(EJB7) ALLOW 3 00003000 UID(EJC7) ALLOW 4 00005000 UID(EJF4) ALLOW 5 00006000 UID(EJF5) ALLOW 6 00007000 UID(EJ03) ALLOW 7 00008000 UID(EJ18) ALLOW 8 00009000 UID(EJ19) ALLOW 9 00010000 UID(EJ20) ALLOW 10 00011000 UID(EJ21) ALLOW 11 00013000 UID(EJ54) ALLOW 12 00014000 UID(EJ55) ALLOW 13 00015000 UID(EJ58) ALLOW 14 00016000 UID(EJ62) ALLOW 15 00017000 UID(E***KG01) ALLOW 16 00018000 UID(E***KL00) ALLOW @2EDA --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDA) TYPE(AKC) 3 00002001 UID(EJ19) ALLOW 4 00002101 UID(EJ20) ALLOW 5 00002202 UID(EJ21) ALLOW @2EDC --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EDC) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW
ISRSUPC - MVS/PDF FILE/LINE/WORD/BYTE/SFOR COMPARE UTILITY- ISPF FOR z/OS 2017/12/20 0.15 PAGE 6 LINE-# SOURCE SECTION SRCH DSN: SECURITY.ACF2AKC.RULES 15 00015000 UID(E*******I9) ALLOW @2EMT --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2EMT) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW @2FCS --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2FCS) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW @2FC2 --------- STRING(S) FOUND ------------------- 2 00001000$KEY(2FC2) TYPE(AKC) 3 00002000 UID(EJB7) ALLOW 4 00003000 UID(EJF4) ALLOW 5 00004000 UID(EJF5) ALLOW 6 00005000 UID(EJ03) ALLOW 7 00007000 UID(EJ18) ALLOW 8 00008000 UID(EJ19) ALLOW 9 00009000 UID(EJ20) ALLOW 10 00010000 UID(EJ21) ALLOW 11 00011000 UID(EJ54) ALLOW 12 00012000 UID(EJ58) ALLOW 13 00013000 UID(EJ60) ALLOW 14 00014000 UID(E***KL00ON) ALLOW 15 00015000 UID(E*******I9) ALLOW

DalJeanis · ‎12-22-2017

It's useful to look at what something is NOT, rather than what it is. It looks like you can never have an @ in your data, other than in the member ID. Is this correct? If so, then you can use that as the stop for the member_string variable, by taking everything that ISN'T an @, like this...

"(?<member_string>(?<member>[a-zA-Z0-9\@]{1,8}))(\s+?-+\s+STRING\(S\) FOUND\s+-+\s+?[^\@]+)"

We could do a little more, in order to get rid of the ending space character in all but the last member_string, but that pulls out what you are asking for.

This note turned out to be unneeded, but it's generally useful so I'll leave it here for you.

With regex, you can give the system alternatives using parenthesis and the vertical pipe. (A|B) will select either the character "A" or the character "B". (A|$) will select either the character "A" or the end of the input string. So, that's a useful technique.

mgranger1 · ‎12-26-2017

I appreciate this suggestion, however, while all of the member_id examples in the data set start with "@", it isn't true that ALL of the member_id values start with "@". They might start with anything (hence the [a-zA-Z0-9\@]{1,8}

They can be any combination of 1 to 8 characters. Any letter or number, and they might contain an "@" or not. The only consistent thing about them is that they are the first "word" prior to --------- STRING(S).

DalJeanis · ‎12-22-2017

There are at least three ways to "mark" your code so the interface doesn't treat <tags> or * like html: (1) mark with the 101 010 button (2) put four blanks at the beginning of each line (3) put grave accents (the one on the same key as the tilde ~) before and after the code.

niketn · ‎12-22-2017

@mgranger1, Please repost the code and sample data using the code button on Splunk Answers (101010) so that special characters do not escape and modify actual data.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Regex - Extracting a string between two records

Accelerating Observability as Code with the Splunk AI Assistant

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!

Join the Conversation

Regex - Extracting a string between two records

Accelerating Observability as Code with the Splunk AI Assistant

Integrating Splunk Search API and Quarto to Create Reproducible Investigation ...

Congratulations to the 2025-2026 SplunkTrust!