Splunk Search

Rex multiline extraction only grabbing every other occurrence

rmercy
Explorer

Hoping this is something simple with lookahead/lookback that I'm missing... trying to extract multi-line fields from ANSI 835 files indexed in chunks by line count, so 10K line events (unfortunately, I have no control over the sourcetype / event breaking for these).  My rex is matching the pattern, but after the first match it skips the second and matches the third.  Then it skips the fourth and matches the fifth, etc.  The capture groups start and ends with the same pattern (CLP*), and there can be all kinds of variations in the number of lines, type of lines (starting characters), number of * delimited fields (without or without values) in each line, and multiple types of special characters.  The constants are the tilde ~ line breaks, and that I need everything between each CLP* occurrence. 

In the example 835 below, I would need to have three multi-line fields extracted starting with (1) 77777777*, then (2) 77777778*, and (3) 77777779*, but my rex is only getting (1) and (3).  Also, I know there are some redundancies (m and n+, etc), doesn't appear they're impacting the results... though happy to eat that sandwich if I'm wrong.  Any help with this would be much appreciated!

Cheers!

 

| rex max_match=0 "(?msi)CLP\*(?P<clmevent>.*?)\n+\CLP\*"

 

Example 835:

N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

View solution in original post

Tags (1)

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

| rex max_match=0 "(?s)(?P<clmevent>(?<=CLP\*).*?(?=CLP\*|$))"

rmercy
Explorer

Thanks, @ITWhisperer, works great!  Appreciate the response!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

Tags (1)

rmercy
Explorer

Thanks, @yuanliu this works!  So does @ITWhisperer  solution!  I don't think it will allow me to select both as accepted solution, so click it for yours since you replied first.  Thanks!!

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...