Hoping this is something simple with lookahead/lookback that I'm missing... trying to extract multi-line fields from ANSI 835 files indexed in chunks by line count, so 10K line events (unfortunately, I have no control over the sourcetype / event breaking for these). My rex is matching the pattern, but after the first match it skips the second and matches the third. Then it skips the fourth and matches the fifth, etc. The capture groups start and ends with the same pattern (CLP*), and there can be all kinds of variations in the number of lines, type of lines (starting characters), number of * delimited fields (without or without values) in each line, and multiple types of special characters. The constants are the tilde ~ line breaks, and that I need everything between each CLP* occurrence.
In the example 835 below, I would need to have three multi-line fields extracted starting with (1) 77777777*, then (2) 77777778*, and (3) 77777779*, but my rex is only getting (1) and (3). Also, I know there are some redundancies (m and n+, etc), doesn't appear they're impacting the results... though happy to eat that sandwich if I'm wrong. Any help with this would be much appreciated!
Cheers!
| rex max_match=0 "(?msi)CLP\*(?P<clmevent>.*?)\n+\CLP\*"
Example 835:
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~
Splunk uses pcre but there is some difference. I have a hard time trusting it with multiline. Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.
Because your ANSI 835 is strictly formatted, maybe split will suffice.
| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent
Your sample event gives me all three. Here is an emulation you can compare with real data
| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```
Hope this helps.
Try something like this
| rex max_match=0 "(?s)(?P<clmevent>(?<=CLP\*).*?(?=CLP\*|$))"
Thanks, @ITWhisperer, works great! Appreciate the response!
Splunk uses pcre but there is some difference. I have a hard time trusting it with multiline. Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.
Because your ANSI 835 is strictly formatted, maybe split will suffice.
| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent
Your sample event gives me all three. Here is an emulation you can compare with real data
| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```
Hope this helps.
Thanks, @yuanliu this works! So does @ITWhisperer solution! I don't think it will allow me to select both as accepted solution, so click it for yours since you replied first. Thanks!!