Splunk Search

Rex multiline extraction only grabbing every other occurrence

rmercy
Explorer

Hoping this is something simple with lookahead/lookback that I'm missing... trying to extract multi-line fields from ANSI 835 files indexed in chunks by line count, so 10K line events (unfortunately, I have no control over the sourcetype / event breaking for these).  My rex is matching the pattern, but after the first match it skips the second and matches the third.  Then it skips the fourth and matches the fifth, etc.  The capture groups start and ends with the same pattern (CLP*), and there can be all kinds of variations in the number of lines, type of lines (starting characters), number of * delimited fields (without or without values) in each line, and multiple types of special characters.  The constants are the tilde ~ line breaks, and that I need everything between each CLP* occurrence. 

In the example 835 below, I would need to have three multi-line fields extracted starting with (1) 77777777*, then (2) 77777778*, and (3) 77777779*, but my rex is only getting (1) and (3).  Also, I know there are some redundancies (m and n+, etc), doesn't appear they're impacting the results... though happy to eat that sandwich if I'm wrong.  Any help with this would be much appreciated!

Cheers!

 

| rex max_match=0 "(?msi)CLP\*(?P<clmevent>.*?)\n+\CLP\*"

 

Example 835:

N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

View solution in original post

Tags (1)

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

| rex max_match=0 "(?s)(?P<clmevent>(?<=CLP\*).*?(?=CLP\*|$))"

rmercy
Explorer

Thanks, @ITWhisperer, works great!  Appreciate the response!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

Tags (1)

rmercy
Explorer

Thanks, @yuanliu this works!  So does @ITWhisperer  solution!  I don't think it will allow me to select both as accepted solution, so click it for yours since you replied first.  Thanks!!

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

 (view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...