Splunk Search

Rex multiline extraction only grabbing every other occurrence

rmercy
Explorer

Hoping this is something simple with lookahead/lookback that I'm missing... trying to extract multi-line fields from ANSI 835 files indexed in chunks by line count, so 10K line events (unfortunately, I have no control over the sourcetype / event breaking for these).  My rex is matching the pattern, but after the first match it skips the second and matches the third.  Then it skips the fourth and matches the fifth, etc.  The capture groups start and ends with the same pattern (CLP*), and there can be all kinds of variations in the number of lines, type of lines (starting characters), number of * delimited fields (without or without values) in each line, and multiple types of special characters.  The constants are the tilde ~ line breaks, and that I need everything between each CLP* occurrence. 

In the example 835 below, I would need to have three multi-line fields extracted starting with (1) 77777777*, then (2) 77777778*, and (3) 77777779*, but my rex is only getting (1) and (3).  Also, I know there are some redundancies (m and n+, etc), doesn't appear they're impacting the results... though happy to eat that sandwich if I'm wrong.  Any help with this would be much appreciated!

Cheers!

 

| rex max_match=0 "(?msi)CLP\*(?P<clmevent>.*?)\n+\CLP\*"

 

Example 835:

N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

View solution in original post

Tags (1)

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

| rex max_match=0 "(?s)(?P<clmevent>(?<=CLP\*).*?(?=CLP\*|$))"

rmercy
Explorer

Thanks, @ITWhisperer, works great!  Appreciate the response!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Splunk uses pcre but there is some difference.  I have a hard time trusting it with multiline.  Your code sample in my vanilla 9.1.2 installation, for example, results in 77777777* alone.

Because your ANSI 835 is strictly formatted, maybe split will suffice.

| eval clmevent = mvindex(split(_raw, "
CLP*"), 1, -1) ``` extra newline is paranoia - sample works without ```
| mvexpand clmevent

Your sample event gives me all three.  Here is an emulation you can compare with real data

| makeresults
| fields - _time
| eval _raw="N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*SUMMER*XX*6666666666~
REF*TJ*111111111~
CLP*77777777*4*72232*0**MC*6666666666666~
CAS*OA*147*50016*0~
CAS*CO*26*22216*0~
NM1*QC*1*TOM*SMITH****MR*77777777777~
NM1*74*1*ALAN*PARKER****C*88888888888~
NM1*PR*2*PACIFI*****PI* 9999~
NM1*GB*1*BARRY*CARRY****MI*666666666~
REF*EA*8888888~
DTM*232*20180314~
DTM*233*20180317~
SE*22*0001~
ST*835*0002~
BPR*H*0*C*NON************20180615~
TRN*1*100004765*5555555555~
DTM*405*20180613~
N1*PR*DIVISON OF HEALTH CARE FINANCING AND POLICY~
N3*1100 East William Street Suite 101~
N4*Carson*NV*89701~
PER*BL*Nevada Medicaid*TE*8776383472*EM*nvmmis.edisupport@dxc.com~
N1*PE*VALLEY*XX*6666666666~
REF*TJ*530824679~
LX*1~
CLP*77777778*2*3002*0**MC*6666666666667~
CAS*OA*176*3002*0~
NM1*QC*1*BOB*THOMAS****MR*55555555555~
NM1*74*1*ALAN*JACKSON****C*66666666666~
REF*EA*8888888~
DTM*232*20171001~
DTM*233*20171002~
CLP*77777779*4*41231.04*0**MC*6666666666668~
CAS*OA*147*9365.04*0~
CAS*CO*26*31866*0~
NM1*QC*1*HELD*ALLEN****MR*77777777778~
NM1*74*1*RYAN*LARRY****C*88888888889~
NM1*PR*2*SENIOR*****PI* 8888~"
``` data emulation above ```

Hope this helps.

Tags (1)

rmercy
Explorer

Thanks, @yuanliu this works!  So does @ITWhisperer  solution!  I don't think it will allow me to select both as accepted solution, so click it for yours since you replied first.  Thanks!!

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...