topic Re: How to extract fields from a multiline header followed by structured data columns? in Splunk Search

How to extract fields from a multiline header followed by structured data columns?

HLVarian — Mon, 22 Feb 2016 19:00:08 GMT

I have a sourcetype that is in CSV format and I'd like to extract fields from the multiline header that proceeds these files coming in. Each new line in the header begins with # and these lines are comma separated. This header is followed by the actual data fields which are semi-colon separated. I'm trying to figure out the best way to go about parsing this data. The data looks like this:

 #PlanID: '1.2.246.352.71.5.459020837699.2820.20131008220539', IrrSessionID: '1.2.246.352.82.6.5130518203565855886.177214397189262813860', FieldNum:1
 #BeamSizeID:'4.0', Status (1, 0, 0), CumMU: 109720.889806
 #Temp C/M(0.000000,0.000000), Pressure C/M(0.000000,0.000000)
 #VALUES;;;
 Field1;Field2;Field3;Field4;...;FieldN
 Data1;Data2;Data4;Data4;...;DataN

I'd like to get out the following fields and what they would be relate to in this example:

PlanID = 1.2.246.352.71.5.459020837699.2820.20131008220539
IrrSessionID = 1.2.246.352.82.6.5130518203565855886.177214397189262813860
FieldNum = 1
BeamSizeID = 4.0
Status = (1, 0, 0)
CumMu = 109720.889806
Temp C/M = (0.000000,0.000000)
Pressure C/M= (0.000000,0.000000)

Followed by:

Field1, Field2,..., etc.

Previously I just had Splunk strip away all the lines beginning with # using:

FIELD DELIMITER = ;
PREAMBLE REGEX = ^\#

(this works well for all my other sourcetypes)

I have also used a Python script to append the necessary fields to file and then I can keep using the aforementioned settings. However, we do not want to run a separate process on our files before getting them into Splunk.

So, what is the best way to get these fields in? Is there a way to manipulate my Python script and use it in Splunk on my incoming data? Or should I use some extensive RegEx in props.conf and transforms.conf to achieve this?

Many thanks!

Re: How to extract fields from a multiline header followed by structured data columns?

stephanefotso — Tue, 23 Feb 2016 14:18:02 GMT

Hello! Let's get a sample of your data please.

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Tue, 23 Feb 2016 19:02:19 GMT

I don't have enough Karma points to provide attachments yet. However this is a sample of the data. Data comes in on a CSV file and looks just like this if opened in text:

 #PlanID: '1.2.246.352.71.5.546855675324.46054.20150723173433', IrrSessionID: '1.2.246.352.82.6.4916826480507715613.151884354515765536450', FieldNum:1
#BeamSizeID:'4.0', Status (1, 0, 0), CumMU: 26940.399953
#Temp C/M(0.000000,0.000000), Pressure C/M(0.000000,0.000000)
#VALUES;;;
LayerNr;Energy;DoseRate;Start-Irradiation;Cumm MU;X;Y;MU;DiagnosticDataValid;X_Measured;Y_Measured;X_MeasuredRaw;Y_MeasuredRaw;Mu_MeasuredC; Mu_MeasuredCt;Mu_MeasuredM;Mu_MeasuredMt;minusXBeamWidth;plusXBeamWidth;minusYBeamWidth;plusYBeamWidth;currentX;currentX_Measured;currentY;currentY_Measured;doseRateC;doseRateM;irradiationTime;numSamplesC;numSamplesM
1;180.139000;304359.803029;AUG 03 2015 16:36:43 GMT;46.599998;-65.599998;-49.200001;46.599998;1;88716.000000;132111.000000;0.000000;0.000000;46.398749;4643.000000;46.342000;4487.000000;5253;5445;4431;3869;-148038;-148074;-75555;-75703;304359.803029;308679.237710;0.008752;100525;100528
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;57.400002;-65.599998;49.000000;1;166233.000000;121459.000000;0.000000;0.000000;49.026979;4906.000000;49.523042;4795.000000;5272;5479;3842;4047;128182;126383;-99672;-99767;264168.273430;269533.829245;0.010886;11063;11063
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;-57.400002;-41.000000;46.400001;1;93767.000000;138361.000000;0.000000;0.000000;46.398749;4643.000000;46.620857;4514.000000;5409;5458;3862;4404;-128187;-61199;-62274;-63445;260916.063907;265030.589047;0.010261;202;202
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;-65.599998;-57.400002;45.799999;1;88777.000000;126961.000000;0.000000;0.000000;45.799153;4583.000000;45.897893;4444.000000;5298;5481;4166;4455;-146450;-141625;-87204;-83594;261488.078635;264819.896647;0.010151;71;71

Hope this helps others out.

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Tue, 23 Feb 2016 19:04:03 GMT

I don't have enough Karma points to post an attachment, but see the "answer" I posted below when i was trying to add an attachment for a snapshot of what the data looks like.

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Tue, 23 Feb 2016 22:30:18 GMT

I've had success with this RegEx in search, now I just need to figure out how to format it for props.conf and transforms.conf

 rex field=_raw "(?m)^\#PlanID: '(?<PlanID>[.?\d]+)\', IrrSessionID: '(?<IrrSessionID>[.?\d]+)\', FieldNum:(?<FieldNum>[.?\d].*)$
^\#BeamSizeID:'(?<BeamSizeID>[.?\d]+)\', Status (?<Status>\([^(]*\))\, CumMU: (?<CumMU>[.?\d]+.*)$
^\#Temp C\/M(?<TempCM>\([^(]*\))\, Pressure C\/M(?<PressureCM>\([^(]*\))$
^\#VALUES;;;$
^LayerNr;Energy;DoseRate;Start-Irradiation;Cumm MU;X;Y;MU;DiagnosticDataValid;X_Measured;Y_Measured;X_MeasuredRaw;Y_MeasuredRaw;Mu_MeasuredC; Mu_MeasuredCt;Mu_MeasuredM;Mu_MeasuredMt;minusXBeamWidth;plusXBeamWidth;minusYBeamWidth;plusYBeamWidth;currentX;currentX_Measured;currentY;currentY_Measured;doseRateC;doseRateM;irradiationTime;numSamplesC;numSamplesM$
^(?<LayerNr>[^\;*]*);(?<Energy>[^\;*]*);(?<DoseRate>[^\;*]*);(?<StartIrradiation>[^\;*]*);(?<Cumm_MU>[^\;*]*);(?<X>[^\;*]*);(?<Y>[^\;*]*);(?<MU>[^\;*]*);(?<DiagnosticDataValid>[^\;*]*);(?<X_Measured>[^\;*]*);(?<Y_Measured>[^\;*]*);(?<X_MeasuredRaw>[^\;*]*);(?<Y_MeasuredRaw>[^\;*]*);(?<Mu_MeasuredC>[^\;*]*);(?<Mu_MeasuredCt>[^\;*]*);(?<Mu_MeasuredM>[^\;*]*);(?<Mu_MeasuredMt>[^\;*]*);(?<minusXBeamWidth>[^\;*]*);(?<plusXBeamWidth>[^\;*]*);(?<minusYBeamWidth>[^\;*]*);(?<plusYBeamWidth>[^\;*]*);(?<currentX>[^\;*]*);(?<currentX_Measured>[^\;*]*);(?<currentY>[^\;*]*);(?<currentY_Measured>[^\;*]*);(?<doseRateC>[^\;*]*);(?<doseRateM>[^\;*]*);(?<irradiationTime>[^\;*]*);(?<numSamplesC>[^\;*]*);(?<numSamplesM>[^;].*)$"  
| table _time, LayerNr,Energy,DoseRate,StartIrradiation,Cumm_MU,X,Y,MU,DiagnosticDataValid,X_Measured,Y_Measured,X_MeasuredRaw,Y_MeasuredRaw,Mu_MeasuredC,Mu_MeasuredCt,Mu_MeasuredM,Mu_MeasuredMt,minusXBeamWidth,plusXBeamWidth,minusYBeamWidth,plusYBeamWidth,currentX,currentX_Measured,currentY,currentY_Measured,doseRateC,doseRateM,irradiationTime,numSamplesC,numSamplesM,PlanID,IrrSessionID,FieldNum,BeamSize,Status,CumMU,TempCM,PressureCM

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Wed, 24 Feb 2016 18:03:45 GMT

Correction, I HAD success with this search. But it now fails.

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Fri, 26 Feb 2016 00:59:57 GMT

It looks like the key to this one was to create eventtypes. One based on the header and the other the data. I then ran a search on the eventtype=IrrHeader using a regex. I then appended a search on eventtype=IrrData. From here I created my table.

This was the final search:

eventtype=IrrHeader 
| rex field=_raw "(?ms)^[^:\n]*:\s+'(?P<PlanID>[^']+)[^:\n]*:\s+'(?P<IrrSessionID>[^']+)[^:\n]*:(?P<FieldNum>\d+)\s+#\w+:'(?P<BeamSizeID>\d+\.\d+)',\s+\w+\s+(?P<Status>\(\d+,\s+\d+,\s+\d+\))[^:\n]*:\s+(?P<CumMU>\d+\.\d+)\s+#\w+\s+\w+\/\w+(?P<TempCM>\(\d+\.\d+,\d+\.\d+\)),\s+\w+\s+\w+\/\w+(?P<PressureCM>\(\d+\.\d+,\d+\.\d+\))" 
| append [search sourcetype=Irr2File eventtype=IrrData 
| rex field=_raw "^(?<LayerNr>[^\;*]*);(?<Energy>[^\;*]*);(?<DoseRate>[^\;*]*);(?<StartIrradiation>[^\;*]*);(?<Cumm_MU>[^\;*]*);(?<X>[^\;*]*);(?<Y>[^\;*]*);(?<MU>[^\;*]*);(?<DiagnosticDataValid>[^\;*]*);(?<X_Measured>[^\;*]*);(?<Y_Measured>[^\;*]*);(?<X_MeasuredRaw>[^\;*]*);(?<Y_MeasuredRaw>[^\;*]*);(?<Mu_MeasuredC>[^\;*]*);(?<Mu_MeasuredCt>[^\;*]*);(?<Mu_MeasuredM>[^\;*]*);(?<Mu_MeasuredMt>[^\;*]*);(?<minusXBeamWidth>[^\;*]*);(?<plusXBeamWidth>[^\;*]*);(?<minusYBeamWidth>[^\;*]*);(?<plusYBeamWidth>[^\;*]*);(?<currentX>[^\;*]*);(?<currentX_Measured>[^\;*]*);(?<currentY>[^\;*]*);(?<currentY_Measured>[^\;*]*);(?<doseRateC>[^\;*]*);(?<doseRateM>[^\;*]*);(?<irradiationTime>[^\;*]*);(?<numSamplesC>[^\;*]*);(?<numSamplesM>[^;].*)$"]
| table _time, LayerNr, Energy, DoseRate, StartIrradiation, Cumm_MU, X, Y, MU, DiagnosticDataValid, X_Measured, Y_Measured, X_MeasuredRaw, Y_MeasuredRaw, Mu_MeasuredC, Mu_MeasuredCt, Mu_MeasuredM, Mu_MeasuredMt, minusXBeamWidth, plusXBeamWidth, minusYBeamWidth, plusYBeamWidth, currentX, currentX_Measured, currentY, currentY_Measured, doseRateC, doseRateM, irradiationTime, numSamplesC, numSamplesM, PlanID, IrrSessionID, FieldNum, BeamSize, Status, CumMU, TempCM, PressureCM

Re: How to extract fields from a multiline header followed by structured data columns?

HLVarian — Fri, 26 Feb 2016 01:12:27 GMT

I, FINALLY, figured it out. Thanks for trying.