Splunk Search
Highlighted

How to extract fields from a multiline header followed by structured data columns?

Path Finder

I have a sourcetype that is in CSV format and I'd like to extract fields from the multiline header that proceeds these files coming in. Each new line in the header begins with # and these lines are comma separated. This header is followed by the actual data fields which are semi-colon separated. I'm trying to figure out the best way to go about parsing this data. The data looks like this:

 #PlanID: '1.2.246.352.71.5.459020837699.2820.20131008220539', IrrSessionID: '1.2.246.352.82.6.5130518203565855886.177214397189262813860', FieldNum:1
 #BeamSizeID:'4.0', Status (1, 0, 0), CumMU: 109720.889806
 #Temp C/M(0.000000,0.000000), Pressure C/M(0.000000,0.000000)
 #VALUES;;;
 Field1;Field2;Field3;Field4;...;FieldN
 Data1;Data2;Data4;Data4;...;DataN

I'd like to get out the following fields and what they would be relate to in this example:

PlanID = 1.2.246.352.71.5.459020837699.2820.20131008220539
IrrSessionID = 1.2.246.352.82.6.5130518203565855886.177214397189262813860
FieldNum = 1
BeamSizeID = 4.0
Status = (1, 0, 0)
CumMu = 109720.889806
Temp C/M = (0.000000,0.000000)
Pressure C/M= (0.000000,0.000000)

Followed by:

Field1, Field2,..., etc.

Previously I just had Splunk strip away all the lines beginning with # using:

FIELD DELIMITER = ;
PREAMBLE REGEX = ^\#

(this works well for all my other sourcetypes)

I have also used a Python script to append the necessary fields to file and then I can keep using the aforementioned settings. However, we do not want to run a separate process on our files before getting them into Splunk.

So, what is the best way to get these fields in? Is there a way to manipulate my Python script and use it in Splunk on my incoming data? Or should I use some extensive RegEx in props.conf and transforms.conf to achieve this?

Many thanks!

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Motivator

Hello! Let's get a sample of your data please.

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

I don't have enough Karma points to post an attachment, but see the "answer" I posted below when i was trying to add an attachment for a snapshot of what the data looks like.

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

I, FINALLY, figured it out. Thanks for trying.

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

I don't have enough Karma points to provide attachments yet. However this is a sample of the data. Data comes in on a CSV file and looks just like this if opened in text:

 #PlanID: '1.2.246.352.71.5.546855675324.46054.20150723173433', IrrSessionID: '1.2.246.352.82.6.4916826480507715613.151884354515765536450', FieldNum:1
#BeamSizeID:'4.0', Status (1, 0, 0), CumMU: 26940.399953
#Temp C/M(0.000000,0.000000), Pressure C/M(0.000000,0.000000)
#VALUES;;;
LayerNr;Energy;DoseRate;Start-Irradiation;Cumm MU;X;Y;MU;DiagnosticDataValid;X_Measured;Y_Measured;X_MeasuredRaw;Y_MeasuredRaw;Mu_MeasuredC; Mu_MeasuredCt;Mu_MeasuredM;Mu_MeasuredMt;minusXBeamWidth;plusXBeamWidth;minusYBeamWidth;plusYBeamWidth;currentX;currentX_Measured;currentY;currentY_Measured;doseRateC;doseRateM;irradiationTime;numSamplesC;numSamplesM
1;180.139000;304359.803029;AUG 03 2015 16:36:43 GMT;46.599998;-65.599998;-49.200001;46.599998;1;88716.000000;132111.000000;0.000000;0.000000;46.398749;4643.000000;46.342000;4487.000000;5253;5445;4431;3869;-148038;-148074;-75555;-75703;304359.803029;308679.237710;0.008752;100525;100528
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;57.400002;-65.599998;49.000000;1;166233.000000;121459.000000;0.000000;0.000000;49.026979;4906.000000;49.523042;4795.000000;5272;5479;3842;4047;128182;126383;-99672;-99767;264168.273430;269533.829245;0.010886;11063;11063
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;-57.400002;-41.000000;46.400001;1;93767.000000;138361.000000;0.000000;0.000000;46.398749;4643.000000;46.620857;4514.000000;5409;5458;3862;4404;-128187;-61199;-62274;-63445;260916.063907;265030.589047;0.010261;202;202
2;176.639000;261391.348040;AUG 03 2015 16:36:53 GMT;262.600000;-65.599998;-57.400002;45.799999;1;88777.000000;126961.000000;0.000000;0.000000;45.799153;4583.000000;45.897893;4444.000000;5298;5481;4166;4455;-146450;-141625;-87204;-83594;261488.078635;264819.896647;0.010151;71;71

Hope this helps others out.

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

I've had success with this RegEx in search, now I just need to figure out how to format it for props.conf and transforms.conf

 rex field=_raw "(?m)^\#PlanID: '(?<PlanID>[.?\d]+)\', IrrSessionID: '(?<IrrSessionID>[.?\d]+)\', FieldNum:(?<FieldNum>[.?\d].*)$
^\#BeamSizeID:'(?<BeamSizeID>[.?\d]+)\', Status (?<Status>\([^(]*\))\, CumMU: (?<CumMU>[.?\d]+.*)$
^\#Temp C\/M(?<TempCM>\([^(]*\))\, Pressure C\/M(?<PressureCM>\([^(]*\))$
^\#VALUES;;;$
^LayerNr;Energy;DoseRate;Start-Irradiation;Cumm MU;X;Y;MU;DiagnosticDataValid;X_Measured;Y_Measured;X_MeasuredRaw;Y_MeasuredRaw;Mu_MeasuredC; Mu_MeasuredCt;Mu_MeasuredM;Mu_MeasuredMt;minusXBeamWidth;plusXBeamWidth;minusYBeamWidth;plusYBeamWidth;currentX;currentX_Measured;currentY;currentY_Measured;doseRateC;doseRateM;irradiationTime;numSamplesC;numSamplesM$
^(?<LayerNr>[^\;*]*);(?<Energy>[^\;*]*);(?<DoseRate>[^\;*]*);(?<StartIrradiation>[^\;*]*);(?<Cumm_MU>[^\;*]*);(?<X>[^\;*]*);(?<Y>[^\;*]*);(?<MU>[^\;*]*);(?<DiagnosticDataValid>[^\;*]*);(?<X_Measured>[^\;*]*);(?<Y_Measured>[^\;*]*);(?<X_MeasuredRaw>[^\;*]*);(?<Y_MeasuredRaw>[^\;*]*);(?<Mu_MeasuredC>[^\;*]*);(?<Mu_MeasuredCt>[^\;*]*);(?<Mu_MeasuredM>[^\;*]*);(?<Mu_MeasuredMt>[^\;*]*);(?<minusXBeamWidth>[^\;*]*);(?<plusXBeamWidth>[^\;*]*);(?<minusYBeamWidth>[^\;*]*);(?<plusYBeamWidth>[^\;*]*);(?<currentX>[^\;*]*);(?<currentX_Measured>[^\;*]*);(?<currentY>[^\;*]*);(?<currentY_Measured>[^\;*]*);(?<doseRateC>[^\;*]*);(?<doseRateM>[^\;*]*);(?<irradiationTime>[^\;*]*);(?<numSamplesC>[^\;*]*);(?<numSamplesM>[^;].*)$"  
| table _time, LayerNr,Energy,DoseRate,StartIrradiation,Cumm_MU,X,Y,MU,DiagnosticDataValid,X_Measured,Y_Measured,X_MeasuredRaw,Y_MeasuredRaw,Mu_MeasuredC,Mu_MeasuredCt,Mu_MeasuredM,Mu_MeasuredMt,minusXBeamWidth,plusXBeamWidth,minusYBeamWidth,plusYBeamWidth,currentX,currentX_Measured,currentY,currentY_Measured,doseRateC,doseRateM,irradiationTime,numSamplesC,numSamplesM,PlanID,IrrSessionID,FieldNum,BeamSize,Status,CumMU,TempCM,PressureCM
0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

Correction, I HAD success with this search. But it now fails.

0 Karma
Highlighted

Re: How to extract fields from a multiline header followed by structured data columns?

Path Finder

It looks like the key to this one was to create eventtypes. One based on the header and the other the data. I then ran a search on the eventtype=IrrHeader using a regex. I then appended a search on eventtype=IrrData. From here I created my table.

This was the final search:

eventtype=IrrHeader 
| rex field=_raw "(?ms)^[^:\n]*:\s+'(?P<PlanID>[^']+)[^:\n]*:\s+'(?P<IrrSessionID>[^']+)[^:\n]*:(?P<FieldNum>\d+)\s+#\w+:'(?P<BeamSizeID>\d+\.\d+)',\s+\w+\s+(?P<Status>\(\d+,\s+\d+,\s+\d+\))[^:\n]*:\s+(?P<CumMU>\d+\.\d+)\s+#\w+\s+\w+\/\w+(?P<TempCM>\(\d+\.\d+,\d+\.\d+\)),\s+\w+\s+\w+\/\w+(?P<PressureCM>\(\d+\.\d+,\d+\.\d+\))" 
| append [search sourcetype=Irr2File eventtype=IrrData 
| rex field=_raw "^(?<LayerNr>[^\;*]*);(?<Energy>[^\;*]*);(?<DoseRate>[^\;*]*);(?<StartIrradiation>[^\;*]*);(?<Cumm_MU>[^\;*]*);(?<X>[^\;*]*);(?<Y>[^\;*]*);(?<MU>[^\;*]*);(?<DiagnosticDataValid>[^\;*]*);(?<X_Measured>[^\;*]*);(?<Y_Measured>[^\;*]*);(?<X_MeasuredRaw>[^\;*]*);(?<Y_MeasuredRaw>[^\;*]*);(?<Mu_MeasuredC>[^\;*]*);(?<Mu_MeasuredCt>[^\;*]*);(?<Mu_MeasuredM>[^\;*]*);(?<Mu_MeasuredMt>[^\;*]*);(?<minusXBeamWidth>[^\;*]*);(?<plusXBeamWidth>[^\;*]*);(?<minusYBeamWidth>[^\;*]*);(?<plusYBeamWidth>[^\;*]*);(?<currentX>[^\;*]*);(?<currentX_Measured>[^\;*]*);(?<currentY>[^\;*]*);(?<currentY_Measured>[^\;*]*);(?<doseRateC>[^\;*]*);(?<doseRateM>[^\;*]*);(?<irradiationTime>[^\;*]*);(?<numSamplesC>[^\;*]*);(?<numSamplesM>[^;].*)$"]
| table _time, LayerNr, Energy, DoseRate, StartIrradiation, Cumm_MU, X, Y, MU, DiagnosticDataValid, X_Measured, Y_Measured, X_MeasuredRaw, Y_MeasuredRaw, Mu_MeasuredC, Mu_MeasuredCt, Mu_MeasuredM, Mu_MeasuredMt, minusXBeamWidth, plusXBeamWidth, minusYBeamWidth, plusYBeamWidth, currentX, currentX_Measured, currentY, currentY_Measured, doseRateC, doseRateM, irradiationTime, numSamplesC, numSamplesM, PlanID, IrrSessionID, FieldNum, BeamSize, Status, CumMU, TempCM, PressureCM

View solution in original post

0 Karma