I have been playing with the rex command for awhile now and I am stuck. I have a csv source that I need to extract a 1-3 digit number from. The trouble is that depending upon the line in the csv, it is either in the 8th or 9th column in the csv. Sample csv line:
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,ABC,"CENTRAL, MAINE",x123456,204,PC,Windows,more,text
OR
8/28/2013 15:04,100.200.250.27,,username@email.com,ABC,DEF,x987654,23,,,IE 8,Mozilla,more
I would like to extract the 204 from the first line and 23 from the second. The trouble is the extra comma in the location in some of the lines (,"CENTRAL, MAINE",).
My base rex is | rex field=_raw ",{8,9}.{1,3} (?<AAA>.*)"
What I am thinking I am doing is searching for 8 or 9 commas ,{8,9}
then when it finds it, select the next 1 to 3 characters .{1,3}
then put all that in an extraction called AAA (?<AAA>.*)
I think I am just off, but any tips would be great!
Thanks, Mike
To make this work correctly in all situations, you need to have the regex handle when CSV fields are enclosed in quotes:
| rex "^(?:(?:\"(?:\"\"|.)*?\",)|(?:[^,]*),){7}(?P<AAA>[^,]+)"
If a field starts with a double quote, everything up to the next ",
will be captured as part of that field.
This regex works for all the following examples:
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,ABC,"CENTRAL, M"AINE" ",x123456,204,PC,"Windows",more,text
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,ABC,"CENTRAL, MAINE",x123456,204,PC,"Windows",more,text
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,"a"",",asdf,x123456,204,PC,"Windows",more,text
8/28/2013 15:04,100.200.250.27,,username@email.com,"",DEF,x987654,23,PC,,IE 8,Mozilla,more
8/28/2013 15:04,100.200.250.27,,username@email.com,ABC,DEF,x987654,23,PC,,IE 8,Mozilla,more
8/28/2013 15:04,100.200.250.27,,username@email.com,ABC,DEF,x987654,23,,,IE 8,Mozilla,more
Edit: Updated to handle double quotes inside quotes followed by a comma.
To make this work correctly in all situations, you need to have the regex handle when CSV fields are enclosed in quotes:
| rex "^(?:(?:\"(?:\"\"|.)*?\",)|(?:[^,]*),){7}(?P<AAA>[^,]+)"
If a field starts with a double quote, everything up to the next ",
will be captured as part of that field.
This regex works for all the following examples:
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,ABC,"CENTRAL, M"AINE" ",x123456,204,PC,"Windows",more,text
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,ABC,"CENTRAL, MAINE",x123456,204,PC,"Windows",more,text
2014/04/28 07:23:41 AM EDT,100.200.250.27,email link click,username@email.com,"a"",",asdf,x123456,204,PC,"Windows",more,text
8/28/2013 15:04,100.200.250.27,,username@email.com,"",DEF,x987654,23,PC,,IE 8,Mozilla,more
8/28/2013 15:04,100.200.250.27,,username@email.com,ABC,DEF,x987654,23,PC,,IE 8,Mozilla,more
8/28/2013 15:04,100.200.250.27,,username@email.com,ABC,DEF,x987654,23,,,IE 8,Mozilla,more
Edit: Updated to handle double quotes inside quotes followed by a comma.
Great call with the double quote - thanks!
This works fine with your example logs.
... | rex "(?i)^(?:[^,]*,){7,8}(?P<AAA>[^,]+)"
This won't work for OPs second example if the field after the 23 contains a value.