I am trying to extract a field with 2 distinct problems:
In psuedo speak, here is an example event:
"JOB_NEW" "pretend this is the ""very"" long field with more than 498 chars" "next field" "more fields"
The regex to solve #1: ... | rex "^\"JOB_NEW\" \"(?<lsfcommand>([^\"]*)\")"
The regex to solve #2: ... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(\"\"|[^\"])*)\""
Can you help find a regex to solve both #1 and #2?
I've included a data sample below which I believe captures both problems. So if we can find a regex that works for all 3 events then we're golden.
=== props.conf ===
[regextest]
SHOULD_LINEMERGE = false
DATETIME_CONFIG = CURRENT
=== data sample (3 events) ===
"JOB_NEW" "siliconsmart -x ""set sis_stage libgen"" scripts/sis_runme.tcl" 0 "" "default" 32987 1 "LINUX64" "" "" "" "" 2104336 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
"JOB_NEW" "/prj/abc/def-sys/bolt/users/fooo/KalmanRegressionTip/tip100173/wiltsim/tools/../../library/lsf_tools/lsf_jobname_wait.pl stressTest_iceqbe.pl.b68c83ae\* /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//wiltsim/tools/regress_finalize.pl -m xluo /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//regression/logs/log_stressTest_iceqbe.pl.20130829.013458.report.log fooo fooo" 0 "" "11644" 1 "LINUX64" "" "" "" "" 2098192 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
"JOB_NEW" "/prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173/wiltsim/tools/../../library/lsf_tools/lsf_jobname_wait.pl stressTest_iceqbe.pl.b68c83ae.1.autosim_define\* /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//run/performance/ICEQBE/_SingleCellKalmanStressTests/_compare2reference.pl -cltv -metric PDSCH_INFO_UEID_1 ThroughputMbps 0.06 /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//regression/logs/log_stressTest_iceqbe.pl.20130829.013458.report.log fooo fooo" 0 "" "11644" 1 "LINUX64" "" "" "" "" 2098192 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
To answer my own question. 😉
Here's the regex that fixed it all:
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""
I needed to use the non-capturing capture group (thank you jonuwz) and a lazy/possessive quantifier. Not sure why it works, but happy that it does.
So it was found that these quantifiers worked:
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*?)\""
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])+?)\""
To answer my own question. 😉
Here's the regex that fixed it all:
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""
I needed to use the non-capturing capture group (thank you jonuwz) and a lazy/possessive quantifier. Not sure why it works, but happy that it does.
So it was found that these quantifiers worked:
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*?)\""
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])+?)\""
catastrophic backtracking perhaps ..
I think I found something that works, but not entirely sure why it works. If I change the quantifier * to be possessive *+, then I seem to be able to get past the character limitation. Aaah, I love regex, but regex does not love me. 🙂
Thank you jonuwz. The 2nd regex does work for both only when the field is less than 498 chars long. It exceeds some kind of PCRE limit and fails. I tried your non-capturing suggestion and it also fails on the event with 498+ chars. 😞
So, the 2nd regex should work for both yes ?
If the problem is the number of capture groups - and there's be hundreds with (\"\"|[^\"])*)
, just make that bit none capturing i.e. :
... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*)\""