Splunk Search

How to extract a very long field with special characters?

hulahoop
Splunk Employee
Splunk Employee

I am trying to extract a field with 2 distinct problems:

  1. The field length can often creep above 498 characters. This is where Splunk fails to complete the field extraction (maybe because of a PCRE recursion limit).
  2. The field values are somewhat tricky as they are surrounded by quotes and include double quotes (to escape single quotes).

In psuedo speak, here is an example event:

"JOB_NEW" "pretend this is the ""very"" long field with more than 498 chars" "next field" "more fields"

The regex to solve #1: ... | rex "^\"JOB_NEW\" \"(?<lsfcommand>([^\"]*)\")"

The regex to solve #2: ... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(\"\"|[^\"])*)\""

Can you help find a regex to solve both #1 and #2?

I've included a data sample below which I believe captures both problems. So if we can find a regex that works for all 3 events then we're golden.

=== props.conf ===

[regextest]
SHOULD_LINEMERGE = false
DATETIME_CONFIG = CURRENT

=== data sample (3 events) ===

"JOB_NEW" "siliconsmart -x ""set sis_stage libgen"" scripts/sis_runme.tcl" 0 "" "default" 32987 1 "LINUX64" "" "" "" "" 2104336 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
"JOB_NEW" "/prj/abc/def-sys/bolt/users/fooo/KalmanRegressionTip/tip100173/wiltsim/tools/../../library/lsf_tools/lsf_jobname_wait.pl stressTest_iceqbe.pl.b68c83ae\* /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//wiltsim/tools/regress_finalize.pl -m xluo /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//regression/logs/log_stressTest_iceqbe.pl.20130829.013458.report.log fooo fooo" 0 "" "11644" 1 "LINUX64" "" "" "" "" 2098192 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
"JOB_NEW" "/prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173/wiltsim/tools/../../library/lsf_tools/lsf_jobname_wait.pl stressTest_iceqbe.pl.b68c83ae.1.autosim_define\* /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//run/performance/ICEQBE/_SingleCellKalmanStressTests/_compare2reference.pl -cltv -metric PDSCH_INFO_UEID_1 ThroughputMbps 0.06 /prj/abc/lte-sys/bolt/users/fooo/KalmanRegressionTip/tip100173//regression/logs/log_stressTest_iceqbe.pl.20130829.013458.report.log fooo fooo" 0 "" "11644" 1 "LINUX64" "" "" "" "" 2098192 0 "" "" "/prj/abcdef/lsfgbcspool/x" -1 -1 -1 "default" 0 "" "" -1 "" 0 -1
Tags (3)
0 Karma
1 Solution

hulahoop
Splunk Employee
Splunk Employee

To answer my own question. 😉

Here's the regex that fixed it all:

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""

I needed to use the non-capturing capture group (thank you jonuwz) and a lazy/possessive quantifier. Not sure why it works, but happy that it does.

So it was found that these quantifiers worked:

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*?)\""

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])+?)\""

View solution in original post

hulahoop
Splunk Employee
Splunk Employee

To answer my own question. 😉

Here's the regex that fixed it all:

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""

I needed to use the non-capturing capture group (thank you jonuwz) and a lazy/possessive quantifier. Not sure why it works, but happy that it does.

So it was found that these quantifiers worked:

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*+)\""

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*?)\""

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])+?)\""

jonuwz
Influencer
0 Karma

hulahoop
Splunk Employee
Splunk Employee

I think I found something that works, but not entirely sure why it works. If I change the quantifier * to be possessive *+, then I seem to be able to get past the character limitation. Aaah, I love regex, but regex does not love me. 🙂

0 Karma

hulahoop
Splunk Employee
Splunk Employee

Thank you jonuwz. The 2nd regex does work for both only when the field is less than 498 chars long. It exceeds some kind of PCRE limit and fails. I tried your non-capturing suggestion and it also fails on the event with 498+ chars. 😞

0 Karma

jonuwz
Influencer

So, the 2nd regex should work for both yes ?

If the problem is the number of capture groups - and there's be hundreds with (\"\"|[^\"])*) , just make that bit none capturing i.e. :

... | rex "^\"JOB_NEW\" \"(?<lsfcommand>(?:\"\"|[^\"])*)\""
0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...