When extracting the request or cookie from httpd logs I'm having problems capturing an entire request when the request contains an escaped double quote. The reason appears to be in the handling of this sequence \" by Splunk.
For example if the request field of the log contains this data ...
"http://www.mydomain.com/request.pl?clientData=someVar:\"this is the important data\""
Then a regular expression for \"(?[^\"]*?)\"
will capture http://www.mydomain.com/request.pl?clientData=someVar:\
If I try \"(?(?:(\x5c\x22|[^\"]))*?)\"
then the search fails with an error saying "Please check log"... no details.
If I try \"(?(?:(\x5c\x21|[^\"]))*?)\"
then the search completes with no error. Too bad \x21 isn't what I'm looking for.
If I try \"(?(?:(\x5c.|[^\"]))*?)\"
in the hopes that ANY character preceded by a backslash will match then I get an error again.
The simple question is how would one capture data between double quotes where the data may contain escaped double quotes?
Can someone explain how to handle the \" characters in a capture group when my field boundaries are double quotes? That's what I really need. It seems like splunk is having a problem when I escape the backslash and double quotes in my regex. Other regex tools are able to handle things like \"(?(\\"|[^\"])?)\" or \"(?(?:(\\"|[^\"]))?)\" just fine... but splunk errors on it.
try some this like this,,
| stats c | eval _raw="2015-03-27T15:49:34 http://www.mydomain.com/request.pl?field2=value2&field1=value1&field4=value4&clientData=someVar:\"th... is the important data\"&field3=value3 data2" |rex "^[^\?\n]*\?(?P<url_parameter>.*) " | rex max_match=10 field=url_parameter "(?<url_parameter_field>\w+)=" | rex max_match=10 field=url_parameter "=(?<url_parameter_value>[0-9a-zA-Z\:\\\"\ ]*)" | fields - c
The regex "(?<url>.*)"
works on regex101.com.
Let me clarify a little. It is in fact a little more complicated than I originally stated.
The data is in w3c format. "(?.*)" would match but with the data looking like this ...
"data" "data" "data" data data data "http://www.mydomain.com/request.pl?clientData=someVar:\"this is the important data\"" "other data" "more data"
\"(?[^\"]*?)\"\s\"(?[^\"]*?)\"\s\"(?[^\"]*?)\"\s(?\S*?)\s(?\S*?)\s(?\S*?)\s\"(?.*)\"
matches more than the request data.
If you just want the URL then "(?<url>http.*)"
will match it.
If you're trying to match all of the fields, then you have a trickier problem because no single delimiter separates the fields. Space won't work because of embedded spaces and some fields aren't quoted.