There is a field in my Bluecoat Proxy logs that is not extracting correctly.
Here are portions of the two losable logs;
2015-02-02 14:59:08 1170 x.x.x.x - - - OBSERVED "Technology/Internet" - 200 TCP_NC_MISS POST application/json;charset=utf-8 http www.umeng.com 80 /check_config_update - - - y.y.y.y 185 659 - "none" "none" x.x.x.x "Tengine" www.umeng.com
2015-02-02 14:54:09 939 x.x.x.x - - - OBSERVED "Business/Economy" http://cloudcroftwebcam.com/camera-1/ 200 TCP_MISS GET image/jpeg http cloudcroftwebcam.com 80 /camera1.jpg ?Mon%20Feb%202%2007:54:08%20MST%202015 jpg "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" y.y.y.y 71592 367 - "none" "none" x.x.x.x "Apache" cloudcroftwebcam.com
The field that is not extracting correctly is the http_user_agent field
. This field in the top record is the third "-" just before the "y.y.y.y" IP. In the lower record, the field is the content between the quote marks "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
This is my regex:
\s+\"(?<http_user_agent_new>[^\"]+)\"\s+
This works well when there is content between the quotes, but not when there is no content and just a "-" with no quotes.
I have tried this regex:
\s+(?<http_user_agent_new>[^\s]+)\s+
and it works until there is a space inside the quotes then the regex stops.
I tried this regex:
[\s\"|\s+](?<http_user_agent_new>[^[\"|\s]]+)[\"\s+|\s+]
But this is too many OR's for regex to understand what I want.
How can I search for the dash with no quotes when there is no "http_user_agent" content and search for the content between the quotes when there is?
Hi,
this one is working for both examples... But not the cleanest I think:
^.*?(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).*?".*?".*?(?<http_user_agent>-|".*?")\s(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
Danny