Splunk Search

How to edit my regex to extract all expected fields from my sample Blue Coat log?

New Member

I'm using the following regular expression:

(?<timestamp>:"(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})"|(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}))\s+(?<time_taken>:"([^"]+)"|(\S+))\s+(?<c_ip>:"([^"]+)"|(\S+))\s+(?<cs_username>:"([^"]+)"|(\S+))\s+(?<cs_auth_group>:"([^"]+)"|(\S+))\s+(?<x_exception_id>:"([^"]+)"|(\S+))\s+(?<sc_filter_result>:"([^"]+)"|(\S+))\s+(?<cs_categories>:"([^"]+)"|(\S+))\s+(?<cs_referrer>:"([^"]+)"|(\S+))\s+(?<sc_status>:"([^"]+)"|(\S+))\s+(?<s_action>:"([^"]+)"|(\S+))\s+(?<cs_method>:"([^"]+)"|(\S+))\s+(?<rs_content_type>:"([^"]+)"|(\S+))\s+(?<cs_uri_scheme>:"([^"]+)"|(\S+))\s+(?<cs_host>:"([^"]+)"|(\S+))\s+(?<cs_uri_port>:"([^"]+)"|(\S+))\s+(?<cs_uri_path>:"([^"]+)"|(\S+))\s+(?<cs_uri_query>:"([^"]+)"|(\S+))\s+(?<cs_uri_extension>:"([^"]+)"|(\S+))\s+(?<cs_user_agent>:"([^"]+)"|(\S+))\s+(?<s_ip>:"([^"]+)"|(\S+))\s+(?<sc_bytes>:"([^"]+)"|(\S+))\s+(?<cs_bytes>:"([^"]+)"|(\S+))\s+(?<x_virus_id>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_name>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_operation>:"([^"]+)"|(\S+))\s+(?<cs_auth_type>:"([^"]+)"|(\S+))\s*

On the following example log file:

2016-07-28 23:37:32 240144 1.1.1.1 - - - OBSERVED "Social Networking" -  200 TCP_TUNNELED CONNECT - tcp plus.google.com 443 / - - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36" 1.1.1.1 1135 2522 - "GooglePlus" "none" - 

There should be 28 fields in that example log file when date and time are separate fields (I combined them into one field).

With my regular expression, I'm finding that the space in the "cs_categories" field is being used to end the regex match, which doesn't make sense to me since when I try it out on a regex simulator it matches just fine. Example: http://regexr.com/3dtdr

It's obvious that the space in the cs_categories field is somehow throwing off the parser. However, I'm not sure why. I'm not a regex master, so I'm leaning more toward it being a Splunk specific difference in regex engine, but I could be entirely wrong.

I would really appreciate any kind of help.

Thanks.

0 Karma

Communicator

Bluecoat logs are a pain in the *** to extract but I think this regex should do the trick:

(?<timestamp>[0-9-:\s]{19})\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<sc_filter_result>[^\s]+)(?:\s+\"|\s+)(?<cs_categories>[^\"]+)(?:\"\s+|\s+)(?<cs_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<s_action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<rs_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)(?:\s+\"|\s+)(?<cs_user_agent>[^\"]+)(?:\"\s+|\s+)(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)(?:\s+\"|\s+)(?<x_virus_id>[^\"\s]+)(?:\"\s+\"|\"\s+|\s+\"|\s+)(?<x_bluecoat_applicatoin_name>[^\s\"]+)(?:\"\s+\"|\"\s+|\s+\"|\s+)(?<x_bleucoat_application_operation>[^\"\s]+)(?:\"\s+|\s+)(?<cs_auth_type>[^\s]+)
0 Karma

Contributor

I was confused by this phrase: "With my regular expression, I'm finding that the space in the "cscategories" field is being used to end the regex match". With a bit of play, I understood you mean that if in your data the category is "Social Networking" then the extracted cscategories is "Social. Not what I would expect to happen but I was actually able to reproduce that so I'm guessing that's what you meant.

So in this regex:

(?<cs_categories>:"([^"]+)"|(\S+))

the "([^"]+)" is supposed to match (because it's first and the quotes are there) but the (\S+)alternative is also a potential match and seems to be preferred by the regexp engine in that instance. I believe this alternative is here to match cases where there is no category and the data just has a single -. There might be other cases too, but the point is they won't have double quotes.

So I would suggest you replace the \S+ with [^"]\S* to prevent that alternative from being used when quotes are present. I think that should work. The idea is that [^"] means the first character cannot be a " and of course we replace the + (which means 1 or more) with the * (which means zero or more) so that we still match instances where the match is a single character long.

Hope it helps.

0 Karma

Splunk Employee
Splunk Employee

Are you using the Splunk Blue Coat TA or do you have custom log formats you're dealing with? If you're not using the TA, this should help:

https://splunkbase.splunk.com/app/2758/

0 Karma

New Member

Not exactly.

I'm testing my own add-on using data generated from SA-Eventgen, and that data happens to be based off of Blue Coat logs. Those logs are pretty custom I think (the one in the original post is a decent example).

I can try the Add-on for Blue Coat ProxySG though and see what happens.

0 Karma

Builder

I doubt it will work for everything, since csuseragent changes every time.. But this one works on your sample event, and it works for cscategories. I just used the built in field extractor to get the one field, and then inserted it into your regex after `(?<cscategories>`

If possible, I'd set bluecoat to insert delimeters like | into your logs, and just use a delimited extraction. csuseragent is BANE of web logs.

alt text

(?<timestamp>:"(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})"|(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}))\s+(?<time_taken>:"([^"]+)"|(\S+))\s+(?<c_ip>:"([^"]+)"|(\S+))\s+(?<cs_username>:"([^"]+)"|(\S+))\s+(?<cs_auth_group>:"([^"]+)"|(\S+))\s+(?<x_exception_id>:"([^"]+)"|(\S+))\s+(?<sc_filter_result>:"([^"]+)"|(\S+))\s+(?<cs_categories>"\w+\s+\w+"|(\S+))\s+(?<cs_referrer>:"([^"]+)"|(\S+))\s+(?<sc_status>:"([^"]+)"|(\S+))\s+(?<s_action>:"([^"]+)"|(\S+))\s+(?<cs_method>:"([^"]+)"|(\S+))\s+(?<rs_content_type>:"([^"]+)"|(\S+))\s+(?<cs_uri_scheme>:"([^"]+)"|(\S+))\s+(?<cs_host>:"([^"]+)"|(\S+))\s+(?<cs_uri_port>:"([^"]+)"|(\S+))\s+(?<cs_uri_path>:"([^"]+)"|(\S+))\s+(?<cs_uri_query>:"([^"]+)"|(\S+))\s+(?<cs_uri_extension>:"([^"]+)"|(\S+))\s+(?<cs_user_agent>("\w+/\d+\.\d+\s+\(\w+\s+\w+\s+\d+\.\d+;\s+\w+\)\s+\w+/\d+\.\d+\s+\(\w+,\s+\w+\s+\w+\)\s+\w+/\d+\.\d+\.\d+\.\d+\s+\w+/\d+\.\d+")|(\S+))\s+(?<sc_bytes>:"([^"]+)"|(\S+))\s+(?<cs_bytes>:"([^"]+)"|(\S+))\s+(?<x_virus_id>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_name>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_operation>:"([^"]+)"|(\S+))\s+(?<cs_auth_type>:"([^"]+)"|(\S+))\s*
0 Karma

New Member

That's really odd..

I just tried that expression you have and it's somewhat working, but it turns out there are many variations of how the values for that field can appear, so I still get off-by-one type issues where the wrong field's value is recorded as a cs_categories value.

Not sure if there's a better way to find them all..

I don't actually have access to the source Blue Coat system so I don't have a way to set delimiters like that, though I wish that I could...

0 Karma

Builder

My hand written regex is rusty, but if they are always inclosed in quotes rebuild the line to capture anything between the quotes.

|(\S+))\s+(?<cs_categories>"([^"]+)"
0 Karma

New Member

Yeah, that's what I've tried and then it all just breaks down.

0 Karma

Builder

The closest thing I've dealt with to this are IIS Logs. I followed this guide.

http://blogs.splunk.com/2013/10/18/iis-logs-and-splunk-6/

IIS Logs have a header row in the file which defines the fields, and is whitespace delimited. Along with examples of props.conf which worked for me. The app I put in place works about 95% of the time, which for what we need is good enough. Where it breaks is csuseragent.

0 Karma