All Apps and Add-ons

Splunk for Blue Coat ProxySG: About 5% of our logs did not get any field extraction. Has anyone noticed bad transforms.conf regex?

brigancc
Explorer

With the ProxySG using the default "bcreportermain_v1" output, we found that in about 5% of our logs did not get any field extraction. We noted that when the "http_user_agent" was blank (represented by a hyphen), it was not quoted. This is normally a quoted field. So, we surmised that it might be a problem with the regex. Turns out we were correct.

In the line below, the hyphen just before "2.2.2.2" is supposed to be the http_user_agent... as you can see it's unquoted.

2015-12-02 14:38:17 84 1.1.1.1 - - - OBSERVED "Business/Economy" -  200 TCP_NC_MISS GET text/html;charset=UTF-8 http prod-app.enmetric.com 80 /Command-war/retrieve ?limit=5 - - 2.2.2.2 198 129 - "none" "none"

In the line below, you can clearly see the quoted User-Agent field preceding 4.4.4.4 ...

2015-12-02 14:38:17 1662 1.1.1.2 - - - OBSERVED "Web Ads/Analytics" -  200 TCP_NC_MISS GET image/gif http p.liadm.com 80 /imp ?s=5 - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)" 4.4.4.4 478 982 - "none" "none"

Original transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"(?<http_user_agent>[^\"]+)\"\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"(?<http_user_agent>[^\"]+)\"

Config for "bcreportermain_v1"

date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer)  sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation

Not sure whether the field should be fixed so that it is always quoted or if the regex is bad... curious if anyone else has noticed this.

0 Karma
1 Solution

brigancc
Explorer

Used the awesome regex tool at http://regex101.com/#PCRE to visualize the matching and found that the http_user_agent named capture group was surrounded by literal quotes. That caused the whole regex to not match when the event didn't have a user agent.

The fix was to make the quotes optional by adding the "?" quantifier to make it match 0 or 1 time.

After applying the change we went from 95% overall field extraction to 100%
Fixed transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"?(?<http_user_agent>[^\"]+)\"?\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"?(?<http_user_agent>[^\"]+)\"?

View solution in original post

brigancc
Explorer

Used the awesome regex tool at http://regex101.com/#PCRE to visualize the matching and found that the http_user_agent named capture group was surrounded by literal quotes. That caused the whole regex to not match when the event didn't have a user agent.

The fix was to make the quotes optional by adding the "?" quantifier to make it match 0 or 1 time.

After applying the change we went from 95% overall field extraction to 100%
Fixed transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"?(?<http_user_agent>[^\"]+)\"?\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"?(?<http_user_agent>[^\"]+)\"?

ppablo
Retired

Thanks @brigancc 🙂

Cheers!

0 Karma

ppablo
Retired

Hi @brigancc

Would you actually be able to post your fixed transforms.conf regular expressions that solved your issue as an actual answer in the "Enter your answer here..." box below? That way, this post will actually show as having an answer that may help other users out instead of showing as unresolved.

0 Karma

brigancc
Explorer

Very good point. Thank you for the suggestion. I'll update the post and put the solution as an answer. Thanks!

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...