All Apps and Add-ons

Splunk for Blue Coat ProxySG: About 5% of our logs did not get any field extraction. Has anyone noticed bad transforms.conf regex?

brigancc
Explorer

With the ProxySG using the default "bcreportermain_v1" output, we found that in about 5% of our logs did not get any field extraction. We noted that when the "http_user_agent" was blank (represented by a hyphen), it was not quoted. This is normally a quoted field. So, we surmised that it might be a problem with the regex. Turns out we were correct.

In the line below, the hyphen just before "2.2.2.2" is supposed to be the http_user_agent... as you can see it's unquoted.

2015-12-02 14:38:17 84 1.1.1.1 - - - OBSERVED "Business/Economy" -  200 TCP_NC_MISS GET text/html;charset=UTF-8 http prod-app.enmetric.com 80 /Command-war/retrieve ?limit=5 - - 2.2.2.2 198 129 - "none" "none"

In the line below, you can clearly see the quoted User-Agent field preceding 4.4.4.4 ...

2015-12-02 14:38:17 1662 1.1.1.2 - - - OBSERVED "Web Ads/Analytics" -  200 TCP_NC_MISS GET image/gif http p.liadm.com 80 /imp ?s=5 - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)" 4.4.4.4 478 982 - "none" "none"

Original transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"(?<http_user_agent>[^\"]+)\"\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"(?<http_user_agent>[^\"]+)\"

Config for "bcreportermain_v1"

date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer)  sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation

Not sure whether the field should be fixed so that it is always quoted or if the regex is bad... curious if anyone else has noticed this.

0 Karma
1 Solution

brigancc
Explorer

Used the awesome regex tool at http://regex101.com/#PCRE to visualize the matching and found that the http_user_agent named capture group was surrounded by literal quotes. That caused the whole regex to not match when the event didn't have a user agent.

The fix was to make the quotes optional by adding the "?" quantifier to make it match 0 or 1 time.

After applying the change we went from 95% overall field extraction to 100%
Fixed transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"?(?<http_user_agent>[^\"]+)\"?\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"?(?<http_user_agent>[^\"]+)\"?

View solution in original post

brigancc
Explorer

Used the awesome regex tool at http://regex101.com/#PCRE to visualize the matching and found that the http_user_agent named capture group was surrounded by literal quotes. That caused the whole regex to not match when the event didn't have a user agent.

The fix was to make the quotes optional by adding the "?" quantifier to make it match 0 or 1 time.

After applying the change we went from 95% overall field extraction to 100%
Fixed transform for bcreporter_v1

(?<date>[^\s]+)\s+(?<time>[^\s]+)\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<filter_result>[^\s]+)\s+\"(?<category>[^\"]+)\"\s+(?<http_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<http_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)\s+\"?(?<http_user_agent>[^\"]+)\"?\s+(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)\s+\"?(?<x_virus_id>[^\"]+)\"?\s+\"(?<x_bluecoat_application_name>[^\"]+)\"\s+\"(?<x_bluecoat_application_operation>[^\"]+)\"

Here it is all by itself

\"?(?<http_user_agent>[^\"]+)\"?

ppablo
Retired

Thanks @brigancc 🙂

Cheers!

0 Karma

ppablo
Retired

Hi @brigancc

Would you actually be able to post your fixed transforms.conf regular expressions that solved your issue as an actual answer in the "Enter your answer here..." box below? That way, this post will actually show as having an answer that may help other users out instead of showing as unresolved.

0 Karma

brigancc
Explorer

Very good point. Thank you for the suggestion. I'll update the post and put the solution as an answer. Thanks!

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...