Splunk Search

Host Extraction REGEX in Transforms failing for lengthy events.

lzellmer_splunk
Splunk Employee
Splunk Employee

After realizing the hostname of a Blue Coat appliance was at the end of the incoming events, we created a host name extraction within props and transforms of our modified Blue Coat TA to extract the correct x_bluecoat_appliance_name. We verified the RegEx work by testing with | rex field=_raw "OUR_REGEX" and by testing externally on regex101.com. Both were 100% successful in test.

When applied to the new incoming data, we experienced failure on all lengthy messages.

Sample data and Props/Transforms are below.

It appears that there is a maximum length on RegEx for index-time extractions?


props.conf

[bcoat_proxysg]
TRANSFORMS-hostchange = bluecoat_host


transforms.conf

[bluecoat_host]
#we tried both of these regexes - starting with the lookbehind ...
REGEX = \"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
#then the one starting from the beginning of the message - as inefficient as it may be. 
#REGEX = \S+\s+\S+\s+\S+\s+\S+\s+\S+\s\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\"(.+)\"\s+\S+\s+\S+
DEST_KEY = MetaData:Host
FORMAT = host::$1

REGEX = (?<date>\S+)\s+(?<time>\S+)\s+(?<time_taken>\S+)\s+(?<c_ip>\S+)\s+(?<sc_status>\S+)\s(?<s_action>\S+)\s+(?<sc_bytes>\S+)\s+(?<cs_bytes>\S+)\s+(?<cs_method>\S+)\s+(?<cs_uri_scheme>\S+)\s+(?<cs_host>\S+)\s+(?<cs_uri_port>\S+)\s+(?<cs_uri_path>\S+)\s+(?<cs_uri_query>\S+)\s+(?<cs_username>\S+)\s+(?<cs_auth_group>\S+)\s+(?<s_hierarchy>\S+)\s+(?<s_supplier_name>\S+)\s+(?<rs_content_type>\S+)\s+(?<cs_referer>\S+)\s+\"?(?<cs_user_agent>.+)\"?\s+(?<sc_filter_result>\S+)\s+\"(?<cs_categories>.+)\"\s+(?<x_virus_id>\S+)\s+(?<s_ip>\S+)\s+(?<c_port>\S+)\s+(?<x_exception_id>\S+)\s+\"(?<cs_category>.+)\"\s+(?<cs_uri_extension>\S+)\s+(?<cs_uri>\S+)\s+(?<s_sitename>\S+)\s+(?<r_ip>\S+)\s+(?<r_dns>\S+)\s+(?<s_session_id>\S+)\s+\"(?<x_bluecoat_appliance_name>.+)\"\s+(?<x_cache_info>\S+)\s+(?<x_rs_streaming_content>\S+)

FIELDS Listing from Headers
FIELDS="date","time","time_taken","c_ip","sc_status","s_action","sc_bytes","cs_bytes","cs_method","cs_uri_scheme","cs_host","cs_uri_port","cs_uri_path","cs_uri_query","cs_username","cs_auth_group","s_hierarchy","s_supplier_name","rs_content_type","cs_referer","cs_user_agent","sc_filter_result","cs_categories","x_virus_id","s_ip","c_port","x_exception_id","cs_category","cs_uri_extension","cs_uri","s_sitename","r_ip","r_dns","s_session_id","x_bluecoat_appliance_name","x_cache_info","x_rs_streaming_content"


sample data

SHORT MSG:
2015-06-05 12:14:47 44 8.8.8.8 200 TCP_HIT 1206 2324 GET http data.t.bleacherreport.com 80 /jsonp/MLB_Reg/Baseball/2015/6/5/e19d5d29-13ce-4f34-b3c3-c13814afe6da/line_scores.json ?callback=BRLineScore_isOver_76524 username - - 8.8.8.8 application/javascript http://bleacherreport.com/articles/2484112-the-miami-heat-in-surprising-showdown-simply-cant-afford-... "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/7.0; NISSC)" OBSERVED "Sports/Recreation" - 4.4.4.4 62169 - "Sports/Recreation" json http://data.t.bleacherreport.com/jsonp/MLB_Reg/Baseball/2015/6/5/e19d5d29-13ce-4f34-b3c3-c13814afe6d... SG-HTTP-Service 157.166.249.67 data.t.bleacherreport.com - "bc-appliance-hostname" - -

LONG MSG:
2015-06-05 12:05:07 92 8.8.8.8 200 TCP_NC_MISS 715 2636 GET http webstats.americanbar.org 80 /b/ss/abajournalproduction/1/H.22.1/s24794675480276??AQB=1&ndh=1&t=5%2F5%2F2015%208%3A5%3A6%205%20240&ce=UTF-8&ns=americanbarassociation&g=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&r=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fman_sued_for_sawing_neighbors_garage_in_half_isnt_liable_judge_rules%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&cc=USD&c1=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&c2=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fman_sued_for_sawing_neighbors_garage_in_half_isnt_liable_judge_rules%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&c3=news&c4=article&c5=does_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says&c19=NOT%20SECURE&c20=NO%20404%20ERROR&c25=Not%20Logged%20In&c28=OTHER&c29=www.abajournal.com&c32=NON-MEMBER&c33=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackh... megodm - - webstats.americanbar.org image/gif http://www.abajournal.com/news/article/does_blackhawks_jersey_ban_violate_the_first_amendment_maybe_... "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; NISSC; rv:11.0) like Gecko" OBSERVED "Education;Government/Legal" - 8.8.8.8 59406 - "Education" - http://webstats.americanbar.org/b/ss/abajournalproduction/1/H.22.1/s24794675480276?AQB=1&ndh=1&t=5%2... SG-HTTP-Service 4.4.4.4 webstats.americanbar.org - "bc-appliance-hostname" - -

1 Solution

lzellmer_splunk
Splunk Employee
Splunk Employee

For messages longer than 4096 characters in length, the LOOKAHEAD parameter must be applied to your RegEx:

LOOKAHEAD =
* NOTE: This option is valid for all index time transforms, such as index-time
field creation, or DEST_KEY modifications.
* Optional. Specifies how many characters to search into an event.
* Defaults to 4096. You may want to increase this value if you have event line lengths that
exceed 4096 characters (before linebreaking).

transforms.conf

[bluecoat_host]
LOOKAHEAD = 10000
REGEX =\"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
DEST_KEY = MetaData:Host
FORMAT = host::$1

View solution in original post

lzellmer_splunk
Splunk Employee
Splunk Employee

For messages longer than 4096 characters in length, the LOOKAHEAD parameter must be applied to your RegEx:

LOOKAHEAD =
* NOTE: This option is valid for all index time transforms, such as index-time
field creation, or DEST_KEY modifications.
* Optional. Specifies how many characters to search into an event.
* Defaults to 4096. You may want to increase this value if you have event line lengths that
exceed 4096 characters (before linebreaking).

transforms.conf

[bluecoat_host]
LOOKAHEAD = 10000
REGEX =\"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
DEST_KEY = MetaData:Host
FORMAT = host::$1

richgalloway
SplunkTrust
SplunkTrust

If you solved your problem, please accept your answer.

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...