Splunk Search

Host Extraction REGEX in Transforms failing for lengthy events.

lzellmer_splunk
Splunk Employee
Splunk Employee

After realizing the hostname of a Blue Coat appliance was at the end of the incoming events, we created a host name extraction within props and transforms of our modified Blue Coat TA to extract the correct x_bluecoat_appliance_name. We verified the RegEx work by testing with | rex field=_raw "OUR_REGEX" and by testing externally on regex101.com. Both were 100% successful in test.

When applied to the new incoming data, we experienced failure on all lengthy messages.

Sample data and Props/Transforms are below.

It appears that there is a maximum length on RegEx for index-time extractions?


props.conf

[bcoat_proxysg]
TRANSFORMS-hostchange = bluecoat_host


transforms.conf

[bluecoat_host]
#we tried both of these regexes - starting with the lookbehind ...
REGEX = \"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
#then the one starting from the beginning of the message - as inefficient as it may be. 
#REGEX = \S+\s+\S+\s+\S+\s+\S+\s+\S+\s\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\"(.+)\"\s+\S+\s+\S+
DEST_KEY = MetaData:Host
FORMAT = host::$1

REGEX = (?<date>\S+)\s+(?<time>\S+)\s+(?<time_taken>\S+)\s+(?<c_ip>\S+)\s+(?<sc_status>\S+)\s(?<s_action>\S+)\s+(?<sc_bytes>\S+)\s+(?<cs_bytes>\S+)\s+(?<cs_method>\S+)\s+(?<cs_uri_scheme>\S+)\s+(?<cs_host>\S+)\s+(?<cs_uri_port>\S+)\s+(?<cs_uri_path>\S+)\s+(?<cs_uri_query>\S+)\s+(?<cs_username>\S+)\s+(?<cs_auth_group>\S+)\s+(?<s_hierarchy>\S+)\s+(?<s_supplier_name>\S+)\s+(?<rs_content_type>\S+)\s+(?<cs_referer>\S+)\s+\"?(?<cs_user_agent>.+)\"?\s+(?<sc_filter_result>\S+)\s+\"(?<cs_categories>.+)\"\s+(?<x_virus_id>\S+)\s+(?<s_ip>\S+)\s+(?<c_port>\S+)\s+(?<x_exception_id>\S+)\s+\"(?<cs_category>.+)\"\s+(?<cs_uri_extension>\S+)\s+(?<cs_uri>\S+)\s+(?<s_sitename>\S+)\s+(?<r_ip>\S+)\s+(?<r_dns>\S+)\s+(?<s_session_id>\S+)\s+\"(?<x_bluecoat_appliance_name>.+)\"\s+(?<x_cache_info>\S+)\s+(?<x_rs_streaming_content>\S+)

FIELDS Listing from Headers
FIELDS="date","time","time_taken","c_ip","sc_status","s_action","sc_bytes","cs_bytes","cs_method","cs_uri_scheme","cs_host","cs_uri_port","cs_uri_path","cs_uri_query","cs_username","cs_auth_group","s_hierarchy","s_supplier_name","rs_content_type","cs_referer","cs_user_agent","sc_filter_result","cs_categories","x_virus_id","s_ip","c_port","x_exception_id","cs_category","cs_uri_extension","cs_uri","s_sitename","r_ip","r_dns","s_session_id","x_bluecoat_appliance_name","x_cache_info","x_rs_streaming_content"


sample data

SHORT MSG:
2015-06-05 12:14:47 44 8.8.8.8 200 TCP_HIT 1206 2324 GET http data.t.bleacherreport.com 80 /jsonp/MLB_Reg/Baseball/2015/6/5/e19d5d29-13ce-4f34-b3c3-c13814afe6da/line_scores.json ?callback=BRLineScore_isOver_76524 username - - 8.8.8.8 application/javascript http://bleacherreport.com/articles/2484112-the-miami-heat-in-surprising-showdown-simply-cant-afford-... "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/7.0; NISSC)" OBSERVED "Sports/Recreation" - 4.4.4.4 62169 - "Sports/Recreation" json http://data.t.bleacherreport.com/jsonp/MLB_Reg/Baseball/2015/6/5/e19d5d29-13ce-4f34-b3c3-c13814afe6d... SG-HTTP-Service 157.166.249.67 data.t.bleacherreport.com - "bc-appliance-hostname" - -

LONG MSG:
2015-06-05 12:05:07 92 8.8.8.8 200 TCP_NC_MISS 715 2636 GET http webstats.americanbar.org 80 /b/ss/abajournalproduction/1/H.22.1/s24794675480276??AQB=1&ndh=1&t=5%2F5%2F2015%208%3A5%3A6%205%20240&ce=UTF-8&ns=americanbarassociation&g=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&r=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fman_sued_for_sawing_neighbors_garage_in_half_isnt_liable_judge_rules%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&cc=USD&c1=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&c2=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fman_sued_for_sawing_neighbors_garage_in_half_isnt_liable_judge_rules%2F%3Futm_source%3Dinternal%26utm_medium%3Dnavigation%26utm_campaign%3Dmost_read&c3=news&c4=article&c5=does_blackhawks_jersey_ban_violate_the_first_amendment_maybe_law_prof_says&c19=NOT%20SECURE&c20=NO%20404%20ERROR&c25=Not%20Logged%20In&c28=OTHER&c29=www.abajournal.com&c32=NON-MEMBER&c33=http%3A%2F%2Fwww.abajournal.com%2Fnews%2Farticle%2Fdoes_blackh... megodm - - webstats.americanbar.org image/gif http://www.abajournal.com/news/article/does_blackhawks_jersey_ban_violate_the_first_amendment_maybe_... "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; NISSC; rv:11.0) like Gecko" OBSERVED "Education;Government/Legal" - 8.8.8.8 59406 - "Education" - http://webstats.americanbar.org/b/ss/abajournalproduction/1/H.22.1/s24794675480276?AQB=1&ndh=1&t=5%2... SG-HTTP-Service 4.4.4.4 webstats.americanbar.org - "bc-appliance-hostname" - -

1 Solution

lzellmer_splunk
Splunk Employee
Splunk Employee

For messages longer than 4096 characters in length, the LOOKAHEAD parameter must be applied to your RegEx:

LOOKAHEAD =
* NOTE: This option is valid for all index time transforms, such as index-time
field creation, or DEST_KEY modifications.
* Optional. Specifies how many characters to search into an event.
* Defaults to 4096. You may want to increase this value if you have event line lengths that
exceed 4096 characters (before linebreaking).

transforms.conf

[bluecoat_host]
LOOKAHEAD = 10000
REGEX =\"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
DEST_KEY = MetaData:Host
FORMAT = host::$1

View solution in original post

lzellmer_splunk
Splunk Employee
Splunk Employee

For messages longer than 4096 characters in length, the LOOKAHEAD parameter must be applied to your RegEx:

LOOKAHEAD =
* NOTE: This option is valid for all index time transforms, such as index-time
field creation, or DEST_KEY modifications.
* Optional. Specifies how many characters to search into an event.
* Defaults to 4096. You may want to increase this value if you have event line lengths that
exceed 4096 characters (before linebreaking).

transforms.conf

[bluecoat_host]
LOOKAHEAD = 10000
REGEX =\"(\S+)\"\s{1,3}\S+\s\S+(?<=$)
DEST_KEY = MetaData:Host
FORMAT = host::$1

richgalloway
SplunkTrust
SplunkTrust

If you solved your problem, please accept your answer.

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Index This | When is October more than just the tenth month?

October 2025 Edition  Hayyy Splunk Education Enthusiasts and the Eternally Curious!   We’re back with this ...

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

What’s New & Next in Splunk SOAR

 Security teams today are dealing with more alerts, more tools, and more pressure than ever.  Join us for an ...