Hi folks,
Recently onboarded a new sourcetype configured with search time extractions. Regex works when tested on sample data, however at search time, about 400 fields are extracted which are complete nonsense, the desired fields aren't extracted at all.
Config is on Heavy forwarder, and Search Head Cluster.
Any guidance would be much appreciated!
Thanks
[aam_wss]
DATETIME_CONFIG =
NO_BINARY_CHECK = true
category = Custom
disabled = false
KV_MODE = none
pulldown_type = true
TZ = UCT
EXTRACT-wss = " ^(?<x_bluecoat_request_tenant_id>[^\s]+) (?<date>\d+\-\d+\-\d+) (?<time>\d+:\d+:\d+) \"(?<x_bluecoat_appliance_name>[^\s]+)\" (?<time_taken>[^\s]+) (?<c_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<cs_userdn>[^\s]+) \"?(?<cs_auth_groups>[^\s\"]+)\"? (?<x_exception_id>[^\s]+) (?<sc_filter_result>[^\s]+) \"(?<cs_categories>.*?)\" (?<cs_Referer>[^\s]+) (?<sc_status>[^\s]+) (?<s_action>[^\s]+) (?<cs_method>[^\s]+) (?<rs_Content_Type>[^\s]+) (?<cs_uri_scheme>[^\s]+) (?<cs_host>[^\s]+) (?<cs_uri_port>[^\s]+) (?<cs_uri_path>[^\s]+) (?<cs_uri_query>[^\s]+) (?<cs_uri_extension>[^\s]+) \"?(?<cs_User_Agent>.*?)\"? (?<s_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<sc_bytes>[^\s]+) (?<cs_bytes>[^\s]+) (?<x_data_leak_detected>[^\s]+) (?<x_virus_id>[^\s]+) (?<x_bluecoat_location_id>[^\s]+) \"(?<x_bluecoat_location_name>.*?)\" (?<x_bluecoat_access_type>[^\s]+) \"(?<x_bluecoat_application_name>.*?)\" \"(?<x_bluecoat_application_operation>.*?)\" (?<r_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \"(?<r_supplier_country>.*?)\" (?<x_rs_certificate_validate_status>[^\s]+) (?<x_rs_certificate_observed_errors>[^\s]+) (?<x_cs_ocsp_error>[^\s]+) (?<x_rs_ocsp_error>[^\s]+) (?<ssl_version>[^\s]+) (?<negotiated_cipher>[^\s]+) (?<cipher_size>[^\s]+) (?<x_rs_certificate_hostname>[^\s]+) \"?(?<certificate_hostname_categories>.*?)\"? (?<x_cs_negotiated_ssl_version>[^\s]+) (?<x_cs_negotiated_cipher>[^\s]+) (?<x_cs_negotiated_cipher_size>[^\s]+) (?<x_cs_certificate_subject>[^\s]+) (?<cs_icap_status>[^\s]+) (?<cs_icap_error_details>[^\s]+) (?<rs_icap_status>[^\s]+) (?<rs_icap_error_details>[^\s]+) (?<s_supplier_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<s_supplier_country>[^\s]+) (?<s_supplier_failures>[^\s]+) \"(?<x_cs_client_ip_country>.*?)\" (?<cs_threat_risk>[^\s]+) (?<x_rs_certificate_threat_risk>[^\s]+) (?<x_client_agent_type>[^\s]+) (?<x_client_os>[^\s]+) (?<x_client_agent_sw>[^\s]+) (?<x_client_device_id>[^\s]+) (?<x_client_device_name>[^\s]+) (?<x_client_device_type>[^\s]+) (?<x_client_security_details>[^\s]+) (?<x_client_security_risk_score>[^\s]+) (?<x_bluecoat_reference_id>[^\s]+) (?<x_sc_connection_issuer_keyring>[^\s]+) (?<x_scissuer_keyring_alias>[^\s]+) (?<x_cloud_rs>[^\s]+) (?<x_bluecoat_placeholder>[^\s]+) (?<cs_X_Requested_With>[^\s]+) (?<x_bluecoat_transaction_uuid>[^\s]+)"
The garbage fields are due to automatic key-value extraction so you need to set KV_MODE = none
against your sourcetype on your Search Head. As far as the broken field extractions, that is the splunk life. You are just going to have to work through it. I like to use RegEx101.com. We could help more, but you did not post your broken events.
Thanks for the advice, Regex is tested and functional, KV mode is also set to none. Bit of a weird one I've not come up against before. Raising a support case with Splunk to see if I can get a resolution.
What did they say/find?
He already has KV_MODE = none
and in the comments below my answer he also shared a sample event, which seems to match the regex (after removing the quotes surrounding the regex, which he claims he also tried already). He mentions he even used btool to confirm the config is correct.
So it is a bit of a mystery. Unless he is actually using the wrong sourcetype or so.
Hi @milesmedboe ,
I have tested the following setting for props.conf and it works:
EXTRACT-wss = ^(?<x_bluecoat_request_tenant_id>[^\s]+) (?<date>\d+\-\d+\-\d+) (?<time>\d+:\d+:\d+) "(?<x_bluecoat_appliance_name>[^\s]+)" (?<time_taken>[^\s]+) (?<c_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<cs_userdn>[^\s]+) "?(?<cs_auth_groups>[^\s"]+)"? (?<x_exception_id>[^\s]+) (?<sc_filter_result>[^\s]+) "(?<cs_categories>.*?)" (?<cs_Referer>[^\s]+) (?<sc_status>[^\s]+) (?<s_action>[^\s]+) (?<cs_method>[^\s]+) (?<rs_Content_Type>[^\s]+) (?<cs_uri_scheme>[^\s]+) (?<cs_host>[^\s]+) (?<cs_uri_port>[^\s]+) (?<cs_uri_path>[^\s]+) (?<cs_uri_query>[^\s]+) (?<cs_uri_extension>[^\s]+) "?(?<cs_User_Agent>.*?)"? (?<s_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<sc_bytes>[^\s]+) (?<cs_bytes>[^\s]+) (?<x_data_leak_detected>[^\s]+) (?<x_virus_id>[^\s]+) (?<x_bluecoat_location_id>[^\s]+) "(?<x_bluecoat_location_name>.*?)" (?<x_bluecoat_access_type>[^\s]+) "(?<x_bluecoat_application_name>.*?)" "(?<x_bluecoat_application_operation>.*?)" (?<r_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) "(?<r_supplier_country>.*?)" (?<x_rs_certificate_validate_status>[^\s]+) (?<x_rs_certificate_observed_errors>[^\s]+) (?<x_cs_ocsp_error>[^\s]+) (?<x_rs_ocsp_error>[^\s]+) (?<ssl_version>[^\s]+) (?<negotiated_cipher>[^\s]+) (?<cipher_size>[^\s]+) (?<x_rs_certificate_hostname>[^\s]+) "?(?<certificate_hostname_categories>.*?)"? (?<x_cs_negotiated_ssl_version>[^\s]+) (?<x_cs_negotiated_cipher>[^\s]+) (?<x_cs_negotiated_cipher_size>[^\s]+) (?<x_cs_certificate_subject>[^\s]+) (?<cs_icap_status>[^\s]+) (?<cs_icap_error_details>[^\s]+) (?<rs_icap_status>[^\s]+) (?<rs_icap_error_details>[^\s]+) (?<s_supplier_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (?<s_supplier_country>[^\s]+) (?<s_supplier_failures>[^\s]+) "(?<x_cs_client_ip_country>.*?)" (?<cs_threat_risk>[^\s]+) (?<x_rs_certificate_threat_risk>[^\s]+) (?<x_client_agent_type>[^\s]+) (?<x_client_os>[^\s]+) (?<x_client_agent_sw>[^\s]+) (?<x_client_device_id>[^\s]+) (?<x_client_device_name>[^\s]+) (?<x_client_device_type>[^\s]+) (?<x_client_security_details>[^\s]+) (?<x_client_security_risk_score>[^\s]+) (?<x_bluecoat_reference_id>[^\s]+) (?<x_sc_connection_issuer_keyring>[^\s]+) (?<x_scissuer_keyring_alias>[^\s]+) (?<x_cloud_rs>[^\s]+) (?<x_bluecoat_placeholder>[^\s]+) (?<cs_X_Requested_With>[^\s]+) (?<x_bluecoat_transaction_uuid>[^\s]+)
If that doesn't work, I would look at your props.conf with btool
to see if something is taking precedence over your setting.
Try remove the "
around the REGEX, that's copy pasted from search bar I guess (where you do need those)? Also no need to do \"
inside the regex, just "
should do.
Thanks for the advice, had attempted this in the first instance, thought it might need to be formatted the same as it needs to be in Splunk search as it was not working. Have reverted as per your suggestions to no avail.
KV_mode is set to none, yet Splunk is attempting to automatically hundreds of fields. Have used btool to ensure the correct config is in memory, bit stumped!
Thanks again!
Any chance you can share some screenshots of what the data looks like and the kind of fields that get extracted?
Unfortunately don't have the required Karma yet required to upload anything
This is a scrubbed example from the logs -
26111 1007-03-27 15:00:41 "BV1-ZC0_VvsbkBI" 20 125.20.105.50 EVERETTE\Naida%00Ldbrljloh "EVERETTE\ROLE-U-ILA-QujqvtyGucatk" - OBSERVED "Business/Economy;Web Ads/Annamaria" https://app.jackqueline.com/player?course=call-monitoring-measure-quality&author=shaunte-miller&name... 200 TCP_BY_MISS GET text/plain https tim-ei00-g0.czmrorwaya01.com 131 /ping ?michAela=00523&bitrate=-1&throughput=-1&playhead=261.3046330&hldxyqaPsczrp=0&playrate=1&timemark=1001312020210&system=anlbjtthfrjtbfk&guillerMina=U_20000036_renf5gzemojr05fo_1530010403312&joaqUina=02&code=U_20000036_renf5gzemojr05fo_1530010403312 - "Mozilla/5.0 (Windows NT 6.1; DOZ04; Kennith/7.0; fm:01.0) like Gecko" 042.047.1.2 051 605 no - 310211 "Dannielle Jonelle Data Iraida (IDA)" explicit_proxy "-" "-" 00.200.105.023 "Charlesetta" RONI_VALID none - - CVRq0.2 VELMA-LEA-WEZ145-JJG202 255 *.czmrorwaya01.com "Business/Economy" CVRq0.2 VELMA-LEA-WEZ145-JJG202 255 - LENA_NOT_SCANNED - LENA_NO_MODIFICATION - 00.200.105.023 - - "United Kingdom" 3 2 sep-windows Windows%207%00Tbvtpgqngo 14.2.1023.0100 020NPG02I10P0S5E002I101B4G00B002 OX2-P-GSU1004 FW - - - - - - - - i0erfy049100v30m-0000000022uqo0o1-000000001p012d53
The selected fields area on the left hand-side displays the following
Selected Fields
aaction 9
aapp 3
aArchitecture 1
aatyp 2
acharset 22
acolor 1
acomponent 2
act 7
aculture 4
adate_month 1
adate_wday 1
adomain 4
aei 27
aeventtype 1
afactoryName 1
afname 6
ahash 3
ahl 10
ahost 1
aid 36
aidclient 2
aindex 1
aip 2
alng 2
aloc 2
alocation 2
amode 1
aname 11
ap 18
aproduct 1
aptag 1
apunct 100+
aq 69
are 4
aresourceGroupName 1
aSID 11
asource 1
asourcetype 1
asplunk_server 1
asrc_is_expected 1
asrc_pci_domain 1
asrc_requires_av 1
asrc_should_timesync 1
asrc_should_update 1
astatus 1
asubscriptionId 1
asysparm_auto_request 1
at 55
atag 1
atag::eventtype 1
aTYPE 1
atype 8
auid 7
aurl 22
av 80
aved 7
aVersion 2
avtag 1
Thanks again for your assistance!
You can upload screenshots elsewhere (e.g. imgur) and share the links here 🙂
But looks like auto kv is not disabled for starters.
Not really possible in this corporate environment, sorry 😞
I agree, it definitely looks like auto kv is being applied. Btool however only shows "KV_MODE = none" for this sourcetype.
Can you think of anywhere else this could be getting overridden?
Thanks again
And the events actually have the correct sourcetype assigned (and only 1)?
It is indeed getting the correct sourcetype. The extractions work well (over >99% of events anyway) when tested as part of a search.
Have raised a support case with Splunk, will update here if I get a resolution.
Thanks for your help!
What does the data look like? Did you try setting KV_MODE = none
? Did you do a | extract reload=T
after setting that regex on the SH?
Skalli
Thanks Skalli, already had KV_MODE = none, not sure why Splunk is still attempting to extract fields itself.
| extract reload=T didn't help either, wasn't aware of this command though so thanks for bringing it to my attention!