Background
I have a very legacy application with bad/inconsistent log formatting, and I want to be able to somehow collect this in Splunk via Universal Forwarder. The issue is with multiple line events, which dump XML documents containing separate timestamps into log messages.
Issue
Because these multiline messages contain a timestamp within the body of the XML, and this becomes part of the body of the log message, Splunk is inserting events with "impossible" timestamps. For example an event will get indexed as happening in 2019 when this is actually a log event from 2024, which output an XML body containing an <example></example> element which contains a 2019 timestamp, and part of this body is stored as a Splunk event from 5 years ago.
Constraints
Ideas?
My only idea so far is a custom sourcetype which specifies the log timestamp format exactly including a regex anchor to the start of the line, and also reduces/removes the MAX_TIMESTAMP_LOOKAHEAD value to stop Splunk from looking past the first match - I believe this would mean that all the lines in an event would be considered correctly because the XML document would start with either whitespace or a < character. However my understanding is that this would require a change either to the indexer or to a Heavy Forwarder which I can't do.
I'm looking for any alternatives this community can offer as a potential workaround until the log sanitization effort gets off the ground.
This data is not being onboarded properly. That may be your fault or someone else's, but you need to work with the owner of the HF to install a better set of props.conf settings so the data is onboarded correctly.
Focus on the Great Eight settings, with particular attention to LINE_BREAKER, TIME_PREFIX, and TIME_FORMAT.
If the HF owner pushes back, remind him/her that Splunk suffers when data is not onboarded well. Additionally, the company may suffer if data cannot be searched because the timestamps are wrong.
UF does not do parsing. Except for indexed extractions or when you set force_local_processing=true. So unless you turn your UF into a kind of a poor-man's-HF, your parsing and time extraction settings will not work on UF.
If you have access to HEC endpoint though you could consider using another method (like third-party solution like filebeat or even your own python script to pre-parse those events a bit and send them via HTTP.