Hello Splunk Community,
My team is currently processing logs from a single source that can contain events with different timestamp formats. We are debating the best configuration approach and would like input from the community.
We are currently using a TRANSFORMS-split rule in our props.conf file to differentiate the source types based on some criteria, and then applying a single TIME_FORMAT within each resulting source type stanza. This involves creating several dedicated source types for essentially the same data stream.
[xxx:tomcat9:catalina1]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
TIME_FORMAT=%d-%b-%Y %H:%M:%S
TIME_PREFIX=^
MAX_TIMESTAMP_LOOKAHEAD=20
TRANSFORMS-split = catalina1, catalina2
[tomcat9stdout1]
REGEX = \d+\-[a-zA-Z]+-\d+\s\d+\:\d+\:\d+
DEST_KEY = MetaData:Sourcetype
FORMAT = sourcetype::xxx:tomcat9:stdout1
[tomcat9stdout2]
REGEX = [0-9]{4}\-\d+\-\d+\s\d+\:\d+\:\d+
... (etc for other formats)The alternative approach suggested is to use a single source type and configure a datetime.xml file. This XML file would contain multiple regular expressions, allowing Splunk to iterate through them and automatically identify the correct timestamp format for each individual event.
Which approach is considered the industry best practice for handling this specific scenario? Is the datetime.xml method generally more robust and maintainable than splitting source types via transforms?
Thanks for your guidance!
Hi @PickleRick, @bowesmana, thank you for your insightful replies !
Looking at the official props.conf documentation, and it seems clear that datetime.xml is the best practice for these situations - props.conf. what do you think?
I also wonder if the official Tomcat TA covers it, but looking briefly into the props of the TA, and I don't see any date extraction, which is really strange.
To be fully honest, I've never seen datetime.xml fiddled with. It's a relatively narrow border use case. I'm not sure there's a lot of docs on it either.
Neither will be intuitive (I disagree here a bit with @bowesmana ). By definition all events from a given sourcetype should share common format so if you have different time formats it would be natural for me to split it into separate sourcetypes but the split itself is a bit tricky - it won't work the way you're describing because timestamp recognition is happening at the very beginning of the ingestion pipeline and even if you recast your sourcetype it won't happen again. Even if you CLONE_SOURCETYPE your duplicated event will be reinjected into the queue after the timestamp recognition phase.
With syslog-provided events it's usually relatively easy because you can split your event stream into multiple sourcetypes before it hits Splunk. With files... it's gonna be tricky. Probably @bowesmana 's approach with dynamic overwrites of the already extracted (or assigned because extraction might not happen properly for misformatted timestamps) might be the way to go. But it's worth docummenting extensively because it's not intuitive.
BTW, catalina.out is a mess
😂yes totally right that neither are intuitive - it should be a Catalina thing. I know we had the same issue with a custom Tomcat app that had multiple date formats and we pushed back to get them fixed, which got traction.
Yes. It's one of the common issues with tomcat (and java in general) logs. But be prepared for more "fun".
1. Java apps often produce multiline stack dumps. And if by any chance you're forwarding those logs via syslog to a remote machine you'll end up with a single "logical event" split into several separate syslog entries. That's horrible to deal with.
2. Developers tend to happily write to logs... just about anything. And in any format (or lack thereof) they can think of. I don't know why but java seems to be one of the cases where the devs are most prolific in coming up with several different ways of formatting events from the same application.
3. It might or might not be an issue, but rotating logs with log4j is (or at least used to be) painful. It's usually not directly an ingestion issue but it might cause problems if you want to keep the log dir tidy - you can't just use logrotate and send HUP to the app.
I'm not sure about best practice, but splitting the same stream into multiple sourcetypes just to handle different data format seems non-intuitive.
What about using INGEST_EVAL to extract _time with a bunch of eval statements to extract _time, e.g.
INGEST_EVAL _time = coalesce(strptime(_raw, "%FT%T"), strptime(_raw, "%d-%b-%Y %H:%M:%S"), strptime(_raw, ...))