I forward data to Splunk in JSON Lines format with the event timestamp as the first field of each line:
{"time":"2018-11-02T23:59:30.123456Z","type":"xyz", ...
Here is the corresponding props.conf
stanza:
[myapp]
SHOULD_LINEMERGE = false
KV_MODE = json
TIME_PREFIX = {\"time\":\"
# Time stamp:
# - ISO 8601 extended format
# - Seconds to a maximum precision of 6 decimal places
# - With zone designator
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N%:z
This works.
Recently, a colleague who is designing the JSON Lines output for a new project, where the data will also be forwarded to Splunk, queried two aspects of what I have just described:
"time"
would be towards the end of each line.TIME_PREFIX
value.These queries from my colleague prompted me to revisit the corresponding settings in Splunk and to ask the following questions here...
(Note: I am not talking about the order in which JSON parsers process properties. I don't believe that issue is relevant here, in the context of Splunk timestamp recognition.)
(As opposed to placing "time"
later in each line.)
I thought the answer was "yes", but, after carefully re-reading the related Splunk docs, I'm no longer sure.
I had previously thought that Splunk scanned each line from left to right for the first match for the TIME_PREFIX
regex.
However, based on what my colleague tells me about regex processing in environments outside of Splunk, I suspect I've been naive about that strict "left to right" assumption. Which leads to my next question...
Like this:
TIME_PREFIX = ^{\"time\":\"
I'm asking because I previously thought—perhaps naively—that this anchor would be redundant, because Splunk searched the input line from left to right anyway.
MAX_TIMESTAMP_LOOKAHEAD
offer any performance benefits?If so, how, exactly? (To "abort" reading malformed/garbled input lines sooner rather than later?)
The default value of 128 exceeds the longest possible time stamp value; I could reduce it to match that longest possible value.
props.conf
settings for timestamp recognition in this case?This is really just a "catch-all" in case I missed any issues in my earlier, more specific questions.
For example, given that, in this case, the timestamp is only a few characters into the line, would it be more performant to not specify TIME_PREFIX
, and instead let Splunk scan through those first few characters without any TIME_PREFIX
-related regex processing? (And also specify MAX_TIMESTAMP_LOOKAHEAD
.)
Last question first, one should always specify TIME_PREFIX
and TIME_FORMAT
. This keeps Splunk from guessing about your data and is slightly more performant.
Use of the ^
character does not improve performance, AFAIK. I tend to use it only if the timestamp is the first character of a line. I would suggest removing {
from your TIME_PREFIX
setting just in case the timestamp is not the first field.
I have no information to prove putting the timestamp at the beginning of a line performs better, just a hunch that it does.
I have no information to prove putting the timestamp at the beginning of a line performs better, just a hunch that it does.
I appreciate your answer (thanks!), but I will admit I was hoping for more than a hunch. I have that same hunch.
I'm hoping that the Splunk devs will step in and answer. They have "inside information"—they know the code path—whereas I must rely on information I can gather externally, performing tests and measuring the results. I don't really have the time to do that properly, but it's looking like I'll need to make time if I want a fact-based answer.
I've submitted feedback on the Splunk docs topic "Tune timestamp recognition for better indexing performance", which you'd think might answer these questions, but doesn't. From that topic:
To speed up indexing, you can use props.conf to adjust how far ahead into events the Splunk timestamp processor looks
The topic goes on to mention MAX_TIMESTAMP_LOOKAHEAD
, but not TIME_PREFIX
.
Perhaps my feedback on that topic might prompt the Splunk devs or writers to address this question.
That's a good approach. The docs team is great about chasing down answers to questions raised by the docs.