Are there performance benefits to placing the time...

Graham_Hanningt · ‎11-01-2018

Background

I forward data to Splunk in JSON Lines format with the event timestamp as the first field of each line:

{"time":"2018-11-02T23:59:30.123456Z","type":"xyz", ...

Here is the corresponding props.conf stanza:

[myapp]
SHOULD_LINEMERGE = false
KV_MODE = json
TIME_PREFIX = {\"time\":\"
# Time stamp:
# - ISO 8601 extended format
# - Seconds to a maximum precision of 6 decimal places
# - With zone designator
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N%:z

This works.

Recently, a colleague who is designing the JSON Lines output for a new project, where the data will also be forwarded to Splunk, queried two aspects of what I have just described:

The position of the timestamp at the start of each line. The colleague proposed outputting fields in alphabetical order; "time" would be towards the end of each line.
The lack of a "start of line" anchor (^) at the start of the TIME_PREFIX value.

These queries from my colleague prompted me to revisit the corresponding settings in Splunk and to ask the following questions here...

(Note: I am not talking about the order in which JSON parsers process properties. I don't believe that issue is relevant here, in the context of Splunk timestamp recognition.)

Questions

Are there performance benefits to placing the timestamp at the start of input event data?

(As opposed to placing "time" later in each line.)

I thought the answer was "yes", but, after carefully re-reading the related Splunk docs, I'm no longer sure.

I had previously thought that Splunk scanned each line from left to right for the first match for the TIME_PREFIX regex.

However, based on what my colleague tells me about regex processing in environments outside of Splunk, I suspect I've been naive about that strict "left to right" assumption. Which leads to my next question...

Would adding a start of line anchor (^) to my regex improve performance?

Like this:

TIME_PREFIX = ^{\"time\":\"

I'm asking because I previously thought—perhaps naively—that this anchor would be redundant, because Splunk searched the input line from left to right anyway.

Would setting `MAX_TIMESTAMP_LOOKAHEAD` offer any performance benefits?

If so, how, exactly? (To "abort" reading malformed/garbled input lines sooner rather than later?)

The default value of 128 exceeds the longest possible time stamp value; I could reduce it to match that longest possible value.

What are the optimal `props.conf` settings for timestamp recognition in this case?

This is really just a "catch-all" in case I missed any issues in my earlier, more specific questions.

For example, given that, in this case, the timestamp is only a few characters into the line, would it be more performant to not specify TIME_PREFIX, and instead let Splunk scan through those first few characters without any TIME_PREFIX-related regex processing? (And also specify MAX_TIMESTAMP_LOOKAHEAD.)

richgalloway · ‎11-02-2018

Last question first, one should always specify TIME_PREFIX and TIME_FORMAT. This keeps Splunk from guessing about your data and is slightly more performant.
Use of the ^ character does not improve performance, AFAIK. I tend to use it only if the timestamp is the first character of a line. I would suggest removing { from your TIME_PREFIX setting just in case the timestamp is not the first field.
I have no information to prove putting the timestamp at the beginning of a line performs better, just a hunch that it does.

---
If this reply helps you, Karma would be appreciated.

Graham_Hanningt · ‎11-06-2018

I have no information to prove putting the timestamp at the beginning of a line performs better, just a hunch that it does.

I appreciate your answer (thanks!), but I will admit I was hoping for more than a hunch. I have that same hunch.

I'm hoping that the Splunk devs will step in and answer. They have "inside information"—they know the code path—whereas I must rely on information I can gather externally, performing tests and measuring the results. I don't really have the time to do that properly, but it's looking like I'll need to make time if I want a fact-based answer.

I've submitted feedback on the Splunk docs topic "Tune timestamp recognition for better indexing performance", which you'd think might answer these questions, but doesn't. From that topic:

To speed up indexing, you can use props.conf to adjust how far ahead into events the Splunk timestamp processor looks

The topic goes on to mention MAX_TIMESTAMP_LOOKAHEAD, but not TIME_PREFIX.

Perhaps my feedback on that topic might prompt the Splunk devs or writers to address this question.

richgalloway · ‎11-07-2018

That's a good approach. The docs team is great about chasing down answers to questions raised by the docs.

---
If this reply helps you, Karma would be appreciated.

Are there performance benefits to placing the timestamp at the start of input event data?

Background

Questions

Are there performance benefits to placing the timestamp at the start of input event data?

Would adding a start of line anchor (^) to my regex improve performance?

Would setting `MAX_TIMESTAMP_LOOKAHEAD` offer any performance benefits?

What are the optimal `props.conf` settings for timestamp recognition in this case?

Monitoring MariaDB and MySQL

Financial Services Industry Use Cases, ITSI Best Practices, and More New Articles ...

Splunk Federated Analytics for Amazon Security Lake

Are there performance benefits to placing the timestamp at the start of input event data?

Background

Questions

Are there performance benefits to placing the timestamp at the start of input event data?

Would adding a start of line anchor (^) to my regex improve performance?

Would setting MAX_TIMESTAMP_LOOKAHEAD offer any performance benefits?

What are the optimal props.conf settings for timestamp recognition in this case?

Monitoring MariaDB and MySQL

Financial Services Industry Use Cases, ITSI Best Practices, and More New Articles ...

Splunk Federated Analytics for Amazon Security Lake

Would setting `MAX_TIMESTAMP_LOOKAHEAD` offer any performance benefits?

What are the optimal `props.conf` settings for timestamp recognition in this case?