Which timestamp does indexer use as _time?

yuanliu — Sun, 20 Jun 2021 21:57:42 GMT

When multiple timestamps exist in raw events, which one does the indexer pick as _time? In the majority of conditions, Splunk picks the one that I would have most preferred even though I am unable to give it preference. How is the decision made?

In file ingestion, I can explicitly specify "TIMESTAMP_FIELDS". If multiple is present, this means that Splunk has to pick one of them.
In file monitoring, multiple fields may contain a timestamp. Even with structured input such as CSV, I notice that the field name may not have a direct impact on which field is ultimately chosen. (I was once surprised that a field containing a text string concatenated with a numeric value that falls into the current epoch range, the numeric part was used as _time. That was one of the rare obvious "wrong" choices indexer made that I have noticed.)

My most recent (pleasant surprise) experience was with a JSON API source that comes with several timestamp fields that may or may not be populated, so I also had to forcefully add my own timestamp. I gave my field the name "timestamp" because I thought it would be best to just use this because in some cases, the other timestamp fields could be really stale, although I wouldn't mind if one of the "fresher" timestamps were used; in fact, I would prefer that a fresh timestamp from original data be used.

Rather strangely, if I do not add this retrieval "timestamp", indexer doesn't populate _time - which is bad. But after I add my "timestamp" (somewhat reluctantly), the indexer picks my "timestamp" field if all other timestamp fields are either stale or blank, but ignores my (artificial) "timestamp" field, and pick a "fresh" timestamp from the original source as _time. This is kind of optimal for me.

In the files that I produce from this API, there is no indication that "timestamp" is "artificial". What is the criteria that Splunk uses to make a determination that one of the original timestamps is "fresh" or "stale", and that my "timestamp" field could be "too fresh"?

Adding to my befuddlement, I add the same "timestamp" field on a different API (also JSON), except this time, indexer is not returning any _time at all.

If, on the other hand, I do not populate my own "timestamp" field, indexer adds a "timestamp" field to the result, except the value is universally "none". If I cheat by setting a field named "_time", the indexer populates a field "time" with that value.

At this point, I am at a deadend with this "other" API.

To help me think, I construct this diagnostic matrix.

	API 1	API 2
Several original timestamp fields, but no faked "timestamp" or "_time"	No _time	=
Fake "timestamp"	_time populated with desirable selection between original timestamps and faked "timestamp"	No _time, just "timestamp"
Fake "_time"	(not tested)	No _time, populates "time" instead.

In all cases, my fake time fields are in fractional epoch, while original timestamp fields are in text format. Both sourcetypes do not have "TIMESTAMP_FIELDS" set.

Re: Which timestamp does indexer use as _time?

yuanliu — Sun, 20 Jun 2021 23:16:18 GMT

I have partial (a large part) answer now: Something to do with sourcetype's implicit MAX_TIMESTAMP_LOOKAHEAD property. This property defaults 128 and, unless you change it, it won't show in props.conf's sourcetype stanza, or in the GUI's Advanced view.

Both sourcetypes do not have "TIMESTAMP_FIELDS" set.

What is left unsaid is INDEXED_EXTRACTIONS. In both cases, I tested json and none. With INDEXED_EXTRACTIONS=json, I can specify TIMESTAMP_FIELDS but I didn't. (You can say I really like to examine how automatic extraction works.)

API 1 happens to be placing a possible timestamp field before the 128 mark, while API 2's first timestamp field comes after. My fake timestamp field (however I name it) comes at the end. I can either give MAX_TIMESTAMP_LOOKAHEAD a large enough number, alternatively, use TIME_PREFIX or, just use INDEXED_EXTRACTIONS=json and set TIMESTAMP_FIELDS so files from API 2 will be timestamped correctly.

It is interesting to know that MAX_TIMESTAMP_LOOKAHEAD is still effective when INDEXED_EXTRACTIONS=json (in the absence of TIMESTAMP_FIELDS).

I still do not know

why API 1 won't auto extract without a fake "timestamp" field way beyond the 128 mark, and
why, with fake "timestamp" appended to the end, when the event's first timestamp contains null value, the indexer seeks my fake "timestamp". (When all possible event timestamp fields are populated and relatively fresh, it sometimes picked another field. All without an explicit MAX_TIMESTAMP_LOOKAHEAD, i.e., the value would be 128.)

topic Which timestamp does indexer use as _time? in Getting Data In

Which timestamp does indexer use as _time?

Re: Which timestamp does indexer use as _time?