Splunk Search

Why is date_hour inconsistent with %H?

yuanliu
SplunkTrust
SplunkTrust

According to doco: "The date_hour field ... is extracted from the event's timestamp (the value in _time)." Consider this test:

index=*
| eval hour=strftime(_time,"%H")
| eval shift=date_hour-hour
| stats count by shift index date_zone
| stats values(eval(index."-".count)) as sourcetype sum(count) as count by shift date_zone
| sort by -shift

The output is all over the map:

shift   date_zone   sourcetype  count
17      local   main-3550674
                   r-7006
                   sample-16093    3573773
16      local    r-1572       1572
0       0        main-1158239   1158239
-7      local   main-3817593
                   r-18887
                   sample-41819    3878299
-8      local    main-1626
                   r-2839         4465

When I examine raw data closely, it seems that strftime(_time,"%H") reports the hour of day correctly.

Similar inconsistence exists in date_mday vs "%d".

1 Solution

sideview
SplunkTrust
SplunkTrust

The one case where I know date_* fields are all a bit wrong, is when _time is being extracted from an epochtime value. ie when _time is calculated from something in the raw event text that is itself a number of seconds since 1970.

date_hour and date_minute and all its friends, are technically not always extracted. In cases where the timestamp extraction can't find a reliable timezone, Splunk isn't supposed to create any of these fields. That by the way can be a surprise for people who have come to expect they are always there.

But in the case of _time is extracted from an epochtime value in the events, even though in such a case the timestamp-extraction code really has no valid timezone listed there, it has always as far as I know had a bug in it where it will erroneously assume the data is in GMT, and go on to create date_* fields as though the data were unequivocally in GMT.

The only recourse that I know of, is to just stop trusting date_* completely when you're using one of these data sets.
Or to hardcode an offset into your search that represents your offset from GMT, (and then change it twice a year for DST!). I recommend not trusting it, and just creating your own little fields in props.conf.

EVAL-hour_of_day=strftime(_time,"%H")
EVAL-day_of_week=strftime(_time,"%a")

It's been this way for years, I've filed it several times as a bug, I've even had conversations with engineering (years ago now) about why it's a tremendous pain for them to fix.

View solution in original post

lguinn2
Legend

Where did you find that statement in the documentation? I couldn't find it - and I think it is wrong...

0 Karma

yuanliu
SplunkTrust
SplunkTrust

http://docs.splunk.com/Documentation/Splunk/6.3.3/Knowledge/UseDefaultFields - scroll down to "Default datetime fields". The statement surely is wrong in the sample that I just examined.

Interestingly, closer to the top of the doco, there is a correct note:


Note: Only events that have timestamp information in them as generated by their respective systems will have date_* fields. If an event has a date_* field, it represents the value of time/date directly from the event itself. If you have specified any timezone conversions or changed the value of the time/date at indexing or input time (for example, by setting the timestamp to be the time at index or input time), these fields will not represent that.


0 Karma

sideview
SplunkTrust
SplunkTrust

The one case where I know date_* fields are all a bit wrong, is when _time is being extracted from an epochtime value. ie when _time is calculated from something in the raw event text that is itself a number of seconds since 1970.

date_hour and date_minute and all its friends, are technically not always extracted. In cases where the timestamp extraction can't find a reliable timezone, Splunk isn't supposed to create any of these fields. That by the way can be a surprise for people who have come to expect they are always there.

But in the case of _time is extracted from an epochtime value in the events, even though in such a case the timestamp-extraction code really has no valid timezone listed there, it has always as far as I know had a bug in it where it will erroneously assume the data is in GMT, and go on to create date_* fields as though the data were unequivocally in GMT.

The only recourse that I know of, is to just stop trusting date_* completely when you're using one of these data sets.
Or to hardcode an offset into your search that represents your offset from GMT, (and then change it twice a year for DST!). I recommend not trusting it, and just creating your own little fields in props.conf.

EVAL-hour_of_day=strftime(_time,"%H")
EVAL-day_of_week=strftime(_time,"%a")

It's been this way for years, I've filed it several times as a bug, I've even had conversations with engineering (years ago now) about why it's a tremendous pain for them to fix.

View solution in original post

yuanliu
SplunkTrust
SplunkTrust

I have this example, where _time is not from an epoch time in the source event, a syslog entry


Mar 29 17:54:11 amiohdrmp1 snmpd[14773]: Connection from UDP: [127.0.0.1]:46920


In this particular case, syslog uses EDT without printing zone info. Splunk correctly dates this event at 3/29/16 9:54:11.000 PM, i.e., 21:54:11. As a result, %H correctly gives 21. However, date_hour is 17, the split output from source text!

Whereas this case looks like a fixable bug, the designer may have other use cases in mind. You have sufficiently scared me, so I'll just accept "in date_* no trust" as answer:-)

0 Karma

yuanliu
SplunkTrust
SplunkTrust

To close the loop. After reviewing @lgnuin's comment about the doco being wrong and discovering the correct note in the same doco page, the above example can be explained - as kind of expected behavior. Here, syslog is not logging year, so Splunk discarded "Mar 29 17:54:11" and supplied indexer timestamp . Per the correct part of the doco: "If you have ... changed the value of the time/date at indexing or input time (for example, by setting the timestamp to be the time at index or input time), these fields will not represent that."

In this sense, it is not a bug.

0 Karma

lguinn2
Legend

Actually, Splunk does not discard "Mar 29 17:54:11"

If the event arrived to be indexed before Mar 29, 2016, Splunk would assume the year to be 2015. Otherwise, Splunk would assume the current year (2016).
(An easier-to-understand example: if an event showed up today (30-Mar-2016) with a timestamp of "Aug 9 17:54:11", Splunk would assume 2015. For a timestamp of "Feb 2 17:54:11", it would assume 2016.)

Many, many timestamps have this form, although syslog is the most common. If Splunk wasn't able to deal with this, a lot of inputs would be broken.

Why did Splunk figure it was EDT? Check out this docs page, in the section How Splunk applies time zones My guess is that the forwarder supplied the timezone of the underlying OS.

0 Karma

lguinn2
Legend

Yes, what @sideview said: "stop trusting date_* completely" - although I would go even farther and say "don't use date_*".

_time is "normalized" - by that I mean: it is parsed by Splunk, using any timezone and props.conf information available; it is stored in the index in UTC; it is displayed to you based on your user timezone setting in the GUI. If you extract the hour (%H) from _time, it will always be "right" and it always exists.

I would really like to have a way to suppress the date fields altogether, so that users can't see them and use them without understanding the consequences.

From this page in the Splunk Docs:
http://docs.splunk.com/Documentation/Splunk/6.3.3/Knowledge/UseDefaultFields
"The datetime values are the literal values parsed from the event when it is indexed, regardless of its timezone."

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!