Solved: Time Skew for when logs are read.

becksyboy · ‎03-11-2019

Hi,

we deployed the Microsoft Office 365 Reporting Add-on for Splunk five days ago with the following settings:
INDEX_CONTINUOUS
Interval(secs): 600
Query window size (minutes) 10
Delay throttle (minutes) 5
Start Time: 2019-03-07T16:01:05

On the day of the deployment we where seeing the logs being ingested approx 15 mins behind current time, which is what we expected. Today (5 days later) we are seeing this now extended to 30-35 mins. has anyone experienced this or know of the cause/solution?

11/03/2019/09:55:07 11/03/2019 09:09 11/03/2019 09:19 >> approx. 35 mins behind current time
07/03/2019/16:35:15 07/03/2019 16:11 07/03/2019 16:21 >> approx. 15 mins behind current time

thanks

swatts1000 · ‎03-13-2019

You need to be careful of this TA. In a production environment you will find it slowly falls behind your delay throttleuntil it is hopelessly out of sync.

A couple of reasons for this.

It computes the last checkpoint a few different ways:

1) If messages are downloaded then the last checkpoint is the time of the last message. Sounds sensible, but because it operates on a fixed query window size you will start to fall behind your delay throttle period. Even in a busy environment like yours the last message is maybe a second or so earlier than the actual start_date + window_size. But this means you will be 1 second behind your delay throttle. Repeat this over a number of days and you will fall far behind your delay throttle. This is exactly what you are seeing above. It will never catch up.

2) If you ever stop the forwarder that this TA runs on you will be in trouble for similar reasons. It simply picks up off at the last checkpoint and applies this + query_window. This means you are will behind your delay throttle and will never catch up.

To get it to work in a sensible way we had to change how the end_date checkpoint was computed in input_module_ms_o365_message_trace.py:

end_date = utc_now - datetime.timedelta(minutes=delay_throttle)

Then if no messages or skipped messages:

max_date = end_date

Although I am tempted to put this one as max_date = start_date so you reread the same time period if no messages. Because of MS potential delay.

Maybe we missed some tunable config params, but we just found it kept falling behind and only way to fix was to update the code to suit our environment.

-S.

View solution in original post

becksyboy · ‎03-13-2019

To add to this, if we ran this search, we can see the time skew in action:
index=_internal host= source="/opt/splunk/var/log/splunk/ta_ms_o365_reporting_ms_o365_message_trace.log" "start_date=" OR "Start date" OR "End Date" OR "Number of messages returned" |rex field=_raw "start_date=(?P\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})" |rex field=_raw "Start\sdate:\s(?P.?)," | rex field=_raw "End\sdate:\s(?P.?)$"| rex field=_raw "Number\sof\smessages\sreturned:\s(?P\d+)" | convert timeformat="%d/%m/%Y/%T %N" ctime(_time) AS last_polled|table last_polled,Number_Of_Messages,checkpoint_good_start, checkpoint_good_end,checkpoint_bad | sort -last_polled

swatts1000 · ‎03-13-2019

You need to be careful of this TA. In a production environment you will find it slowly falls behind your delay throttleuntil it is hopelessly out of sync.

A couple of reasons for this.

It computes the last checkpoint a few different ways:

1) If messages are downloaded then the last checkpoint is the time of the last message. Sounds sensible, but because it operates on a fixed query window size you will start to fall behind your delay throttle period. Even in a busy environment like yours the last message is maybe a second or so earlier than the actual start_date + window_size. But this means you will be 1 second behind your delay throttle. Repeat this over a number of days and you will fall far behind your delay throttle. This is exactly what you are seeing above. It will never catch up.

2) If you ever stop the forwarder that this TA runs on you will be in trouble for similar reasons. It simply picks up off at the last checkpoint and applies this + query_window. This means you are will behind your delay throttle and will never catch up.

To get it to work in a sensible way we had to change how the end_date checkpoint was computed in input_module_ms_o365_message_trace.py:

end_date = utc_now - datetime.timedelta(minutes=delay_throttle)

Then if no messages or skipped messages:

max_date = end_date

Although I am tempted to put this one as max_date = start_date so you reread the same time period if no messages. Because of MS potential delay.

Maybe we missed some tunable config params, but we just found it kept falling behind and only way to fix was to update the code to suit our environment.

-S.

nightflame · ‎05-08-2019

Thanks for this. It is exactly this behaviour we see. Would be great if the original authors could fix the TA checkpoint issue.

jeremyhagand61 · ‎06-19-2019

I tried this mod to the file in question, but it just started throwing errors so I reverted to the backup copy. Is it just a matter of altering the pyhton script? Or is there more to it than that?

swatts1000 · ‎06-20-2019

All we did is modify the python script. Here's what we did:

In input_module_ms_o365_message_trace.py :

145 start_date = get_start_date(helper, check_point_key)
146 # end_date = start_date + datetime.timedelta(minutes=query_window_size)
147 # CHANGE
148 utc_now = datetime.datetime.utcnow()
149 # CHANGE - read from last checkpoint up to time now minus delay throttle - this makes more sense!
150 end_date = utc_now - datetime.timedelta(minutes=delay_throttle)

...

166 # CHANGE
167 # max_date = start_date + datetime.timedelta(seconds=interval)
168 max_date = end_date

...

204 if len(messages) == message_skip_count:
205 # If we skipped all the messages, move the query window forward by the amount of seconds in the interval
206 # CHANGE
207 # max_date = max_date + datetime.timedelta(seconds=interval)
208 max_date = end_date
209 helper.log_debug("Skipped all messages. Setting max date to %s" % max_date)

jeremyhagand61 · ‎07-10-2019

Thanks for your reply. However it isn't working for me. I get an ExecProcessor error on this line:
end_date = utc_now - datetime.timedelta(minutes=delay_throttle)

When splunk attempts to run the script.

jeremyhagand61 · ‎07-10-2019

Maybe I'll write my own PowerShell modular input.

becksyboy · ‎03-12-2019

Also noticed that when the API call is made, the following API call does not follow on from the previous End date/time. In the example below the end date/time was 07/03/2019 17:21:04, but with the next API call the start time was 07/03/2019 17:21:03. If this keeps happening, after a while the timing gap would get larger.

When TA API call made Start End
07/03/2019 17:45:15 07/03/2019 17:21:03.000 07/03/2019 17:31:03
07/03/2019 17:35:15 07/03/2019 17:11:04.000 07/03/2019 17:21:04

Time Skew for when logs are read.

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Splunk Community Badges!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

Join the Conversation