All Apps and Add-ons

Time Skew for when logs are read.

Communicator

Hi,

we deployed the Microsoft Office 365 Reporting Add-on for Splunk five days ago with the following settings:
INDEX_CONTINUOUS
Interval(secs): 600
Query window size (minutes) 10
Delay throttle (minutes) 5
Start Time: 2019-03-07T16:01:05

On the day of the deployment we where seeing the logs being ingested approx 15 mins behind current time, which is what we expected. Today (5 days later) we are seeing this now extended to 30-35 mins. has anyone experienced this or know of the cause/solution?

11/03/2019/09:55:07 11/03/2019 09:09 11/03/2019 09:19 >> approx. 35 mins behind current time
07/03/2019/16:35:15 07/03/2019 16:11 07/03/2019 16:21 >> approx. 15 mins behind current time

thanks

0 Karma
1 Solution

Explorer

You need to be careful of this TA. In a production environment you will find it slowly falls behind your delay throttleuntil it is hopelessly out of sync.

A couple of reasons for this.

It computes the last checkpoint a few different ways:

1) If messages are downloaded then the last checkpoint is the time of the last message. Sounds sensible, but because it operates on a fixed query window size you will start to fall behind your delay throttle period. Even in a busy environment like yours the last message is maybe a second or so earlier than the actual startdate + windowsize. But this means you will be 1 second behind your delay throttle. Repeat this over a number of days and you will fall far behind your delay throttle. This is exactly what you are seeing above. It will never catch up.

2) If you ever stop the forwarder that this TA runs on you will be in trouble for similar reasons. It simply picks up off at the last checkpoint and applies this + query_window. This means you are will behind your delay throttle and will never catch up.

To get it to work in a sensible way we had to change how the enddate checkpoint was computed in inputmodulemso365messagetrace.py:

enddate = utcnow - datetime.timedelta(minutes=delay_throttle)

Then if no messages or skipped messages:

maxdate = enddate

Although I am tempted to put this one as maxdate = startdate so you reread the same time period if no messages. Because of MS potential delay.

Maybe we missed some tunable config params, but we just found it kept falling behind and only way to fix was to update the code to suit our environment.

-S.

View solution in original post

Communicator

To add to this, if we ran this search, we can see the time skew in action:
index=internal host= source="/opt/splunk/var/log/splunk/tamso365reportingmso365messagetrace.log" "startdate=" OR "Start date" OR "End Date" OR "Number of messages returned" |rex field=raw "startdate=(?P\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})" |rex field=raw "Start\sdate:\s(?P.?)," | rex field=_raw "End\sdate:\s(?P.?)$"| rex field=raw "Number\sof\smessages\sreturned:\s(?P\d+)" | convert timeformat="%d/%m/%Y/%T %N" ctime(time) AS lastpolled|table lastpolled,NumberOfMessages,checkpointgoodstart, checkpointgoodend,checkpointbad | sort -lastpolled

0 Karma

Explorer

You need to be careful of this TA. In a production environment you will find it slowly falls behind your delay throttleuntil it is hopelessly out of sync.

A couple of reasons for this.

It computes the last checkpoint a few different ways:

1) If messages are downloaded then the last checkpoint is the time of the last message. Sounds sensible, but because it operates on a fixed query window size you will start to fall behind your delay throttle period. Even in a busy environment like yours the last message is maybe a second or so earlier than the actual startdate + windowsize. But this means you will be 1 second behind your delay throttle. Repeat this over a number of days and you will fall far behind your delay throttle. This is exactly what you are seeing above. It will never catch up.

2) If you ever stop the forwarder that this TA runs on you will be in trouble for similar reasons. It simply picks up off at the last checkpoint and applies this + query_window. This means you are will behind your delay throttle and will never catch up.

To get it to work in a sensible way we had to change how the enddate checkpoint was computed in inputmodulemso365messagetrace.py:

enddate = utcnow - datetime.timedelta(minutes=delay_throttle)

Then if no messages or skipped messages:

maxdate = enddate

Although I am tempted to put this one as maxdate = startdate so you reread the same time period if no messages. Because of MS potential delay.

Maybe we missed some tunable config params, but we just found it kept falling behind and only way to fix was to update the code to suit our environment.

-S.

View solution in original post

Explorer

Thanks for this. It is exactly this behaviour we see. Would be great if the original authors could fix the TA checkpoint issue.

0 Karma

Path Finder

I tried this mod to the file in question, but it just started throwing errors so I reverted to the backup copy. Is it just a matter of altering the pyhton script? Or is there more to it than that?

0 Karma

Explorer

All we did is modify the python script. Here's what we did:

In inputmodulemso365message_trace.py :

145 startdate = getstartdate(helper, checkpointkey)
146 # end
date = startdate + datetime.timedelta(minutes=querywindowsize)
147 # CHANGE
148 utc
now = datetime.datetime.utcnow()
149 # CHANGE - read from last checkpoint up to time now minus delay throttle - this makes more sense!
150 enddate = utcnow - datetime.timedelta(minutes=delay_throttle)

...

166 # CHANGE
167 # maxdate = startdate + datetime.timedelta(seconds=interval)
168 maxdate = enddate

...

204 if len(messages) == messageskipcount:
205 # If we skipped all the messages, move the query window forward by the amount of seconds in the interval
206 # CHANGE
207 # maxdate = maxdate + datetime.timedelta(seconds=interval)
208 maxdate = enddate
209 helper.logdebug("Skipped all messages. Setting max date to %s" % maxdate)

0 Karma

Path Finder

Thanks for your reply. However it isn't working for me. I get an ExecProcessor error on this line:
enddate = utcnow - datetime.timedelta(minutes=delay_throttle)

When splunk attempts to run the script.

0 Karma

Path Finder

Maybe I'll write my own PowerShell modular input.

0 Karma

Communicator

Also noticed that when the API call is made, the following API call does not follow on from the previous End date/time. In the example below the end date/time was 07/03/2019 17:21:04, but with the next API call the start time was 07/03/2019 17:21:03. If this keeps happening, after a while the timing gap would get larger.

When TA API call made Start End
07/03/2019 17:45:15 07/03/2019 17:21:03.000 07/03/2019 17:31:03
07/03/2019 17:35:15 07/03/2019 17:11:04.000 07/03/2019 17:21:04

0 Karma