All Apps and Add-ons

Is the Rate limiting - 2000 logs per polling max, limitation built in the addon or is it a limitation within Microsoft Cloud?

Azerty728
Path Finder

Hello there,

This post to ask a question, but mostly to share experience on this addon.

We discovered that the addon limits the number of logs obtained on every connection it makes to the Microsoft Cloud.

Every time you pull, you will only get 2000 logs. So if you pull every hour, you will get 2000 logs/hour. If you pull every 30 seconds, you will get 240 000 logs/hour (max).
So depending on your infrastructure, please note this behavior. Every message polled is assigned an Index value, going from 0 to 1999. So you can use this field to monitor whether you lose logs or not...Search for "Index=1999" and if it matches then you may have lost logs... So reducing the polling time in the addon's "Inputs" configuration may help to get all your logs.

And now the question: is this a limitation built in the addon, or is it a limitation within Microsoft Cloud ?
There is nothing in the GUI of the addon to modify this value.
Is someone has the info, please share.

Regards

0 Karma

ericlavalley
Explorer

@jconger Any chance this can be included in the add-on? I'm regularly dropping emails due to Microsoft's rate limiting. I have a powershell script doing something similar but would much rather leverage this app instead of something homegrown.

0 Karma

nathanielduren
New Member

I changed the script to follow the __next links and process the additional logs. I don't have enough karma to attach a file, so I've pasted it below. Replace %SPLUNK_DIR%/etc/apps/TA-MS_O365_Reporting/bin/input_module_ms_o365_message_trace.py with the below and it should get all the messages for the polling period.

# encoding = utf-8

import os
import sys
import time
import datetime
import re
import requests
import json
import dateutil.parser

def validate_input(helper, definition):
    """Implement your own validation logic to validate the input stanza configurations"""
    # This example accesses the modular input variable
    # microsoft_office_365_account = definition.parameters.get('microsoft_office_365_account', None)
    pass

def get_start_date(helper, check_point_key):

    # Try to get a date from the check point first
    d = helper.get_check_point(check_point_key)

    # If there was a check point date, retun it.
    if (d not in [None,'']):
        return dateutil.parser.parse(d["max_date"])
    else:
        # No check point date, so look if a start date was specified as an argument
        d = helper.get_arg("start_date_time")
        if (d not in [None,'']):
            return dateutil.parser.parse(d)
        else:
            # If there was no start date specified, default to 5 days ago
            return datetime.datetime.now() - datetime.timedelta(days=5)

def get_last_epoch(helper, check_point_key):
    e = helper.get_check_point(check_point_key)
    if (e not in [None,'']):
        return e["max_epoch"]
    else:
        return 0

def collect_events(helper, ew):
    global_microsoft_office_365_username = helper.get_arg("office_365_account")["username"]
    global_microsoft_office_365_password = helper.get_arg("office_365_account")["password"]
    index_metadata = helper.get_arg("index_metadata")
    check_point_key = "%s_obj_checkpoint" % helper.get_input_stanza_names()
    start_date = get_start_date(helper, check_point_key)

    # Sometimes the Subject and FromIP is set to null and Size 0  from MS.
    # We are probably fetching the log from them while they have not synced properly which means we get bad data into splunk.
    # We increased the end_date to have 180s delay instead of using now().
    end_date = datetime.datetime.utcnow() - datetime.timedelta(seconds=180)

    microsoft_trace_url = "https://reports.office365.com/ecp/reportingwebservice/reporting.svc/MessageTrace?$format=json&orderby=Received asc&$filter=StartDate eq datetime'%sZ' and EndDate eq datetime'%sZ'" % (start_date.isoformat(), end_date.isoformat())

    helper.log_debug("Endpoint URL: %s" % microsoft_trace_url)

    r = requests.get(microsoft_trace_url, auth=requests.auth.HTTPBasicAuth(global_microsoft_office_365_username, global_microsoft_office_365_password))

    try:
        r.raise_for_status()
        data = r.json()

        max_date = start_date
        max_epoch = get_last_epoch(helper, check_point_key)
        current_max_epoch = max_epoch

        for message_trace in data["d"]["results"]:

            # According to https://msdn.microsoft.com/en-us/library/office/jj984335.aspx
            # The StartDate and EndDate fields do not provide useful information in the report results...
            message_trace.pop("StartDate")
            message_trace.pop("EndDate")

            if not index_metadata:
                message_trace.pop("__metadata")

            # Convert the /Date()/ format returned from the JSON and create a new field
            _received = re.search('/Date\((.+?)\)/', message_trace["Received"])
            if(_received):
                t = int(_received.group(1))

                # There is a chance that we could ingest duplicate data due to date granularity.
                # This check should catch those situations.
                if t <= max_epoch:
                    continue

                d = datetime.datetime.utcfromtimestamp(t/1000)
                message_trace["DateReceived"] = d.isoformat() + "Z"

                # Keep up with the max received date
                max_date = max([max_date, d])

                # Keep up with the max epoch as well for greater precision
                current_max_epoch = max([current_max_epoch,t])

            e = helper.new_event(source=helper.get_input_type(), index=helper.get_output_index(), sourcetype=helper.get_sourcetype(), data=json.dumps(message_trace))
            ew.write_event(e)
        _next = data["d"].get("__next",0)
        helper.log_debug("Next URL: %s" % _next)
        while _next :
            _next = collect_events_next(helper,ew,_next)
        checkpoint_data = {}
        checkpoint_data["max_date"] = str(max_date)
        checkpoint_data["max_epoch"] = current_max_epoch

        helper.save_check_point(check_point_key, checkpoint_data)

    except Exception as e:
        helper.log_error("HTTP Request error: %s" % str(e))

def collect_events_next(helper, ew, _next):
    global_microsoft_office_365_username = helper.get_arg("office_365_account")["username"]
    global_microsoft_office_365_password = helper.get_arg("office_365_account")["password"]
    index_metadata = helper.get_arg("index_metadata")
    check_point_key = "%s_obj_checkpoint" % helper.get_input_stanza_names()
    start_date = get_start_date(helper, check_point_key)
    microsoft_trace_url = _next + "&$format=json&orderby=Received%20asc"

    helper.log_debug("Endpoint URL: %s" % microsoft_trace_url)

    r = requests.get(microsoft_trace_url, auth=requests.auth.HTTPBasicAuth(global_microsoft_office_365_username, global_microsoft_office_365_password))

    try:
        r.raise_for_status()
        data = r.json()

        max_date = start_date
        max_epoch = get_last_epoch(helper, check_point_key)
        current_max_epoch = max_epoch

        for message_trace in data["d"]["results"]:

            # According to https://msdn.microsoft.com/en-us/library/office/jj984335.aspx
            # The StartDate and EndDate fields do not provide useful information in the report results...
            message_trace.pop("StartDate")
            message_trace.pop("EndDate")

            if not index_metadata:
                message_trace.pop("__metadata")

            # Convert the /Date()/ format returned from the JSON and create a new field
            _received = re.search('/Date\((.+?)\)/', message_trace["Received"])
            if(_received):
                t = int(_received.group(1))

                # There is a chance that we could ingest duplicate data due to date granularity.
                # This check should catch those situations.
                if t <= max_epoch:
                    continue

                d = datetime.datetime.utcfromtimestamp(t/1000)
                message_trace["DateReceived"] = d.isoformat() + "Z"

                # Keep up with the max received date
                max_date = max([max_date, d])

                # Keep up with the max epoch as well for greater precision
                current_max_epoch = max([current_max_epoch,t])

            e = helper.new_event(source=helper.get_input_type(), index=helper.get_output_index(), sourcetype=helper.get_sourcetype(), data=json.dumps(message_trace))
            ew.write_event(e)
        _next=data["d"].get('__next',0)
        helper.log_debug("Next URL: %s" % _next)
        return _next      

    except Exception as e:
        helper.log_error("HTTP Request error: %s" % str(e))
0 Karma

john0499
Explorer

I've just realized we're encountering this issue too. As a university we can have spikes of 40k emails in a few minutes so we're dropping quite a few.

Is there no way for the add-on to fetch historical events and slowly catch up after a spike? This is how the add-on for Okta works.

0 Karma

hkust
New Member

I suppose this is a limitation for Office 365 Reporting Add-On 1.01! Hope this can be fixed if there is update.

0 Karma

Azerty728
Path Finder

Hi all,

@hkust,
15s is very small. You can't keep reducing this timer, because addon still needs time to get those messages.

Maybe you don't have sufficient bandwidth to get enough messages every 15s ?

And to answer @mwoods2, the best way to ensure you get all messages is to get an alert every time you have Index=1999 in order to reduce polling interval. But this is not a miracle solution, just a temporary workaround. Quid when there's a message burst ?

Regards.

0 Karma

hkust
New Member

We've installed Microsoft Office 365 Reporting Add-On for Splunk recently and able to get the message log. However, with the setting "Interval=15" (sec), there is still "Index: 1999" within a few hours.

Is there any recommended value for interval, say 5 sec or 10 sec? Thanks

James

0 Karma

mwoods2
New Member

Some time has gone by since the original poster posted the question, but I've experienced the same thing. Research seems to indicate that this is a limitation on the Microsoft side. What I've done until I can sort this out more is reduced the polling time period significantly (every minute), as was mentioned above. Note that there is the ability to poll not only through REST but also through the PowerShell applets (in general, not with this app that is), and those applets have the ability to use the -PageSize parameter to request up to 5,000 records. Not that it helps here, but it is interesting to note I think that the REST interface does not have that same capability, though according to MicroSoft it also uses PS on the back-end (within their environment).

So far the app leaves a number of questions open for me, like:

  • Is this too frequent for o365 polling?
  • What happens when the Heavy Forwarder is down, are the message logs between the time period when it is down and when it comes up lost?
  • How to ensure we’re getting all of the messages that are available and that none are dropped?
  • If there is a delay in reporting for a message or messages, and the retrieval period passes for getting it, do we lose that/those message/messages?

It may be that at my organization we may need to come up with a different solution if my review of the app leaves questions like the above open.

I don't want to downplay the app authors work at all though! They did a great job, I appreciate their work, and they deserve kudos for it!

0 Karma
*NEW* Splunk Love Promo!
Snag a $25 Visa Gift Card for Giving Your Review!

It's another Splunk Love Special! For a limited time, you can review one of our select Splunk products through Gartner Peer Insights and receive a $25 Visa gift card!

Review:





Or Learn More in Our Blog >>