All Apps and Add-ons

How can I fix Issues fetching Exchange Online message tracking logs (HTTP 401 Client Error) since 1.4.2023?

rvaglid
Explorer

Since the first of April we started receiving HTTP 401 Client Error in modular input logs from Splunk Add-on for Microsoft Office 365 Reporting Web Service (TA-MS_O365_Reporting version 2.0.1).
We tried both OAuth authentication and basic authentication, but we still receive the same error.

I was able to replicate the same issue in another Splunk environment against another M365 tenant.

We also configured the addon Splunk Add-on for Microsoft Office 365 (splunk_ta_o365 version 4.2.1) to fetch these logs, but we still receive the HTTP 401.

We are pretty confident that the app registrations and permissions are set up correctly.

Both apps connects to the API endpoint https://reports.office365.com/ecp/reportingwebservice/reporting.svc/MessageTrace - do anyone know of any changes made to this endpoint from Microsoft?

 

Cheers,

Rolf

Labels (1)

alaa_ahmad
Loves-to-Learn Everything

dears

this problem is back again

0 Karma

amyers16
Path Finder

It's working for me now as well! Had to rebuild my inputs and fix the time, but otherwise seems to be ingesting correctly.

0 Karma

splunkfordummie
Engager

Our Splunk SE who has been following the internal Splunk Jira ticket told us that the issue should be resolved now.  Microsoft had acknowledged this was an issue on their end.   I have confirmed as of midnight that we stopped getting the 401 Client Error messages and are now ingesting logs successfully with the Reporting Web Service Splunk App. 

0 Karma

grokdesigns
Explorer

My rep says they got an update from Microsoft that they can “tentatively” confirm they’ve identified the root cause of the issue and are working on patching it. No ETA given.

0 Karma

MightyJ
Explorer

The official documentation has been updated with the known issue

Reference:

https://docs.splunk.com/Documentation/AddOns/released/MSO365/Releasenotes

Known issues

Version 4.2.1 of the Splunk Add-on for Microsoft Office 365 contains the following, if any, known issues:

  • Customers will experience a delay in event ingestion after v4.2.0 due to KVstore performance on cloud architecture.

Date filed Issue number Description

2023-04-13ADDON-61818Repeated 401 Client errors when attempting to pull message trace data.

Workaround:
None Known
0 Karma

madcitygeek
Explorer

I fixed it temporarily by adding a loop in the python. Seems to be working okay. 

response = session.get(url)

# noticed via Postman that when MS fails, it returns a 200 with a logon page, not the expected json. Retry until they give up the goods.

# Sometimes it gives a 401 as well, but that appears to be transient.

while response.headers["Content-Type"] != 'application/json;odata=minimalmetadata;streaming=true;charset=utf-8':

    response = session.get(url) response.raise_for_status()

0 Karma

rvaglid
Explorer

Thank you all for your input on this issue. 

We managed to get it working again after a great deal of trial and error. 
We ended up creating a brand new service principal, and applying the same permissions again. We had to manually alter the manifest to be able to select "ReportingWebService.Read.All".

Our working theory is that the service principal we have been using for years might be "outdated" in a way, as it started working instantly with a brand new service principal.
We do not have any information from MS on this theory, as we were able to resolve the issue just before submitting a case to MS.

Cheers,
Rolf

bbour53
Engager

Throwing my 2-cents in here. We've been using the OAuth route since the beginning of the year without issue. We didn't start seeing the 401 error until 4/7. 

After the errors started on 4/7, it looked like the error was intermittent and ingestion continued until 4/10, at which point the errors became very persistent.

Disabling the input and re-enabling seemed to have temporarily brought some belief but the issue persists. Sometimes, based on the log, you can see the skiptoken successfully incrementing until it randomly hits the 401. Debug logs show successful retrieval of the access token and Azure AD logs confirm that the app isn't getting any auth failures. 

Ideally the consistency of the Microsoft endpoint improves, but maybe the Splunk Add-On for Microsoft Office 365 needs a better method to catch this error and retry instead of starting the collection again at the first message after every failure.

AhmadGul23
Loves-to-Learn Everything

Hi @bbour53 ,

Thank you for sharing your temp work-around, I've been trying it out since I'm facing this issue as well but somehow it seems it doesn't work every time. 

Can you kindly confirm that the temp workaround only consists of disabling the message trace input from Splunk Add-On for Microsoft 365 via Web GUI and then re-enabling it instantly or is there a wait time until we re-enable the input?

Worked for 1-2 times for me and what I was seeing was that it would make 2 api calls after every 300 secs and the first one would fetch messages but the second one would get the 401 client error and now even the frist one isn't getting any messages.

0 Karma

AhmadGul23
Loves-to-Learn Everything

Does Query Window Size have any impact if we change that? 

The current was set to 5 mins and I tried changing it to 30 mins while the interval of the api call is 5 mins.

0 Karma

AhmadGul23
Loves-to-Learn Everything

Just to update - I've been trying the workaround of disabling the input and then enabling the input for Message Trace Logs - initially it seemed to work but now it's not working at all. 

Has anyone found any other workaround that seems to work?

0 Karma

scannon4
Communicator

Anyone have an update on this?  We have this issue when using Message Trace inside the Add on for Microsoft Office 365.  But the older app Splunk Add On for Microsoft Office 365 Reporting Web Service works.

0 Karma

amyers16
Path Finder

I am facing the same issue, but according to the link @Wiessiet posted, it was resolved on April 6th and they are not updating the case information any longer. 

0 Karma

Wiessiet
Path Finder

I saw the same thing yesterday when I went to reference the status. My organization has a ticket open with Microsoft and our O365 team forwarded me the update from a Microsoft engineer that they could replicate the issue. If our ticket bears fruit and I have a worthwhile update for everyone I'll post it here.

Wiessiet
Path Finder

Update from me - I just met with my Splunk support team for something else and made a mention about this. Apparently Splunk is aware and has an internal Jira open. It would appear MS changed something on their side that broke Splunk's TA. I'm going to open a case with Splunk anyway to help with visibility, but this might be a case of having to wait for the two companies to sort it out and fix it >.<

Wiessiet
Path Finder

By way of a further update - I logged a case with Splunk support today and got the following response:

 

Thank you for submitting the case. We are aware of this issue and I want to let you know that we have received many cases of the same issue from other customers as well.

We have reproduced & encountered the same error and suspect an issue with the API, not with the add-on. I would request that you allow us some time to validate the issue from the Microsoft Azure end to know the API behaviour.

In order to expedite the case, we have escalated the issue to our internal team, and the add-on engineering team has started the conversation with Microsoft about the 401 client error (message trace failure). Rest assured that I will keep you informed of any further updates on this matter.

I have also associated the internal Jira ticket for this issue with your support case so now even your account owner can check the status for any update from our internal add-on engineering team regarding this.

If you have any other questions, kindly let me know.

0 Karma

Wiessiet
Path Finder

I got an update today on the ticket I have open:

Thank you for your patience while we have been working to resolve the issue you reported. We would like to assure you that our engineering team and Microsoft team have been conducting a thorough investigation into the problem.

We are pleased to inform you that the MS team have successfully reproduced the issue locally and is currently implementing patches to prevent reported issues.

As per the latest update from the Microsoft side, They have a new patch being tested and will be rolled out today or the day by tomorrow to Production. ETA for this fix is to be available very soon. If there are any delays, we will update you as soon as possible.

We kindly request your patience as we work to implement this fix. Rest assured that we will update you as soon as we have further information from our engineering team.

Thank you for your understanding and support.

 

In addition, my sales engineer indicated that they are having potential success by rolling back to the updated, OAuth version of the Microsoft add-on versus using the Splunk add-on:

https://splunkbase.splunk.com/app/3720 instead of

https://splunkbase.splunk.com/app/4055

3720 actually indicates that users should migrate to 4055, but perhaps that's bad advice at the moment. I'm optimistic that they'll fix the 4055 add-on as well though. For now, I'm going to test version 2.0.1 of that first add-on link and I'll report back with my findings.

grokdesigns
Explorer

Seems to be up and working with the 4055 app now.

Wiessiet
Path Finder

I'm getting the same results as well - the official Splunk version of this ingest app is working for me. I didn't adjust anything or update to the just-released newer version (4.3) - I have 4.2.1 installed. A brief test in my non-production worked and pulled in a bunch of trace logs.

0 Karma

amyers16
Path Finder

Interestingly I saw it indicate in the logs it was getting data, but nothing when searching for 

sourcetype=o365:reporting:messagetrace
0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...