Since the first of April we started receiving HTTP 401 Client Error in modular input logs from Splunk Add-on for Microsoft Office 365 Reporting Web Service (TA-MS_O365_Reporting version 2.0.1).
We tried both OAuth authentication and basic authentication, but we still receive the same error.
I was able to replicate the same issue in another Splunk environment against another M365 tenant.
We also configured the addon Splunk Add-on for Microsoft Office 365 (splunk_ta_o365 version 4.2.1) to fetch these logs, but we still receive the HTTP 401.
We are pretty confident that the app registrations and permissions are set up correctly.
Both apps connects to the API endpoint https://reports.office365.com/ecp/reportingwebservice/reporting.svc/MessageTrace - do anyone know of any changes made to this endpoint from Microsoft?
Our Splunk SE who has been following the internal Splunk Jira ticket told us that the issue should be resolved now. Microsoft had acknowledged this was an issue on their end. I have confirmed as of midnight that we stopped getting the 401 Client Error messages and are now ingesting logs successfully with the Reporting Web Service Splunk App.
The official documentation has been updated with the known issue
Version 4.2.1 of the Splunk Add-on for Microsoft Office 365 contains the following, if any, known issues:
Date filed Issue number Description
|2023-04-13||ADDON-61818||Repeated 401 Client errors when attempting to pull message trace data.|
I fixed it temporarily by adding a loop in the python. Seems to be working okay.
response = session.get(url)
# noticed via Postman that when MS fails, it returns a 200 with a logon page, not the expected json. Retry until they give up the goods.
# Sometimes it gives a 401 as well, but that appears to be transient.
while response.headers["Content-Type"] != 'application/json;odata=minimalmetadata;streaming=true;charset=utf-8':
response = session.get(url) response.raise_for_status()
Thank you all for your input on this issue.
We managed to get it working again after a great deal of trial and error.
We ended up creating a brand new service principal, and applying the same permissions again. We had to manually alter the manifest to be able to select "ReportingWebService.Read.All".
Our working theory is that the service principal we have been using for years might be "outdated" in a way, as it started working instantly with a brand new service principal.
We do not have any information from MS on this theory, as we were able to resolve the issue just before submitting a case to MS.
Throwing my 2-cents in here. We've been using the OAuth route since the beginning of the year without issue. We didn't start seeing the 401 error until 4/7.
After the errors started on 4/7, it looked like the error was intermittent and ingestion continued until 4/10, at which point the errors became very persistent.
Disabling the input and re-enabling seemed to have temporarily brought some belief but the issue persists. Sometimes, based on the log, you can see the skiptoken successfully incrementing until it randomly hits the 401. Debug logs show successful retrieval of the access token and Azure AD logs confirm that the app isn't getting any auth failures.
Ideally the consistency of the Microsoft endpoint improves, but maybe the Splunk Add-On for Microsoft Office 365 needs a better method to catch this error and retry instead of starting the collection again at the first message after every failure.
Hi @bbour53 ,
Thank you for sharing your temp work-around, I've been trying it out since I'm facing this issue as well but somehow it seems it doesn't work every time.
Can you kindly confirm that the temp workaround only consists of disabling the message trace input from Splunk Add-On for Microsoft 365 via Web GUI and then re-enabling it instantly or is there a wait time until we re-enable the input?
Worked for 1-2 times for me and what I was seeing was that it would make 2 api calls after every 300 secs and the first one would fetch messages but the second one would get the 401 client error and now even the frist one isn't getting any messages.
Just to update - I've been trying the workaround of disabling the input and then enabling the input for Message Trace Logs - initially it seemed to work but now it's not working at all.
Has anyone found any other workaround that seems to work?
Anyone have an update on this? We have this issue when using Message Trace inside the Add on for Microsoft Office 365. But the older app Splunk Add On for Microsoft Office 365 Reporting Web Service works.
I saw the same thing yesterday when I went to reference the status. My organization has a ticket open with Microsoft and our O365 team forwarded me the update from a Microsoft engineer that they could replicate the issue. If our ticket bears fruit and I have a worthwhile update for everyone I'll post it here.
Update from me - I just met with my Splunk support team for something else and made a mention about this. Apparently Splunk is aware and has an internal Jira open. It would appear MS changed something on their side that broke Splunk's TA. I'm going to open a case with Splunk anyway to help with visibility, but this might be a case of having to wait for the two companies to sort it out and fix it >.<
By way of a further update - I logged a case with Splunk support today and got the following response:
Thank you for submitting the case. We are aware of this issue and I want to let you know that we have received many cases of the same issue from other customers as well.
We have reproduced & encountered the same error and suspect an issue with the API, not with the add-on. I would request that you allow us some time to validate the issue from the Microsoft Azure end to know the API behaviour.
In order to expedite the case, we have escalated the issue to our internal team, and the add-on engineering team has started the conversation with Microsoft about the 401 client error (message trace failure). Rest assured that I will keep you informed of any further updates on this matter.
I have also associated the internal Jira ticket for this issue with your support case so now even your account owner can check the status for any update from our internal add-on engineering team regarding this.
If you have any other questions, kindly let me know.
I got an update today on the ticket I have open:
Thank you for your patience while we have been working to resolve the issue you reported. We would like to assure you that our engineering team and Microsoft team have been conducting a thorough investigation into the problem.
We are pleased to inform you that the MS team have successfully reproduced the issue locally and is currently implementing patches to prevent reported issues.
As per the latest update from the Microsoft side, They have a new patch being tested and will be rolled out today or the day by tomorrow to Production. ETA for this fix is to be available very soon. If there are any delays, we will update you as soon as possible.
We kindly request your patience as we work to implement this fix. Rest assured that we will update you as soon as we have further information from our engineering team.
Thank you for your understanding and support.
In addition, my sales engineer indicated that they are having potential success by rolling back to the updated, OAuth version of the Microsoft add-on versus using the Splunk add-on:
https://splunkbase.splunk.com/app/3720 instead of
3720 actually indicates that users should migrate to 4055, but perhaps that's bad advice at the moment. I'm optimistic that they'll fix the 4055 add-on as well though. For now, I'm going to test version 2.0.1 of that first add-on link and I'll report back with my findings.
I'm getting the same results as well - the official Splunk version of this ingest app is working for me. I didn't adjust anything or update to the just-released newer version (4.3) - I have 4.2.1 installed. A brief test in my non-production worked and pulled in a bunch of trace logs.