I downloaded and installed Teams Add-on for Splunk and it worked for a while, until we encountered a lot of 404 error like below
ERROR pid=14248 tid=MainThread file=base_modinput.py:log_error:309 | Error getting callRecord data: 404 Client Error: Not Found for url: https://graph.microsoft.com/v1.0/communications/callRecords/<call ID>?$expand=sessions($expand=segments)
I found out that the callID was removed from Teams CDR for some reason, therefore when Splunk tried to download the CDR, it returned error 404, which is understandable.
However Teams Add-on will not remove the Call ID from webhook directory for this scenario. The call ID will remain there forever and Splunk will keep on trying again and again to download the CDR and failed. This results in a huge amount of call IDs that never get cleaned up and massive number of error messages in the log.
Further more, i found out that if there were too many call ID files exist in the wehbook directory (~60K), the Add-on will encountered error "401 Unauthorized to download the CDR" and stopped working soon afterward. After restarting Splunk, the Add-on worked again and then stopped the moment it hit 401 error again. I manually created a script to manage the load of webhook folder, so this is OK for now but it would be preferable that the Add-on has load management feature by itself.
Hopefully the author of this Add-on will add this error handling soon, but meanwhile if anyone knows how to get around this 404 issue please kindly share.
Thanks a lot!
Hello,
The reason for that is the call ID is no longer available at Azure side, but MS Teams addon tries to get information with it.
Currently there is no way except to delete the local kv store lookup data.
Please try the below command and see if there is any improvements.
splunk clean kvstore -app TA_MS_Teams -collection TA_MS_Teams_checkpointer
I dont think this has anything to do with kvstore.
problem is clear: the add-on doesn't handle 404 error properly
Flow in normal situation:
check webhook folder for call ID --> download call ID --> delete call ID from webhook folder --> proceed with next ID
Flow in 404 situation:
check webhook folder for call ID --> download call ID (but failed) --> raise error. And it stops there, the call ID with 404 error is not cleaned up from webhook folder.
So i think the author of the app just needs to improve the error handling to clean up "404" call ID from the webhook folder, problem will be solved
i totally agree. This app does not handle 400 or 404 errors. The developer @jconger is top notch though, i have met him. just this one app have never really worked correctly
We have to reset the inputs almost daily By reset, we create a new subscription input. Or we have to disable/reenable the call record or user report inputs
However, i did perform the kvstore clean on both of our heavy fowarders (behind Load balancer)
for a load balancer environment webhook, call record, and user report inputs are setup on both HF, but subscription is setup on only one
This worked for me, for now
disable all inputs
clean kvstore
splunk clean kvstore -app TA_MS_Teams -collection TA_MS_Teams_checkpointer
enable inputs in this order
webhook, subscription, call record, user report
we have data again...for now.
Hello,
I have the same issue and the remediation you shared is correct. But... in my case the app runs like 2 days flawlessly, then the webhook fails > subscription fails > no callrecords anymore.
Maybe somebody found a way to make this app stable. I played with the intervals, but no help. I will try to disable and enable the webhook and subscription input by a crone job from CLI, but that is so "homemade"...
The app is installed on an HWF and it runs when it runs... I have no idea why go fail randomly...
The bad part is that when it stops collecting the CR, it will be lost. No way to fetch the "historical" logs...
Appreciate your advice...
I have the same problem. The webhook work for a couple of days and the fails. Did the cron job to restart the inputs work successfully as a workaround?
Hi,
I still have no 100% working workaround. I tried to create an Alert on my search head> when the subscription failed, triggering a curl script to disable - re-enable the inputs. I learned two important things there:
Order
you should disable the webhook, then the subscription input then the call record input. Enable the webhook, and enable the subscription. This will update the subscription, but sometimes doesn't work correctly - in this case, you should clear the KV store first - and the webhook is Exit! So you should disable the webhook again, enable it then enable the call record input.
This method above, if you do manually solving the issue all the time. But the second thing:
Scripted disable/enable works 50-50%. Seems the call record is not correctly reset by the script.
so currently, I have an alert to myself: "Go monkey and reset it manually" 🙂
Thanks for the update.
I am familiar with Windows and Powershell scripting, The splunk instance is not managed by me and the person who manages has indicated he does not know how to script the restart of the inputs and to clear the keystore.
I would like a script to run every night at midnight to complete the above steps
Can you provide some details on how to accomplish this in Splunk,
Any help would be greatly appreciated.