On the basis of the data I see from our tenant the add on is not retrieving all of the sign in records when compared with the Azure Portal sign in page.
The number of records loaded appears correlated with the polling frequency set. I have tried 300s (5m), 600s (10m) and 900s (15m). In each case the number of underlying events that the add on loads appears different. The effect is quite marked.
Query for the chart above:
index=liquid_it sourcetype="ms:aad:signin"
| timechart span=5m count
| eval tpm=round(count / 5, 2)
| fields - count
I set up an alternate ingest pipeline: AAD --> Event Hub --> Azure function --> Splunk HEC
That reliably produces a full set of the events in the graph the new ingest is "aad_audit" and the reporting add-in is shown as "ms:aad:signin". The difference is quite marked.
Here's a simple fix to the app if developer is watching this thread - in the api call add '+and+signinDateTime+le+(current time - delay minutes)'. So the new filter query will look like:
&$filter=signinDateTime+gt+(check point time)+and+signinDateTime+le+(current time - delay minutes)
With delay minutes set to 5, this will get 99% of the data considering MS's less than 5 minute latency for 99% of events. And if you're making the change, please let the user control the delay minutes.
This Microsoft article https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/reference-reports-latenci... talks about the latency for sign-ins and audit logs in Azure. The latency is between 2 to 5 mins. My understand would be that the logs will be available in Azure portal (also ready for the API to pull) within 5 mins of the originating event. So I think setting the polling frequency to >300s should be OK. However, I have concern about this Add-on using the largest siginDateTime/activityDateTime seen during the query as the checkpoint timestamp. My reasoning is that Azure logs may come in different order, and we will miss some events came in late but their originating event timestamps are before the checkpoint.
I have the following scenario in mind:
As suggested by jconger, the "Azure Monitor Add-on for Splunk" may be the better way to collect near real time from the an Event Hub.
FYI... I have been trying to collect Azure AD logs (sign-in, audit), Azure AD risk events, as well as Office 365 logs into Splunk. I feel in general the latencies in the Microsoft reporting infrastructure causing lots of confusion/issues on how we can properly schedule our data ingestion without incomplete/duplicate data problem. It makes it harder to use the data for near real time monitoring/reporting solution.
Completely agree with your statement:
the latencies in the Microsoft reporting infrastructure causing lots of confusion/issues on how we can properly schedule our data ingestion without incomplete/duplicate data problem
Version 1.0.3 of the Azure AD Reporting Add-on has some data collection improvements that should address your issue. Also Azure AD logs can be sent to Event Hubs now. The Azure Monitor Add-on for Splunk can be used to collect them from an Event Hub.
Thanks for the suggestion, will try this out. Have set up the event hub and can see activity. Will be interesting to do a side-by-side and see if I get a more complete set of events via this route than the API-based reporting add-on.
Ok, the results are in and on this basis I can see that the Azure AD Reporting Add-on is missing events.
I set up an alternate ingest pipeline: AAD --> Event Hub --> Azure function --> Splunk HEC
That reliably produces more events than the reporting add-on.
Ok, so there is some relationship between the frequency of polling and how many events get ingested. The more frequent the polling the fewer events (the more missing events). The less frequent the polling the more events get ingested for any given period.
I changed the polling from 300s to 600s and the number of events per minute went up by a factor of 3.
Have you tried the Microsoft cloud services app? It may do what you’re looking for too.
Thanks, will try that. Initially I did not think it did the sign-ins, but on closer reading it may do.
Some further details:
Needless to say this is a deal-breaker. If the audit in Splunk is not complete it is all but useless.
Not sure how to progress in diagnosing this.
Changed the polling frequency to 600s to see if that makes a difference.