We've installed and configured the Azure add-on, and while it works, the various inputs seem to hang once or twice a day. For us, this is most noticeable with the device and user inputs (sourcetypes azure:aad:device and :user). I've set the add-on to DEBUG level logging, but there's nothing especially obvious. Environment: This add-on is running on a heavy forwarder, that exists almost exclusively to run API-based add-ons like this. It's a relatively-untaxed RHEL 8 VM. We're using version 3.2.0, the now-current version of the add-on. (We had the same problem with 3.1.1 at least. I'm not sure how far back the problem goes, but it's been an intermittent issue for at least a few months.) First: the add-on has what to me looks like a bug in the interval setting. We've set the interval to "300" -- this is labeled as the number of seconds between queries, but the logs show the queries are running closer to every 300 milliseconds. If we set it lower than 300, the time between queries seems to shorten as you would expect, but setting it higher than 300 doesn't seem to work. We've tried setting it to values like 5000, to see if we could trick the add-on to pulling every 5 seconds, but that didn't do what we hoped.) More important, though, is that the input periodically hangs. The normal behavior looks like this: 2022-01-20 16:24:39,314 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_make_request:461 | https://graph.microsoft.com:443 "GET /v1.0/devices/?$skiptoken=(token 1) HTTP/1.1" 200 None 2022-01-20 16:24:39,476 DEBUG pid=3938342 tid=MainThread file=base_modinput.py:log_debug:288 | _Splunk_ AAD devices nextLink URL (@odata.nextLink): https://graph.microsoft.com/v1.0/devices/?$skiptoken=(token 2) 2022-01-20 16:24:39,477 DEBUG pid=3938342 tid=MainThread file=base_modinput.py:log_debug:288 | _Splunk_ Getting proxy server. 2022-01-20 16:24:39,477 INFO pid=3938342 tid=MainThread file=setup_util.py:log_info:117 | Proxy is not enabled! 2022-01-20 16:24:39,479 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_new_conn:975 | Starting new HTTPS connection (1): graph.microsoft.com:443 2022-01-20 16:24:39,741 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_make_request:461 | https://graph.microsoft.com:443 "GET /v1.0/devices/?$skiptoken=(token 2) HTTP/1.1" 200 None Basically, the add-on makes a request with a given token, part of the output of that is to get a new token, then (interval) milliseconds later, it uses that token and the cycle starts again. Eventually, though, the add-on gets to the fifth line in the above (where it's starting a new connection), and... that's it. The add-on doesn't do anything until one of the Splunk admins gets the alert we set up, that says "hey there haven't been any new events of sourcetype X in index Y for a couple hours, maybe you should take a look". Sometimes, the inputs will hang just a few hours after a restart; sometimes they work just fine for weeks at a time. Logging into the heavy forwarder, and toggling the input to "Disabled" and right back to "Enabled" clears the issue. Presumably disabling the input kills off the underlying Python script, then re-enabling it launches a fresh instance. We've thought about scripting a regular restart of this add-on, but there doesn't seem to be a way in the CLI to do so, short of restarting the whole heavy forwarder. That's a really big hammer for a relatively small nail, so it's not our first choice. And given that the add-on doesn't hang on any predictable schedule, we don't think it's worth the trade-off (Plan 5 or plan 6 would probably be building a new heavy forwarder for JUST the Azure add-on, so a scheduled restart of Splunk as a whole won't impact any other add-ons. But since building a new machine incurs costs to our team, and how it's still an inelegant solution, it's probably the last-resort plan.) Aside from setting the add-on to "DEBUG," is there anything else I can do within the add-on to debug this? Anyone had problems like this before, and if you have, how did you work around them? Is the "interval" thing really a bug, and if so to whom should I report it?
... View more