We've installed and configured the Azure add-on, and while it works, the various inputs seem to hang once or twice a day. For us, this is most noticeable with the device and user inputs (sourcetypes azure:aad:device and :user). I've set the add-on to DEBUG level logging, but there's nothing especially obvious.
Environment: This add-on is running on a heavy forwarder, that exists almost exclusively to run API-based add-ons like this. It's a relatively-untaxed RHEL 8 VM. We're using version 3.2.0, the now-current version of the add-on. (We had the same problem with 3.1.1 at least. I'm not sure how far back the problem goes, but it's been an intermittent issue for at least a few months.)
First: the add-on has what to me looks like a bug in the interval setting. We've set the interval to "300" -- this is labeled as the number of seconds between queries, but the logs show the queries are running closer to every 300 milliseconds. If we set it lower than 300, the time between queries seems to shorten as you would expect, but setting it higher than 300 doesn't seem to work. We've tried setting it to values like 5000, to see if we could trick the add-on to pulling every 5 seconds, but that didn't do what we hoped.)
More important, though, is that the input periodically hangs. The normal behavior looks like this:
2022-01-20 16:24:39,314 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_make_request:461 | https://graph.microsoft.com:443 "GET /v1.0/devices/?$skiptoken=(token 1) HTTP/1.1" 200 None
2022-01-20 16:24:39,476 DEBUG pid=3938342 tid=MainThread file=base_modinput.py:log_debug:288 | _Splunk_ AAD devices nextLink URL (@odata.nextLink): https://graph.microsoft.com/v1.0/devices/?$skiptoken=(token 2)
2022-01-20 16:24:39,477 DEBUG pid=3938342 tid=MainThread file=base_modinput.py:log_debug:288 | _Splunk_ Getting proxy server.
2022-01-20 16:24:39,477 INFO pid=3938342 tid=MainThread file=setup_util.py:log_info:117 | Proxy is not enabled!
2022-01-20 16:24:39,479 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_new_conn:975 | Starting new HTTPS connection (1): graph.microsoft.com:443
2022-01-20 16:24:39,741 DEBUG pid=3938342 tid=MainThread file=connectionpool.py:_make_request:461 | https://graph.microsoft.com:443 "GET /v1.0/devices/?$skiptoken=(token 2) HTTP/1.1" 200 None
Basically, the add-on makes a request with a given token, part of the output of that is to get a new token, then (interval) milliseconds later, it uses that token and the cycle starts again.
Eventually, though, the add-on gets to the fifth line in the above (where it's starting a new connection), and... that's it. The add-on doesn't do anything until one of the Splunk admins gets the alert we set up, that says "hey there haven't been any new events of sourcetype X in index Y for a couple hours, maybe you should take a look". Sometimes, the inputs will hang just a few hours after a restart; sometimes they work just fine for weeks at a time.
Logging into the heavy forwarder, and toggling the input to "Disabled" and right back to "Enabled" clears the issue. Presumably disabling the input kills off the underlying Python script, then re-enabling it launches a fresh instance.
We've thought about scripting a regular restart of this add-on, but there doesn't seem to be a way in the CLI to do so, short of restarting the whole heavy forwarder. That's a really big hammer for a relatively small nail, so it's not our first choice. And given that the add-on doesn't hang on any predictable schedule, we don't think it's worth the trade-off (Plan 5 or plan 6 would probably be building a new heavy forwarder for JUST the Azure add-on, so a scheduled restart of Splunk as a whole won't impact any other add-ons. But since building a new machine incurs costs to our team, and how it's still an inelegant solution, it's probably the last-resort plan.)
Aside from setting the add-on to "DEBUG," is there anything else I can do within the add-on to debug this? Anyone had problems like this before, and if you have, how did you work around them? Is the "interval" thing really a bug, and if so to whom should I report it?
Unrelated: Our sales engineer says he's been working with the staff within Splunk that developed the add-on, and allegedly an updated version of the add-on is coming Real Soon Now that addresses this issue. No ETA beyond "soon" unfortunately.
I have a workaround, but it probably wouldn't work if you're using Splunk Cloud.
I wrote a shell script that runs once a minute, via the system cron scheduler, on my heavy forwarder. That shell script tail's the logs for the various inputs, and if one of them doesn't have any new entries for the past minute, uses the API to disable and re-enable that input.
Here's a slightly simplified example for a single input:
## You'll have to edit the above based on where you installed Splunk and how you named your inputs
LAST_ENTRY=$(date +%s --date="$(tail -n 1 $LOGFILE | cut -d ' ' -f 1-2)")
if [[ $DIFF -gt 60 ]]; then
echo Attempting auto restart of AAD user input on HF. | mailx -r email@example.com -s "Splunk HF AAD User input notice" firstname.lastname@example.org 2>/dev/null
logger -p local0.warn Attempting automatic restart of Splunk MS AAD User input.
curl --silent -X POST --stderr - -k -u $API_USER:$API_PASS https://localhost:8089/servicesNS/nobody/TA-MS-AAD/data/inputs/MS_AAD_user/AAD_Users_Prod/disable >/dev/null
curl --silent -X POST --stderr - -k -u $API_USER:$API_PASS https://localhost:8089/servicesNS/nobody/TA-MS-AAD/data/inputs/MS_AAD_user/AAD_Users_Prod/enable >/dev/null
It's not ideal for a number of reasons:
* It just feels Weird to use something outside of Splunk to monitor an input within Splunk
* When the input hangs, there still is a 1-2 minute gap in the logs
* It doesn't actually FIX the problem, it's just a workaround to reduce the amount of data lost
Really, though, the key part is the fact that you can disable and re-enable individual inputs in that add-on via the REST API. It looks like:
Someone more clever than I might be able to create a monitor within Splunk that looks at the input's debug logs, notices the lack of new events in that file, and toggles the input off and back on. I grew up with shell scripts, so this was much quicker for me to implement.
Interesting and thanks for sharing, though your right this is not ideal for cloud. In the past when we've had this issue a longer term issue was to clone the input and it would be good for 3-12 months before acting up again. whereas disabling and re-enabling only ever lasted 1-7 days before giving out again.
Good to hear that an updated version may be in the works, do they plan on actually supporting this?
Never thought about cloning the input, but I've had problems since day one, so I'm not sure it'd help me out.
Personally, I doubt there will be a change in how this add-on is (not) supported, but everything I know about it is second-hand or third-hand.