Good afternoon!
This week we upgraded a Splunk deployment from 9.4.x to 10.0.3, and whilst everything seemingly went well, we came to work the next day to a number of upset users who kept getting random 503 errors and the Oops page:
I was able to replicate this and found that approximately every 5 minutes certain pages in the Splunk UI gave this frustating error! Following further analysis we found that 30 seconds after the 503 error we saw the following error in the _internal logs:
Splunkd daemon is not responding: ('Error connecting to /services/apps/local: The read operation timed out')
After checking directly to https://splunkserver:8089/services/apps/local we found that most of the time it returned in 10-30milliseconds but would periodically take upto 180 seconds to respond.
I later found errors in _internal for the following web address: https://cdn.splunkbase.splunk.com/public/report/apps_dump.json - This environment does not have internet access and has allowInternetAccess=false in server.conf.
Following a chat with some other users in the Community on Slack we found others have had the same issue with different versions - we found that a setting was added in the latest releases (9.4.8/10.0.3/10.2.0) [see https://github.com/livehybrid/splunk-spec-files/blob/master/server.conf#L255] in server.conf/[applicationsManagement]/splunkbaseAppsDumpUrl which contains that SplunkBase URL.
These settings were not in previous 9.4/10.0 releases and I couldnt find any reference to it in the Release Notes either, however settings this to a blank value solved the problem for us.
# server.conf
[applicationsManagement]
splunkbaseAppsDumpUrl=
archivedSplunkbaseAppsDumpUrl=Ultimately we believe this to be a bug which was introduced in the latest releases and have raised with Support to see if it gets added to the Known Issue for 10.0.3 (https://help.splunk.com/en/splunk-enterprise/release-notes-and-updates/release-notes/10.0/known-issu...)
Hopefully if you come across this issue you can find this fix! 🙂
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
https://splunk.my.site.com/customer/s/article/Error-503-Splunkd-daemon-cannot-be-reached-by-Splunk-W... has been updated to say that this is fixed in v10.2.2. However!
I tried upgrading our dev environment from v10.2.1 to v10.2.2.
First removed the workaround (setting splunkbaseAppsDumpUrl / archivedSplunkbaseAppsDumpUrl to nothing) before upgrading.
First thing I see when logging into the web UI of a searchhead is a 503 error. So it seems it is not resolved in v10.2.2, not here at least.
Did you clean all caches around (load balancers, proxies, browsers, etc...)?
I updated 9.3.9 to 9.3.11, but i left workaround there.
Later i'll try removing it from a test environment.
@verbal_666 Two of the searchheads were instances I've never accessed through web. Upon login they both gave the classic 503 response after a minute of spinning. So I think I can rule out cache or any local issue.
So, i think we need a new issue to SPLUNK DEVS to check the bug again 🤕
Meanwhile, i leave workaround on my nodes in 9.3.11 🤧
I raised this issue with support after testing 9.4.10, the fix version requires a different change to server.conf:
# server.conf
[applicationsManagement]
allowInternetAccess = false
Hi @clatham
Thanks for letting us know. With our original issue on (introduced in 9.4.8) we already had this value set to false so it sounds now like this value is being utilised by the code rather than ignored! 🙂
I just started getting a 502 Proxy error (not a 503) after upgrading from 9.3.8 to 9.3.9. The addition of the below stanza to server.conf did not help.
[applicationsManagement]
splunkbaseAppsDumpUrl=
archivedSplunkbaseAppsDumpUrl=
We are using Apache as an SSL proxy to pass your user's CAC credentials through Splunk.
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request
Reason: Error reading from remote server
This was a very specific 503 error, if you're getting Proxy error 502 then I would recommend creating a new thread so that someone can help look into it as it sounds like a different issue.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Many many thanks again for resolving the issue.
I did the same analysis, and found all those CDN calls with errors, but couldn't figure out how to solve it!
Great 👍👍👍
PS. quite annoying SPLUNK introduces this really strange bug with 503/delays/timeouts for users 😫
Then our users start asking questions about the 502 proxy errors.
There are now 2 Knowledge base articles about this so its an acknowledged issue with these releases and hopefully patched in the next release (I'll keep an eye out!)
The following KB links might be useful for others who find this, but it pretty much repeats what is in the original post:
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Applied the suggested fix on Splunk 9.3.9, it only converted
ERROR LocalAppsAdminHandler [846555 TcpChannelThread] - Failed to fetch and parse apps dump; aborting splunkbase apps cache update
ERROR LocalAppsAdminHandler [846555 TcpChannelThread] - Failed to fetch apps from apps dump error at url=https://cdn.splunkbase.splunk.com/public/report/apps_dump.json code=502 desc=Error connecting: Connect TimeoutTo (every time I click on the GUI now)
ERROR LocalAppsAdminHandler [1078631 TcpChannelThread] - Failed to parse apps dump uri from server.conf; aborting splunkbase apps cache updateDo you had the same issue?
I did not have 503 errors, but instead 502 errors. Applying the fix here did fix that issue. The UI was also responding very slowly on random moments, the fix also remedied that. Unfortunately, I'm now seeing the "
Failed to parse apps dump uri from server.conf;error as well. It seems to not affect the speed of the UI though, so for now we're keeping it like this.
I'm seeing the same behavior on Splunk 9.4.8 after upgrading our lower environment.
Same issue with Splunk 9.2.12
Can confirm that this fix is working on Splunk Version 9.4.8!
Thanks for sharing this find.
Great find, I can confirm this solved this (very annoying) problem for me!
Great a fix! Thanks you.
Added this post to my splunk support ticket
We are experiencing this issue. Splunk support was no help. It'd be nice to have accurate change logs.
Hi!
Found same bug and errors upgrading to 9.4.8 (splunk-9.4.8-c543277b24fa-linux-amd64.tgz) on Customer's Environment.
Asked to appli the same fix and let's see if it works there too!
Thanks,
Marco