Splunk Enterprise

Spike in 503 errors in the Splunk WebUI after upgrading to 10.0.3 (also affects 9.2.10/9.3.9/9.4.8/10.2.0)

livehybrid
SplunkTrust
SplunkTrust

Good afternoon! 

This week we upgraded a Splunk deployment from 9.4.x to 10.0.3, and whilst everything seemingly went well, we came to work the next day to a number of upset users who kept getting random 503 errors and the Oops page:

livehybrid_0-1770310144013.png

I was able to replicate this and found that approximately every 5 minutes certain pages in the Splunk UI gave this frustating error! Following further analysis we found that 30 seconds after the 503 error we saw the following error in the _internal logs:

Splunkd daemon is not responding: ('Error connecting to /services/apps/local: The read operation timed out')


After checking directly to https://splunkserver:8089/services/apps/local we found that most of the time it returned in 10-30milliseconds but would periodically take upto 180 seconds to respond.


I later found errors in _internal for the following web address: https://cdn.splunkbase.splunk.com/public/report/apps_dump.json - This environment does not have internet access and has allowInternetAccess=false in server.conf.

Following a chat with some other users in the Community on Slack we found others have had the same issue with different versions - we found that a setting was added in the latest releases (9.4.8/10.0.3/10.2.0) [see https://github.com/livehybrid/splunk-spec-files/blob/master/server.conf#L255] in   server.conf/[applicationsManagement]/splunkbaseAppsDumpUrl which contains that SplunkBase URL.

These settings were not in previous 9.4/10.0 releases and I couldnt find any reference to it in the Release Notes either, however settings this to a blank value solved the problem for us.

# server.conf
[applicationsManagement]
splunkbaseAppsDumpUrl=
archivedSplunkbaseAppsDumpUrl=

Ultimately we believe this to be a bug which was introduced in the latest releases and have raised with Support to see if it gets added to the Known Issue for 10.0.3 (https://help.splunk.com/en/splunk-enterprise/release-notes-and-updates/release-notes/10.0/known-issu...)

Hopefully if you come across this issue you can find this fix! 🙂

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Labels (2)
Tags (1)

erlingen
Explorer

https://splunk.my.site.com/customer/s/article/Error-503-Splunkd-daemon-cannot-be-reached-by-Splunk-W... has been updated to say that this is fixed in v10.2.2. However!

I tried upgrading our dev environment from v10.2.1 to v10.2.2.
First removed the workaround (setting splunkbaseAppsDumpUrl / archivedSplunkbaseAppsDumpUrl to nothing) before upgrading.
First thing I see when logging into the web UI of a searchhead is a 503 error. So it seems it is not resolved in v10.2.2, not here at least.

verbal_666
Builder

Did you clean all caches around (load balancers, proxies, browsers, etc...)?

I updated 9.3.9 to 9.3.11, but i left workaround there.

Later i'll try removing it from a test environment.

erlingen
Explorer

@verbal_666 Two of the searchheads were instances I've never accessed through web. Upon login they both gave the classic 503 response after a minute of spinning. So I think I can rule out cache or any local issue. 

verbal_666
Builder

So, i think we need a new issue to SPLUNK DEVS to check the bug again 🤕

Meanwhile, i leave workaround on my nodes in 9.3.11 🤧

clatham
Loves-to-Learn

I raised this issue with support after testing 9.4.10, the fix version requires a different change to server.conf:

# server.conf
[applicationsManagement]
allowInternetAccess = false
0 Karma

livehybrid
SplunkTrust
SplunkTrust

Hi @clatham 

Thanks for letting us know. With our original issue on (introduced in 9.4.8) we already had this value set to false so it sounds now like this value is being utilised by the code rather than ignored! 🙂 

 

0 Karma

computermathguy
Communicator

I just started getting a 502 Proxy error (not a 503) after upgrading from 9.3.8 to 9.3.9.  The addition of the below stanza to server.conf did not help.

[applicationsManagement]
splunkbaseAppsDumpUrl=
archivedSplunkbaseAppsDumpUrl=

We are using Apache as an SSL proxy to pass your user's CAC credentials through Splunk.  

Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: Error reading from remote server

0 Karma

livehybrid
SplunkTrust
SplunkTrust

Hi @computermathguy 

This was a very specific 503 error, if you're getting Proxy error 502 then I would recommend creating a new thread so that someone can help look into it as it sounds like a different issue.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

verbal_666
Builder

Many many thanks again for resolving the issue.
I did the same analysis, and found all those CDN calls with errors, but couldn't figure out how to solve it!

https://community.splunk.com/t5/Splunk-Enterprise/9-3-6-to-9-3-9-frequently-WebUI-timeout/td-p/75876...

Great 👍👍👍

PS. quite annoying SPLUNK introduces this really strange bug with 503/delays/timeouts for users 😫

computermathguy
Communicator

Then our users start asking questions about the 502 proxy errors.  

0 Karma

livehybrid
SplunkTrust
SplunkTrust

There are now 2 Knowledge base articles about this so its an acknowledged issue with these releases and hopefully patched in the next release (I'll keep an eye out!)

The following KB links might be useful for others who find this, but it pretty much repeats what is in the original post:

https://splunk.my.site.com/customer/s/article/Splunk-Web-503-errors-after-9-4-8-upgrade-on-offline-d...

https://splunk.my.site.com/customer/s/article/Error-503-Splunkd-daemon-cannot-be-reached-by-Splunk-W...

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

joao_amorim
Communicator

Applied the suggested fix on Splunk 9.3.9, it only converted

ERROR LocalAppsAdminHandler [846555 TcpChannelThread] - Failed to fetch and parse apps dump; aborting splunkbase apps cache update
ERROR LocalAppsAdminHandler [846555 TcpChannelThread] - Failed to fetch apps from apps dump error at url=https://cdn.splunkbase.splunk.com/public/report/apps_dump.json code=502 desc=Error connecting: Connect Timeout

To (every time I click on the GUI now)

ERROR LocalAppsAdminHandler [1078631 TcpChannelThread] - Failed to parse apps dump uri from server.conf; aborting splunkbase apps cache update

Do you had the same issue?

0 Karma

i_am_me
Loves-to-Learn

I did not have 503 errors, but instead 502 errors. Applying the fix here did fix that issue. The UI was also responding very slowly on random moments, the fix also remedied that. Unfortunately, I'm now seeing the "

Failed to parse apps dump uri from server.conf;

error as well. It seems to not affect the speed of the UI though, so for now we're keeping it like this.

0 Karma

tjohnson
Engager

I'm seeing the same behavior on Splunk 9.4.8 after upgrading our lower environment. 

esplunkuser
Engager

Same issue with Splunk 9.2.12

0 Karma

BenjaminFinck
Engager

Can confirm that this fix is working on Splunk Version 9.4.8!

Thanks for sharing this find.

whar_garbl
Path Finder

Great find, I can confirm this solved this (very annoying) problem for me!

teunlaan
Contributor

Great a fix! Thanks you.

Added this post to my splunk support ticket 

davisb221
Explorer

We are experiencing this issue. Splunk support was no help.  It'd be nice to have accurate change logs.

marcoscala
Builder

Hi! 

Found same bug and errors upgrading to 9.4.8 (splunk-9.4.8-c543277b24fa-linux-amd64.tgz) on Customer's Environment.

Asked to appli the same fix and let's see if it works there too!

 

Thanks,

Marco

 

Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...