Splunk Instability After upgrade to v9.3.3 and OS ...

mmohamed

Environment Setup:

Splunk version: Upgraded from 9.0.7 -> 9.3.3
OS: Upgraded from Oracle Linux 7 (OL7) -> Oracle Linux 8 (OL8)
Deployment: 3 separate Splunk instances, each on it's own vm: HF, IDX, SH
Private Apps: Several customized Splunk apps are deployed across SH, IDX, and HF
THP: Transparent Huge Pages disabled on all instances
Data Volume: Low (minimal ingestion)

Problem:
since upgrading Splunk and it's underlying OS, all the splunk components have become unstable and unpredictable.

Splunk Services Hang or Fail to Restart
- The UI (Splunk Web) often becomes unavailable on both the SH and Indexer - pages fail to load or time out.
- Running Splunk status shows multiple splunkd helper processes that won't terminate
- Restart attempts hang indefinitely at "stopping helpers"
- Even systemctl stop splunk fails; the only recovery option is rebooting the entire vm.
- Occasionally, even rebooting hangs - requiring to kill and restart the vm from Oracle VM Manager (OVM).
Data Flow and Queue Freezing. Below are the errors observers
- 'The TCP output processor has paused data flow'
- 'Now skipping indexing of internal audit events because the downstream queue is not accepting data'
- Unable to distribute to peer named <indexer> because replication was unsuccessful'
- This indicates queue blocking, replication failures, or communication issues between the SH and IDX
- These errors coincide with the system become unresponsive
Dispatch directory filling quickly
- Splunk_home/var/run/dispatch/ fills up rapidly despite low data ingestion.
- Implemented a cron job to delete old search artifacts weekly. This helps temporarily but fills up again within days
High CPU Utilization on indexer
- Splunkd consistently consumes 99-108% CPU inside idx vm when checked using top. However, our external infra monitoring tools don't show a cpu spike for that vm
- This suggests the CPU load is internal splunk (mabye searching or indexing pipleline contention?), not OS-level exhaustion
Private Apps May be contributing
- Several private/custom splunk app are stil running depreciated python and JQuery components (readiness scan shows this even after upgrading these apps).
- I am not sure if these are perhaps causing the increasing helper thread count, long running background processes

Troubleshooting steps taken:

Disabled Transparent Huge Pages (THP) on all instances.
Verified correct resource limits in /etc/security/limits.conf
- splunk soft nofile 3077200
- splunk hard nofile 524288
- splunk nproc 10240
- splunk hard nproc 20480
Cleaned dispatch folder regularly via cron
Verified inter-instance connectivity (HF->IDX, SH -> IDX) is stable
No OS-level CPU, memory, or I/O bottlenecks detected.

Request for help:

Recommended configuration tuning (e.g. server.conf, indexes.conf, limits.conf, or queue parameters) for Splunk 9.3.3 on Oracle Linux 8
Whether custom apps with depreciated components could cause Splunkd thread buildup and blocking behavior.
Known Splunk 9.3.3 stability issues on Oracle Linux 8
Steps or tools (splunk diag, debug logs etc) to isolate internal cpu spikes and stuck helper threads.
Recommendations for stabilizing queue performance and preventing dispatch directory overflow

MuS

Hi there,

That sounds more like a VM problem if the VMs cannot even be rebooted. I would stop all of them and start one instance, check/verify its activities and if it's healthy start the next and repeat that process. If not, troubleshoot that instance on why it's not healthy and hold back until you have fix the issues and it's healthy again.

Dispatch filling up means you run a lot of searches, so check what they are and why they run so often.

Beside that disable the UI on the IDX and like @PickleRick said use any other tools available for troubleshooting.

Hope this helps ...

cheers, MuS

PickleRick

1. 9.3.3 is a relatively old version. Even within 9.3 line (which is still supported) you have several updates (if I'm not mistaken, the current 9.3 release is 9.3.7).

2. You did two changes at the same time (OS upgrade and Splunk upgrade). Additionally, the correlation of those upgrades with your performance problems might just be coincidental - there could be a change in how your environment is used or there might be problems with the underlying virtualization. Intermittent problems could also indicate some issues on the network level (duplicate IPs?) which would prevent your setup from behaving correctly,

Probably the main tool on Splunk's side would be the Monitoring Console. And your typical OS-level debugging tools. It's hard to say over the network what's wrong with an installation we don't see.

BTW, you shouldn't have webui enabled on the indexer.

Splunk Instability After upgrade to v9.3.3 and OS upgrade to Oracle Linux 8. Idx overload, zombie processes etc

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR

Are you a member of the Splunk Community?

Splunk Instability After upgrade to v9.3.3 and OS upgrade to Oracle Linux 8. Idx overload, zombie processes etc

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR