Environment Setup: Splunk version: Upgraded from 9.0.7 -> 9.3.3 OS: Upgraded from Oracle Linux 7 (OL7) -> Oracle Linux 8 (OL8) Deployment: 3 separate Splunk instances, each on it's own vm: HF, IDX, SH Private Apps: Several customized Splunk apps are deployed across SH, IDX, and HF THP: Transparent Huge Pages disabled on all instances Data Volume: Low (minimal ingestion) Problem: since upgrading Splunk and it's underlying OS, all the splunk components have become unstable and unpredictable. Splunk Services Hang or Fail to Restart The UI (Splunk Web) often becomes unavailable on both the SH and Indexer - pages fail to load or time out. Running Splunk status shows multiple splunkd helper processes that won't terminate Restart attempts hang indefinitely at "stopping helpers" Even systemctl stop splunk fails; the only recovery option is rebooting the entire vm. Occasionally, even rebooting hangs - requiring to kill and restart the vm from Oracle VM Manager (OVM). Data Flow and Queue Freezing. Below are the errors observers 'The TCP output processor has paused data flow' 'Now skipping indexing of internal audit events because the downstream queue is not accepting data' Unable to distribute to peer named <indexer> because replication was unsuccessful' This indicates queue blocking, replication failures, or communication issues between the SH and IDX These errors coincide with the system become unresponsive Dispatch directory filling quickly Splunk_home/var/run/dispatch/ fills up rapidly despite low data ingestion. Implemented a cron job to delete old search artifacts weekly. This helps temporarily but fills up again within days High CPU Utilization on indexer Splunkd consistently consumes 99-108% CPU inside idx vm when checked using top. However, our external infra monitoring tools don't show a cpu spike for that vm This suggests the CPU load is internal splunk (mabye searching or indexing pipleline contention?), not OS-level exhaustion Private Apps May be contributing Several private/custom splunk app are stil running depreciated python and JQuery components (readiness scan shows this even after upgrading these apps). I am not sure if these are perhaps causing the increasing helper thread count, long running background processes Troubleshooting steps taken: Disabled Transparent Huge Pages (THP) on all instances. Verified correct resource limits in /etc/security/limits.conf splunk soft nofile 3077200 splunk hard nofile 524288 splunk nproc 10240 splunk hard nproc 20480 Cleaned dispatch folder regularly via cron Verified inter-instance connectivity (HF->IDX, SH -> IDX) is stable No OS-level CPU, memory, or I/O bottlenecks detected. Request for help: Recommended configuration tuning (e.g. server.conf, indexes.conf, limits.conf, or queue parameters) for Splunk 9.3.3 on Oracle Linux 8 Whether custom apps with depreciated components could cause Splunkd thread buildup and blocking behavior. Known Splunk 9.3.3 stability issues on Oracle Linux 8 Steps or tools (splunk diag, debug logs etc) to isolate internal cpu spikes and stuck helper threads. Recommendations for stabilizing queue performance and preventing dispatch directory overflow
... View more