Hi All, We have been experiencing intermittent indexing delays on our Splunk environment, which consists of three standalone (non-clustered) indexers. On certain days, one or more indexers begin queuing for extended durations (7–8 hours, sometimes up to 12–14 hours). During these periods: Indexing throughput degrades significantly Event ingestion is delayed (in some cases by more than 1 hour) Downstream processes such as summary indexing jobs receive incomplete data Upstream reporting jobs show incorrect or partial counts Observations CPU and Memory utilization on all indexers remain within normal limits Each indexer has 24 CPU cores and 256 GB RAM No obvious resource saturation observed at the OS level (CPU/Memory) However, during these queueing periods: Disk I/O latency spikes significantly FlashArray storage reports high read/write activity and elevated latency From preliminary analysis, we suspect: Sudden ingestion bursts from specific hosts or Kubernetes pods In some cases, we observed extreme spikes (e.g., ~24 million events in a single minute from a single source) We believe that short-lived but intense ingestion bursts are overwhelming the storage subsystem, causing: Increased disk I/O wait on indexers Backpressure in the indexing pipeline Prolonged queue buildup and delayed ingestion We are trying to identify the exact sources responsible for these bursts through Splunk queries. Identify bursty data sources (hosts / Kubernetes pods) Determine which hosts or Kubernetes pods generate sudden spikes in event volume over short time intervals (e.g., 1-minute or 5-minute windows), relative to their typical baseline activity. Analyze parsing overhead on indexers Identify which indexers are spending the most time processing (parsing) incoming data, and determine the sourcetypes contributing most to this parsing load.
... View more