Getting Data In

Duplicate and Missing Logs After Splunk Universal Forwarder Pod Restart in EKS

Ravi1
Loves-to-Learn

We are experiencing consistent log duplication and data loss when the Splunk Universal Forwarder (UF) running as a Helm deployment inside our EKS cluster is restarted or redeployed.

Environment Details:

  • Platform: AWS EKS (Kubernetes)

  • UF Deployment: Helm chart

  • Splunk UF Version: 9.1.2

  • Indexers: Splunk Enterprise 9.1.1 (self-managed)

  • Source Logs: Kubernetes container logs (/var/log/containers, etc.)

 

Symptoms:

  1. After UF pod restarts/re-deployed:

    • Previously ingested logs are duplicated.

    • Logs that were generated during the restart window are missing(not all logs) in Splunk.

  2. The fishbucket is recreated at each restart:

    • Confirmed by logging into the UF pod post-restart and checking:
      /opt/splunkforwarder/var/lib/splunk/fishbucket/

    • Timestamps indicate it is freshly recreated (ephemeral).

 

Our Hypothesis:

We suspect this behavior is caused by the Splunk UF losing its ingestion state (fishbucket) on pod restart, due to the lack of a PersistentVolumeClaim (PVC) mounted to:

/opt/splunkforwarder/var/lib/splunk
 

This would explain both:

  • Re-ingestion of previously-read files (-> duplicates)

  • Fail to re-ingest certain logs that may no longer be available or tracked (-> causing data loss)

However, we are not yet certain if the missing logs are due to non-persistent fishbucket and container log rotation

What We Need from Splunk Support:

  • How can we conclusively verify whether the missing logs are caused by fishbucket loss, file rotation, inode mismatch, or other ingestion tracking issues?

  • What is the recommended and supported approach for maintaining ingestion state in a Kubernetes/Helm-based Splunk UF deployment?

  • Is mounting a PersistentVolumeClaim (PVC) to /opt/splunkforwarder/var/lib/splunk sufficient and reliable for preserving fishbucket across pod restarts?

  • Are there additional best practices to prevent both log loss and duplication, especially in dynamic environments like Kubernetes?

Labels (1)
0 Karma

livehybrid
Super Champion

Hi @Ravi1 

I agree that the loss of the fishbucket state (due to ephemeral storage) is the cause of both log duplication and data loss after Splunk Universal Forwarder pod restarts in Kubernetes. When the fishbucket is lost, the UF cannot track which files and offsets have already been ingested, leading to re-reading old data (duplicates) and missing logs that rotated or were deleted during downtime.

If logs are rotated (e.g. to myapp.log.1) and Splunk is not configured to monitor the rotated filepath then this could result in your losing data as well as the more obvious duplicate of data due to the file tracking within fishbucket being lost.

As far as I am aware, the approach of using a UF within K8s is not generally encouraged, instead the Splunk validated architecture (SVA) for sending logs to Splunk from K8s is via Splunk OpenTelemetry Collector for Kubernetes - this allows sending of logs (amonst other things) to Splunk Enterprise / Splunk Cloud.

If you do want to use the UF approach (which may/may not be supported) then you could look at adding PVC as is done with the full Splunk Enterprise deployment under splunk-operator, check out the Storage Guidelines and StorageClass docs for splunk-operator.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

0 Karma
Get Updates on the Splunk Community!

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

In today’s complex digital landscape, security teams face increasing pressure to protect sprawling data across ...

Your summer travels continue with new course releases

Summer in the Northern hemisphere is in full swing, and is often a time to travel and explore. If your summer ...

From Alert to Resolution: How Splunk Observability Helps SREs Navigate Critical ...

It's 3:17 AM, and your phone buzzes with an urgent alert. Wire transfer processing times have spiked, and ...