Deployment Architecture

"Error reading compressed journal while streaming: gzip data truncated". Are my Hadoop archived buckets corrupted, and how do I fix it?

heroku_curzonj
Explorer

While running a query via EMR on a bucket archived to s3 with hadoop data roll, I got the following error:

[hadoop] [ip-192-168-4-184] Streamed search execute failed because: Error reading compressed journal while streaming: gzip data truncated, provider=StdinGzDataProvider

Does this mean that one of the archived journal.gz files is corrupt? If so:

  • How can I figure out how it got corrupted?
  • How do I figure out which one and fix it? This is still in test phase, so I have all the archived buckets on my indexer still. I'm trying to validate that the archival mechanism is safe and reliable.
0 Karma
1 Solution

kpawar_splunk
Splunk Employee
Splunk Employee

"Streamed search execute failed because: Error reading compressed journal while streaming: gzip data truncated, provider=StdinGzDataProvider" error is because one or more of the archived journal.gz are corrupted.

If splunk suffers crash or an unclean shutdown (power loss, hardware failure, OS failure, etc) then some buckets can be left in a bad state where not all data is searchable. If bucket is corrupted locally on indexer, then archived bucket will also be corrupted.

Local splunk buckets can be fixed by following these instructions : http://docs.splunk.com/Documentation/Splunk/6.5.0/Indexer/Bucketissues

Currently there is no way to fix corrupted journal.gz that are archived. We are working on fix, that will ensure that we read data from corrupted journal till we hit corrupted part of the journal. We will log error message in search.log suggesting that particular journal is corrupted. This fix will be available in future release.

View solution in original post

kpawar_splunk
Splunk Employee
Splunk Employee

"Streamed search execute failed because: Error reading compressed journal while streaming: gzip data truncated, provider=StdinGzDataProvider" error is because one or more of the archived journal.gz are corrupted.

If splunk suffers crash or an unclean shutdown (power loss, hardware failure, OS failure, etc) then some buckets can be left in a bad state where not all data is searchable. If bucket is corrupted locally on indexer, then archived bucket will also be corrupted.

Local splunk buckets can be fixed by following these instructions : http://docs.splunk.com/Documentation/Splunk/6.5.0/Indexer/Bucketissues

Currently there is no way to fix corrupted journal.gz that are archived. We are working on fix, that will ensure that we read data from corrupted journal till we hit corrupted part of the journal. We will log error message in search.log suggesting that particular journal is corrupted. This fix will be available in future release.

gurlest
Path Finder

I am having this same issue - v7.2.1. Has there been any progress on a fix for this?

0 Karma

pbrinkman
Path Finder

hi Gurlest, No update has been provided by Splunk or any of the users from Splunk answers.

0 Karma

pbrinkman
Path Finder

Hi,

I have been unable to locate any future updates on this topic ?
We are running 7.2.1 and I would like to know if there is still no way to fix a corrupt archived journal.gz file

Cheers
Paul

0 Karma

jmantor
Path Finder

Has there been any progress?

0 Karma
Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...