Getting Data In

GZIPPED files are being reindexed

Brandon_ganem1
Path Finder

Hello,
I'm aware of an issue with GZ files that get written to causing the entire file to be reindexed.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

Are there any known work arounds for this? My application writes out to a .csv.gz file in realtime.
Thank you!

0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

View solution in original post

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

0 Karma

Brandon_ganem1
Path Finder

Sorry, I wasn't clear in my question. The gz is being written to by a propitiatory app (ofcourse). I'm not sure the method, but i'll see if i can figure it out by watching the file.

I believe you have still answered my question. I guess I wasn't expecting a good work around, but I wanted to see if there were any better solutions.

Thank you!

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...