Getting Data In

GZIPPED files are being reindexed

Brandon_ganem1
Path Finder

Hello,
I'm aware of an issue with GZ files that get written to causing the entire file to be reindexed.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

Are there any known work arounds for this? My application writes out to a .csv.gz file in realtime.
Thank you!

0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

View solution in original post

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

0 Karma

Brandon_ganem1
Path Finder

Sorry, I wasn't clear in my question. The gz is being written to by a propitiatory app (ofcourse). I'm not sure the method, but i'll see if i can figure it out by watching the file.

I believe you have still answered my question. I guess I wasn't expecting a good work around, but I wanted to see if there were any better solutions.

Thank you!

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...