Getting Data In

GZIPPED files are being reindexed

Brandon_ganem1
Path Finder

Hello,
I'm aware of an issue with GZ files that get written to causing the entire file to be reindexed.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

Are there any known work arounds for this? My application writes out to a .csv.gz file in realtime.
Thank you!

0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

View solution in original post

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

0 Karma

Brandon_ganem1
Path Finder

Sorry, I wasn't clear in my question. The gz is being written to by a propitiatory app (ofcourse). I'm not sure the method, but i'll see if i can figure it out by watching the file.

I believe you have still answered my question. I guess I wasn't expecting a good work around, but I wanted to see if there were any better solutions.

Thank you!

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...