Hello,
I'm aware of an issue with GZ files that get written to causing the entire file to be reindexed.
On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.
Are there any known work arounds for this? My application writes out to a .csv.gz file in realtime.
Thank you!
It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?
If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.
All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.
It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?
If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.
All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.
Sorry, I wasn't clear in my question. The gz is being written to by a propitiatory app (ofcourse). I'm not sure the method, but i'll see if i can figure it out by watching the file.
I believe you have still answered my question. I guess I wasn't expecting a good work around, but I wanted to see if there were any better solutions.
Thank you!