Getting Data In

GZIPPED files are being reindexed

Path Finder

Hello,
I'm aware of an issue with GZ files that get written to causing the entire file to be reindexed.

On a related issue, if you add new data to an existing compressed archive such as a .gz file, Splunk will re-index the entire file, not just the new data in the file. This can result in duplication of events.

Are there any known work arounds for this? My application writes out to a .csv.gz file in realtime.
Thank you!

0 Karma
1 Solution

Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

It would be helpful if you indicated how you are writing the gzip file out in realtime? Are you collecting multiple blocks of data, gzipping each one, and then appending those to the file? Or are you doing this event-by-event? Or are you rewriting and recompressing the entire file each time you add to it?

If you are zipping individual lines and then appending them, I would suggest that it's not worth it since the amount of compression you achieve will be limited, and that you should use the more traditional logrotate-style method of writing in plaintext, then gzipping the entire file when you rotate it.

All the workaround would involve either waiting until the gzip file was complete before presenting it to Splunk, writing in plaintext, or possibly writing each gzipped chunk as a separate file for Splunk (you could write twice, one to append to the file, once to a Splunk batch directory). This last method would work if you are in fact appending gzipped sections to a gzip file, and would be as realtime as your own writing. Writing in plaintext would not hurt either, and note that Splunk has to work to unzip a file that you have worked to zip, so it would save on some CPU.

View solution in original post

0 Karma

Path Finder

Sorry, I wasn't clear in my question. The gz is being written to by a propitiatory app (ofcourse). I'm not sure the method, but i'll see if i can figure it out by watching the file.

I believe you have still answered my question. I guess I wasn't expecting a good work around, but I wanted to see if there were any better solutions.

Thank you!

0 Karma
Don’t Miss Global Splunk
User Groups Week!

Free LIVE events worldwide 2/8-2/12
Connect, learn, and collect rad prizes
and swag!