Getting Data In

Indexing content that may contain in-line gzip

johnamcafee
New Member

We need to index content that may contain in-line gzip (or other compression) content. We do not need to search on the compressed content, but we do need to be able to read that content back out out of Splunk and have it be valid for decompression and display.

I've done some searching through the documentation and knowledge base but have not found any pages that address the topic of gzip content mingled into text log content.

In our case, in the file Splunk is forwarding, we have a message delimiter that we use for our linebreaker, then one line of data that we parse with a REPORT regex, then the content of the message that we are handling. That content, which includes line breaks, usually has some plain-text headers, some other text, then content which might be json, xml, or might be gzip or otherwise compressed something.

We control the writing and use of the content, so for example it would be possible for us to BASE64-encode any binary content before we write it to the log file, then have our application decode it just prior to use - making the log content plain text the rest of the way though.

We would appreciate your advice/recommendations on how best to accomplish this

Tags (2)
0 Karma

gkanapathy
Splunk Employee
Splunk Employee

That should be okay. You can stick arbitrary text content into Splunk, though as you suggested, you should base64-encode it. If it's in an extractable field in structured or semi structured content (json, xml), then it would be fine. you'll have to make a few config tweaks in Splunk to ensure clean event breaking and adjust the right max event size, but that's straightforward.

However, because you're not going to be searching on that data, there is no reason for Splunk to index it, and since I am guessing it's of substantial size, it would be very advantageous in disk space and search speed to avoid that. How would you need to search on the content? Would it be just by timestamp, source, host, and sourcetype? Or would you need to be able to search on the non-gzip text of the event? If the former, you can set SEGMENTATION = none for the sourcetype in props.conf. Also, is the gzip stuff intervealed, or all at the end of the searchable free text?

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.