Solved: Index gz archive files from Storage Account via Sp...

tries · ‎01-04-2018

Is it possible to index gz archives files from a Azure Storage account into Splunk using the Splunk Add-on for Microsoft Cloud Services

Import of not archived files via the Splunk Add-on for Microsoft Cloud Services is working perfect. I already tested using the following parameters for source and/or sourcetype but Splunk is still import only "weird symbols"/the raw gz file.

props.conf (tested within source + sourcetype stanza)
invalid_cause = archive
unarchive_cmd = gzip -cd -

I also tested to NO_BINARY_CHECK = false to explicitly ignore the gz files (which is not really my goal) but also that doesn't work at all. Downloading the gz file manually and import into Splunk manually is working fine. So I assume that issue is related to the way the Splunk Add-on for Microsoft Cloud Services is consuming and forwarding the data. Hopefully someone else was running into the same issue/able to fix that issue.

Thanks for your help in advance,
Thorsten

jconger · ‎01-10-2018

It sounds like you are using the blob input in the Splunk Add-on for Microsoft Cloud Services. The short answer to your question is the add-on will not decompress the archive to read the contents.

Here is the technical answer:
The add-on doesn't actually download the blob data and then read the content. Instead, the blob input uses the Azure Python SDK to get a blob's content into text. Specifically, it uses the get_blob_to_text method in the blob service -> https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/base... This method streams blob bytes as text. So, when encountering a binary file (like a .gz archive file), you get the weird characters.

Depending on your use case, it is possible to download .gz files from Azure blobs that can then be indexed using traditional file methods. A tool like AzCopy can be handy for this -> https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy

View solution in original post

jconger · ‎01-10-2018

It sounds like you are using the blob input in the Splunk Add-on for Microsoft Cloud Services. The short answer to your question is the add-on will not decompress the archive to read the contents.

Here is the technical answer:
The add-on doesn't actually download the blob data and then read the content. Instead, the blob input uses the Azure Python SDK to get a blob's content into text. Specifically, it uses the get_blob_to_text method in the blob service -> https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/base... This method streams blob bytes as text. So, when encountering a binary file (like a .gz archive file), you get the weird characters.

Depending on your use case, it is possible to download .gz files from Azure blobs that can then be indexed using traditional file methods. A tool like AzCopy can be handy for this -> https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy

tries · ‎01-11-2018

Thanks Jason for clarification. Your answer perfectly explain why I was running into a another issue when trying to download a CSV via this app and using the Splunk capabilities of indexing CSV files didn't work. I was able to find a workaround for this ignoring the first line and using a key/value extraction at search time so not a big deal in that case.

Just to make sure I got your right, you recommend, at least for now, to download the files manually and than index the data using a filter/folder monitor input?

Index gz archive files from Storage Account via Splunk Add-on for Microsoft Cloud Services not working

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Index gz archive files from Storage Account via Splunk Add-on for Microsoft Cloud Services not working

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits