All Apps and Add-ons

Splunk Hadoop Connect - unable to read snappy compressed data

splunkears
Path Finder

Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.

Any insights?

0 Karma
1 Solution

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

View solution in original post

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.

0 Karma

splunkears
Path Finder

Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?

Thanks.

0 Karma

splunkears
Path Finder

Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?

0 Karma

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

0 Karma
Get Updates on the Splunk Community!

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Watch On Demand the Tech Talk, and empower your SOC to reach new heights! Duration: 1 hour  Prepare to ...

Splunk Observability as Code: From Zero to Dashboard

For the details on what Self-Service Observability and Observability as Code is, we have some awesome content ...