All Apps and Add-ons

Splunk Hadoop Connect - unable to read snappy compressed data

splunkears
Path Finder

Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.

Any insights?

0 Karma
1 Solution

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

View solution in original post

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.

0 Karma

splunkears
Path Finder

Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?

Thanks.

0 Karma

splunkears
Path Finder

Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?

0 Karma

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...