All Apps and Add-ons

Splunk Hadoop Connect - unable to read snappy compressed data

splunkears
Path Finder

Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.

Any insights?

0 Karma
1 Solution

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

View solution in original post

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.

0 Karma

splunkears
Path Finder

Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?

Thanks.

0 Karma

splunkears
Path Finder

Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?

0 Karma

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

0 Karma
Get Updates on the Splunk Community!

Splunk Mobile: Your Brand-New Home Screen

Meet Your New Mobile Hub  Hello Splunk Community!  Staying connected to your data—no matter where you are—is ...

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Real progress on your strategic priorities starts with knowing the business outcomes your teams are delivering ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

As of today, Enterprise Security (ES) Essentials 8.3 is now generally available, helping SOC teams simplify ...