Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.
Any insights?
Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)
splunk-HadoopConnect-master/bin/hdfs.py
...
...
def process_file_uri(hdfs_uri):
hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"):
hj.text(hdfs_uri) <<======== making -text method of hadoop FsShell.
translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
hj.cat(hdfs_uri)
translator = get_data_translator(hdfs_uri, hj.process.stdout)
cur_src = ""
buf = translator.read()
bytes_read = len(buf)
Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.
Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.
Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?
Thanks.
Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?
Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)
splunk-HadoopConnect-master/bin/hdfs.py
...
...
def process_file_uri(hdfs_uri):
hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"):
hj.text(hdfs_uri) <<======== making -text method of hadoop FsShell.
translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
hj.cat(hdfs_uri)
translator = get_data_translator(hdfs_uri, hj.process.stdout)
cur_src = ""
buf = translator.read()
bytes_read = len(buf)
Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.