All Apps and Add-ons

Splunk Analytics for Hadoop: Why is Splunk not reading current active HDFS file?

suarezry
Builder

We are running Splunk Analytics for Hadoop v6.5.1 with Hortonworks HDP v2.5.

I can search and results are returned within the timerange EXCEPT for the current file. There are no results returned if I am searching for events in the current hour. I'm not sure what the difference is. Can someone help me troubleshoot?


The file is being written to using webhdfs:
http://docs.fluentd.org/articles/out_webhdfs

There is a new file created on the hour, hdfs structure is as follows:

/syslogs/yyyy/yyyy-MM-dd_HH_datacollectorhostname.txt
eg.
/syslogs/2017/2017-01-11_16_datacollector2.txt

Here is some sample data:

hdfs@mynn1:~$ hadoop dfs -tail /syslogs/2017/2017-01-11_16_datacollector2.txt


2017-01-11T21:59:59Z    syslog.tcp  {"message":"<167>2017-01-11T21:59:59.976Z myhost.internal Vpxa: verbose vpxa[259FBB70] [Originator@6876 sub=hostdstats] Set internal stats for VM: 878 (vpxa VM id), 314 (vpxd VM id). Is FT primary? false","client_host":"10.0.0.30"}

Here are the contents of my indexes.conf:

[provider:myprovider] 
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/java-8-oracle 
vix.family = hadoop vix.fs.default.name = hdfs://mynn1.internal:8020
vix.mapred.child.java.opts = -server -Xmx1024m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr 
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0 vix.splunk.home.hdfs = /user/splunk/splunk-srch/
vix.yarn.resourcemanager.address = mynn2.internal:8050
vix.yarn.resourcemanager.scheduler.address = mynn2.internal:8030

[hdp-syslog] 
vix.input.1.et.format = yyyyMMddHH 
vix.input.1.et.regex = /syslogs/(\d+)/\d+-(\d+)-(\d+)_(\d+)_\w+\.txt
vix.input.1.et.offset = 3600
vix.input.1.lt.format = yyyyMMddHH
vix.input.1.lt.regex = /syslogs/(\d+)/\d+-(\d+)-(\d+)_(\d+)_\w+\.txt
vix.input.1.lt.offset = 3600
vix.input.1.path = /syslogs/... 
vix.provider = myprovider

Here is the contents of my props.conf

[source::/syslogs/...]
sourcetype = hadoop
priority = 100
ANNOTATE_PUNCT = false
SHOULD_LINEMERGE = false
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_PREFIX = ^ 
TIME_FORMAT = %Y-%m-%dT%H:%M:%SZ 
TZ=UTC
0 Karma
1 Solution

kschon_splunk
Splunk Employee
Splunk Employee

Did you mean for vix.input.1.et.offset and vix.input.1.lt.offset to be equal? I'm guessing vix.input.1.et.offset should be "0". It's possible that the VIX is interpreting each split as only having events for the minute "on the hour", and for any query that does not include such a minute, it's rejecting all splits. For queries that span more than an hour, it will read each split, and correctly interpret the timestamp for each event.

View solution in original post

kschon_splunk
Splunk Employee
Splunk Employee

Did you mean for vix.input.1.et.offset and vix.input.1.lt.offset to be equal? I'm guessing vix.input.1.et.offset should be "0". It's possible that the VIX is interpreting each split as only having events for the minute "on the hour", and for any query that does not include such a minute, it's rejecting all splits. For queries that span more than an hour, it will read each split, and correctly interpret the timestamp for each event.

suarezry
Builder

For some reason I was thinking the et.offset is subtracted from the earliest time. Thanks, that was the fix!

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Glad it worked!

0 Karma

suarezry
Builder

I shutdown the hosts writing to hdfs suspecting the files are being locked somehow. The problem still persists.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to July Tech Talks, Office Hours, and Webinars!

What are Community Office Hours?Community Office Hours is an interactive 60-minute Zoom series where ...

Updated Data Type Articles, Anniversary Celebrations, and More on Splunk Lantern

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

A Prelude to .conf25: Your Guide to Splunk University

Heading to Boston this September for .conf25? Get a jumpstart by arriving a few days early for Splunk ...