Solved: Splunk Analytics for Hadoop: Why is Splunk not rea...

suarezry · ‎01-11-2017

We are running Splunk Analytics for Hadoop v6.5.1 with Hortonworks HDP v2.5.

I can search and results are returned within the timerange EXCEPT for the current file. There are no results returned if I am searching for events in the current hour. I'm not sure what the difference is. Can someone help me troubleshoot?

The file is being written to using webhdfs:
http://docs.fluentd.org/articles/out_webhdfs

There is a new file created on the hour, hdfs structure is as follows:

/syslogs/yyyy/yyyy-MM-dd_HH_datacollectorhostname.txt
eg.
/syslogs/2017/2017-01-11_16_datacollector2.txt

Here is some sample data:

hdfs@mynn1:~$ hadoop dfs -tail /syslogs/2017/2017-01-11_16_datacollector2.txt


2017-01-11T21:59:59Z    syslog.tcp  {"message":"<167>2017-01-11T21:59:59.976Z myhost.internal Vpxa: verbose vpxa[259FBB70] [Originator@6876 sub=hostdstats] Set internal stats for VM: 878 (vpxa VM id), 314 (vpxd VM id). Is FT primary? false","client_host":"10.0.0.30"}

Here are the contents of my indexes.conf:

[provider:myprovider] 
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/java-8-oracle 
vix.family = hadoop vix.fs.default.name = hdfs://mynn1.internal:8020
vix.mapred.child.java.opts = -server -Xmx1024m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr 
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0 vix.splunk.home.hdfs = /user/splunk/splunk-srch/
vix.yarn.resourcemanager.address = mynn2.internal:8050
vix.yarn.resourcemanager.scheduler.address = mynn2.internal:8030

[hdp-syslog] 
vix.input.1.et.format = yyyyMMddHH 
vix.input.1.et.regex = /syslogs/(\d+)/\d+-(\d+)-(\d+)_(\d+)_\w+\.txt
vix.input.1.et.offset = 3600
vix.input.1.lt.format = yyyyMMddHH
vix.input.1.lt.regex = /syslogs/(\d+)/\d+-(\d+)-(\d+)_(\d+)_\w+\.txt
vix.input.1.lt.offset = 3600
vix.input.1.path = /syslogs/... 
vix.provider = myprovider

Here is the contents of my props.conf

[source::/syslogs/...]
sourcetype = hadoop
priority = 100
ANNOTATE_PUNCT = false
SHOULD_LINEMERGE = false
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_PREFIX = ^ 
TIME_FORMAT = %Y-%m-%dT%H:%M:%SZ 
TZ=UTC

kschon_splunk · ‎01-11-2017

Did you mean for vix.input.1.et.offset and vix.input.1.lt.offset to be equal? I'm guessing vix.input.1.et.offset should be "0". It's possible that the VIX is interpreting each split as only having events for the minute "on the hour", and for any query that does not include such a minute, it's rejecting all splits. For queries that span more than an hour, it will read each split, and correctly interpret the timestamp for each event.

View solution in original post

kschon_splunk · ‎01-11-2017

Did you mean for vix.input.1.et.offset and vix.input.1.lt.offset to be equal? I'm guessing vix.input.1.et.offset should be "0". It's possible that the VIX is interpreting each split as only having events for the minute "on the hour", and for any query that does not include such a minute, it's rejecting all splits. For queries that span more than an hour, it will read each split, and correctly interpret the timestamp for each event.

suarezry · ‎01-12-2017

For some reason I was thinking the et.offset is subtracted from the earliest time. Thanks, that was the fix!

kschon_splunk · ‎01-12-2017

Glad it worked!

suarezry · ‎01-11-2017

I shutdown the hosts writing to hdfs suspecting the files are being locked somehow. The problem still persists.

Splunk Analytics for Hadoop: Why is Splunk not reading current active HDFS file?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Observability Simplified: Combining User Experience, Application Performance & ...

Event Series May & June: From Network Visibility to Service Intelligence

Join the Conversation