All Apps and Add-ons
Highlighted

Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

We are new to Hunk (or now called Splunk Analytics for Hadoop).
I am attempting to run a query on our HDFS directories for the last 5 mins.
Here is the query: index=foo | sort 0 _time
So just return all the entries from the last 5 mins in the index foo sorted without truncation.

But it searches through all 8 million + events in our HDFS directories even after it seems to have found the complete list for the last 5 mins.

Any reasons why it might be doing this?

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

post your indexes.conf

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Splunk Employee
Splunk Employee

It sounds like an issue with your "et" (earliest time) configurations. When you give a search a time range, Splunk Analytics for Hadoop (formerly called Hunk) decides whether to read a particular file on HDFS based on the earliest and latest times for that file, as read from it's path. (It may also skip files based on other field values, if you have configured other path field extractions.) The relevant configurations for your virtual index are:

vix.input.1.et.regex
vix.input.1.et.format
vix.input.1.et.offset
vix.input.1.lt.regex
vix.input.1.lt.format
vix.input.1.lt.offset

You can get more information about these properties here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

If you've already set these props and you don't know what's going wrong, please post the provider and vix stanzas for this vix from your indexes.conf file, and an example HDFS file path, after anonymizing any confidential portions.

View solution in original post

Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

I don't actually see those properties : vix.input.* when browsing the properties listed under Additional Settings in Virtual Indexes in Splunk Web. Are these properties somewhere else or do we need to add them and what should they be set to?

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

You need find a file called indexes.conf within the directory splunk is installed on. The vix.input.* is inside. Post the contents here.

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

[provider:XXX]
vix.command.arg.3 = $SPLUNKHOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP
HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNKTHIRDPARTYJARS = $SPLUNKHOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNKHOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNKHOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNKHOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNKHOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNKHOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNKHOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNKHOME/bin/jars/thirdparty/hive12/hive-exec-1.2.1.jar,$SPLUNKHOME/bin/jars/thirdparty/hive12/hive-metastore-1.2.1.jar,$SPLUNKHOME/bin/jars/thirdparty/hive12/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.8.0
vix.family = hadoop
vix.fs.default.name = hdfs://10.x.x.x.:xxxx
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /tmp/splunk
vix.yarn.resourcemanager.address = hdfs://10.x.x.x:xxxx

[juniper]
vix.input.1.path = /topics/firewall/...
vix.provider = XXX

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

Ok, so you are missing a bunch of vix.input definitions, particularly vix.input.1.et.regex and vix.input.1.lt.regex. These tell splunk how to interpret datetime from paths.

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

Yup, I figured that out - thanks for the heads up.

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Splunk Employee
Splunk Employee

They don't get added by default when you create a new index, but you can add them via the "New Setting" link. Might be easier though to edit your indexes.conf file directly. It's probably in your /etc/apps/search/local/ directory (assuming you were in the search app when you created the vix).

As for what they need to be set to, there is a lot of detail on the page I linked to before, and there is an example here:
https://docs.splunk.com/Documentation/Hunk/6.4.5/Hunk/Setupvirtualindexes

Briefly, "et" means "earliest time" and "lt" means latest time. Each one is extracted from the HDFS path via the regex, and interpreted via the date format. The offset is just that--it makes the et/lt more or less than what was obtained from the path by a fixed amount. BTW, another useful config is "timezone", as in:

vix.input.x.et.timezone
vix.input.x.lt.timezone

0 Karma
Highlighted

Re: Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Builder

Thanks for the insight.
I tried to change my indexes.conf accordingly below and it still searches throughout all the files. Perhaps my regex or wording is wrong. The file structure path of I'm using is /topics/foo/01-12-2017/

This is what I added to indexex.conf:
vix.input.1.et.regex = /topics/foo/(\d+)-(\d+)-(\d+)
vix.input.1.et.format = MMddyyyy

Then I ran this query: index=foo earliest=-5m | sort 0 _time

And unfortunately it still ran through all the files before finishing the search. Any ideas?

0 Karma