All Apps and Add-ons

Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

EricLloyd79
Builder

We are new to Hunk (or now called Splunk Analytics for Hadoop).
I am attempting to run a query on our HDFS directories for the last 5 mins.
Here is the query: index=foo | sort 0 _time
So just return all the entries from the last 5 mins in the index foo sorted without truncation.

But it searches through all 8 million + events in our HDFS directories even after it seems to have found the complete list for the last 5 mins.

Any reasons why it might be doing this?

0 Karma
1 Solution

kschon_splunk
Splunk Employee
Splunk Employee

It sounds like an issue with your "et" (earliest time) configurations. When you give a search a time range, Splunk Analytics for Hadoop (formerly called Hunk) decides whether to read a particular file on HDFS based on the earliest and latest times for that file, as read from it's path. (It may also skip files based on other field values, if you have configured other path field extractions.) The relevant configurations for your virtual index are:

vix.input.1.et.regex
vix.input.1.et.format
vix.input.1.et.offset
vix.input.1.lt.regex
vix.input.1.lt.format
vix.input.1.lt.offset

You can get more information about these properties here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

If you've already set these props and you don't know what's going wrong, please post the provider and vix stanzas for this vix from your indexes.conf file, and an example HDFS file path, after anonymizing any confidential portions.

View solution in original post

EricLloyd79
Builder

As a side note to all who are helping (thank you by the way), I have changed our directory structure to the recommended
/topic/foo/year/month/day
instead of
/topic/foo/month-day-year
and am altering the indexes.conf accordingly. Hopefully, this more detailed and partitioned by time structure will allow for the regex to work properly (I changed that as well to:
vix.input.1.et.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.et.format = yyyyMMdd

0 Karma

EricLloyd79
Builder

More odd behavior.
With the configuration recently posted, if I run the Splunk search:
index=foo | sort 0 _time
and use the GUI Splunk web based time identifier interface (the button) to say "Last 5 mins", the search will search through only 1,000,000 of the 8mil + events (better result)
When I used:
index=foo earliest=-5m | sort 0 _time
It will find the last 5 min of events, get to ~1,000,000 and then the search hangs...

The good news is I think its using the changes to indexes.conf somewhat correctly because when I test to see how many events I have for the folder for today only, its about 1,000,000. So I think the behavior is that it is searching through only this one day today... I assume if I want the search to be even quicker, I'd have to sort my subdirectories further and have them into years/months/days/hours/minutes which I'm not sure we're willing to do.

I'm not sure why its hanging when we use the earliest keyword but your answer solved my problem.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

It sounds like there is an issue with your query, but I can't see what it is. Hopefully somebody else can. Glad it's working through the GUI though.

0 Karma

EricLloyd79
Builder

Yeah I cannot understand why when I run this query:
index="juniper" earliest=-5m | sort 0 _time

It still finds the last 5 mins and then continues searching through the rest of directories...

This is my indexes.conf now:
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.8.0
vix.family = hadoop
vix.fs.default.name = hdfs://10.x.x.x:xxxx
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /tmp/splunk
vix.yarn.resourcemanager.address = hdfs://10.x.x.x:xxxx

[juniper]
vix.provider = hadoopoly
vix.input.1.path = /topics/foo/...
vix.input.1.et.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.et.format = yyyyMMdd
vix.input.1.et.offset = 0
vix.input.1.lt.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400

0 Karma

EricLloyd79
Builder

Yes, I've been restarting Splunk after I edit it as well.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Looks good. Under this scheme, to add a latest time, you would use:

vix.input.1.lt.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400

0 Karma

ddrillic
Ultra Champion

How is your data organized on the file system, time-wise? as you probably know, Hunk doesn't store any of the data in any index, so it relies on the data organization of the HDFS file system.

0 Karma

EricLloyd79
Builder

Our data is stored: /topics/foo/01-12-2017/filename.log
Trying to use these parameters to indicate earliest time but seems to not be working.
vix.input.1.et.regex = /topics/firewall/(\d+)-(\d+)-(\d+)
vix.input.1.et.format = MMddyyyy

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

It sounds like an issue with your "et" (earliest time) configurations. When you give a search a time range, Splunk Analytics for Hadoop (formerly called Hunk) decides whether to read a particular file on HDFS based on the earliest and latest times for that file, as read from it's path. (It may also skip files based on other field values, if you have configured other path field extractions.) The relevant configurations for your virtual index are:

vix.input.1.et.regex
vix.input.1.et.format
vix.input.1.et.offset
vix.input.1.lt.regex
vix.input.1.lt.format
vix.input.1.lt.offset

You can get more information about these properties here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

If you've already set these props and you don't know what's going wrong, please post the provider and vix stanzas for this vix from your indexes.conf file, and an example HDFS file path, after anonymizing any confidential portions.

View solution in original post

EricLloyd79
Builder

I don't actually see those properties : vix.input.* when browsing the properties listed under Additional Settings in Virtual Indexes in Splunk Web. Are these properties somewhere else or do we need to add them and what should they be set to?

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

They don't get added by default when you create a new index, but you can add them via the "New Setting" link. Might be easier though to edit your indexes.conf file directly. It's probably in your /etc/apps/search/local/ directory (assuming you were in the search app when you created the vix).

As for what they need to be set to, there is a lot of detail on the page I linked to before, and there is an example here:
https://docs.splunk.com/Documentation/Hunk/6.4.5/Hunk/Setupvirtualindexes

Briefly, "et" means "earliest time" and "lt" means latest time. Each one is extracted from the HDFS path via the regex, and interpreted via the date format. The offset is just that--it makes the et/lt more or less than what was obtained from the path by a fixed amount. BTW, another useful config is "timezone", as in:

vix.input.x.et.timezone
vix.input.x.lt.timezone

0 Karma

EricLloyd79
Builder

Thanks for the insight.
I tried to change my indexes.conf accordingly below and it still searches throughout all the files. Perhaps my regex or wording is wrong. The file structure path of I'm using is /topics/foo/01-12-2017/

This is what I added to indexex.conf:
vix.input.1.et.regex = /topics/foo/(\d+)-(\d+)-(\d+)
vix.input.1.et.format = MMddyyyy

Then I ran this query: index=foo earliest=-5m | sort 0 _time

And unfortunately it still ran through all the files before finishing the search. Any ideas?

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

What you have so far looks good. Did you also set lt (latest time) properties? If you only have et, then as far as Splunk knows, any file might contain events from the last five minutes. Based on what you have here, I think you want:

vix.input.1.lt.regex = /topics/foo/(\d+)-(\d+)-(\d+)
vix.input.1.lt.format = MMddyyyy
vix.input.1.lt.offset = 86400

Note that 86400 = 24*60*60 is the number of seconds in one day.

0 Karma

suarezry
Builder

You need find a file called indexes.conf within the directory splunk is installed on. The vix.input.* is inside. Post the contents here.

0 Karma

EricLloyd79
Builder

[provider:XXX]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.8.0
vix.family = hadoop
vix.fs.default.name = hdfs://10.x.x.x.:xxxx
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /tmp/splunk
vix.yarn.resourcemanager.address = hdfs://10.x.x.x:xxxx

[juniper]
vix.input.1.path = /topics/firewall/...
vix.provider = XXX

0 Karma

suarezry
Builder

Ok, so you are missing a bunch of vix.input definitions, particularly vix.input.1.et.regex and vix.input.1.lt.regex. These tell splunk how to interpret datetime from paths.

0 Karma

EricLloyd79
Builder

Yup, I figured that out - thanks for the heads up.

0 Karma

suarezry
Builder

post your indexes.conf

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.