Solved: Splunk Analytics for Hadoop: Why is Hunk searching...

EricLloyd79 · ‎01-11-2017

We are new to Hunk (or now called Splunk Analytics for Hadoop).
I am attempting to run a query on our HDFS directories for the last 5 mins.
Here is the query: index=foo | sort 0 _time
So just return all the entries from the last 5 mins in the index foo sorted without truncation.

But it searches through all 8 million + events in our HDFS directories even after it seems to have found the complete list for the last 5 mins.

Any reasons why it might be doing this?

kschon_splunk · ‎01-11-2017

It sounds like an issue with your "et" (earliest time) configurations. When you give a search a time range, Splunk Analytics for Hadoop (formerly called Hunk) decides whether to read a particular file on HDFS based on the earliest and latest times for that file, as read from it's path. (It may also skip files based on other field values, if you have configured other path field extractions.) The relevant configurations for your virtual index are:

vix.input.1.et.regex
vix.input.1.et.format
vix.input.1.et.offset
vix.input.1.lt.regex
vix.input.1.lt.format
vix.input.1.lt.offset

You can get more information about these properties here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

If you've already set these props and you don't know what's going wrong, please post the provider and vix stanzas for this vix from your indexes.conf file, and an example HDFS file path, after anonymizing any confidential portions.

View solution in original post

EricLloyd79 · ‎01-12-2017

As a side note to all who are helping (thank you by the way), I have changed our directory structure to the recommended
/topic/foo/year/month/day
instead of
/topic/foo/month-day-year
and am altering the indexes.conf accordingly. Hopefully, this more detailed and partitioned by time structure will allow for the regex to work properly (I changed that as well to:
vix.input.1.et.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.et.format = yyyyMMdd

EricLloyd79 · ‎01-13-2017

More odd behavior.
With the configuration recently posted, if I run the Splunk search:
index=foo | sort 0 _time
and use the GUI Splunk web based time identifier interface (the button) to say "Last 5 mins", the search will search through only 1,000,000 of the 8mil + events (better result)
When I used:
index=foo earliest=-5m | sort 0 _time
It will find the last 5 min of events, get to ~1,000,000 and then the search hangs...

The good news is I think its using the changes to indexes.conf somewhat correctly because when I test to see how many events I have for the folder for today only, its about 1,000,000. So I think the behavior is that it is searching through only this one day today... I assume if I want the search to be even quicker, I'd have to sort my subdirectories further and have them into years/months/days/hours/minutes which I'm not sure we're willing to do.

I'm not sure why its hanging when we use the earliest keyword but your answer solved my problem.

kschon_splunk · ‎01-13-2017

It sounds like there is an issue with your query, but I can't see what it is. Hopefully somebody else can. Glad it's working through the GUI though.

EricLloyd79 · ‎01-13-2017

Yeah I cannot understand why when I run this query:
index="juniper" earliest=-5m | sort 0 _time

It still finds the last 5 mins and then continues searching through the rest of directories...

This is my indexes.conf now:
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.8.0
vix.family = hadoop
vix.fs.default.name = hdfs://10.x.x.x:xxxx
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /tmp/splunk
vix.yarn.resourcemanager.address = hdfs://10.x.x.x:xxxx

[juniper]
vix.provider = hadoopoly
vix.input.1.path = /topics/foo/...
vix.input.1.et.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.et.format = yyyyMMdd
vix.input.1.et.offset = 0
vix.input.1.lt.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400

EricLloyd79 · ‎01-13-2017

Yes, I've been restarting Splunk after I edit it as well.

kschon_splunk · ‎01-12-2017

Looks good. Under this scheme, to add a latest time, you would use:

vix.input.1.lt.regex = /topics/foo/(\d\d\d\d)/(\d\d)/(\d\d)/.+
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400

ddrillic · ‎01-11-2017

How is your data organized on the file system, time-wise? as you probably know, Hunk doesn't store any of the data in any index, so it relies on the data organization of the HDFS file system.

EricLloyd79 · ‎01-12-2017

Our data is stored: /topics/foo/01-12-2017/filename.log
Trying to use these parameters to indicate earliest time but seems to not be working.
vix.input.1.et.regex = /topics/firewall/(\d+)-(\d+)-(\d+)
vix.input.1.et.format = MMddyyyy

kschon_splunk · ‎01-11-2017

It sounds like an issue with your "et" (earliest time) configurations. When you give a search a time range, Splunk Analytics for Hadoop (formerly called Hunk) decides whether to read a particular file on HDFS based on the earliest and latest times for that file, as read from it's path. (It may also skip files based on other field values, if you have configured other path field extractions.) The relevant configurations for your virtual index are:

vix.input.1.et.regex
vix.input.1.et.format
vix.input.1.et.offset
vix.input.1.lt.regex
vix.input.1.lt.format
vix.input.1.lt.offset

You can get more information about these properties here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

If you've already set these props and you don't know what's going wrong, please post the provider and vix stanzas for this vix from your indexes.conf file, and an example HDFS file path, after anonymizing any confidential portions.

EricLloyd79 · ‎01-11-2017

I don't actually see those properties : vix.input.* when browsing the properties listed under Additional Settings in Virtual Indexes in Splunk Web. Are these properties somewhere else or do we need to add them and what should they be set to?

kschon_splunk · ‎01-11-2017

They don't get added by default when you create a new index, but you can add them via the "New Setting" link. Might be easier though to edit your indexes.conf file directly. It's probably in your /etc/apps/search/local/ directory (assuming you were in the search app when you created the vix).

As for what they need to be set to, there is a lot of detail on the page I linked to before, and there is an example here:
https://docs.splunk.com/Documentation/Hunk/6.4.5/Hunk/Setupvirtualindexes

Briefly, "et" means "earliest time" and "lt" means latest time. Each one is extracted from the HDFS path via the regex, and interpreted via the date format. The offset is just that--it makes the et/lt more or less than what was obtained from the path by a fixed amount. BTW, another useful config is "timezone", as in:

vix.input.x.et.timezone
vix.input.x.lt.timezone

EricLloyd79 · ‎01-12-2017

Thanks for the insight.
I tried to change my indexes.conf accordingly below and it still searches throughout all the files. Perhaps my regex or wording is wrong. The file structure path of I'm using is /topics/foo/01-12-2017/

This is what I added to indexex.conf:
vix.input.1.et.regex = /topics/foo/(\d+)-(\d+)-(\d+)
vix.input.1.et.format = MMddyyyy

Then I ran this query: index=foo earliest=-5m | sort 0 _time

And unfortunately it still ran through all the files before finishing the search. Any ideas?

kschon_splunk · ‎01-12-2017

What you have so far looks good. Did you also set lt (latest time) properties? If you only have et, then as far as Splunk knows, any file might contain events from the last five minutes. Based on what you have here, I think you want:

vix.input.1.lt.regex = /topics/foo/(\d+)-(\d+)-(\d+)
vix.input.1.lt.format = MMddyyyy
vix.input.1.lt.offset = 86400

Note that 86400 = 24*60*60 is the number of seconds in one day.

suarezry · ‎01-11-2017

You need find a file called indexes.conf within the directory splunk is installed on. The vix.input.* is inside. Post the contents here.

EricLloyd79 · ‎01-12-2017

[provider:XXX]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /usr/hdp/2.5.0.0-1245/hadoop
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/jre-1.8.0
vix.family = hadoop
vix.fs.default.name = hdfs://10.x.x.x.:xxxx
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /tmp/splunk
vix.yarn.resourcemanager.address = hdfs://10.x.x.x:xxxx

[juniper]
vix.input.1.path = /topics/firewall/...
vix.provider = XXX

suarezry · ‎01-12-2017

Ok, so you are missing a bunch of vix.input definitions, particularly vix.input.1.et.regex and vix.input.1.lt.regex. These tell splunk how to interpret datetime from paths.

EricLloyd79 · ‎01-12-2017

Yup, I figured that out - thanks for the heads up.

suarezry · ‎01-11-2017

post your indexes.conf

Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

Join the Conversation

Splunk Analytics for Hadoop: Why is Hunk searching all of the HDFS files instead of restricting it to the selected the time range?

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey