I have a Hunk installation that is successfully (albeit slowly) pulling data from an s3:// filesystem. However, I'm having problems getting Hunk to only search relevant directories in s3. I see the correct results when running a search over a specific time range in the Hunk UI, but Hunk is still searching over all files in Hadoop to do so which is slow and wasteful.
For instance, I have my data in directories in s3 that follow this format:
s3://my-bucket/data/appname/2016/08/09/22/appname_22_30.log
which would correspond to the logs from my app that were collected on August 9th, 2016 for the minute of 22:30.
I have correspondingly set up my provider with the following properties:
vix.input.1.et.format = yyyyMMddHHmm
vix.input.1.et.offset = 0
vix.input.1.et.regex = .*?/appname/(\d+)?/?(\d+)?/?(\d+)?/?(\d+)?.*_?(\d{2}).*?
vix.input.1.lt.format = yyyyMMddHHmm
vix.input.1.lt.offset = 60
vix.input.1.lt.regex = .*?/appname/(\d+)?/?(\d+)?/?(\d+)?/?(\d+)?.*_?(\d{2}).*?
When running searches, I've noticed in my search.log that I get lines like this...
DEBUG ERP.s3-emr - VirtualIndex - File meets time heuristic path=s3://my-bucket/data/myapp/2016/08/02/11/myapp_11_40.log, search.et=1470009600, search.lt=1470268800, file.et=0, file.lt=9223372036854775807, file.mtime=1470766383
08-09-2016 20:24:02.879
DEBUG ERP.s3-emr - VirtualIndex - File meets the search criteria. Will consider it, path=s3://my-bucket/data/myapp/2016/08/02/11/myapp_11_40.log
...which indicate to me that the regex isn't doing its job as file.et and file.lt are not set propertly.
Does anyone have any idea as to why this might be happening?
Thanks in advance!!
... View more