Using CDH5 (MR2) and Hunk 6.1 on Centos 6.4...
I have my netflow ascii data in the HDFS file system in 15 minute increments with each day being a higher level directory and each file having 15 minutes of netflow data. Something like this:
/user/netflow/2015-05-25/asciiflow2014-05-25-02-45-01.csv
/user/netflow/2015-05-25/asciiflow2014-05-25-03-00-01.csv
..
..
/user/netflow/2015-05-26/asciiflow2014-05-26-02-45-01.csv
..
Given this I am wondering about the virtual index configuration I have, listed below, is correct?
I seem to search the same amount of time no mater what the time period is....
Time Capturing Regex is "/user/netflow/(\d+)-(\d+)-(\d+)/"
Time Format is "yyyyMMdd"
Time Adjustment is 15 Minutes??
Time Range is 1 day ??
You can either extract the time range from the parent dir:
Time Capturing Regex: "/user/netflow/(d+)-(d+)-(d+)/"
Time Format: "yyyyMMdd"
Time Adjustment: 0
Time Range: 1 day
or your can extract the more granular timestamp at the file level:
Time Capturing Regex: "asciiflow(\d+)-(\d+)-(\d+)-(\d+)-(\d+)-\d+.csv$"
Time Format: "yyyyMMddHHmm"
Time Adjustment: 0
Time Range: 15 minutes