I have created a virtual index with CDH5 and Hunk 6.1. A simple query like the following:
runs about 28 minutes on our small, 5 node lab cluster with 32GB of memory and 3 HDFS nodes. There is some 35 million netflow records. My question is twofold:
1) When i run the same query over and over, the performance is very linear. Would I expect an index to be created somewhere so subsequent queries run faster?
2) In terms of improving Splunk/Hunk/Hadoop performance, if I segregate the data into directories in HDFS based on date for example (2014-05-26, 2014-05-27) will performance increase (provided i narrow my search to last 24 hours for example)?
Just to clarify, Hunk does not create an index based on the data that it searches once. Yes, we do recommend that you partition your data based on time and any other fields that you search frequently.
1) In order for you to create a MR job, you will need to change your Splunk query:
From this - index=tomnetflow destination_address="22.214.171.124"
To something like this - index=tomnetflow destination_address="126.96.36.199" | top destination_address
In addition, make sure that you are in ' smart mode ' and not in ' verbose mode '
2) Hunk uses VIX = Virtual Index. Therefore, the index itself is not created and performance will not be any faster.
3) To make sure Hunk runs faster - Make sure you run MR Jobs (see answer to #1), Make sure you use VIX with REGEX that will extract the time from the file name or the HDFS directory name (as you mentioned - that will allow Hunk to bring less data per MR job), If you use Report Acceleration that will Cache the results.