runs about 28 minutes on our small, 5 node lab cluster with 32GB of memory and 3 HDFS nodes. There is some 35 million netflow records. My question is twofold:
1) When i run the same query over and over, the performance is very linear. Would I expect an index to be created somewhere so subsequent queries run faster?
2) In terms of improving Splunk/Hunk/Hadoop performance, if I segregate the data into directories in HDFS based on date for example (2014-05-26, 2014-05-27) will performance increase (provided i narrow my search to last 24 hours for example)?
1) In order for you to create a MR job, you will need to change your Splunk query:
From this - index=tomnetflow destination_address="188.8.131.52"
To something like this - index=tomnetflow destination_address="184.108.40.206" | top destination_address
In addition, make sure that you are in ' smart mode ' and not in ' verbose mode '
2) Hunk uses VIX = Virtual Index. Therefore, the index itself is not created and performance will not be any faster.
3) To make sure Hunk runs faster - Make sure you run MR Jobs (see answer to #1), Make sure you use VIX with REGEX that will extract the time from the file name or the HDFS directory name (as you mentioned - that will allow Hunk to bring less data per MR job), If you use Report Acceleration that will Cache the results.