Monitoring Splunk

How to improve Hunk performance when accessing Hive tables with many small orc files?

Path Finder

We are seeing excessively slow performance when accessing Hive tables with many small orc files.

We are looking for ways to improve performance. From what we can see, Hunk is causing Hadoop to create many thousands of mappers. This is because each individual orc file is causing Hadoop to create a unique search job for each file. It can take hours to even get one panel on a dashboard to populate.

Does Hunk use the CombineFileInputFormat API? This seems it would allow for relief of the number of Mappers that are generated in order to complete a search.

0 Karma
1 Solution

Splunk Employee
Splunk Employee

Hunk does not determines how many mappers are going to run. Hunk submits the job and Hadoop determines how many mappers (map task attempts) are going to run. As you can see for each ORC file hadoop creates a new Map Job.
Few options to fix this issue:
1) Ask the people who creates the ORC files to make sure they are larger (For example, 127MB per file or larger)
2) Lower the maxsplits flag. By default vix.splunk.search.mr.maxsplits = 10000. That means Hunk process up to 10000 ORC files per job. Lowering this value, lets say to 5,000 will create more Jobs, but each job will process less file. That will lower the overhead of individual Hunk Map Jobs.
3) You can manipulate any Hadoop client flag as long as you add vix before the flag (for example, vix.mapreduce.job.jvm.numtasks = 100)

View solution in original post

New Member

Hunk does have control on number of mapper by using the desired inputformat which controls the number of splits which in turns controls the number of mappers. As per the source code of splunk it does not have combineFileInputFormat support in it, so unless Hunk adapts it in its Code , we wont be getting this feature.

Hunk should seriously consider adding this feature , as small files are obvious when we do batch load in small intervals. This small file problem has been solved in hadoop using combineFileInputFormat, In hive using combinehiveinputformat. It should be simple and optimal if hunk can adapt this.

Preparing the data to be large file is not a good option.

0 Karma

Splunk Employee
Splunk Employee

Hunk does not determines how many mappers are going to run. Hunk submits the job and Hadoop determines how many mappers (map task attempts) are going to run. As you can see for each ORC file hadoop creates a new Map Job.
Few options to fix this issue:
1) Ask the people who creates the ORC files to make sure they are larger (For example, 127MB per file or larger)
2) Lower the maxsplits flag. By default vix.splunk.search.mr.maxsplits = 10000. That means Hunk process up to 10000 ORC files per job. Lowering this value, lets say to 5,000 will create more Jobs, but each job will process less file. That will lower the overhead of individual Hunk Map Jobs.
3) You can manipulate any Hadoop client flag as long as you add vix before the flag (for example, vix.mapreduce.job.jvm.numtasks = 100)

View solution in original post