Thursday, April 24, 2014

Hadoop number of reducers rule of thumb explained in map reduce

Reviewing my documentation for my CCDH exam in a few weeks, I have encountered the topic of the recommended number of reducers.
There is a rule of thumb widely known: 0.95 or 1.75 times (#nodes * mapred.tasktracker.tasks.maximum).

With 0.95 factor, all reduce tasks are run at the same time, but each one takes typically longer. With 1.75 factor, each reduce task finishes faster, but some reduce tasks are waiting in a queue to be executed. Prefer the 1.75 factor for better load balancing.

From Cloudera documentation:

Typical way to determine how many Reducers to specify:
– Test the job with a relatively small test data set
– Extrapolate to calculate the amount of intermediate data expected from the 'real' input data.
– Use that to calculate the number of Reducers which should be specified.

However this doesn't adjust to all, or most problems and/or environments. Looking for a deeper explanation and hints from industry leaders, I found in this interesting conversation in Quora, the following advice from Allen Wittenauer from Linkedin:

 I tend to tell users that their ideal reducers should be the optimal value that gets them closest to:

- A multiple of the block size
- A task time between 5 and 15 minutes
- Creates the fewest files possible

Anything other than that means there is a good chance your reducers are less than great.  There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!").  Both are equally dangerous, resulting in one or more of:

- Terrible performance on the next phase of the workflow
- Terrible performance due to the shuffle
- Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
- Destroying disk IO for no really sane reason
- Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work

Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb.
[...]
One other thing to mention:  if you are changing the reducer count just to make the partitioning non-skewy in order to influence the partition, you're just plain "doing it wrong".

There is another issue, that is being in a shared cluster. If you use a FairScheduler, you can be affected by other jobs' and users' quotas. You'd need to adjust your "#nodes" value according to it.

More insight from Pere Ferrera from Datasalt:

- # reducers not too few because if they are very big and one fails, retrying it will take much more time than retrying a smaller one.
- # reducers is not generally the problem. If there is one, it is partitioning. It doesn't matter whether you use 100 reducers if only 1 is going to handle all the work.

Two good points. I like the second: partitioning the data correctly is way more important than the number of "slots" where you are sending the data to be reduced. 

No comments:

Post a Comment