Configuring and tuning MapReduce's shuffle

Nov 26, 2012

Once you have outgrown your small Hadoop cluster it’s worth tuning some of the shuffle configurables to ensure that your performance keeps up with the physical growth of your cluster. The figure below shows key configurables in the shuffle stage in Hadoop versions 1.x and earlier, and identifies those that should be tuned.

parition

You can read more about these configurables and their default values by looking at mapred-default.xml. My book Hadoop in Practice (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.

About the author

Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. For the last four years he has gained expertise in Hadoop solving Big Data problems across a number of projects. He is the author of "Hadoop in Practice", a book published by Manning Publications. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign.

If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.

comments powered by Disqus

Latest Tweets

Status updating sdf ...