Once you have outgrown your small Hadoop cluster it’s worth tuning some of the shuffle configurables to ensure that your performance keeps up with the physical growth of your cluster. The figure below shows key configurables in the shuffle stage in Hadoop versions 1.x and earlier, and identifies those that should be tuned.
You can read more about these configurables and their default values by looking at mapred-default.xml. My book Hadoop in Practice (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.
About the author
Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. Since 2008 he has gained expertise in using Hadoop to solve Big Data problems across a number of projects. He is the author of Hadoop in Practice (first edition, with second edition currently in the early access program), a book published by Manning Publications. He has presented at JavaOne and Jazoon.
RECENT BLOG POSTS
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.
Using Hadoop 2.2 as a sink in Flume 1.4
Working around the protobuf 2.5 dependency introduced by Hadoop 2.2.
Simplifying secondary sorting in MapReduce with htuple
Introducing htuple, an open-source project to simplify secondary sorting in MapReduce.