Hadoop unit testing with MiniMRCluster and MiniDFSCluster

In a recent blog post Steve Loughran mentioned that I didn’t cover Hadoop’s MiniMRCluster in my book. At the time I wrote the testing chapter of “Hadoop in Practice” I decided that covering MRUnit and LocalJobRunner were sufficient to cover the goals of most MapReduce unit test, but for completeness I want to cover MiniMRCluster in this post.

MRUnit is great for quick and easy unit testing of MapReduce jobs, where you don’t want to test Input/OutputFormat and Partitioner code. LocalJobRunner is a step above MRUnit in that it allows you to test Input/OutputFormat classes, but it is single-threaded so it’s not useful for uncovering bugs related to multiple map or reduce tasks, or for properly exercising partitioners.

That’s where MiniMRCluster (and MiniDFSCluster) come into play. These classes offer full-blown in-memory MapReduce and HDFS clusters, and can launch multiple MapReduce and HDFS nodes. MiniMRCluster and MiniDFSCluster are bundled with the Hadoop 1.x test JAR, and are used heavily within Hadoop’s own unit tests.

The easy way to leverage MiniMRCluster and MiniDFSCluster is to extend the abstract ClusterMapReduceTestCase class, which is a JUnit TestCase and starts/stops a Hadoop cluster around each JUnit test. ClusterMapReduceTestCase runs a 2-node MapReduce cluster with 2 HDFS nodes. The way you should be able to use this class is as follows:

public class WordCountTest extends ClusterMapReduceTestCase {
    public void test() throws Exception {
        JobConf conf = createJobConf();

        Path inDir = new Path("testing/jobconf/input");
        Path outDir = new Path("testing/jobconf/output");

        OutputStream os = getFileSystem().create(new Path(inDir, "text.txt"));
        Writer wr = new OutputStreamWriter(os);
        wr.write("b a\n");
        wr.close();

        conf.setJobName("mr");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(LongWritable.class);

        conf.setMapperClass(WordCountMapper.class);
        conf.setReducerClass(SumReducer.class);

        FileInputFormat.setInputPaths(conf, inDir);
        FileOutputFormat.setOutputPath(conf, outDir);

        assertTrue(JobClient.runJob(conf).isSuccessful());

        // Check the output is as expected
        Path[] outputFiles = FileUtil.stat2Paths(
                getFileSystem().listStatus(outDir, new Utils.OutputFileUtils.OutputFilesFilter()));

        assertEquals(1, outputFiles.length);

        InputStream in = getFileSystem().open(outputFiles[0]);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        assertEquals("a\t1", reader.readLine());
        assertEquals("b\t1", reader.readLine());
        assertNull(reader.readLine());
        reader.close();
    }
}

However, at least with the Hadoop 1.0.3 release, this will fail with the following exception:

12/10/19 23:10:37 ERROR mapred.MiniMRCluster: Job tracker crashed
java.lang.NullPointerException
  at java.io.File.<init>(File.java:222)
  at org.apache.hadoop.mapred.JobHistory.initLogDir(JobHistory.java:531)
  at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:499)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2334)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2331)
  at java.security.AccessController.doPrivileged(Native Method)
  ...

The trick here is that the JobTracker is expecting hadoop.log.dir to be set in the system properties, which it isn’t in our example, causing the NPE. As it turns out this is a bug (see MAPREDUCE-2785) which according to Jira will be fixed in the Hadoop 1.1 release (thanks to Steve for that information). The fix is simple - override the setUp() method in ClusterMapReduceTestCase and set the Hadoop log directory:

@Override
protected void setUp() throws Exception {

    System.setProperty("hadoop.log.dir", "/tmp/logs");

    super.startCluster(true, null);
}

Once you make this change the above JUnit test will work. This can be a bit tedious to have to roll into each and every one of your unit tests, but luckily there are a couple of options out there so that you don’t have to.First, Steve pointed out a LocalMRCluster Groovy class bundled in SmartFrog which fixes this issue by extending MiniMRCluster.

Another alternative is to use my GitHub hadoop-utils project which contains a JUnit class similar to ClusterMapReduceTestCase called MiniHadoopTestCase which fixes this property problem, and also gives you more control over where the in-memory clusters will store their data on your local filesystem, and also let you control the number of TaskTrackers and DataNodes.

Hadoop-utils also contains a helper class (TextIOJobBuilder) to help with writing MapReduce input files, and verifying the output results. You can see an example of how clean your unit tests can look when combining TextIOJobBuilder with MiniHadoopTestCase in class TotalOrderSortTest:

public class TotalOrderSortTest extends MiniHadoopTestCase {

    @Test
    public void test() throws Exception {

        InputSampler.RandomSampler sampler = new InputSampler.RandomSampler(1.0, 6, 1);

        JobConf jobConf = super.getMiniHadoop().createJobConf();

        TextIOJobBuilder builder = new TextIOJobBuilder(
                super.getMiniHadoop().getFileSystem())
                .addInput("foo-hump")
                .addInput("foo-hump")
                .addInput("clump-bar")
                .addExpectedOutput("clump-bar")
                .addExpectedOutput("foo-hump")
                .writeInputs();

        new SortConfig(jobConf).setUnique(true);

        SortTest.run(
                jobConf,
                builder,
                2,
                2,
                sampler);
    }
}

The only real downside to using MiniMRCluster and MiniDFSCluster is speed - it takes a good 5-10 seconds for both setup and tear-down, and when you multiply this for each test case this can add up.

About the author

Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.

If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.

comments powered by Disqus

RECENT BLOG POSTS

Configuring memory for MapReduce running on YARN

This post examines the various memory configuration settings for your MapReduce job.
Big data anti-patterns presentation

Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers

Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
Using Oozie 4.4.0 with Hadoop 2.2

Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
Hadoop in Practice, Second Edition

A sneak peek at what's coming in the second edition of my book.

Full post archive