Alex Holmes

Configuring memory for MapReduce running on YARN

2016-12-07T14:20:00+00:00

The most common issue that I bump into these days when running MapReduce jobs is the following error:

Application application_1409135750325_48141 failed 2 times due to AM Container for
appattempt_1409135750325_48141_000002 exited with exitCode: 143 due to: Container
[pid=4733,containerID=container_1409135750325_48141_02_000001] is running beyond physical memory limits.
Current usage: 2.0 GB of 2 GB physical memory used; 6.0 GB of 4.2 GB virtual memory used. Killing container.

Reading that message it’s pretty clear that your job has exceeded its memory limits, but how do you go about fixing this?

Make sure your job has to cache data

Before we start tinkering with configuration settings, take a moment to think about what your job is doing. Your map or reduce task running out of memory usually means that data is being cached in your map or reduce tasks. Data can be cached for a number of reasons:

Your job is writing out Parquet data, and Parquet buffers data in memory prior to writing it out to disk
Your code (or a library you’re using) is caching data. An example here is joining two datasets together where one dataset is being cached prior to joining it with the other.

Therefore the first step I’d suggest you take is to think about whether you really need to cache data, and if it’s possible to reduce your memory utilization without too much work. If that’s possible you may want to consider doing that prior to bumping-up the memory for your job.

How YARN monitors the memory of your container

This section isn’t specific to MapReduce, it’s an overview of how YARN generally monitors memory for running containers (in MapReduce a container is either a map or reduce process).

Each slave node in your YARN cluster runs a NodeManager daemon, and one of the NodeManager’s roles is to monitor the YARN containers running on the node. One part of this work is monitoring the memory utilization of each container.

To do this the NodeManager periodically (every 3 seconds by default, which can be changed via yarn.nodemanager.container-monitor.interval-ms) cycles through all the currently running containers, calculates the process tree (all child processes for each container), and for each process examines the /proc/<PID>/stat file (where PID is the process ID of the container) and extracts the physical memory (aka RSS) and the virtual memory (aka VSZ or VSIZE).

If virtual memory checking is enabled (true by default, overridden via yarn.nodemanager.vmem-check-enabled), then YARN compares the summed VSIZE extracted from the container process (and all child processes) with the maximum allowed virtual memory for the container. The maximum allowed virtual memory is basically the configured maximum physical memory for the container multiplied by yarn.nodemanager.vmem-pmem-ratio (default is 2.1). So if your YARN container is configured to have a maximum of 2 GB of physical memory, then this number is multiplied by 2.1 which means you are allowed to use 4.2 GB of virtual memory.

If physical memory checking is enabled (true by default, overridden via yarn.nodemanager.pmem-check-enabled), then YARN compares the summed RSS extracted from the container process (and all child processes) with the maximum allowed physical memory for the container.

If either the virtual or physical utilization is higher than the maximum permitted, YARN will kill the container, as shown at the top of this article.

Increasing the memory availble to your MapReduce job

Back in the days when MapReduce didn’t run on YARN memory configuration was pretty simple, but these days MapReduce runs as a YARN application and things are a little bit more involved. For MapReduce running on YARN there are actually two memory settings you have to configure at the same time:

The physical memory for your YARN map and reduce processes
The JVM heap size for your map and reduce processes

Physical memory for your YARN map and reduce processes

Configure mapreduce.map.memory.mb and mapreduce.reduce.memory.mb to set the YARN container physical memory limits for your map and reduce processes respectively. For example if you want to limit your map process to 2GB and your reduce process to 4GB, and you wanted that to be the default in your cluster, then you’d set the following in mapred-site.xml:

<property>
  <name>mapreduce.map.memory.mb</name>
  <value>2048</value>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>4096</value>
</property>

The physical memory configured for your job must fall within the minimum and maximum memory allowed for containers in your cluster (check the yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb properties respectively).

JVM heap size for your map and reduce processes

Next you need to configure the JVM heap size for your map and reduce processes. These sizes need to be less than the physical memory you configured in the previous section. As a general rule they should be 80% the size of the YARN physical memory settings.

Configure mapreduce.map.java.opts and mapreduce.reduce.java.opts to set the map and reduce heap sizes respectively. To continue the example from the previous section, we’ll take the 2GB and 4GB physical memory limits and multiple by 0.8 to arrive at our Java heap sizes. So we’d end up with the following in mapred-site.xml (assuming you wanted these to be the defaults for your cluster):

<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx1638m</value>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx3278m</value>
</property>

Configuring settings for your job

The same configuration properties that I’ve described above apply if you want to individually configure your MapReduce jobs and override the cluster defaults. Again you’ll want to set values for these properties for your job:

Property	Description
mapreduce.map.memory.mb	The amount of physical memory that your YARN map process can use.
mapreduce.reduce.memory.mb	The amount of physical memory that your YARN reduce process can use.
mapreduce.map.java.opts	Used to configure the heap size for the map JVM process. Should be 80% of `mapreduce.map.memory.mb`.
mapreduce.reduce.java.opts	Used to configure the heap size for the reduce JVM process. Should be 80% of `mapreduce.reduce.memory.mb`.

Big data anti-patterns presentation

2015-10-28T00:20:00+00:00

Today I presented on big data anti-patterns to an audience at JavaOne. It was live-streamed (no pressure Alex) and I’m hoping the video will be publicly available shortly; if so I’ll update this post with a link.

The presentation covered seven anti-patterns ranging from fairly high-level ones (such as “you don’t have big data”) to ones that were more in the weeds (approximate counting), and covering tools such as Hadoop, Cassandra and Kafka.

Thanks to everyone who attended - I had a lot of fun presenting, and I’m looking forward to giving more talks in the future.

Here’s a link to the slides of the talk: http://www.slideshare.net/grepalex/avoiding-big-data-antipatterns

Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers

2014-05-13T14:20:00+00:00

Parquet is a new columnar storage format that come out of a collaboration between Twitter and Cloudera. Parquet’s generating a lot of excitement in the community for good reason - it’s shaping up to be the next big thing for data storage in Hadoop for a number of reasons:

It’s a sophisticated columnar file format, which means that it’s well-suited to OLAP workloads, or really any workload where projection is a normal part of working with the data.
It has a high level of integration with Hadoop and the ecosystem - you can work with Parquet in MapReduce, Pig, Hive and Impala.
It supports Avro, Thrift and Protocol Buffers.

The last item raises a question - how does Parquet work with Avro and friends? To understand this you’ll need to understand three concepts:

Storage formats, which are binary representations of data. For Parquet this is contained within the parquet-format GitHub project.
Object model converters, whose job it is to map between an external object model and Parquet’s internal data types. These converters exist in the parquet-mr GitHub project.
Object models, which are in-memory representations of data. Avro, Thrift, Protocol Buffers, Hive and Pig are all examples of object models. Parquet does actually supply an example object model (with MapReduce support ) , but the intention is that you’d use one of the other richer object models such as Avro.

The figure below shows a visual representation of these concepts (view a larger image ).

Avro, Thrift and Protocol Buffers all have have their own storage formats, but Parquet doesn’t utilize them in any way. Instead their objects are mapped to the Parquet data model. Parquet data is always serialized using its own file format. This is why Parquet can’t read files serialized using Avro’s storage format, and vice-versa.

Let’s examine what happens when you write an Avro object to Parquet:

The Avro converter stores within the Parquet file’s metadata the schema for the objects being written. You can see this by using a Parquet CLI to dumps out the Parquet metadata contained within a Parquet file.

$ export HADOOP_CLASSPATH=parquet-avro-1.4.3.jar:parquet-column-1.4.3.jar:parquet-common-1.4.3.jar:parquet-encoding-1.4.3.jar:parquet-format-2.0.0.jar:parquet-generator-1.4.3.jar:parquet-hadoop-1.4.3.jar:parquet-hive-bundle-1.4.3.jar:parquet-jackson-1.4.3.jar:parquet-tools-1.4.3.jar

$ hadoop parquet.tools.Main meta stocks.parquet
creator:     parquet-mr (build 3f25ad97f209e7653e9f816508252f850abd635f)
extra:       avro.schema = {"type":"record","name":"Stock","namespace" [more]...

file schema: hip.ch5.avro.gen.Stock
--------------------------------------------------------------------------------
symbol:      REQUIRED BINARY O:UTF8 R:0 D:0
date:        REQUIRED BINARY O:UTF8 R:0 D:0
open:        REQUIRED DOUBLE R:0 D:0
high:        REQUIRED DOUBLE R:0 D:0
low:         REQUIRED DOUBLE R:0 D:0
close:       REQUIRED DOUBLE R:0 D:0
volume:      REQUIRED INT32 R:0 D:0
adjClose:    REQUIRED DOUBLE R:0 D:0

row group 1: RC:45 TS:2376
--------------------------------------------------------------------------------
symbol:       BINARY UNCOMPRESSED DO:0 FPO:4 SZ:84/84/1.00 VC:45 ENC:B [more]...
date:         BINARY UNCOMPRESSED DO:0 FPO:88 SZ:198/198/1.00 VC:45 EN [more]...
open:         DOUBLE UNCOMPRESSED DO:0 FPO:286 SZ:379/379/1.00 VC:45 E [more]...
high:         DOUBLE UNCOMPRESSED DO:0 FPO:665 SZ:379/379/1.00 VC:45 E [more]...
low:          DOUBLE UNCOMPRESSED DO:0 FPO:1044 SZ:379/379/1.00 VC:45  [more]...
close:        DOUBLE UNCOMPRESSED DO:0 FPO:1423 SZ:379/379/1.00 VC:45  [more]...
volume:       INT32 UNCOMPRESSED DO:0 FPO:1802 SZ:199/199/1.00 VC:45 E [more]...
adjClose:     DOUBLE UNCOMPRESSED DO:0 FPO:2001 SZ:379/379/1.00 VC:45  [more]...

The “avro.schema” is where the Avro schema information is stored. This allows the Avro Parquet reader the ability to marshall Avro objects without the client having to supply the schema.

You can also use the “schema” command to view the Parquet schema.

$ hadoop parquet.tools.Main schema stocks.parquet
message hip.ch4.avro.gen.Stock {
  required binary symbol (UTF8);
  required binary date (UTF8);
  required double open;
  required double high;
  required double low;
  required double close;
  required int32 volume;
  required double adjClose;
}

This tool is useful when loading a Parquet file into Hive, as you’ll need to use the field names defined in the Parquet schema when defining the Hive table (note that the syntax below only works with Hive 0.13 and newer).

hive> CREATE EXTERNAL TABLE parquet_stocks(
    symbol string,
    date string,
    open double,
    high double,
    low double,
    close double,
    volume int,
    adjClose double
) STORED AS PARQUET
LOCATION '...';

Using Oozie 4.4.0 with Hadoop 2.2

2014-02-16T14:20:00+00:00

The current version of Oozie (4.0.0) doesn’t build correctly when you try and target Hadoop 2.2. The Oozie team have a fix going into release 4.0.1 (see OOZIE-1551), but until then you can hack the Maven files to get it working with 4.0.0.

First download the 4.0.0 version from https://oozie.apache.org/, and then unpackage it. Next run the following command to change the Hadoop version being targeted:

cd oozie-4.0.0/
find . -name pom.xml | xargs sed -ri 's/(2.2.0\-SNAPSHOT)/2.2.0/'

Now all you need to do is target the hadoop-2 profile in Maven and you’ll be all set:

mvn -DskipTests=true -P hadoop-2 clean package assembly:single

Hadoop in Practice, Second Edition

2014-02-11T14:20:00+00:00

The first edition of my book went to press on November 2012, just over a year ago! It’s not that long, but in Hadoop years it’s a generation, and there have been many exciting developments in Hadoop and its ecosystem, especially YARN, and the promise of a general-purpose, distributed platform that can support any computing models, beyond MapReduce.

I’m excited to announce that I’ve started work on the second edition of the book, which will bring the existing coverage of the book up to date, and also add new chapters to cover items such as:

An overview of YARN and how it works
How MapReduce 2 works as a YARN application
Recipes for writing your own YARN applications
Pulling data out of Kafka into HDFS
Running Storm on YARN and using it to perform aggregations
Using Spark for in-memory, iterative data processing

The book is currently in MEAP, which is Manning’s early access program. The benefit of this program is that you get new content as it’s being written, and at the end you’ll get the full production-polished version of the book.

I welcome any suggestions or ideas for how the book can be improved at the forum.

Using Hadoop 2.2 as a sink in Flume 1.4

2014-02-09T14:20:00+00:00

Google really screwed the pooch with their protobuf 2.5 release. Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries (I guess Google missed the semantic versioning boat on this release). Unfortunately the current stable release of Flume 1.4 packages protobuf 2.4.1 and if you try and use HDFS on Hadoop 2.2 as a sink you’ll be smacked with the following exception:

java.lang.VerifyError: class org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    ...
    at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
    at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:328)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:235)

Hadoop 2.2 uses protobuf 2.5 for its RPC, and Flume loads its older packaged version of protobuf ahead of Hadoop’s, which causes this error. To fix this you’ll need to move both protobuf and guava out of Flume’s lib directory. The following command moves them into your home directory.

$ mv ${flume_bin}/lib/{protobuf-java-2.4.1.jar,guava-10.0.1.jar} ~/

Now if you restart your Flume agent you’ll be able to target HDFS as a sink with Hadoop 2.2. Great success!

Flume’s next release will move to protobuf 2.5 so this problem should magically disappear in due course.

Simplifying secondary sorting in MapReduce with htuple

2013-10-07T14:20:00+00:00

I’ve recently found myself immersed in writing a number of MapReduce jobs that all require secondary sort. Whilst I was nursing my cramping hands after writing what felt like the 100th custom Writable (and supporting partitioner/comparators), a thought occurred to me - “surely there’s a better way”? As I started thinking about this some more, I realized that what I needed was a general-purpose mechanism that would allow me to:

Work with compound elements
Provide pre-built partitioners and comparators that would know how to work with these compound elements
Model all of this in a way that is easy to read and understand

This is the inspiration behind htuple, a small project that I just open-sourced.

htuple

Let me give you an example of how you can use htuple to perform secondary sorting. Imagine that you have a dataset which contains last and first names:

Smith	John
Smith	Anne
Smith	Ken

One example aggregation you may want to perform on this data is to count the number of distinct first names for each last name. A reasonable approach to implementing this in MapReduce would be to emit the last name as the mapper output key, the first name as the mapper output value, and in the reducer you’d collect all the first names in a set and then count them. This would work fine when working with names, but what if your dataset had some keys with a large number of distinct values - large enough that you run into problems caching all the data in the reducer’s memory?

One solution here would be to use secondary sort - and in the example of our names, sort the first names so that the reducer wouldn’t need to store them in a set (instead it can just increment a count as it’s reading the first names). In this case you’d probably end up writing a custom Writable which would contain both the last name and first name, and you’d also write a custom partitioner, and a sorting and grouping comparator. Phew, that’s a lot of work just to get secondary sort working.

Let’s examine how you’d use htuple to do this work. First of all, I’d recommend defining an enum to create logical names for the elements you’ll store in the tuple. In our case we need two elements for the names, so here goes:

/**
 * User-friendly names that we can use to refer to fields in the tuple.
 */
enum TupleFields {
    LAST_NAME,
    FIRST_NAME
}

The first concept we’ll introduce in htuple is the Tuple class. This class is merely a container for reading and writing multiple elements, and will be the class that you’ll use to emit keys from your mapper. There are three ways you can write data into this tuple - here we’ll cover what I think is the most useful method, which is using the enum you just created. Let’s see how this will work in our mapper.

public static class Map extends Mapper<LongWritable, Text, Tuple, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        // tokenize the line
        String nameParts[] = value.toString().split("\t");

        // create the tuple, setting the first and last names
        Tuple outputKey = new Tuple();
        outputKey.set(TupleFields.LAST_NAME, nameParts[0]);
        outputKey.set(TupleFields.FIRST_NAME, nameParts[1]);

        // emit the tuple and the original contents of the line
        context.write(outputKey, value);
    }
}

The first thing you do in your mapper is split the input line, where the first token is the last name, and the second token is the first name. Next you create a new Tuple object and set the last and first name. We’re using the enum to logically refer to the fields. What’s happening beneath the scenes is that the Tuple class is using the ordninal value of the enum to determine the position in the ArrayList to set. So that means LAST_NAME, which has an ordinal position of 0, will have its value set in index 0 in the Tuple classes underlying ArrayList.

Now that you’ve emitted your Tuple in the mapper, you need to configure your job for secondary sort. This will then expose you to the second class in htuple, ShuffleUtils. ShuffleUtils allows you to specify which elements in your tuple are used for partitioning, sorting and grouping during the shuffle phase. And this is how you do it:

ShuffleUtils.configBuilder()
    .useNewApi()
    .setPartitionerIndices(TupleFields.LAST_NAME)
    .setSortIndices(TupleFields.values())
    .setGroupIndices(TupleFields.LAST_NAME)
    .configure(conf);

If you recall how secondary sort works (see my book ”Hadoop in Practice” for a detailed explanation), you need to perform three steps in your MapReduce driver:

Specify how your compound key will be partitioned. In our example we only want the partitioner to use the last name so that all records with the same last name get routed to the same reducer.
Specify how your compound key will be sorted. Here we want both the last and first name to be sorted, so that the first names will be presented to your reducer in sorted order.
Specify how your compound key will be grouped. Since we want all the first names to be streamed to a single reducer invocation for a given last name, we only want to group on the last name.

A couple of things worth noting in the above code example:

We’re using the new MapReduce API (i.e. using package org.apache.hadoop.mapreduce), and as such you need to call the useNewApi method.
The values method on an enum returns an array of all of the enum fields in order of definition, which in our example is the last name followed by the first name - exactly the order in which we want the sorting to occur.

You’re done! If you examine the output of the MapReduce job in HDFS you’ll see that indeed all the records are sorted by last and first name.

$ hadoop fs -cat output/part*
Smith	Anne
Smith	John
Smith	Ken

You can look at the complete source in SecondarySort.java. The htuple github page has instructions for downloading, building and running this same example in a couple of easy steps. There’s also a page which shows the types supported by htuple.

Next Generation Hadoop - It's Not Just Batch!

2013-09-25T14:20:00+00:00

In my JavaOne talk today I presented changes that are happening in Hadoop, where it’s shaking off it’s batch-based shackles and enabling a new Hadoop platform that can support a mix of processing systems, from stream-processing systems to NoSQL systems.

The slides for my talk can be viewed on Speaker Deck. The rest of this post is an overview of the technologies covered in my talk, along with links for further reading.

YARN

With Hadoop 2.x, we now have YARN which acts as a distributed scheduler. This is a big step towards the vision of Hadoop being the Big Data Kernel, as it allows arbitrary applications to be scheduled on the same Hadoop cluster, and enables a new world where we can have silo’d applications coexisting on the same hardware and sharing the same storage.

The following links serve as a good starting ground to learn more about YARN:

An introduction to YARN: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
A book by Arun Murthy et. al. on YARN: http://www.amazon.com/Apache-Hadoop-YARN-Processing-Addison-Wesley/dp/0321934504, first chapter can be read for free at http://hortonworks.com/wp-content/uploads/downloads/2013/06/Apache.Hadoop.YARN_.Sample.pdf.
The YARN ResourceManager: http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/
Writing YARN applications: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Setting up a cluster to run MapReduce on YARN: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
Configuring YARN: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
Default YARN configuration: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
YARN commands: http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

Apache HBase

HBase is a NoSQL, distributed multi-dimensional map based on Google’s BigTable. It uses HDFS for persistence, which is a huge benefit if a key requirement of your NoSQL system is the ability to read and write data into HBase using MapReduce.

HBase project page: http://hbase.apache.org/ and mailing lists: http://hbase.apache.org/mail-lists.html
A good presentation by Amandeep Khurana on HBase: http://www.slideshare.net/amansk/hbase-hadoop-day-seattle-4987041
HBase wiki: http://wiki.apache.org/hadoop/Hbase
The HBase Reference Guide - a great resource on how HBase’s data model, design and configuration: http://hbase.apache.org/book.html
HBase in Action, a book from Manning: http://www.manning.com/dimidukkhurana/

HBase on YARN (Hoya)

Hoya is a YARN application that allows multiple HBase clusters to coexist on a single Hadoop YARN cluster. This provides strong data/resource isolation properties, in conjunction with the ability to easily spin up, upsize/downsize and shutdown HBase clusters. Hoya was developed by Steve Loghran and friends over at Hortonworks.

GitHub project: https://github.com/hortonworks/hoya/
Introducing Hoya: http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/
Hoya architecture: http://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/
Presentation by Steve and Devaraj: http://www.slideshare.net/steve_l/hoya-hbase-on-yarn-20130820-hbase-hug

Apache Accumulo

Accumulo is a BigTable implementation much like HBase. It also uses HDFS for storage, and currently has an edge in the security world due to its cell-level security. Although it should be noted that this is planned for HBase (see HBASE-6222).

Project page: http://accumulo.apache.org/
Todd Lipcon’s presentation comparing HBase and Accumulo http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012

ElephantDB

ElephantDB is a read-only key-value store, which uses HDFS to load data which is served in real-time. It’s a part of Nathan Marz’s Lambda Architecture and enables the rapid loading and serving of data produced in the batch tier.

GitHub page: https://github.com/nathanmarz/elephantdb
Presentation by Nathan Marz: http://www.slideshare.net/nathanmarz/elephantdb
Presentation by Soren Macbeth, a contributor to the project: https://speakerdeck.com/sorenmacbeth/introduction-to-elephantdb

Storm

Storm is a stream processing, continuous computation and distributed RPC system developed and open-sourced by Twitter. It allows you to perform near real-time calculations such as trending topics.

Project home: http://storm-project.net/
GitHub project: https://github.com/nathanmarz/storm
Extensive documentation which covers the background and basics on how Storm works: https://github.com/nathanmarz/storm/wiki
Natan Marz presentation on Storm: http://www.youtube.com/watch?v=bdps8tE0gYo
Running a multi-node Storm cluster from Michael Noll: http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
Understanding the parallelism of a Storm topology, also from Mr. Noll: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Storm on YARN

Yahoo use Storm for a variety of use cases, and created the Storm-on-YARN so that then could run Storm on their YARN clusters. They also added the ability for Storm to read/write to secure HDFS.

GitHub project page: https://github.com/yahoo/storm-yarn
Yahoo! blog post introducing the project: http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html
Hortonworks blog on the project: http://hortonworks.com/blog/streaming-in-hadoop-yahoo-release-storm-yarn/
Hadoop Summit 2013 presentation: http://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2

Apache Samza

Samza (incubating) is a stream processing system that uses Kafka for messaging, and optionally YARN for resource management.

Project page: http://samza.incubator.apache.org/
LinkedIn post on Samza’s background: http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-framework

Morphlines

Morphlines is a ETL library from Cloudera that has implementations available for use within Flume, MapReduce and HBase. Using a modified JSON syntax it allows you to create a pipeline of work which can fulfill use cases such as near real-time writes from Flume into Solr Cloud.

GitHub page: https://github.com/cloudera/search
Introductory blog post: http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
Presentation from Cloudera: http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl
Documentation as part of the Cloudera Development Kit: http://cloudera.github.io/cdk/docs/0.5.0/cdk-morphlines/index.html

Apache Giraph

Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. It offers performance advantages over graph processing with MapReduce.

Project page: http://giraph.apache.org/
Quick start guide: http://giraph.apache.org/quick_start.html
HadoopSummit 2013 presentation: http://www.youtube.com/watch?v=_RsJfZGQo9I
Architectural overview: http://www.slideshare.net/averyching/20111014hortonworks

Impala

Impala from Cloudera is an implementation of Google’s paper on Dremel, and provides interactive SQL capabilities on top of data in HDFS and HBase.

GitHub page: https://github.com/cloudera/impala
Project announcement from Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Impala 1.0 release announcement: http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/
Configuring Impala for multi-tenant performance: http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/
Cloudera presentation at the Swiss Big Data User Group: http://www.slideshare.net/SwissHUG/cloudera-impala-15376625

Apache Drill

An (incubating) project that offers the promise of interactive SQL capabilities over data in HDFS, HBase, Cassandra, MongoDB and Splunk.

Apache incubating project page: http://incubator.apache.org/drill/
Architecture outlines: http://www.slideshare.net/jasonfrantz/drill-architecture-20120913

Parquet

Parquet, a joint initiative from Cloudera and Twitter, is a columnar data format supporting nested data. It can offer space and time advantages over row-ordered data, especially with queries that return a subset of the overall columns. It supports a wide variety of tools (MapReduce, Impala, Pig and Hive) and is used in production by Twitter.

GitHub page: https://github.com/Parquet
Presentation from Cloudera Impala meetup: http://www.slideshare.net/cloudera/presentations-25757981
Hadoop Summit 2013 presentation: http://www.youtube.com/watch?v=pFS-FScophU and accompanying slides http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013
Twitter blog post: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Cloudera blog post: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

ORC File

ORC File is a columnar data format that also supports nested data. It is currently implemented within Hive 0.11.

Presentation from Hortonworks: http://www.slideshare.net/oom65/orc-files
Details on the file format: https://cwiki.apache.org/Hive/languagemanual-orc.html
Hadoop Summit 2013 presentation http://www.youtube.com/watch?v=GV7vpR7vpjM and slides http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit

Apache Tez

Tez (incubating) is a generalized DAG execution engine. The goal of the project is to remove disk barriers that exist with pipelined MapReduce jobs. The first goal of the project is to provide a MapReduce implementation using Tez, followed by Hive and Pig.

Incubating page at Apache: http://incubator.apache.org/projects/tez.html
Introducing Tez: http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/
Hadoop Summit 2013 presentation http://www.youtube.com/watch?v=9ZLLzlsz7h8 and accompanying slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212

Apache Mesos

Mesos is a cluster manager, similar to YARN, providing resource sharing and isolation capabilities in a distributed cluster. It can support multiple instances and versions of Hadoop, Spark and other applications. It’s used in Twitter to manage various applications in production.

Project page: http://mesos.apache.org/
Tech talk: http://www.youtube.com/watch?v=Hal00g8o1iY

Lambda Architecture

The Lambda Architecture, an architectural blueprint from Nathan Marz, suggests that speed and batch layers should exist to play to their mutual strengths: the speed layer providing near real-time data aggregations, and the batch layer providing a mechanism to correct potential mistakes made in the speed layer.

Nathan’s book, Big Data from Manning, which goes into detail on the Lambda Architecture: http://www.manning.com/marz/
Nathan’s presentation explaining the background behind Lambda: http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it

Summingbird

Summingbird is a project out of Twitter which could be viewed as an implementation of the Lambda Architecture. It allows you to using a single API to define operations on distributed collections which can be mapped into MapReduce or Storm executions.

GitHub project page: https://github.com/twitter/summingbird
Twitter blog post on Summingbird: https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
Sam Ritchie presentation on Summingbird: http://www.youtube.com/watch?v=Y3PETLJeP7o and accompanying slides https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter

Apache Spark

Spark (incubating) is an in-memory distributed processing system which allows you to perform MapReduce, as well as iterative workloads over data. Spark and its family of associated projects (such as Spark Streaming, GraphX) offers a complete solution to most distributed processing use cases.

Project page: http://spark.incubator.apache.org/
Documentation, including links to video tutorials: http://spark.incubator.apache.org/documentation.html

Bucketing, multiplexing and combining in Hadoop - part 2

2013-07-16T14:20:00+00:00

In the first post of this series, we looked at how the MultipleOutputFormat class could be used in a task to write to multiple output files. This approach had a few shortcomings which included that it couldn’t be used in the map-side of a job that used reducers, and it only worked with the old mapred API.

In this post we’ll look at the MultipleOutputs class, which offers an alternative to the MultipleOutputFormat and also addresses its shortcomings.

MultipleOutputs

Using the MultipleOutputs class is a more modern Hadoop way of writing to multiple outputs. It has both mapred and mapreduce API implementations, and allows you to work with multiple OutputFormat classes in your job. Its approach is different from MultipleOutputFormat - rather than defining its own OutputFormat it merely provides some helper methods which need to be called in your driver code, as well as in your mapper/reducer.

The two MultipleOutputs classes in mapred and mapreduce are close in functionality, the main difference being support of multi-named outputs, which we’ll examine later in this post.

Let’s look at how we would achieve the same result as we did with MultipleOutputFormat. If you recall from the previous post in this series, we were working with some sample data from a fruit market, where the data points were the location of each market, and the fruit that was sold:

cupertino   apple
sunnyvale   banana
cupertino   pear

Our goal is to partition the outputs by city, so there would be city-specific files. First up is our driver code, where we need to tell MultipleOutputs the named outputs, and their related OutputFormat classes. For simplicity we’ve chosen TextOutputFormat for both, but you can use different OutputFormats for each named output.

MultipleOutputs.addNamedOutput(jobConf, "cupertino", TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(jobConf, "sunnyvale", TextOutputFormat.class, Text.class, Text.class);

The named outputs “cupertino” and “sunnyvale” are used for two purposes in MultipleOutputs - first as logical keys that you use in your mapper and reducer to lookup their associated OutputCollector classes. And second, they are used as the output filenames in HDFS.

We can’t use an identity reducer in this example as we have to use the MultipleOutputs class to redirect our output to the appropriate file, so let’s go ahead and see what the reducer will look like.

class Reduce extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

    private MultipleOutputs output;

    @Override
    public void configure(final JobConf job) {
        super.configure(job);
        output = new MultipleOutputs(job);
    }

    @Override
    public void reduce(final Text key, final Iterator<Text> values,
                       final OutputCollector<Text, Text> collector, final Reporter reporter)
            throws IOException {
        while (values.hasNext()) {
            output.getCollector(key.toString(), reporter).collect(key, values.next());
        }
    }
}

As you can you’re not using the OutputCollector supplied to us in the reduce method. Instead you create a MultipleOutputs instance in the configure method which is used in the reduce method. For each reducer input record, we use the key to lookup the OutputCollector and then emit each key/value pair to that collector. Remember that when calling getCollector you must use one of the named outputs that you defined in the job driver. In our case our input keys are either “cupertino” or “sunnyvale”, and they map directly to the named outputs we defined in our driver, so we’re in good shape.

Let’s examine the contents of the job output directory after running the job.

$ hadooop -lsr /output
/output/cupertino-r-00000
/output/sunnyvale-r-00000
/output/part-00000
/output/part-00001

This output highlights one of the key differences between MultipleOutputs and MultipleOutputFormat. When using MultipleOutputs you can output to the reducer’s regular OutputCollector, or to the OutputCollector for a named output, or to both, which is why you see part-nnnnn files.

But wait! One problem with MultipleOutputs is that you needed to pre-define the partitions “cupertino” and “sunnyvale” ahead of time in our driver. What if we didn’t know the partitions ahead of time?

Dynamic files with the MultipleOutputs class

Up until now MultipleOutputs has treated us well - it supported both the old and new MapReduce API’s, and can also support multiple OutputFormat classes within the same reducer. But as we saw we essentially had to pre-define the output files in our driver code. So how do we handle cases where we want this to be dynamically performed in the reducer?

Luckily the MultipleOutputs has a notion of “multi named” output. In the driver method instead of enumerating all the output files we want, we’ll simply just add a single logical name called “fruit”, using addMultiNamedOutput instead of addNamedOutput:

MultipleOutputs.addMultiNamedOutput(jobConf, "fruit", TextOutputFormat.class, Text.class, Text.class);

In our reducer we always specify “fruit” as the name, but we use a different getCollector method which takes an additional field, which is used to determine the filename which is used for output:

output.getCollector("fruit", key.toString(), reporter).collect(key, values.next());

Let’s do another HDFS listing:

$ hadooop -lsr /output
/output/fruit_cupertino-r-00000
/output/fruit_sunnyvale-r-00000
/output/part-00000
/output/part-00001

Hurray! We now have multiple output files that are dynamically created based on the reducer output key, just like we did with MultipleOutputFormat.

Now unfortunately the multi-named output is only supported by the old mapred API, whereas with the new mapreduce API you are forced to define your partitions in your job driver.

Conclusion

There are plenty of things to like about MultipleOutputs, namely its support for both “old” and “new” MapReduce API’s, and its support for multiple OutputFormat classes. Its only real downside is that multi named outputs are only supported in the old mapred API, so those looking for dynamic partitions in the new mapreduce API are not supported by either MultipleOutputs or MultipleOutputFormat described in part 1.

Secondary sorting with Avro

2013-06-03T14:20:00+00:00

In the last Avro sorting post you saw how sorting Avro records works in MapReduce, and how one can ignore fields in Avro records for partitioning, sorting and grouping. In the process you discovered that ignored fields are limited by being immutable (since they can only be defined once for a schema), which means you can’t vary what fields are ignored for partitioning, sorting or grouping, which is key for secondary sort.

If you wish to use secondary sort with Avro, one option would be to emit a custom Writable as the map output key, and emit an Avro record as the map output value. With this approach you’d write a custom partitioner, and sorting/grouping implementation.

This post looks at another option, where with some hacking you can actually have secondary sort with Avro map output keys.

True secondary sort with an AvroKey

Avro has some utility classes for sorting and hashing (required for the partitioner), but the code is locked-down with private methods. The hacking therefore requires lifting certain parts of Avro’s code, and writing some helper functions to easily allow jobs fine-grained control over what fields are used for secondary sort.

Let’s take an example with the same Avro schema we used in the last post:

{"type": "record", "name": "com.alexholmes.avro.WeatherNoIgnore",
 "doc": "A weather reading.",
 "fields": [
     {"name": "station", "type": "string"},
     {"name": "time", "type": "long"},
     {"name": "temp", "type": "int"},
     {"name": "counter", "type": "int", "default": 0}
 ]
}

For secondary sort you may imagine a scenario where you want to partition output records by the station, sort records using the station, time and temp fields, and finally group by the station and time fields. The code to do this is as follows GitHub source:

AvroSort.builder()
    .setJob(job)
    .addPartitionField(WeatherNoIgnore.SCHEMA$, "station", true)
    .addSortField(WeatherNoIgnore.SCHEMA$, "station", true)
    .addSortField(WeatherNoIgnore.SCHEMA$, "time", true)
    .addSortField(WeatherNoIgnore.SCHEMA$, "temp", true)
    .addGroupField(WeatherNoIgnore.SCHEMA$, "station", true)
    .addGroupField(WeatherNoIgnore.SCHEMA$, "time", true)
    .configure();

The ordering of the addXXX calls is significant, as it determines the order in which fields are used for sorting and grouping. The last argument in the addXXX methods is a boolean which indicates whether the ordering is ascending.

Most of the heavy lifting is performed in the AvroSort and AvroDataHack - the latter, as its name indicates, is where some hacking took place to get things working.

The only caveat with the current implementation is that Avro union types aren’t currently supported - I’ll look into that in the near future.

Avro's built-in sorting

2013-05-28T14:20:00+00:00

Avro has a little-known gem of a feature which allows you to control which fields in an Avro record are used for partitioning, sorting and grouping in MapReduce. The following figure gives a quick refresher as to what these terms mean. Oh, and don’t take the placement of the “sorting” literally - sorting actually occurs on both the map and reduce side - but it’s always performed in the context of a specific partition (i.e. for a specific reducer).

By default all the fields in an Avro map output key are used for partitioning, sorting and grouping in MapReduce. Let’s walk through an example and see how this works. You’ll begin with a simple schema GitHub source:

{"type": "record", "name": "com.alexholmes.avro.WeatherNoIgnore",
 "doc": "A weather reading.",
 "fields": [
     {"name": "station", "type": "string"},
     {"name": "time", "type": "long"},
     {"name": "temp", "type": "int"},
     {"name": "counter", "type": "int", "default": 0}
 ]
}

We’re going to see what happens when we run this code against a small sample data set, which we’ll generate using Avro code GitHub source:

File input = tmpFolder.newFile("input.txt");
AvroFiles.createFile(input, WeatherNoIgnore.SCHEMA$, Arrays.asList(
    WeatherNoIgnore.newBuilder().setStation("SFO").setTime(1).setTemp(3).build(),
    WeatherNoIgnore.newBuilder().setStation("IAD").setTime(1).setTemp(1).build(),
    WeatherNoIgnore.newBuilder().setStation("SFO").setTime(2).setTemp(1).build(),
    WeatherNoIgnore.newBuilder().setStation("SFO").setTime(1).setTemp(2).build(),
    WeatherNoIgnore.newBuilder().setStation("SFO").setTime(1).setTemp(1).build()
).toArray());

To understand how Avro is partitioning, sorting and grouping the data, we’ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the counter field for each record we see in an individual reducer instance GitHub source:

package com.alexholmes.avro.sort.basic;

import com.alexholmes.avro.WeatherNoIgnore;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapred.AvroValue;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.avro.mapreduce.AvroKeyOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class AvroSort {

    private static class SortMapper
            extends Mapper<AvroKey<WeatherNoIgnore>, NullWritable,
                           AvroKey<WeatherNoIgnore>, AvroValue<WeatherNoIgnore>> {
        @Override
        protected void map(AvroKey<WeatherNoIgnore> key, NullWritable value, Context context)
                throws IOException, InterruptedException {
            context.write(key, new AvroValue<WeatherNoIgnore>(key.datum()));
        }
    }

    private static class SortReducer
            extends Reducer<AvroKey<WeatherNoIgnore>, AvroValue<WeatherNoIgnore>,
                            AvroKey<WeatherNoIgnore>, NullWritable> {
        @Override
        protected void reduce(AvroKey<WeatherNoIgnore> key,
                              Iterable<AvroValue<WeatherNoIgnore>> values, Context context)
                throws IOException, InterruptedException {
            int counter = 1;
            for (AvroValue<WeatherNoIgnore> WeatherNoIgnore : values) {
                WeatherNoIgnore.datum().setCounter(counter++);
                context.write(new AvroKey<WeatherNoIgnore>(WeatherNoIgnore.datum()),
                              NullWritable.get());
            }
        }
    }

    public boolean runMapReduce(final Job job, Path inputPath, Path outputPath)
            throws Exception {
        FileInputFormat.setInputPaths(job, inputPath);
        job.setInputFormatClass(AvroKeyInputFormat.class);
        AvroJob.setInputKeySchema(job, WeatherNoIgnore.SCHEMA$);

        job.setMapperClass(SortMapper.class);
        AvroJob.setMapOutputKeySchema(job, WeatherNoIgnore.SCHEMA$);
        AvroJob.setMapOutputValueSchema(job, WeatherNoIgnore.SCHEMA$);

        job.setReducerClass(SortReducer.class);
        AvroJob.setOutputKeySchema(job, WeatherNoIgnore.SCHEMA$);

        job.setOutputFormatClass(AvroKeyOutputFormat.class);
        FileOutputFormat.setOutputPath(job, outputPath);

        return job.waitForCompletion(true);
    }
}

If you look at the output of the job below, you’ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. What this means is that when MapReduce is sorting these records, it compares the station field first, then the time field second, and so on according to the ordering of the fields in the Avro schema. This is pretty much what you’d expect if you write your own complex Writable type, and your comparator compared all the fields in order.

{"station": "IAD", "time": 1, "temp": 1, "counter": 1}
{"station": "SFO", "time": 1, "temp": 1, "counter": 1}
{"station": "SFO", "time": 1, "temp": 2, "counter": 1}
{"station": "SFO", "time": 1, "temp": 3, "counter": 1}
{"station": "SFO", "time": 2, "temp": 1, "counter": 1}

Oh, and before we move on notice that the value for the counter field is always 1, meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the MapReduce partitioner, sorter and grouper were using all the fields in the record.

Excluding fields for sorting

Avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. In MapReduce these fields are ignored for sorting/partitioning and grouping in MapReduce, which basically means that we have the ability to perform secondary sorting. Let’s examine the following schema GitHub source:

{"type": "record", "name": "com.alexholmes.avro.Weather",
 "doc": "A weather reading.",
 "fields": [
     {"name": "station", "type": "string"},
     {"name": "time", "type": "long"},
     {"name": "temp", "type": "int", "order": "ignore"},
     {"name": "counter", "type": "int", "order": "ignore", "default": 0}
 ]
}

It’s pretty much identical to the first schema, the only difference being that the last two fields have an additional “order” field whose value is set to “ignore”. Let’s run the same (other than modified to work with the different schema) MapReduce code GitHub source as above against this new schema and examine the outputs.

{"station": "IAD", "time": 1, "temp": 1, "counter": 1}
{"station": "SFO", "time": 1, "temp": 3, "counter": 1}
{"station": "SFO", "time": 1, "temp": 2, "counter": 2}
{"station": "SFO", "time": 1, "temp": 1, "counter": 3}
{"station": "SFO", "time": 2, "temp": 1, "counter": 1}

There are a couple of notable differences between this output, and the output from the previous schema which didn’t have any ignored fields. First, it’s clear that the temp field isn’t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. However, more interestingly, note the value of the counter field. All records that had identical station and time values went to the same reducer invocation, evidenced by the increasing value of counter. This is essentially secondary sort!

Sort order

The Avro documentation will give you an idea around how ordering is performed for different Avro types. Field ordering is ascending by default, but you can make it descending by setting the value of the “order” field to “descending”:

{"type": "record", "name": "com.alexholmes.avro.Weather",
 "doc": "A weather reading.",
 "fields": [
     {"name": "station", "type": "string"},
     {"name": "time", "type": "long"},
     {"name": "temp", "type": "int", "order": "descending"},
     {"name": "counter", "type": "int", "order": "ignore", "default": 0}
 ]
}

Limitations

Now, all of this greatness isn’t without some limitations:

You can’t support two MapReduce jobs that use the same Avro key, but have different sorting/partitioning/grouping requirements. Although it’s conceivable that you could create a new instance of the Avro schema and set the ignored flags for these fields yourself.
The partitioner, sorter and grouping functions in MapReduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). This means that your options for secondary sorting are limited. For example, you wouldn’t be able to partition all stations to the same reducer, and then group by station and time.
Ordering uses a field’s ordinal position to determine its order within the overall set of fields to be ordered. In other words, in a two-field record, the first field is always compared before the second. There’s no way to change this behavior other than flipping the order of the fields in the record.

Having said all of that - the “ignoring fields” feature for sorting is pretty awesome, and something that will no doubt come in handy in my future MapReduce work.

Using Avro's code generation from Maven

2013-05-24T14:20:00+00:00

Avro has the ability to generate Java code from Avro schema, IDL and protocol files. Avro also has a plugin which allows you to generate these Java sources directly from Maven, which is a good idea as it avoids issues that can arise if your schema/protocol files stray from the checked-in code generated equivalents.

Today I created a simple GitHub project called avro-maven because I had to fiddle a bit to get Avro and Maven to play nice. The GitHub project is self-contained and also has a README which goes over the basics. In this post I’ll go over how to use Maven to generate code for schema, IDL and protocol files.

pom.xml updates to support the Avro plugin

Avro schema files only define types, whereas IDL and protocol files model types as well as RPC semantics such as messages. The only difference between IDL and protocol files is that IDL files are Avro’s DSL for specifying RPC, versus protocol files are the same in JSON form.

Each type of file has an entry that can be used in the goals element as can be seen below. All three can be used together, or if you only have schema files you can safely remove the protocol and idl-protocol entries (and vice-versa).

<plugin>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-maven-plugin</artifactId>
  <version>${avro.version}</version>
  <executions>
    <execution>
      <phase>generate-sources</phase>
      <goals>
        <goal>schema</goal>
        <goal>protocol</goal>
        <goal>idl-protocol</goal>
      </goals>
    </execution>
  </executions>
</plugin>

...

<dependencies>
  <dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>${avro.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-maven-plugin</artifactId>
    <version>${avro.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-compiler</artifactId>
    <version>${avro.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro-ipc</artifactId>
    <version>${avro.version}</version>
  </dependency>
</dependencies>

By default the plugin assumes that your Avro sources are located in ${basedir}/src/main/avro, and that you want your generated sources to be written to ${project.build.directory}/generated-sources/avro, where ${project.build.directory} is typically the target directory. Keep reading if you want to change any of these settings.

Avro configurables

Luckily Avro’s Maven plugin offers the ability to customize various code generation settings. The following table shows the configurables that can be used for any of the schema, IDL and protocol code generators.

Configurable	Default value	Description
sourceDirectory	${basedir}/src/main/avro	The Avro source directory for schema, protocol and IDL files.
outputDirectory	${project.build.directory}/generated-sources/avro	The directory where Avro writes code-generated sources.
testSourceDirectory	${basedir}/src/test/avro	The input directory containing any Avro files used in testing.
testOutputDirectory	${project.build.directory}/generated-test-sources/avro	The output directory where Avro writes code-generated files for your testing purposes.
fieldVisibility	PUBLIC_DEPRECATED	Determines the accessibility of fields (e.g. whether they are public or private). Must be one of PUBLIC, PUBLIC_DEPRECATED or PRIVATE. PUBLIC_DEPRECATED merely adds a deprecated annotation to each field, e.g. "@Deprecated public long time".

In addition, the includes and testIncludes configurables can also be used to specify alternative file extensions to the defaults, which are **/*.avsc, **/*.avpr and **/*.avdl for schema, protocol and IDL files respectively.

Let’s look at an example of how we can specify all of these options for schema compilation.

<plugin>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro-maven-plugin</artifactId>
  <version>${avro.version}</version>
  <executions>
    <execution>
      <phase>generate-sources</phase>
      <goals>
        <goal>schema</goal>
      </goals>
      <configuration>
        <sourceDirectory>${project.basedir}/src/main/myavro/</sourceDirectory>
        <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
        <testSourceDirectory>${project.basedir}/src/main/myavro/</testSourceDirectory>
        <testOutputDirectory>${project.basedir}/src/test/java/</testOutputDirectory>
        <fieldVisibility>PRIVATE</fieldVisibility>
        <includes>
          <include>**/*.avro</include>
        </includes>
        <testIncludes>
          <testInclude>**/*.test</testInclude>
      </testIncludes>
      </configuration>
    </execution>
  </executions>
</plugin>

As a reminder everything covered in this blog article can be seen in action in the GitHub repo at https://github.com/alexholmes/avro-maven.

Bucketing, multiplexing and combining in Hadoop - part 1

2013-05-20T14:20:00+00:00

This is the first blog post in a series which looks at some data organization patterns in MapReduce. We’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. These are all common patterns that are useful to have in your MapReduce toolkit.

We’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. By default when using a FileOutputFormat-derived OutputFormat (such as TextOutputFormat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in HDFS.

Imagine a situation where you have user activity logs being streamed into HDFS, and you want to write a MapReduce job to better organize the incoming data. As an example a large organization with multiple products may want to bucket the logs based on the product. To do this you’ll need the ability to write to multiple output files in a single task. Let’s take a look at how we can make that happen.

MultipleOutputFormat

There are a few ways you can achieve your goal, and the first option we’ll look at is the MultipleOutputFormat class in Hadoop. This is an abstract class that lets you do the following:

Define the output path for each and every key/value output record being emitted by a task.
Incorporate the input paths into the output directory for map-only jobs.
Redefine the key and value that are used to write to the underlying RecordWriter. This is useful in situations where you want to remove data from the outputs as it duplicates data in the filename.
For each output path, define the RecordWriter that should be used to write the outputs.

OK enough with the words - let’s look at some data and code. First up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased.

cupertino   apple
sunnyvale   banana
cupertino   pear

To help bucket your data for future analysis, you want to bin each record into city-specific files. For the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. To force more than one mapper, we’ll write the data to two separate files.

$ TAB="$(printf '\t')"
$ hdfs -put - file1.txt << EOF
cupertino${TAB}apple
sunnyvale${TAB}banana
EOF

$ hdfs -put - file2.txt << EOF
cupertino${TAB}pear
EOF

Here’s the code which will let you write city-specific output files.

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import org.apache.hadoop.util.Progressable;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.Arrays;

/**
 * An example of how to use {@link org.apache.hadoop.mapred.lib.MultipleOutputFormat}.
 */
public class MOFExample extends Configured implements Tool {

    /**
     * Create output files based on the output record's key name.
     */
    static class KeyBasedMultipleTextOutputFormat
                 extends MultipleTextOutputFormat<Text, Text> {
        @Override
        protected String generateFileNameForKeyValue(Text key, Text value, String name) {
            return key.toString() + "/" + name;
        }
    }

    /**
     * The main job driver.
     */
    public int run(final String[] args) throws Exception {
        String csvInputs = StringUtils.join(Arrays.copyOfRange(args, 0, args.length - 1), ",");
        Path outputDir = new Path(args[args.length - 1]);

        JobConf jobConf = new JobConf(super.getConf());
        jobConf.setJarByClass(MOFExample.class);
        jobConf.setNumReduceTasks(0);
        jobConf.setMapperClass(IdentityMapper.class);

        jobConf.setInputFormat(KeyValueTextInputFormat.class);
        jobConf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);

        FileInputFormat.setInputPaths(jobConf, csvInputs);
        FileOutputFormat.setOutputPath(jobConf, outputDir);

        return JobClient.runJob(jobConf).isSuccessful() ? 0 : 1;
    }

    /**
     * Main entry point for the utility.
     *
     * @param args arguments
     * @throws Exception when something goes wrong
     */
    public static void main(final String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new MOFExample(), args);
        System.exit(res);
    }
}

Run this code and you’ll see the following files in HDFS, where /output is the job output directory:

$ hadoop fs -lsr /output
/output/cupertino/part-00000
/output/cupertino/part-00001
/output/sunnyvale/part-00000

If you look at the output files you’ll see that the files contain the correct buckets.

$ hadoop fs -lsr /output/cupertino/*
cupertino	apple
cupertino	pear

$ hadoop fs -lsr /output/sunnyvale/*
sunnyvale	banana

Awesome, you have your data bucketed by store. Now that we have everything working, let’s look at what we did to get there. We had to do two things to get this working:

Extend MultipleTextOutputFormat

This is where the magic happened - let’s look at that class again.

static class KeyBasedMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value, String name) {
        return key.toString() + "/" + name;
    }
}

You are working with text, which is why you extended MultipleTextOutputFormat, a class that in turn extends MultipleOutputFormat. MultipleTextOutputFormat is a simple class which instructs the MultipleOutputFormat to use TextOutputFormat as the underlying output format for writing out the records. If you were to use MultipleOutputFormat as-is it behaves as if you were using the regular TextOutputFormat, which is to say that it’ll only write to a single output file. To write data to multiple files you had to extend it, as with the example above.

The generateFileNameForKeyValue method allows you to return the output path for an input record. The third argument, name, is the original FileOutputFormat-created filename, which is in the form “part-NNNNN”, where “NNNNN” is the task index, to ensure uniqueness. To avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. In our example we’re using the key as the directory name, and then writing to the original FileOutputFormat filename within that directory.

Specify the OutputFormat

The next step was easy - specify that this output format should be used for your job:

jobConf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);

Earlier we also mentioned that you can use the input path as part of the output path, which we will look at next.

Using the input filename as part of the output filename in map-only jobs

What if we wanted to keep the input filename as part of the output filename? This only works for map-only jobs, and can be accomplished by overriding the getInputFileBasedOutputFileName method. Let’s look at the following code to understand how this method fits into the overall sequence of actions that the MultipleOutputFormat class performs:

public void write(K key, V value) throws IOException {

    // get the file name based on the key
    String keyBasedPath = generateFileNameForKeyValue(key, value, myName);

    // get the file name based on the input file name
    String finalPath = getInputFileBasedOutputFileName(myJob, keyBasedPath);

    // get the actual key
    K actualKey = generateActualKey(key, value);
    V actualValue = generateActualValue(key, value);

    RecordWriter<K, V> rw = this.recordWriters.get(finalPath);
    if (rw == null) {
      // if we don't have the record writer yet for the final path, create
      // one
      // and add it to the cache
      rw = getBaseRecordWriter(myFS, myJob, finalPath, myProgressable);
      this.recordWriters.put(finalPath, rw);
    }
    rw.write(actualKey, actualValue);
};

The getInputFileBasedOutputFileName method is called with the output of generateFileNameForKeyValue, which contains our already-customized output file. Our new KeyBasedMultipleTextOutputFormat can now be updated to override getInputFileBasedOutputFileName and append the original input filename to the output filename:

static class KeyBasedMultipleTextOutputFormat extends MultipleTextOutputFormat {
    @Override
    protected String generateFileNameForKeyValue(Object key, Object value, String name) {
        return key.toString() + "/" + name;
    }

    @Override
    protected String getInputFileBasedOutputFileName(JobConf job, String name) {
        String infilename = new Path(job.get("map.input.file")).getName();
        return name + "-" + infilename;
    }

If you run with your modified OutputFormat class you’ll see the following files in HDFS, confirming that the input filenames are now concatenated to the end of each output file.

$ hadoop fs -lsr /output
/output/cupertino/part-00000-file1.txt
/output/cupertino/part-00001-file2.txt
/output/sunnyvale/part-00000-file1.txt

The implementation of getInputFileBasedOutputFileName in MultipleOutputFormat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numOfTrailingLegs configurable to an integer greater than 0, then the getInputFileBasedOutputFileName will use part of the input path as the output path.

Let’s see what happens when we set the value to 1:

jobConf.setInt("mapred.outputformat.numOfTrailingLegs", 1);

The output files in HDFS now exactly mirror the input files used for the job:

$ hadoop fs -lsr /output
/output/file1.txt
/output/file2.txt

If we set mapred.outputformat.numOfTrailingLegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this:

$ hadoop fs -lsr /output
/output/input/file1.txt
/output/input/file2.txt

Basically as you keep incrementing mapred.outputformat.numOfTrailingLegs, then MultipleOutputFormat will continue to go up the parent directories of the input file and use them in the output path.

Modifying the output key and value

It’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. In our example, we took the output key and wrote to a directory using the key name. If you do that keeping the key in the output file may be redundant. How would we modify the output record so that the key isn’t written? MultipleOutputFormat has your back with the generateActualKey method.

class KeyBasedMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value, String name) {
        return key.toString() + "/" + name;
    }

    @Override
    protected Text generateActualKey(Text key, Text value) {
        return null;
    }
}

The returned value from this method replaces the key that’s supplied to the underlying RecordWriter, so if you return null as in the above example, no key will be written to the file.

$ hadoop fs -lsr /output/cupertino/*
apple
pear

$ hadoop fs -lsr /output/sunnyvale/*
banana

You can achieve the same result for the output value by overriding the generateActualValue method.

Changing the RecordWriter

In our final step we’ll look at how you can leverage multiple RecordWriter classes for different output files. This is accomplished by overriding the getRecordWriter method. In the example below we’re leveraging the same TextOutputFormat for all the files, but it gives you a sense of what can be accomplished.

static class KeyBasedMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value, String name) {
        return key.toString() + "/" + name;
    }

    @Override
    public RecordWriter<Text, Text> getRecordWriter(FileSystem fs, JobConf job, String name, Progressable prog) throws IOException {
        if (name.startsWith("apple")) {
            return new TextOutputFormat<Text, Text>().getRecordWriter(fs, job, name, prog);
        } else if (name.startsWith("banana")) {
            return new TextOutputFormat<Text, Text>().getRecordWriter(fs, job, name, prog);
        }
        return super.getRecordWriter(fs, job, name, prog);
    }
}

Conclusion

When using MultipleOutputFormat, give some thought to the number of distinct files that each reducer will create. It would be prudent to plan your bucketing so that you have a relatively small number of files.

In this post we extended MultipleTextOutputFormat, which is a simple extension of MultipleOutputFormat that supports text outputs. MultipleSequenceFileOutputFormat also exists to support SequenceFiles in a similar fashion.

So what are the shortcomings with the MultipleOutputFormat class?

If you have a job that uses both map and reduce phases, then MultipleOutputFormat can’t be used in the map-side to write outputs. Of course, MultipleOutputFormat works fine in map-only jobs.
All RecordWriter classes must support exactly the same output record types. For example, you wouldn’t be able to support a RecordWriter that emitted <IntWritable, Text> for one output file, and have another RecordWriter that emitted <Text, Text>.
MultipleOutputFormat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package.

All is not lost if you bump into either one of these issues, as you’ll discover in the next blog post in this series.

Using the libjars option with Hadoop

2013-02-25T14:20:00+00:00

When working with MapReduce one of the challenges that is encountered early-on is determining how to make your third-part JAR’s available to the map and reduce tasks. One common approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details).

A more elegant solution is to take advantage of the libjars option in the hadoop jar command, also mentioned in the Cloudera post at a high level. Here I’ll go into detail on the three steps required to make this work.

Add libjars to the options

It can be confusing to know exactly where to put libjars when running the hadoop jar command. The following example shows the correct position of this option:

$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value

It’s worth noting in the above example that the JAR’s supplied as the value of the libjar option are comma-separated, and not separated by your O.S. path delimiter (which is how a Java classpath is delimited).

You may think that you’re done, but often times this step alone may not be enough - read on for more details!

Make sure your code is using GenericOptionsParser

The Java class that’s being supplied to the hadoop jar command should use the GenericOptionsParser class to parse the options being supplied on the CLI. The easiest way to do that is demonstrated with the following code, which leverages the ToolRunner class to parse-out the options:

public static void main(final String[] args) throws Exception {
  Configuration conf = new Configuration();
  int res = ToolRunner.run(conf, new com.example.MyTool(), args);
  System.exit(res);
}

It is crucial that the configuration object being passed into the ToolRunner.run method is the same one that you’re using when setting-up your job. To guarantee this, your class should use the getConf() method defined in Configurable (and implemented in Configured) to access the configuration:

public class SmallFilesMapReduce extends Configured implements Tool {

  public final int run(final String[] args) throws Exception {
    Job job = new Job(super.getConf());
    ...
    job.waitForCompletion(true);
    return ...;
  }

If you don’t leverage the Configuration object supplied to the ToolRunner.run method in your MapReduce driver code, then your job won’t be correctly configured and your third-party JAR’s won’t be copied to the Distributed Cache or loaded in the remote task JVM’s.

It’s the ToolRunner.run method (actually it delegates the command parsing to GenericOptionsParser) which actually parses-out the libjars argument, and adds to the Configuration object a value for the tmpjars property. So a quick way to make sure that this step is working is to look at the job file for your MapReduce job (there’s a link when viewing the job details from the JobTracker), and make sure that the tmpjars configuration name exists with a value identical to the path that you specified in your command. You can also use the command-line to search for the libjars configuration in HDFS

$ hadoop fs -cat <JOB_OUTPUT_HDFS_DIRECTORY>/_logs/history/*.xml | grep tmpjars

Use HADOOP_CLASSPATH to make your third-party JAR’s available on the client-side

So far the first two steps tackled what you needed to do to to make your third-party JAR’s available to the remote map and reduce task JVM’s. But what hasn’t been covered so far is making these same JAR’s available to the client JVM, which is the JVM that’s created when you run the hadoop jar command.

For this to happen, you should set the HADOOP_CLASSPATH environment variable to contain the O.S. path-delimited list of third-party JAR’s. Let’s extend the commands in the first step above with the addition of setting the HADOOP_CLASSPATH environment variable:

$ export LIBJARS=/path/jar1,/path/jar2
$ export HADOOP_CLASSPATH=/path/jar1:/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value

Note that value for HADOOP_CLASSPATH uses a Unix path delimiter of :, so modify accordingly for your platform. And if you don’t like the copy-paste above you could modify that line to substitute the commas for semi-colons:

$ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`

Installing AsciiDoc on OSX

2013-02-17T14:00:00+00:00

AsciiDoc is a markup language and tool that I’m starting to play with to produce DocBook and PDF/HTML versions of my work. It took me a little longer than expected to get it up and running, so hopefully this blog will serve as a quick install guide for you, as well as the future me.

First I had to install Homebrew, a useful package manager fo OSX:

$ sudo mkdir /usr/local/homebrew
$ cd /usr/local/homebrew
$ sudo curl -L https://github.com/mxcl/homebrew/tarball/master | tar xz --strip 1 -C .
$ sudo ln -s `pwd`/bin/brew /usr/local/bin/brew

Next-up was installing AsciiDoc and other required libraries via brew:

$ sudo brew install autoconf automake libevent asciidoc

After this I had to update my bash profile file to set an environment variable that points to the XML catalog created as part of the AsciiDoc installation:

$ echo "export XML_CATALOG_FILES=/usr/local/etc/xml/catalog" >>  ~/.bash_profile

Now you have to download Apache FOP, a print formatter used by AsciiDoc to create PDF’s, which in my case resulted in a file at ~/Downloads/fop-1.0-bin.tar.gz. Untar the contents and create a symbollic link for fop:

$ cd /usr/local/
$ sudo tar -xzvf ~/Downloads/fop-1.0-bin.tar.gz
$ sudo ln -s /usr/local/fop-1.0/fop /usr/bin/fop

Finally, let’s make sure that everything is installed correctly. Create a sample AsciiDoc file called sample.asciidoc with the following contents:

Your First AsciiDoc
===================
Jane Blogs
:Author Initials: JB

This is your first AsciiDoc file - yay for you!

You can then run a2x, which will first generate a DocBook version of your AsciiDoc file, and then goes on to generate the PDF.

$ a2x -v -fpdf -dbook --fop sample.asciidoc

This should create a sample.pdf in the same directory as your AsciiDoc file. You can also generate a HTML version with:

$ asciidoc -b html5 -a data-uri -a toc2 tada.asciidoc

Java 6 and 7 with the dotted/dotless I

2013-02-14T14:00:00+00:00

Imagine you’re working on a project in Java where you are handling text in a language that contains characters outside the standard 128-character ASCII scheme, such as Turkish. How about we focus on the dotted and dotless I:

Letter	Description	Unicode (decimal)	Unicode (Java hex)
İ	Upper-case dotted I	304	u0130
I	Upper-case (dotless) Latin I	73	u0049
ı	Lower-case dottless I	305	u0131
i	Lower-case (dotted) Latin I	105	u0069

This is how the lower and upper-case versions of the Turkish dotted/dotless “I” relate:

Since we know that the hexadecimal Unicode representation of the upper-case dotted “I” (İ) is u0130, how about we try and and convert it to its lower-case form, which should be the regular lower-case Latin “i”, which in Unicode hexadecimal form is u0069.

System.out.println(String.valueOf('\u0130').toLowerCase());

If we run this same code under Java 6 and Java 7 we get:

Hmm - I may be mistaken, but it looks like under Java 7 the “i” has grown another dot! Let’s see what the Unicode codepoints in the resulting string look like using the following code:

int offset;
for(int i = 0; i < s.length(); i += offset) {
    int codepoint = s.codePointAt(i);
    offset = Character.charCount(codepoint);
    System.out.print(String.format("u%04x ", codepoint));
}

If we run again run this in Java 6 and Java 7 against the toLowerCase method on the upper-case dotted “I” we get:

Java 6: u0069
Java 7: u0069 u0307

It looks like the first codepoint is indeed correct (the Latin lower-case “i”), but what is u0307? Wikipedia tells us) it’s the “combining dot above”, which is to say that it is displayed as a single character (called a grapheme) it modifies the previous character with an additional dot, just like we saw in our example.

What’s puzzling about this is why do we see the behaviour of toLowerCase change between Java versions? If you dig into the Java 7 String class and compare the code against the Java 6 source, you’ll see that the following code was added to Java 7:

} else if (srcChar == '\u0130') { // LATIN CAPITAL LETTER I DOT
    lowerChar = Character.ERROR;
}

Basically the end result of this change is that for this specific case (the upper-case dotted I), Java 7 now consults a special Unicode character database (http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt), which provides data on complex case-mappings. Looking at this file you can see several lines for the upper-case dotted I:

CODE       LOWER   TITLE   UPPER  LANGUAGE
0130;  0069 0307;   0130;   0130;
0130;  0069;        0130;   0130;       tr;
0130;  0069;        0130;   0130;       az;

Entries with a language take precedence over those without, so in my JVM where the default locale is English, the first row of the mapping is used, which lines-up with the codepoints that we saw outputted in our Java 7 example. Therefore to make Java do the right thing here for Turkish, we need to explicitly specify the Turkish locale (“tr” is the ISO 639 alpha-2 language code for Turkish) to the toLowerCase method:

dumpUnicodeCodePoints(String.valueOf('\u0130').toLowerCase(new Locale("tr")));

This now yields a result consistent with what we expect the Turkish lower-case mapping:

u0069

The bottom line is that Java 6 will always convert the upper-case dotted “I” to a lower-case Latin “I”, whereas Java 7 is following the complex Unicode case mapping based on the locale passed into the toLowerCase method, which defaults to Locale.getDefault() if you don’t supply one to the toLowerCase.

Oh, and one last tip - for most lower-case mappings the String.toLowerCase method defers to Character.toLowerCase. But take stock of the advice given in the Character.toLowerCase JavaDoc comment, especially in the second and third paragraphs:

/**
 * Converts the character (Unicode code point) argument to
 * lowercase using case mapping information from the UnicodeData
 * file.
 *
 * <p> Note that
 * {@code Character.isLowerCase(Character.toLowerCase(codePoint))}
 * does not always return {@code true} for some ranges of
 * characters, particularly those that are symbols or ideographs.
 *
 * <p>In general, {@link String#toLowerCase()} should be used to map
 * characters to lowercase. {@code String} case mapping methods
 * have several benefits over {@code Character} case mapping methods.
 * {@code String} case mapping methods can perform locale-sensitive
 * mappings, context-sensitive mappings, and 1:M character mappings, whereas
 * the {@code Character} case mapping methods cannot.
 *
 * @param   codePoint   the character (Unicode code point) to be converted.
 * @return  the lowercase equivalent of the character (Unicode code
 *          point), if any; otherwise, the character itself.
 * @see     Character#isLowerCase(int)
 * @see     String#toLowerCase()
 *
 * @since   1.5
 */
public static int toLowerCase(int codePoint) {
    return CharacterData.of(codePoint).toLowerCase(codePoint);
}

LZOP decompression - revenge of the useless cat

2013-02-08T14:00:00+00:00

For me LZOP is the ubiquitous compression codec with working with large text files in HDFS due to its MapReduce data locality advantages. As a result when I want to peek at LZOP-compressed files in HDFS I use a command such as:

shell$ hadoop fs -get /some/file.lzo | lzop -dc | head

With this command the output of a LZOP-compressed file in HDFS is piped to the lzop utility, where the -dc flags tell lzop to decompress the stream and write the uncompressed data to standard out, and the final head will show the first 10 lines of the data. I may substitute head with other utilities such as awk or sed, but I always follow this general pattern of piping the output lzop output to another utility.

Imagine my surprise the other day when I tried the same command on a smaller file (hence not needing to use the head command), only to see this error:

shell$ hadoop fs -get /some/file.lzo | lzop -dc
lzop: <stdout>: uncompressed data not written to a terminal

What just happened - why would the first command work, but not the second? My guess is that this is likely the authors of the lzop utility safeguarding us accidentally flooding standard output with uncompressed data. Which is frustrating, because as you can see from the following example this is a different route than that which the authors of gunzip took:

shell$ echo "the cat" | gzip -c | gunzip -c
the cat

If we run the same command with lzop we see the same result as was saw earlier:

shell$ echo "the cat" | lzop -c | lzop -dc
lzop: <stdout>: uncompressed data not written to a terminal

A ghetto approach to solving this problem is to pipe the lzop output to cat (which is a necessary violation of the useless cat pattern):

shell$ hadoop fs -get /some/file.lzo | lzop -dc | cat

Luckily lzop has a -f option which removes the need for the cat:

shell$ hadoop fs -get /some/file.lzo | lzop -dcf

It turns out that man page on lzop is instructive with regards to the -f option, indicates various scenarios where it can be helpful:

shell$ man lzop
...
-f, --force
   Force lzop to

    - overwrite existing files
    - (de-)compress from stdin even if it seems a terminal
    - (de-)compress to stdout even if it seems a terminal
    - allow option -c in combination with -U

   Using -f two or more times forces things like

    - compress files that already have a .lzo suffix
    - try to decompress files that do not have a valid suffix
    - try to handle compressed files with unknown header flags

   Use with care.

Executing variables that contain shell operators

2013-01-27T22:20:00+00:00

I touched a little on pipes in a previous post. Here’s a quick example of an echo utility which outputs two lines, and a pipe operator which redirects that output to a grep utility which performs a simple filter to only include lines that contain the word “cat”:

shell$ echo -e 'the cat \n sat on the mat' | grep cat
the cat

Cool - since that worked, what do you think will happen if you do the following?

shell$ cmd="echo -e 'the cat \n sat on the mat' | grep cat"
shell$ ${cmd}

In the above example we’re simply assigning the original utility to a shell variable, and then executing it. So why, then, would the output be this?

shell$ ${cmd}
'the cat
 sat on the mat' | grep cat

This is something that has bitten me in the past when I write shell scripts. What’s happening here is that the shell executes the contents of variable cmd as a single command, which means that everything after echo are treated as arguments to the echo utility, including the pipe.

What we actually need to happen is to have the entire contents of cmd evaluated by the shell so that the shell can create the pipeline between the two utilities. This is where the utility eval comes into play - eval tells the shell to concatenate the arguments and have them executed by the shell.

shell$ eval ${cmd}
the cat

The morale of this story is that if you want to execute a variable that includes any shell constructs (such as the pipe in our example) - then make sure you eval. Examples of shell constructs include redirections (i.e. echo "the cat" > file1.txt), shell conditionals, loops and functions.

Using awk and friends with Hadoop

2013-01-17T14:20:00+00:00

Imagine you have a CSV file that you want to manipulate. Here’s a sample file we can play with:

lopez,charlie,2002,11,21
parker,ward,1995,04,08
henderson,russell,2007,10,01

Our goal is to transform this into the following form by combining the last three columns:

lopez,charlie,20021121
parker,ward,19950408
henderson,russell,20071001

In Linux this would take all of two seconds (excuse the awkward awk command):

shell$ awk -F"," '{ print $1","$2","$3$4$5 }' people.txt

What if you wanted to quickly do the same in HDFS - and let’s assume you want to write the results back to HDFS. One approach would be to use the HDFS CLI to stream the inputs into awk, and stream the awk output back into HDFS. You could do this with the HDFS cat and put - options (note that adding a hyphen after put instructs the put command to stream data from standard input to HDFS):

shell$ hadoop fs -cat people.txt | awk -F"," '{ print $1","$2","$3$4$5 }' | hadoop fs -put - people-coalesed.txt

BTW, if your input and output files are LZOP-compressed then this command would work:

shell$ hadoop fs -cat people.txt.lzo | lzop -dc | awk -F"," '{ print $1","$2","$3$4$5 }' | \
         lzop -c | hadoop fs -put - people-coalesed.txt.lzo

This is great if your file isn’t too large, but if it’s multiple gigabytes in length then you probably want to harness the power of MapReduce to get this done in a jiffy! The words “in a jiffy” and “MapReduce” aren’t commonly used together, so what do we do? Well you could crack open Pig or Hive and write some custom user-defined functions, but this means you end up in Java which we want to avoid.

Hadoop Streaming comes to the rescue in these situations. Let’s first create our awk script which will be executed:

shell$ cat people.awk
#!/bin/awk -f

BEGIN { FS = "," }
{ print $1","$2","$3$4$5 }

In Linux, if you make this awk script executable, you could execute is as follows:

shell$ ./people.awk people.txt

In MapReduce-land we don’t need to join data in this particular example, so we don’t need to run any reducers. Call your awk script from mappers via Hadoop Streaming with this command:

shell$ HADOOP_HOME=/usr/lib/hadoop
shell$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -D mapreduce.job.reduces=0 \
  -D mapred.reduce.tasks=0 \
  -input people.txt \
  -output people-coalesed \
  -mapper people.awk \
  -file people.awk

You can view the output in HDFS with a cat:

shell$ hadoop fs -cat /user/aholmes/people-coalesed/part*
henderson,russell,20071001
lopez,charlie,20021121
parker,ward,19950408

A few options in the Hadoop Streaming command are worth examining:

Finally - to get LZO into the picture you need to add -inputformat, -D mapred.output.compress and -D mapred.output.compression.codec arguments:

shell$ HADOOP_HOME=/usr/lib/hadoop
shell$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -D mapreduce.job.reduces=0 \
  -D mapred.reduce.tasks=0 \
  -D mapred.output.compress=true \
  -D stream.map.input.ignoreKey=true \
  -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
  -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
  -input people.txt.lzo \
  -output people-coalesed \
  -mapper people.awk \
  -file people.awk

Update 6/3/2013:

This article has a Serbo-Croatian translation by Anja Skrba.

Configuring and tuning MapReduce's shuffle

2012-11-26T14:20:00+00:00

Once you have outgrown your small Hadoop cluster it’s worth tuning some of the shuffle configurables to ensure that your performance keeps up with the physical growth of your cluster. The figure below shows key configurables in the shuffle stage in Hadoop versions 1.x and earlier, and identifies those that should be tuned.

You can read more about these configurables and their default values by looking at mapred-default.xml. My book Hadoop in Practice (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.

Controlling user logging in Hadoop

2012-11-12T14:20:00+00:00

Imagine that you’re a Hadoop administrator, and to make things interesting you’re managing a multi-tenant Hadoop cluster where data scientists, developers and QA are pounding your cluster. One day you notice that your disks are filling-up fast, and after some investigating you realize that the root cause is your MapReduce task attempt logs.

How do you guard against this sort of thing happening? Before we get to that we need to understand where these files exist, and how they’re written. The figure below shows the three log files that are created for each task attempt in MapReduce. Notice that the logs are written to the local disk of the task attempt.

OK, so how does Hadoop normally make sure that our disks don’t fill-up with these task attempt logs? I’ll cover three approaches.

Approach 1: mapred.userlog.retain.hours

Hadoop has a mapred.userlog.retain.hours configurable, which is defined in mapred-default.xml as:

The maximum time, in hours, for which the user-logs are to be retained after the job completion.

Great, but what if your disks are filling up before Hadoop has had a chance to automatically clean them up? It may be tempting to reduce mapred.userlog.retain.hours to a smaller value, but before you do that you should know that there’s a bug with the Hadoop versions 1.x and earlier (see MAPREDUCE-158), where the logs for long-running jobs that run longer than mapred.userlog.retain.hours are accidentally deleted. So maybe we should look elsewhere to solve our overflowing logs problem.

Approach 2: mapred.userlog.limit.kb

Hadoop has another configurable, mapred.userlog.limit.kb, which can be used to limit the file size of stdlog, which is the log4j log output file. Let’s peek again at the documentation:

The maximum size of user-logs of each task in KB. 0 disables the cap.

The default value is 0, which means that log writes go straight to the log file. So all we need to do is to set a non-negative value and we’re set, right? Not so fast - it turns out that this approach has two disadvantages:

Hadoop and user logs are actually cached in memory, so you’re taking away mapred.userlog.limit.kb kilobytes worth of memory from your task attempt’s process.
Logs are only written out when the task attempt process has completed, and only contain the last mapred.userlog.limit.kb worth of log entries, so this can make it challenging to debug long-running tasks.

OK, so what else can we try? We have one more solution, log levels.

Approach 3: Changing log levels

Ideally all your Hadoop users got the memo about minimizing excessive logging. But the reality of the situation is that you have limited control over what users decide to log in their code, but what you do have control over is the task attempt log levels.

If you had a MapReduce job that was aggressively logging in package com.example.mr, then you may be tempted to use the daemonlog CLI to connect to all the TaskTracker daemons and change the logging to ERROR level:

hadoop daemonlog -setlevel <host:port> com.example.mr ERROR

Yet again we hit a roadblock - this will only change the logging level for the TaskTracker process, and not for the task attempt process. Drat! This really only leaves one option, which is to update your ${HADOOP_HOME}/conf/log4j.properties on all your data nodes by adding the following line to this file:

log4j.logger.com.example.mr=ERROR

The great thing about this change is that you don’t need to restart MapReduce, since any new task attempt processes will pick up your changes to log4j.properties.

Pipes and useless cats

2012-10-30T03:20:00+00:00

I love me some Unix command pipes:

$ cat /some/file.txt | sort | head

Pipelines let you chain together multiple commands to manipulate data flows. Pipes are not only useful as a data filtering mechanism, but when combined with tools such as cut, awk and sed can also be used for projections and transformations. The Unix pipe, while simple in concept, is a sophisticated shell construct and one big reason why Unix shells are to this day a popular tool in a programmer/system administrator/data scientist’s toolkit.

So why am I sitting here telling you something that you already know? Fair question - to answer that let’s take another look at that command:

$ cat /some/file.txt | sort | head

While shell pipelines are great, we have a subtle problem here - and it’s something that’s known as a useless cat. No, I don’t hate cats - this expression harks back to the old usenet days where a forum member of comp.unix.shell would write a weekly post where he would highlight a redundant use of the cat command.

So why is the above command useless? Because sort can take one or more files as arguments, much like the majority of Unix commands. So this command can be rewritten as:

$ sort /some/file.txt | head

Removing cat from the equation means that we’ve reduced the number of processes that need to execute, and cut down on the buffering and data copying that the shell needs to do to make pipelines work - a win-win.

In fact cat really doesn’t have many uses - if you need to view the contents of a file you’re better off using vi or less, and otherwise most Unix commands can directly work with files.

So next time you’re about to run a cat command - think about whether or not you need it, or whether you’re just perpetuating use of the useless cat!

Hadoop unit testing with MiniMRCluster and MiniDFSCluster

2012-10-20T05:20:00+00:00

In a recent blog post Steve Loughran mentioned that I didn’t cover Hadoop’s MiniMRCluster in my book. At the time I wrote the testing chapter of “Hadoop in Practice” I decided that covering MRUnit and LocalJobRunner were sufficient to cover the goals of most MapReduce unit test, but for completeness I want to cover MiniMRCluster in this post.

MRUnit is great for quick and easy unit testing of MapReduce jobs, where you don’t want to test Input/OutputFormat and Partitioner code. LocalJobRunner is a step above MRUnit in that it allows you to test Input/OutputFormat classes, but it is single-threaded so it’s not useful for uncovering bugs related to multiple map or reduce tasks, or for properly exercising partitioners.

That’s where MiniMRCluster (and MiniDFSCluster) come into play. These classes offer full-blown in-memory MapReduce and HDFS clusters, and can launch multiple MapReduce and HDFS nodes. MiniMRCluster and MiniDFSCluster are bundled with the Hadoop 1.x test JAR, and are used heavily within Hadoop’s own unit tests.

The easy way to leverage MiniMRCluster and MiniDFSCluster is to extend the abstract ClusterMapReduceTestCase class, which is a JUnit TestCase and starts/stops a Hadoop cluster around each JUnit test. ClusterMapReduceTestCase runs a 2-node MapReduce cluster with 2 HDFS nodes. The way you should be able to use this class is as follows:

public class WordCountTest extends ClusterMapReduceTestCase {
    public void test() throws Exception {
        JobConf conf = createJobConf();

        Path inDir = new Path("testing/jobconf/input");
        Path outDir = new Path("testing/jobconf/output");

        OutputStream os = getFileSystem().create(new Path(inDir, "text.txt"));
        Writer wr = new OutputStreamWriter(os);
        wr.write("b a\n");
        wr.close();

        conf.setJobName("mr");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(LongWritable.class);

        conf.setMapperClass(WordCountMapper.class);
        conf.setReducerClass(SumReducer.class);

        FileInputFormat.setInputPaths(conf, inDir);
        FileOutputFormat.setOutputPath(conf, outDir);

        assertTrue(JobClient.runJob(conf).isSuccessful());

        // Check the output is as expected
        Path[] outputFiles = FileUtil.stat2Paths(
                getFileSystem().listStatus(outDir, new Utils.OutputFileUtils.OutputFilesFilter()));

        assertEquals(1, outputFiles.length);

        InputStream in = getFileSystem().open(outputFiles[0]);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        assertEquals("a\t1", reader.readLine());
        assertEquals("b\t1", reader.readLine());
        assertNull(reader.readLine());
        reader.close();
    }
}

However, at least with the Hadoop 1.0.3 release, this will fail with the following exception:

12/10/19 23:10:37 ERROR mapred.MiniMRCluster: Job tracker crashed
java.lang.NullPointerException
  at java.io.File.<init>(File.java:222)
  at org.apache.hadoop.mapred.JobHistory.initLogDir(JobHistory.java:531)
  at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:499)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2334)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2331)
  at java.security.AccessController.doPrivileged(Native Method)
  ...

The trick here is that the JobTracker is expecting hadoop.log.dir to be set in the system properties, which it isn’t in our example, causing the NPE. As it turns out this is a bug (see MAPREDUCE-2785) which according to Jira will be fixed in the Hadoop 1.1 release (thanks to Steve for that information). The fix is simple - override the setUp() method in ClusterMapReduceTestCase and set the Hadoop log directory:

@Override
protected void setUp() throws Exception {

    System.setProperty("hadoop.log.dir", "/tmp/logs");

    super.startCluster(true, null);
}

Once you make this change the above JUnit test will work. This can be a bit tedious to have to roll into each and every one of your unit tests, but luckily there are a couple of options out there so that you don’t have to.First, Steve pointed out a LocalMRCluster Groovy class bundled in SmartFrog which fixes this issue by extending MiniMRCluster.

Another alternative is to use my GitHub hadoop-utils project which contains a JUnit class similar to ClusterMapReduceTestCase called MiniHadoopTestCase which fixes this property problem, and also gives you more control over where the in-memory clusters will store their data on your local filesystem, and also let you control the number of TaskTrackers and DataNodes.

Hadoop-utils also contains a helper class (TextIOJobBuilder) to help with writing MapReduce input files, and verifying the output results. You can see an example of how clean your unit tests can look when combining TextIOJobBuilder with MiniHadoopTestCase in class TotalOrderSortTest:

public class TotalOrderSortTest extends MiniHadoopTestCase {

    @Test
    public void test() throws Exception {

        InputSampler.RandomSampler sampler = new InputSampler.RandomSampler(1.0, 6, 1);

        JobConf jobConf = super.getMiniHadoop().createJobConf();

        TextIOJobBuilder builder = new TextIOJobBuilder(
                super.getMiniHadoop().getFileSystem())
                .addInput("foo-hump")
                .addInput("foo-hump")
                .addInput("clump-bar")
                .addExpectedOutput("clump-bar")
                .addExpectedOutput("foo-hump")
                .writeInputs();

        new SortConfig(jobConf).setUnique(true);

        SortTest.run(
                jobConf,
                builder,
                2,
                2,
                sampler);
    }
}

The only real downside to using MiniMRCluster and MiniDFSCluster is speed - it takes a good 5-10 seconds for both setup and tear-down, and when you multiply this for each test case this can add up.

How partitioning, collecting and spilling work in MapReduce

2012-09-24T05:20:00+00:00

The figure below shows the various steps that the Hadoop MapReduce framework takes after your map function emits a key/value output record. Please note that this figure represents what’s happening with Hadoop versions 1.x and earlier - in Hadoop 2.x there have been some changes which will be discussed in a future blog post.

My book Hadoop in Practice (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.

Using sed to perform inline replacements of regex groups

2012-09-17T05:20:00+00:00

I love tools like sed and awk - I use them every day, and only realize how much I rely on them when I’m forced to work on a machine that’s not running Unix. Today I want to look at a feature that is really useful when working with regular expressions in sed.

Imagine that you had an IP address, and you wanted to change the second octet - one way to do this in sed is the following:

shell$ echo "127.0.0.1" | sed "s/127.0/127.1/"
127.1.0.1

That seemed to work well, and was simple. But what if you had a file of random IP’s - how would you change the second octet in that scenario? Sure, you could use awk, but that feels like it would be overkill. Well, it can be done in sed with something called regular expression group substitutions.

First of all, you’ll need to tell sed that you are using extended regular expressions by using the -r option, so that you don’t have to escape some of the regular expression characters (if you’re curious, they are ?+(){}). If you end up needing to use any of these characters as literals, you’ll ned to escape them with a backslash (\).

sed supports up to 9 groups that can be defined in the pattern string, and subsequently referenced in the replacement string. In the following command the pattern string starts with a group, which contains the first octet followed by the period, and that’s followed by a second octet. In the replacement string we’re referencing the first (and only) group with \1, followed by 234 which is the replacement for the rest of the matching string, which contains the second octet.

shell$ echo "127.0.0.1" | sed -r "s/^([0-9]{1,3}\.)[0-9]{1,3}/\1234/"
127.234.0.1

What if we wanted to preserve the second octet and simply a “1” in front of it? In that case you can define a second group in the pattern, and reference the second group in the replacement value:

shell$ echo "127.0.0.1" | sed -r "s/^([0-9]{1,3}\.)([0-9]{1,3})/\11\2/"
127.10.0.1

Actually it would have been easier to just remove the second octet altogether from the pattern:

shell$ echo "127.0.0.1" | sed -r "s/^([0-9]{1,3}\.)/\11/"
127.10.0.1

On a final note - it wasn’t so long ago that I would write a command similar to the one below if I wanted to use sed to perform a substitution and overwrite an existing file:

shell$ sed 's/a/b/' file1.txt > file2.txt; mv file2.txt file1.txt

Ugh! Well there’s no need to do this - sed has a -i option which will do an inline replace of the file:

shell$ sed -i 's/a/b/' file1.txt

Ahhh, that’s better! Anything that’s easy on the eyes gets a thumbs-up from me.

Sorting text files with MapReduce

2012-09-10T13:00:00+00:00

In my last post I wrote about sorting files in Linux. Decently large files (in the tens of GB’s) can be sorted fairly quickly using that approach. But what if your files are already in HDFS, or ar hundreds of GB’s in size or larger? In this case it makes sense to use MapReduce and leverage your cluster resources to sort your data in parallel.

MapReduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map output records (using the map output keys), so that all the records that reach a single reducer are sorted. The diagram below shows the internals of how the shuffle phase works in MapReduce.

Given that MapReduce already performs sorting between the map and reduce phases, then sorting files can be accomplished with an identity function (one where the inputs to the map and reduce phases are emitted directly). This is in fact what the sort example that is bundled with Hadoop does. You can look at the how the example code works by examining the org.apache.hadoop.examples.Sort class. To use this example code to sort text files in Hadoop, you would use it as follows:

shell$ export HADOOP_HOME=/usr/lib/hadoop
shell$ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples.jar sort \
         -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
         -outFormat org.apache.hadoop.mapred.TextOutputFormat \
         -outKey org.apache.hadoop.io.Text \
         -outValue org.apache.hadoop.io.Text \
         /hdfs/path/to/input \
         /hdfs/path/to/output

This works well, but it doesn’t offer some of the features that I commonly rely upon in Linux’s sort, such as sorting on a specific column, and case-insensitive sorts.

Linux-esque sorting in MapReduce

I’ve started a new GitHub repo called hadoop-utils, where I plan to roll useful helper classes and utilities. The first one is a flexible Hadoop sort. The same Hadoop example sort can be accomplished with the hadoop-utils sort as follows:

shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         /hdfs/path/to/input \
         /hdfs/path/to/output

To bring sorting in MapReduce closer to the Linux sort, the --key and --field-separator options can be used to specify one or more columns that should be used for sorting, as well as a custom separator (whitespace is the default). For example, imagine you had a file in HDFS called /input/300names.txt which contained first and last names:

shell$ hadoop fs -cat 300names.txt | head -n 5
       Roy     Franklin
       Mario   Gardner
       Willis  Romero
       Max     Wilkerson
       Latoya  Larson

To sort on the last name you would run:

shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --key 2 \
         /input/300names.txt \
         /hdfs/path/to/output

The syntax of --key is POS1[,POS2], where the first position (POS1) is required, and the second position (POS2) is optional - if it’s omitted then POS1 through the rest of the line is used for sorting. Just like the Linux sort, --key is 1-based, so --key 2 in the above example will sort on the second column in the file.

LZOP integration

Another trick that this sort utility has is its tight integration with LZOP, a useful compression codec that works well with large files in MapReduce (see chapter 5 of Hadoop in Practice for more details on LZOP). It can work with LZOP input files that span multiple splits, and can also LZOP-compress outputs, and even create LZOP index files. You would do this with the codec and lzop-index options:

shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --key 2 \
         --codec com.hadoop.compression.lzo.LzopCodec \
         --map-codec com.hadoop.compression.lzo.LzoCodec \
         --lzop-index \
         /hdfs/path/to/input \
         /hdfs/path/to/output

Multiple reducers and total ordering

If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can’t concatenate your output files to create a single sorted output file. To do this you’ll need total ordering, which is supported by both the Hadoop example sort and the hadoop-utils sort - the hadoop-utils sort enables this with the --total-order option.

shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --total-order 0.1 10000 10 \
         /hdfs/path/to/input \
         /hdfs/path/to/output

The syntax is for this option is unintuitive so let’s look at what each field means.

More details on total ordering can be seen in chapter 4 of Hadoop in Practice.

More details

For details on how to download and run the hadoop-utils sort take a look at the CLI guide in the GitHub project page.

Lexicographically sorting large files in Linux

2012-09-01T01:14:00+00:00

When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.

Let’s say you have a large file on a host with 2GB or more of main memory free. The following sort command is a efficient way to lexicographically-order large files.

LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt

Let’s break this command down and examine each part in detail.

Slurper v2

2012-08-20T03:14:00+00:00

The current HDFS Slurper was created as part of writing “Hadoop in Practice”, and it just so happened that it also happened to fulfill a need that we had at work. The one-sentence description of the Slurper is that it’s a utility that copies files between Hadoop file systems. It’s particularly useful in situations where you want to automate moving files from local disk to HDFS, and vice-versa.

While it has worked well for us, with the addition of a few choice features it could be even more useful:

Filter and projection, to remove or reduce data from input files
Write to multiple output files from a single input file
Keep source files intact

As such I have come up with a high-level architecture for what v2 may look like (subject to change of course).

Bare-metal installation for Nginx and Jekyll

2012-08-17T00:56:00+00:00

This blog is a bunch of Jekyll-created HTML which is served by the Ngix HTTP server. This post documents the process of getting Jekyll and Nginx setup from bare metal. It also shows a script being used to periodically pull and generate your blog from GitHub sources. The instructions that follow should work for RedHat 6 and derivatives (such as CentOS 6 which is what I’m using).

Create a user and setup ssh

With a new VM you’ll typically be given root access, but security 101 dictates that you avoid running commands as the root user as much as possible. Therefore the first thing you’ll want to do is to create a user, in this case bloguser:

shell$ useradd bloguser

Next, change the password for the user:

shell$ passwd bloguser

Now you’ll want to create a SSH public/private key set for your user. It’s recommended that you do this on your own machine, not your VM, since you don’t want your private key out there if it can be avoided.

shell$ ssh-keygen -t rsa

This will generate the following files on your local host:

.ssh/id_rsa
.ssh/id_rsa.pub

Once these files are generated, create the .ssh directory on your VM (these steps assume you’re logged-in as root):

shell$ su - bloguser
shell$ mkdir .ssh

Create .ssh/authorized_keys on your VM, and copy the contents of .ssh/id_rsa.pub from your local host:

shell$ vi .ssh/authorized_keys

Setup the permissions on the directory and file.

shell$ chmod 700 .ssh
shell$ chmod 600 .ssh/authorized_keys

Test out your ssh setup, by ssh-ing from your local host to your VM as the bloguser user:

shell$ ssh bloguser@<vm-host>

As root, allow the bloguser user to perform commands as root (if the /etc/sudoers file doesn’t exist, then you will need to install sudo with the yum install sudo command).

shell$ vi /etc/sudoers

Add the following line:

%bloguser       ALL=(ALL)       ALL

Setup some basic security

Next up is tightening-up the SSH configuration.

shell$ sudo vi /etc/ssh/sshd_config

Inside this file you will do three things:

Change the port from 22 to some other number (such as 52846 in the example below).
Disable password authentication, so that a private key must be used to login to the server.
Block the root user from ssh access to your host.

The file therefore needs to contain the following lines (make sure all other entries with these names are commented-out).

Port 52846
PasswordAuthentication no
PermitRootLogin no

Restart the ssh daemon to pick up the changes you just made.

shell$ sudo /sbin/service sshd restart

The next step is to setup a firewall to restrict incoming traffic to just ssh and HTTP. To do this create a file called vm-iptables.sh with the following content. You’ll be executing the following commands as root.


#!/bin/bash

# Flush all current rules from iptables
iptables -F

# Allow SSH and HTTP connections
iptables -A INPUT -p tcp --dport 52846 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT

# Drop traffic on all other inbound ports
iptables -P INPUT DROP
iptables -P FORWARD DROP

# Allow all outbound traffic
iptables -P OUTPUT ACCEPT

# Accept any connection on the local port
iptables -A INPUT -i lo -j ACCEPT

# Accept packets belonging to established and related connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Save the iptables
/sbin/service iptables save

# List
iptables -L -v

After you’ve created the file, make it an executable and execute it to save your rules.


shell$ chmod +x ./vm-iptables.sh
shell$ sudo ./vm-iptables.sh
iptables: Saving firewall rules to /etc/sysconfig/iptables:[  OK  ]
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    2   104 ACCEPT     tcp  --  any    any     anywhere             anywhere            tcp dpt:ssh
    0     0 ACCEPT     tcp  --  any    any     anywhere             anywhere            tcp dpt:http
    0     0 ACCEPT     all  --  lo     any     anywhere             anywhere
    0     0 ACCEPT     all  --  any    any     anywhere             anywhere            state RELATED,ESTABLISHED

Chain FORWARD (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 2 packets, 264 bytes)
 pkts bytes target     prot opt in     out     source               destination

The output shows your new iptables configuration which reflects the rules we saved in myvm-iptables.sh.

Install and start Nginx

Add the EPEL yum repository into your configuration:

shell$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm

Install Nginx using yum:

shell$ sudo yum install nginx

Setup Nginx so that it auto-starts at system start time:

shell$ sudo chkconfig nginx on

Start Nginx:

shell$ sudo /sbin/service nginx start

You can test that Nginx is up and running by pointing your browser at your VM IP address - you should see a page confirming that all is good.

Install Jekyll

The following commands will install Jekyll on your VM:

shell$ sudo yum install gcc rubygems ruby-devel
shell$ sudo gem install jekyll

Install Pygments (for code syntax highlighting)

shell% sudo yum install python-setuptools
shell$ sudo easy_install Pygments

Create a crontab entry and script to generate the blog

We’re going to setup Jekyll to write to the Nginx HTML directory, and since we’re going to do this as the bloguser user, we’ll first need to wipe-out the contents of that directory, and chown it so that the bloguser can write to it:

shell$ sudo rm -rf /usr/share/nginx/html/*
shell$ sudo chown bloguser:bloguser /usr/share/nginx/html

We’ll assume that you have a GitHub repository that’s hosting your Jekyll sources. Therefore you need to install git.

shell$ sudo yum install git

Create a directory to contain your blog source

shell$ sudo mkdir -p /app/blog
shell$ sudo chown bloguser:bloguser /app/blog

The script will send out an email if an error is encountered, so you need to install mail:

shell$ sudo yum install mailx

Next on our list is creating a script which will do the following:

Pulls the latest blog sources from GitHub.
Uses Jekyll to generate the HTML for the blog.
Sends an email if Jekyll exits with an error, or if the home page can’t be retrieved

Create a shell script in /app/blog/gen.sh:

shell$ vi /app/blog/gen.sh

Copy the following content into this file, which clones your github repo for the first time if it doesn’t already exist, or updates the local copy via the pull command:

#!/bin/bash

send_email_and_exit() {
  recipient=$1
  message=$2

  echo "Sending email and exiting due to error"

  /bin/mail -s "Blog generation failure" "${recipient}" << EOF
${message}
EOF

  exit 1
}

echo "Running at "`date`

basedir=/app/blog
gitdir=${basedir}/blog
nginxdir=/usr/share/nginx/html
githubrepo=https://github.com/alexholmes/blog.git
emailto="grep.alex@gmail.com"

if [ ! -d ${gitdir} ]; then
  echo "Checking out repo for the first time"
  mkdir -p ${gitdir}
  cd ${basedir}
  git clone ${githubrepo}
else
  cd ${gitdir}
  git pull
fi

cd ${gitdir}

rm -rf ${nginxdir}/*
jekyll --no-auto . ${nginxdir}/

exitCode=$?

if [ ${exitCode} != "0" ]; then
  send_email_and_exit "${emailto}" "Jekyll failed with exit code ${exitCode}"
fi

curl http://0.0.0.0:80/ >/dev/null 2>&1

exitCode=$?

if [ ${exitCode} != "0" ]; then
  send_email_and_exit "${emailto}" "Curl failed with exit code ${exitCode}"
fi

Make the file executible:

shell$ chmod +x /app/blog/gen.sh

Now all you need is a crontab entry to refresh your blog every 5 minutes:

shell$ crontab -e
*/5 * * * * /app/blog/gen.sh &>> /app/blog/gen.out

To check your crontab settings use the -l option:

shell$ crontab -l
*/5 * * * * /app/blog/gen.sh &>> /app/blog/gen.out

Now you can either wait for up to 5 minutes for the cron to execute the script, or simply run it yourself:

shell$ /app/blog/gen.sh

Now when you refresh your browser you’ll see your Jekyll-generated website!

OSX, Chrome and DNS

2012-08-13T02:06:00+00:00

First post! Welcome to “Hadoop Hamburgers”, where I plan to write some posts about Hadoop and other topics that seem interesting. My first one is not related to Hadoop, but instead related to DNS, a subject near and dear to the heart of my employer, Verisign. Everything in getting this site setup went fairly smoothly, including updating my registrar’s DNS records to point my domain name at my hosting provider. Being an impatient sort, I didn’t want to have to wait for the TTL on my domain name to expire, so I ran a dig request to see if my registrar had pushed through the change:

shell$ dig grepalex.com

;; ANSWER SECTION:
grepalex.com.		3600	IN	A	66.216.100.140

Indeed they had! Next up was trying to hit my website from my browser. When I did that however, Chrome was showing the my registrar’s advertising content. A few pokes around led me to Chrome’s web page which lets you invalidate its DNS cache:

chrome://net-internals/#dns

However even after invalidating Chrome’s cache it still showed the content from the registrar. The cool thing about Chrome’s internal page is that it actually shows you the cached IP address, which indeed was still the old value. Clearly the OSX DNS client was performing some additional caching. After some more digging around I found the (Mountain) Lion-specific command which did indeed successfully clean OSX’s cache:

shell$ sudo killall -HUP mDNSResponder

Hurray!