<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Alex Holmes</title>
 <link href="http://grepalex.com/atom.xml" rel="self"/>
 <link href="http://grepalex.com/"/>
 <updated>2026-05-19T13:00:02+00:00</updated>
 <id>http://grepalex.com/</id>
 <author>
   <name>Alex Holmes</name>
 </author>

 
 <entry>
   <title>Configuring memory for MapReduce running on YARN</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2016/12/07/mapreduce-yarn-memory"/>
   <updated>2016-12-07T14:20:00+00:00</updated>
   <id>http://grepalex.com/2016/12/07/mapreduce-yarn-memory</id>
   <content type="html">&lt;p&gt;The most common issue that I bump into these days when running MapReduce jobs is the following error:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Application application_1409135750325_48141 failed 2 times due to AM Container for
appattempt_1409135750325_48141_000002 exited with exitCode: 143 due to: Container
[pid=4733,containerID=container_1409135750325_48141_02_000001] is running beyond physical memory limits.
Current usage: 2.0 GB of 2 GB physical memory used; 6.0 GB of 4.2 GB virtual memory used. Killing container.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Reading that message it&amp;#8217;s pretty clear that your job has exceeded its memory limits, but how do you go about fixing this?&lt;/p&gt;

&lt;h1 id='make_sure_your_job_has_to_cache_data'&gt;Make sure your job has to cache data&lt;/h1&gt;

&lt;p&gt;Before we start tinkering with configuration settings, take a moment to think about what your job is doing. Your map or reduce task running out of memory usually means that data is being cached in your map or reduce tasks. Data can be cached for a number of reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your job is writing out Parquet data, and Parquet buffers data in memory prior to writing it out to disk&lt;/li&gt;

&lt;li&gt;Your code (or a library you&amp;#8217;re using) is caching data. An example here is joining two datasets together where one dataset is being cached prior to joining it with the other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore the first step I&amp;#8217;d suggest you take is to think about whether you really need to cache data, and if it&amp;#8217;s possible to reduce your memory utilization without too much work. If that&amp;#8217;s possible you may want to consider doing that prior to bumping-up the memory for your job.&lt;/p&gt;

&lt;h1 id='how_yarn_monitors_the_memory_of_your_container'&gt;How YARN monitors the memory of your container&lt;/h1&gt;

&lt;p&gt;This section isn&amp;#8217;t specific to MapReduce, it&amp;#8217;s an overview of how YARN generally monitors memory for running containers (in MapReduce a container is either a map or reduce process).&lt;/p&gt;

&lt;p&gt;Each slave node in your YARN cluster runs a &lt;em&gt;NodeManager&lt;/em&gt; daemon, and one of the &lt;em&gt;NodeManager&lt;/em&gt;&amp;#8217;s roles is to monitor the YARN containers running on the node. One part of this work is monitoring the memory utilization of each container.&lt;/p&gt;

&lt;p&gt;To do this the &lt;em&gt;NodeManager&lt;/em&gt; periodically (every 3 seconds by default, which can be changed via &lt;code&gt;yarn.nodemanager.container-monitor.interval-ms&lt;/code&gt;) cycles through all the currently running containers, calculates the process tree (all child processes for each container), and for each process examines the &lt;code&gt;/proc/&amp;lt;PID&amp;gt;/stat&lt;/code&gt; file (where PID is the process ID of the container) and extracts the physical memory (aka RSS) and the virtual memory (aka VSZ or VSIZE).&lt;/p&gt;

&lt;p&gt;If virtual memory checking is enabled (true by default, overridden via &lt;code&gt;yarn.nodemanager.vmem-check-enabled&lt;/code&gt;), then YARN compares the summed VSIZE extracted from the container process (and all child processes) with the maximum allowed virtual memory for the container. The maximum allowed virtual memory is basically the configured maximum physical memory for the container multiplied by &lt;code&gt;yarn.nodemanager.vmem-pmem-ratio&lt;/code&gt; (default is 2.1). So if your YARN container is configured to have a maximum of 2 GB of physical memory, then this number is multiplied by 2.1 which means you are allowed to use 4.2 GB of virtual memory.&lt;/p&gt;

&lt;p&gt;If physical memory checking is enabled (true by default, overridden via &lt;code&gt;yarn.nodemanager.pmem-check-enabled&lt;/code&gt;), then YARN compares the summed RSS extracted from the container process (and all child processes) with the maximum allowed physical memory for the container.&lt;/p&gt;

&lt;p&gt;If either the virtual or physical utilization is higher than the maximum permitted, YARN will kill the container, as shown at the top of this article.&lt;/p&gt;

&lt;h1 id='increasing_the_memory_availble_to_your_mapreduce_job'&gt;Increasing the memory availble to your MapReduce job&lt;/h1&gt;

&lt;p&gt;Back in the days when MapReduce didn&amp;#8217;t run on YARN memory configuration was pretty simple, but these days MapReduce runs as a YARN application and things are a little bit more involved. For MapReduce running on YARN there are actually two memory settings you have to configure at the same time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The physical memory for your YARN map and reduce processes&lt;/li&gt;

&lt;li&gt;The JVM heap size for your map and reduce processes&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id='physical_memory_for_your_yarn_map_and_reduce_processes'&gt;Physical memory for your YARN map and reduce processes&lt;/h2&gt;

&lt;p&gt;Configure &lt;code&gt;mapreduce.map.memory.mb&lt;/code&gt; and &lt;code&gt;mapreduce.reduce.memory.mb&lt;/code&gt; to set the YARN container physical memory limits for your map and reduce processes respectively. For example if you want to limit your map process to 2GB and your reduce process to 4GB, and you wanted that to be the default in your cluster, then you&amp;#8217;d set the following in &lt;code&gt;mapred-site.xml&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapreduce.map.memory.mb&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;2048&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapreduce.reduce.memory.mb&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;4096&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The physical memory configured for your job must fall within the minimum and maximum memory allowed for containers in your cluster (check the &lt;code&gt;yarn.scheduler.maximum-allocation-mb&lt;/code&gt; and &lt;code&gt;yarn.scheduler.minimum-allocation-mb&lt;/code&gt; properties respectively).&lt;/p&gt;

&lt;h2 id='jvm_heap_size_for_your_map_and_reduce_processes'&gt;JVM heap size for your map and reduce processes&lt;/h2&gt;

&lt;p&gt;Next you need to configure the JVM heap size for your map and reduce processes. These sizes need to be less than the physical memory you configured in the previous section. As a general rule they should be 80% the size of the YARN physical memory settings.&lt;/p&gt;

&lt;p&gt;Configure &lt;code&gt;mapreduce.map.java.opts&lt;/code&gt; and &lt;code&gt;mapreduce.reduce.java.opts&lt;/code&gt; to set the map and reduce heap sizes respectively. To continue the example from the previous section, we&amp;#8217;ll take the 2GB and 4GB physical memory limits and multiple by 0.8 to arrive at our Java heap sizes. So we&amp;#8217;d end up with the following in &lt;code&gt;mapred-site.xml&lt;/code&gt; (assuming you wanted these to be the defaults for your cluster):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapreduce.map.java.opts&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;-Xmx1638m&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapreduce.reduce.java.opts&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;-Xmx3278m&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='configuring_settings_for_your_job'&gt;Configuring settings for your job&lt;/h2&gt;

&lt;p&gt;The same configuration properties that I&amp;#8217;ve described above apply if you want to individually configure your MapReduce jobs and override the cluster defaults. Again you&amp;#8217;ll want to set values for these properties for your job:&lt;/p&gt;
&lt;table class='table table-striped'&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Property&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style='text-align: left;'&gt;mapreduce.map.memory.mb&lt;/td&gt;&lt;td style='text-align: left;'&gt;The amount of physical memory that your YARN map process can use.&lt;/td&gt;
&lt;/tr&gt;&lt;tr&gt;&lt;td style='text-align: left;'&gt;mapreduce.reduce.memory.mb&lt;/td&gt;&lt;td style='text-align: left;'&gt;The amount of physical memory that your YARN reduce process can use.&lt;/td&gt;
&lt;/tr&gt;&lt;tr&gt;&lt;td style='text-align: left;'&gt;mapreduce.map.java.opts&lt;/td&gt;&lt;td style='text-align: left;'&gt;Used to configure the heap size for the map JVM process. Should be 80% of &lt;code&gt;mapreduce.map.memory.mb&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;&lt;tr&gt;&lt;td style='text-align: left;'&gt;mapreduce.reduce.java.opts&lt;/td&gt;&lt;td style='text-align: left;'&gt;Used to configure the heap size for the reduce JVM process. Should be 80% of &lt;code&gt;mapreduce.reduce.memory.mb&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;</content>
 </entry>
 
 <entry>
   <title>Big data anti-patterns presentation</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <category term="--" />
   
   <category term="cassandra" />
   
   <category term="--" />
   
   <category term="kafka" />
   
   <category term="--" />
   
   <category term="storm" />
   
   <link href="http://grepalex.com/2015/10/28/big-data-antipatterns-javaone"/>
   <updated>2015-10-28T00:20:00+00:00</updated>
   <id>http://grepalex.com/2015/10/28/big-data-antipatterns-javaone</id>
   <content type="html">&lt;p&gt;Today I presented on big data anti-patterns to an audience at JavaOne. It was live-streamed (no pressure Alex) and I&amp;#8217;m hoping the video will be publicly available shortly; if so I&amp;#8217;ll update this post with a link.&lt;/p&gt;

&lt;p&gt;The presentation covered seven anti-patterns ranging from fairly high-level ones (such as &amp;#8220;you don&amp;#8217;t have big data&amp;#8221;) to ones that were more in the weeds (approximate counting), and covering tools such as Hadoop, Cassandra and Kafka.&lt;/p&gt;

&lt;p&gt;Thanks to everyone who attended - I had a lot of fun presenting, and I&amp;#8217;m looking forward to giving more talks in the future.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a link to the slides of the talk: &lt;a href='http://www.slideshare.net/grepalex/avoiding-big-data-antipatterns'&gt;http://www.slideshare.net/grepalex/avoiding-big-data-antipatterns&lt;/a&gt;&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2014/05/13/parquet-file-format-and-object-model"/>
   <updated>2014-05-13T14:20:00+00:00</updated>
   <id>http://grepalex.com/2014/05/13/parquet-file-format-and-object-model</id>
   <content type="html">&lt;p&gt;&lt;a href='http://parquet.io/'&gt;Parquet&lt;/a&gt; is a new columnar storage format that come out of a collaboration between Twitter and Cloudera. Parquet&amp;#8217;s generating a lot of excitement in the community for good reason - it&amp;#8217;s shaping up to be the next big thing for data storage in Hadoop for a number of reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It&amp;#8217;s a sophisticated columnar file format, which means that it&amp;#8217;s well-suited to OLAP workloads, or really any workload where projection is a normal part of working with the data.&lt;/li&gt;

&lt;li&gt;It has a high level of integration with Hadoop and the ecosystem - you can work with Parquet in MapReduce, Pig, Hive and Impala.&lt;/li&gt;

&lt;li&gt;It supports Avro, Thrift and Protocol Buffers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The last item raises a question - how does Parquet work with Avro and friends? To understand this you&amp;#8217;ll need to understand three concepts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Storage formats&lt;/em&gt;, which are binary representations of data. For Parquet this is contained within the &lt;a href='https://github.com/Parquet/parquet-format'&gt;parquet-format&lt;/a&gt; GitHub project.&lt;/li&gt;

&lt;li&gt;&lt;em&gt;Object model converters&lt;/em&gt;, whose job it is to map between an external object model and Parquet&amp;#8217;s internal data types. These converters exist in the &lt;a href='https://github.com/Parquet/parquet-mr'&gt;parquet-mr&lt;/a&gt; GitHub project.&lt;/li&gt;

&lt;li&gt;&lt;em&gt;Object models&lt;/em&gt;, which are in-memory representations of data. &lt;a href='http://avro.apache.org/'&gt;Avro&lt;/a&gt;, &lt;a href='http://thrift.apache.org/'&gt;Thrift&lt;/a&gt;, &lt;a href='https://code.google.com/p/protobuf/'&gt;Protocol Buffers&lt;/a&gt;, &lt;a href='http://hive.apache.org/'&gt;Hive&lt;/a&gt; and &lt;a href='http://pig.apache.org/'&gt;Pig&lt;/a&gt; are all examples of object models. Parquet does actually supply an &lt;a href='https://github.com/Parquet/parquet-mr/tree/master/parquet-column/src/main/java/parquet/example'&gt;example object model&lt;/a&gt; (with &lt;a href='https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example'&gt;MapReduce support&lt;/a&gt; ) , but the intention is that you&amp;#8217;d use one of the other richer object models such as Avro.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The figure below shows a visual representation of these concepts (&lt;a href='/images/parquet_storage_object_converter.png'&gt;view a larger image&lt;/a&gt; ).&lt;/p&gt;

&lt;p&gt;&lt;img alt='Image of storage formats and object models' src='/images/parquet_storage_object_converter.png' /&gt;&lt;/p&gt;

&lt;p&gt;Avro, Thrift and Protocol Buffers all have have their own storage formats, but Parquet doesn&amp;#8217;t utilize them in any way. Instead their objects are mapped to the Parquet data model. Parquet data is always serialized using its own file format. This is why Parquet can&amp;#8217;t read files serialized using Avro&amp;#8217;s storage format, and vice-versa.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s examine what happens when you write an Avro object to Parquet:&lt;/p&gt;

&lt;p&gt;&lt;img alt='Avro/Parquet write path' src='/images/parquet_avro_write.png' /&gt;&lt;/p&gt;

&lt;p&gt;The Avro converter stores within the Parquet file&amp;#8217;s metadata the schema for the objects being written. You can see this by using a Parquet CLI to dumps out the Parquet metadata contained within a Parquet file.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ export HADOOP_CLASSPATH=parquet-avro-1.4.3.jar:parquet-column-1.4.3.jar:parquet-common-1.4.3.jar:parquet-encoding-1.4.3.jar:parquet-format-2.0.0.jar:parquet-generator-1.4.3.jar:parquet-hadoop-1.4.3.jar:parquet-hive-bundle-1.4.3.jar:parquet-jackson-1.4.3.jar:parquet-tools-1.4.3.jar

$ hadoop parquet.tools.Main meta stocks.parquet
creator:     parquet-mr (build 3f25ad97f209e7653e9f816508252f850abd635f)
extra:       avro.schema = {&amp;quot;type&amp;quot;:&amp;quot;record&amp;quot;,&amp;quot;name&amp;quot;:&amp;quot;Stock&amp;quot;,&amp;quot;namespace&amp;quot; [more]...

file schema: hip.ch5.avro.gen.Stock
--------------------------------------------------------------------------------
symbol:      REQUIRED BINARY O:UTF8 R:0 D:0
date:        REQUIRED BINARY O:UTF8 R:0 D:0
open:        REQUIRED DOUBLE R:0 D:0
high:        REQUIRED DOUBLE R:0 D:0
low:         REQUIRED DOUBLE R:0 D:0
close:       REQUIRED DOUBLE R:0 D:0
volume:      REQUIRED INT32 R:0 D:0
adjClose:    REQUIRED DOUBLE R:0 D:0

row group 1: RC:45 TS:2376
--------------------------------------------------------------------------------
symbol:       BINARY UNCOMPRESSED DO:0 FPO:4 SZ:84/84/1.00 VC:45 ENC:B [more]...
date:         BINARY UNCOMPRESSED DO:0 FPO:88 SZ:198/198/1.00 VC:45 EN [more]...
open:         DOUBLE UNCOMPRESSED DO:0 FPO:286 SZ:379/379/1.00 VC:45 E [more]...
high:         DOUBLE UNCOMPRESSED DO:0 FPO:665 SZ:379/379/1.00 VC:45 E [more]...
low:          DOUBLE UNCOMPRESSED DO:0 FPO:1044 SZ:379/379/1.00 VC:45  [more]...
close:        DOUBLE UNCOMPRESSED DO:0 FPO:1423 SZ:379/379/1.00 VC:45  [more]...
volume:       INT32 UNCOMPRESSED DO:0 FPO:1802 SZ:199/199/1.00 VC:45 E [more]...
adjClose:     DOUBLE UNCOMPRESSED DO:0 FPO:2001 SZ:379/379/1.00 VC:45  [more]...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &amp;#8220;avro.schema&amp;#8221; is where the Avro schema information is stored. This allows the Avro Parquet reader the ability to marshall Avro objects without the client having to supply the schema.&lt;/p&gt;

&lt;p&gt;You can also use the &amp;#8220;schema&amp;#8221; command to view the Parquet schema.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ hadoop parquet.tools.Main schema stocks.parquet
message hip.ch4.avro.gen.Stock {
  required binary symbol (UTF8);
  required binary date (UTF8);
  required double open;
  required double high;
  required double low;
  required double close;
  required int32 volume;
  required double adjClose;
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This tool is useful when loading a Parquet file into Hive, as you&amp;#8217;ll need to use the field names defined in the Parquet schema when defining the Hive table (note that the syntax below only works with Hive 0.13 and newer).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;hive&amp;gt; CREATE EXTERNAL TABLE parquet_stocks(
    symbol string,
    date string,
    open double,
    high double,
    low double,
    close double,
    volume int,
    adjClose double
) STORED AS PARQUET
LOCATION &amp;#39;...&amp;#39;;&lt;/code&gt;&lt;/pre&gt;</content>
 </entry>
 
 <entry>
   <title>Using Oozie 4.4.0 with Hadoop 2.2</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2014/02/16/oozie-and-hadoop-2.2"/>
   <updated>2014-02-16T14:20:00+00:00</updated>
   <id>http://grepalex.com/2014/02/16/oozie-and-hadoop-2.2</id>
   <content type="html">&lt;p&gt;The current version of Oozie (4.0.0) doesn&amp;#8217;t build correctly when you try and target Hadoop 2.2. The Oozie team have a fix going into release 4.0.1 (see &lt;a href='https://issues.apache.org/jira/browse/OOZIE-1551'&gt;OOZIE-1551&lt;/a&gt;), but until then you can hack the Maven files to get it working with 4.0.0.&lt;/p&gt;

&lt;p&gt;First download the 4.0.0 version from &lt;a href='https://oozie.apache.org/'&gt;https://oozie.apache.org/&lt;/a&gt;, and then unpackage it. Next run the following command to change the Hadoop version being targeted:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cd oozie-4.0.0/
find . -name pom.xml | xargs sed -ri &amp;#39;s/(2.2.0\-SNAPSHOT)/2.2.0/&amp;#39;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now all you need to do is target the hadoop-2 profile in Maven and you&amp;#8217;ll be all set:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;mvn -DskipTests=true -P hadoop-2 clean package assembly:single&lt;/code&gt;&lt;/pre&gt;</content>
 </entry>
 
 <entry>
   <title>Hadoop in Practice, Second Edition</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2014/02/11/hadoop-in-practice-second-edition"/>
   <updated>2014-02-11T14:20:00+00:00</updated>
   <id>http://grepalex.com/2014/02/11/hadoop-in-practice-second-edition</id>
   <content type="html">&lt;p&gt;The first edition of my book went to press on November 2012, just over a year ago! It&amp;#8217;s not that long, but in Hadoop years it&amp;#8217;s a generation, and there have been many exciting developments in Hadoop and its ecosystem, especially YARN, and the promise of a general-purpose, distributed platform that can support any computing models, beyond MapReduce.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m excited to announce that I&amp;#8217;ve started work on the second edition of the book, which will bring the existing coverage of the book up to date, and also add new chapters to cover items such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An overview of YARN and how it works&lt;/li&gt;

&lt;li&gt;How MapReduce 2 works as a YARN application&lt;/li&gt;

&lt;li&gt;Recipes for writing your own YARN applications&lt;/li&gt;

&lt;li&gt;Pulling data out of Kafka into HDFS&lt;/li&gt;

&lt;li&gt;Running Storm on YARN and using it to perform aggregations&lt;/li&gt;

&lt;li&gt;Using Spark for in-memory, iterative data processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The book is currently in MEAP, which is Manning&amp;#8217;s early access program. The benefit of this program is that you get new content as it&amp;#8217;s being written, and at the end you&amp;#8217;ll get the full production-polished version of the book.&lt;/p&gt;
&lt;a href='http://www.manning.com/holmes2/'&gt;&lt;img alt='Hadoop in Practice, Second Edition' src='/images/holmes2_cover150.jpg' /&gt;&lt;/a&gt;
&lt;p&gt;I welcome any suggestions or ideas for how the book can be improved at &lt;a href='http://www.manning-sandbox.com/forum.jspa?forumID=901'&gt;the forum&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Using Hadoop 2.2 as a sink in Flume 1.4</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2014/02/09/flume-and-hadoop-2.2"/>
   <updated>2014-02-09T14:20:00+00:00</updated>
   <id>http://grepalex.com/2014/02/09/flume-and-hadoop-2.2</id>
   <content type="html">&lt;p&gt;Google really screwed the pooch with their protobuf 2.5 release. Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries (I guess Google missed the &lt;a href='http://semver.org/'&gt;semantic versioning&lt;/a&gt; boat on this release). Unfortunately the current stable release of Flume 1.4 packages protobuf 2.4.1 and if you try and use HDFS on Hadoop 2.2 as a sink you&amp;#8217;ll be smacked with the following exception:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;java.lang.VerifyError: class org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    ...
    at org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
    at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:328)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:235)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Hadoop 2.2 uses protobuf 2.5 for its RPC, and Flume loads its older packaged version of protobuf ahead of Hadoop&amp;#8217;s, which causes this error. To fix this you&amp;#8217;ll need to move both protobuf and guava out of Flume&amp;#8217;s lib directory. The following command moves them into your home directory.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ mv ${flume_bin}/lib/{protobuf-java-2.4.1.jar,guava-10.0.1.jar} ~/&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now if you restart your Flume agent you&amp;#8217;ll be able to target HDFS as a sink with Hadoop 2.2. Great success!&lt;/p&gt;

&lt;p&gt;Flume&amp;#8217;s next release will &lt;a href='https://issues.apache.org/jira/browse/FLUME-2172'&gt;move to protobuf 2.5&lt;/a&gt; so this problem should magically disappear in due course.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Simplifying secondary sorting in MapReduce with htuple</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/10/07/secondary-sort-with-htuple"/>
   <updated>2013-10-07T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/10/07/secondary-sort-with-htuple</id>
   <content type="html">&lt;p&gt;I&amp;#8217;ve recently found myself immersed in writing a number of MapReduce jobs that all require secondary sort. Whilst I was nursing my cramping hands after writing what felt like the 100th custom Writable (and supporting partitioner/comparators), a thought occurred to me - &amp;#8220;surely there&amp;#8217;s a better way&amp;#8221;? As I started thinking about this some more, I realized that what I needed was a general-purpose mechanism that would allow me to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Work with compound elements&lt;/li&gt;

&lt;li&gt;Provide pre-built partitioners and comparators that would know how to work with these compound elements&lt;/li&gt;

&lt;li&gt;Model all of this in a way that is easy to read and understand&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the inspiration behind &lt;a href='https://github.com/alexholmes/htuple'&gt;htuple&lt;/a&gt;, a small project that I just open-sourced.&lt;/p&gt;

&lt;h2 id='htuple'&gt;htuple&lt;/h2&gt;

&lt;p&gt;Let me give you an example of how you can use &lt;code&gt;htuple&lt;/code&gt; to perform secondary sorting. Imagine that you have a dataset which contains last and first names:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Smith	John
Smith	Anne
Smith	Ken&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One example aggregation you may want to perform on this data is to count the number of distinct first names for each last name. A reasonable approach to implementing this in MapReduce would be to emit the last name as the mapper output key, the first name as the mapper output value, and in the reducer you&amp;#8217;d collect all the first names in a set and then count them. This would work fine when working with names, but what if your dataset had some keys with a large number of distinct values - large enough that you run into problems caching all the data in the reducer&amp;#8217;s memory?&lt;/p&gt;

&lt;p&gt;One solution here would be to use secondary sort - and in the example of our names, sort the first names so that the reducer wouldn&amp;#8217;t need to store them in a set (instead it can just increment a count as it&amp;#8217;s reading the first names). In this case you&amp;#8217;d probably end up writing a custom &lt;code&gt;Writable&lt;/code&gt; which would contain both the last name and first name, and you&amp;#8217;d also write a custom partitioner, and a sorting and grouping comparator. Phew, that&amp;#8217;s a lot of work just to get secondary sort working.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s examine how you&amp;#8217;d use &lt;code&gt;htuple&lt;/code&gt; to do this work. First of all, I&amp;#8217;d recommend defining an enum to create logical names for the elements you&amp;#8217;ll store in the tuple. In our case we need two elements for the names, so here goes:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt; * User-friendly names that we can use to refer to fields in the tuple.&lt;/span&gt;
&lt;span class='cm'&gt; */&lt;/span&gt;
&lt;span class='kd'&gt;enum&lt;/span&gt; &lt;span class='n'&gt;TupleFields&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='n'&gt;LAST_NAME&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt;
    &lt;span class='n'&gt;FIRST_NAME&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The first concept we&amp;#8217;ll introduce in &lt;code&gt;htuple&lt;/code&gt; is the &lt;code&gt;Tuple&lt;/code&gt; class. This class is merely a container for reading and writing multiple elements, and will be the class that you&amp;#8217;ll use to emit keys from your mapper. There are three ways you can write data into this tuple - here we&amp;#8217;ll cover what I think is the most useful method, which is using the enum you just created. Let&amp;#8217;s see how this will work in our mapper.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;Map&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;Mapper&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;LongWritable&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Tuple&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;map&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;LongWritable&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Context&lt;/span&gt; &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
            &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;InterruptedException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

        &lt;span class='c1'&gt;// tokenize the line&lt;/span&gt;
        &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;nameParts&lt;/span&gt;&lt;span class='o'&gt;[]&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;split&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;\t&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='c1'&gt;// create the tuple, setting the first and last names&lt;/span&gt;
        &lt;span class='n'&gt;Tuple&lt;/span&gt; &lt;span class='n'&gt;outputKey&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Tuple&lt;/span&gt;&lt;span class='o'&gt;();&lt;/span&gt;
        &lt;span class='n'&gt;outputKey&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;set&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;TupleFields&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;LAST_NAME&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;nameParts&lt;/span&gt;&lt;span class='o'&gt;[&lt;/span&gt;&lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='o'&gt;]);&lt;/span&gt;
        &lt;span class='n'&gt;outputKey&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;set&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;TupleFields&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;FIRST_NAME&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;nameParts&lt;/span&gt;&lt;span class='o'&gt;[&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;]);&lt;/span&gt;

        &lt;span class='c1'&gt;// emit the tuple and the original contents of the line&lt;/span&gt;
        &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;write&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;outputKey&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The first thing you do in your mapper is split the input line, where the first token is the last name, and the second token is the first name. Next you create a new &lt;code&gt;Tuple&lt;/code&gt; object and set the last and first name. We&amp;#8217;re using the enum to logically refer to the fields. What&amp;#8217;s happening beneath the scenes is that the &lt;code&gt;Tuple&lt;/code&gt; class is using the &lt;a href='http://docs.oracle.com/javase/7/docs/api/java/lang/Enum.html#ordinal()'&gt;ordninal value&lt;/a&gt; of the enum to determine the position in the ArrayList to set. So that means &lt;code&gt;LAST_NAME&lt;/code&gt;, which has an ordinal position of &lt;code&gt;0&lt;/code&gt;, will have its value set in index &lt;code&gt;0&lt;/code&gt; in the &lt;code&gt;Tuple&lt;/code&gt; classes underlying &lt;code&gt;ArrayList&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now that you&amp;#8217;ve emitted your Tuple in the mapper, you need to configure your job for secondary sort. This will then expose you to the second class in &lt;code&gt;htuple&lt;/code&gt;, &lt;code&gt;ShuffleUtils&lt;/code&gt;. &lt;code&gt;ShuffleUtils&lt;/code&gt; allows you to specify which elements in your tuple are used for partitioning, sorting and grouping during the shuffle phase. And this is how you do it:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;ShuffleUtils&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;configBuilder&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;useNewApi&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setPartitionerIndices&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;TupleFields&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;LAST_NAME&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setSortIndices&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;TupleFields&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;values&lt;/span&gt;&lt;span class='o'&gt;())&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setGroupIndices&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;TupleFields&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;LAST_NAME&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;configure&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;conf&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If you recall how secondary sort works (see my book &amp;#8221;&lt;a href='http://www.manning.com/holmes/'&gt;Hadoop in Practice&lt;/a&gt;&amp;#8221; for a detailed explanation), you need to perform three steps in your MapReduce driver:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Specify how your compound key will be partitioned. In our example we only want the partitioner to use the last name so that all records with the same last name get routed to the same reducer.&lt;/li&gt;

&lt;li&gt;Specify how your compound key will be sorted. Here we want both the last and first name to be sorted, so that the first names will be presented to your reducer in sorted order.&lt;/li&gt;

&lt;li&gt;Specify how your compound key will be grouped. Since we want all the first names to be streamed to a single reducer invocation for a given last name, we only want to group on the last name.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A couple of things worth noting in the above code example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We&amp;#8217;re using the new MapReduce API (i.e. using package &lt;code&gt;org.apache.hadoop.mapreduce&lt;/code&gt;), and as such you need to call the &lt;code&gt;useNewApi&lt;/code&gt; method.&lt;/li&gt;

&lt;li&gt;The &lt;code&gt;values&lt;/code&gt; method on an enum returns an array of all of the enum fields in order of definition, which in our example is the last name followed by the first name - exactly the order in which we want the sorting to occur.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You&amp;#8217;re done! If you examine the output of the MapReduce job in HDFS you&amp;#8217;ll see that indeed all the records are sorted by last and first name.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ hadoop fs -cat output/part*
Smith	Anne
Smith	John
Smith	Ken&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can look at the complete source in &lt;a href='https://github.com/alexholmes/htuple/blob/master/examples/src/main/java/org/htuple/examples/SecondarySort.java'&gt;SecondarySort.java&lt;/a&gt;. The &lt;a href='https://github.com/alexholmes/htuple'&gt;htuple github page&lt;/a&gt; has instructions for downloading, building and running this same example in a couple of easy steps. There&amp;#8217;s also a page which shows the &lt;a href='https://github.com/alexholmes/htuple/blob/master/DATATYPES.md'&gt;types supported by htuple&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Next Generation Hadoop - It's Not Just Batch!</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/09/25/javaone-2013-nextgen-hadoop"/>
   <updated>2013-09-25T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/09/25/javaone-2013-nextgen-hadoop</id>
   <content type="html">&lt;p&gt;In my &lt;a href='https://oracleus.activeevents.com/2013/connect/sessionDetail.ww?SESSION_ID=7356'&gt;JavaOne talk today&lt;/a&gt; I presented changes that are happening in Hadoop, where it&amp;#8217;s shaking off it&amp;#8217;s batch-based shackles and enabling a new Hadoop platform that can support a mix of processing systems, from stream-processing systems to NoSQL systems.&lt;/p&gt;

&lt;p&gt;The slides for my talk can be viewed on &lt;a href='https://speakerdeck.com/alexholmes/javaone-2013-presentation-next-generation-hadoop-its-not-just-batch'&gt;Speaker Deck&lt;/a&gt;. The rest of this post is an overview of the technologies covered in my talk, along with links for further reading.&lt;/p&gt;

&lt;h2 id='yarn'&gt;YARN&lt;/h2&gt;

&lt;p&gt;With Hadoop 2.x, we now have YARN which acts as a distributed scheduler. This is a big step towards the vision of Hadoop being the Big Data Kernel, as it allows arbitrary applications to be scheduled on the same Hadoop cluster, and enables a new world where we can have silo&amp;#8217;d applications coexisting on the same hardware and sharing the same storage.&lt;/p&gt;

&lt;p&gt;The following links serve as a good starting ground to learn more about YARN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An introduction to YARN: &lt;a href='http://hortonworks.com/blog/introducing-apache-hadoop-yarn/'&gt;http://hortonworks.com/blog/introducing-apache-hadoop-yarn/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;A book by Arun Murthy et. al. on YARN: &lt;a href='http://www.amazon.com/Apache-Hadoop-YARN-Processing-Addison-Wesley/dp/0321934504'&gt;http://www.amazon.com/Apache-Hadoop-YARN-Processing-Addison-Wesley/dp/0321934504&lt;/a&gt;, first chapter can be read for free at &lt;a href='http://hortonworks.com/wp-content/uploads/downloads/2013/06/Apache.Hadoop.YARN_.Sample.pdf'&gt;http://hortonworks.com/wp-content/uploads/downloads/2013/06/Apache.Hadoop.YARN_.Sample.pdf&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;The YARN ResourceManager: &lt;a href='http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/'&gt;http://hortonworks.com/blog/apache-hadoop-yarn-resourcemanager/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Writing YARN applications: &lt;a href='http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html'&gt;http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Setting up a cluster to run MapReduce on YARN: &lt;a href='http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html'&gt;http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Configuring YARN: &lt;a href='http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/'&gt;http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Default YARN configuration: &lt;a href='http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml'&gt;http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;YARN commands: &lt;a href='http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/YarnCommands.html'&gt;http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/YarnCommands.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_hbase'&gt;Apache HBase&lt;/h2&gt;

&lt;p&gt;HBase is a NoSQL, distributed multi-dimensional map based on Google&amp;#8217;s BigTable. It uses HDFS for persistence, which is a huge benefit if a key requirement of your NoSQL system is the ability to read and write data into HBase using MapReduce.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HBase project page: &lt;a href='http://hbase.apache.org/'&gt;http://hbase.apache.org/&lt;/a&gt; and mailing lists: &lt;a href='http://hbase.apache.org/mail-lists.html'&gt;http://hbase.apache.org/mail-lists.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;A good presentation by Amandeep Khurana on HBase: &lt;a href='http://www.slideshare.net/amansk/hbase-hadoop-day-seattle-4987041'&gt;http://www.slideshare.net/amansk/hbase-hadoop-day-seattle-4987041&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;HBase wiki: &lt;a href='http://wiki.apache.org/hadoop/Hbase'&gt;http://wiki.apache.org/hadoop/Hbase&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;The HBase Reference Guide - a great resource on how HBase&amp;#8217;s data model, design and configuration: &lt;a href='http://hbase.apache.org/book.html'&gt;http://hbase.apache.org/book.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;HBase in Action, a book from Manning: &lt;a href='http://www.manning.com/dimidukkhurana/'&gt;http://www.manning.com/dimidukkhurana/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='hbase_on_yarn_hoya'&gt;HBase on YARN (Hoya)&lt;/h2&gt;

&lt;p&gt;Hoya is a YARN application that allows multiple HBase clusters to coexist on a single Hadoop YARN cluster. This provides strong data/resource isolation properties, in conjunction with the ability to easily spin up, upsize/downsize and shutdown HBase clusters. Hoya was developed by Steve Loghran and friends over at Hortonworks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub project: &lt;a href='https://github.com/hortonworks/hoya/'&gt;https://github.com/hortonworks/hoya/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Introducing Hoya: &lt;a href='http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/'&gt;http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hoya architecture: &lt;a href='http://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/'&gt;http://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Presentation by Steve and Devaraj: &lt;a href='http://www.slideshare.net/steve_l/hoya-hbase-on-yarn-20130820-hbase-hug'&gt;http://www.slideshare.net/steve_l/hoya-hbase-on-yarn-20130820-hbase-hug&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_accumulo'&gt;Apache Accumulo&lt;/h2&gt;

&lt;p&gt;Accumulo is a BigTable implementation much like HBase. It also uses HDFS for storage, and currently has an edge in the security world due to its cell-level security. Although it should be noted that this is planned for HBase (see &lt;a href='https://issues.apache.org/jira/browse/hbase-6222'&gt;HBASE-6222&lt;/a&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project page: &lt;a href='http://accumulo.apache.org/'&gt;http://accumulo.apache.org/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Todd Lipcon&amp;#8217;s presentation comparing HBase and Accumulo &lt;a href='http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012'&gt;http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='elephantdb'&gt;ElephantDB&lt;/h2&gt;

&lt;p&gt;ElephantDB is a read-only key-value store, which uses HDFS to load data which is served in real-time. It&amp;#8217;s a part of Nathan Marz&amp;#8217;s Lambda Architecture and enables the rapid loading and serving of data produced in the batch tier.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub page: &lt;a href='https://github.com/nathanmarz/elephantdb'&gt;https://github.com/nathanmarz/elephantdb&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Presentation by Nathan Marz: &lt;a href='http://www.slideshare.net/nathanmarz/elephantdb'&gt;http://www.slideshare.net/nathanmarz/elephantdb&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Presentation by Soren Macbeth, a contributor to the project: &lt;a href='https://speakerdeck.com/sorenmacbeth/introduction-to-elephantdb'&gt;https://speakerdeck.com/sorenmacbeth/introduction-to-elephantdb&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='storm'&gt;Storm&lt;/h2&gt;

&lt;p&gt;Storm is a stream processing, continuous computation and distributed RPC system developed and open-sourced by Twitter. It allows you to perform near real-time calculations such as trending topics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project home: &lt;a href='http://storm-project.net/'&gt;http://storm-project.net/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;GitHub project: &lt;a href='https://github.com/nathanmarz/storm'&gt;https://github.com/nathanmarz/storm&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Extensive documentation which covers the background and basics on how Storm works: &lt;a href='https://github.com/nathanmarz/storm/wiki'&gt;https://github.com/nathanmarz/storm/wiki&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Natan Marz presentation on Storm: &lt;a href='http://www.youtube.com/watch?v=bdps8tE0gYo'&gt;http://www.youtube.com/watch?v=bdps8tE0gYo&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Running a multi-node Storm cluster from Michael Noll: &lt;a href='http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/'&gt;http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Understanding the parallelism of a Storm topology, also from Mr. Noll: &lt;a href='http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/'&gt;http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='storm_on_yarn'&gt;Storm on YARN&lt;/h2&gt;

&lt;p&gt;Yahoo use Storm for a variety of use cases, and created the Storm-on-YARN so that then could run Storm on their YARN clusters. They also added the ability for Storm to read/write to secure HDFS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub project page: &lt;a href='https://github.com/yahoo/storm-yarn'&gt;https://github.com/yahoo/storm-yarn&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Yahoo! blog post introducing the project: &lt;a href='http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html'&gt;http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hortonworks blog on the project: &lt;a href='http://hortonworks.com/blog/streaming-in-hadoop-yahoo-release-storm-yarn/'&gt;http://hortonworks.com/blog/streaming-in-hadoop-yahoo-release-storm-yarn/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hadoop Summit 2013 presentation: &lt;a href='http://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2'&gt;http://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_samza'&gt;Apache Samza&lt;/h2&gt;

&lt;p&gt;Samza (incubating) is a stream processing system that uses Kafka for messaging, and optionally YARN for resource management.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project page: &lt;a href='http://samza.incubator.apache.org/'&gt;http://samza.incubator.apache.org/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;LinkedIn post on Samza&amp;#8217;s background: &lt;a href='http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-framework'&gt;http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-framework&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='morphlines'&gt;Morphlines&lt;/h2&gt;

&lt;p&gt;Morphlines is a ETL library from Cloudera that has implementations available for use within Flume, MapReduce and HBase. Using a modified JSON syntax it allows you to create a pipeline of work which can fulfill use cases such as near real-time writes from Flume into Solr Cloud.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub page: &lt;a href='https://github.com/cloudera/search'&gt;https://github.com/cloudera/search&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Introductory blog post: &lt;a href='http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/'&gt;http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Presentation from Cloudera: &lt;a href='http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl'&gt;http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Documentation as part of the Cloudera Development Kit: &lt;a href='http://cloudera.github.io/cdk/docs/0.5.0/cdk-morphlines/index.html'&gt;http://cloudera.github.io/cdk/docs/0.5.0/cdk-morphlines/index.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_giraph'&gt;Apache Giraph&lt;/h2&gt;

&lt;p&gt;Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. It offers performance advantages over graph processing with MapReduce.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project page: &lt;a href='http://giraph.apache.org/'&gt;http://giraph.apache.org/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Quick start guide: &lt;a href='http://giraph.apache.org/quick_start.html'&gt;http://giraph.apache.org/quick_start.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;HadoopSummit 2013 presentation: &lt;a href='http://www.youtube.com/watch?v=_RsJfZGQo9I'&gt;http://www.youtube.com/watch?v=_RsJfZGQo9I&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Architectural overview: &lt;a href='http://www.slideshare.net/averyching/20111014hortonworks'&gt;http://www.slideshare.net/averyching/20111014hortonworks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='impala'&gt;Impala&lt;/h2&gt;

&lt;p&gt;Impala from Cloudera is an implementation of Google&amp;#8217;s paper on &lt;a href='http://research.google.com/pubs/pub36632.html'&gt;Dremel&lt;/a&gt;, and provides interactive SQL capabilities on top of data in HDFS and HBase.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub page: &lt;a href='https://github.com/cloudera/impala'&gt;https://github.com/cloudera/impala&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Project announcement from Cloudera: &lt;a href='http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/'&gt;http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Impala 1.0 release announcement: &lt;a href='http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/'&gt;http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Configuring Impala for multi-tenant performance: &lt;a href='http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/'&gt;http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Cloudera presentation at the Swiss Big Data User Group: &lt;a href='http://www.slideshare.net/SwissHUG/cloudera-impala-15376625'&gt;http://www.slideshare.net/SwissHUG/cloudera-impala-15376625&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_drill'&gt;Apache Drill&lt;/h2&gt;

&lt;p&gt;An (incubating) project that offers the promise of interactive SQL capabilities over data in HDFS, HBase, Cassandra, MongoDB and Splunk.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache incubating project page: &lt;a href='http://incubator.apache.org/drill/'&gt;http://incubator.apache.org/drill/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Architecture outlines: &lt;a href='http://www.slideshare.net/jasonfrantz/drill-architecture-20120913'&gt;http://www.slideshare.net/jasonfrantz/drill-architecture-20120913&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='parquet'&gt;Parquet&lt;/h2&gt;

&lt;p&gt;Parquet, a joint initiative from Cloudera and Twitter, is a columnar data format supporting nested data. It can offer space and time advantages over row-ordered data, especially with queries that return a subset of the overall columns. It supports a wide variety of tools (MapReduce, Impala, Pig and Hive) and is used in production by Twitter.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub page: &lt;a href='https://github.com/Parquet'&gt;https://github.com/Parquet&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Presentation from Cloudera Impala meetup: &lt;a href='http://www.slideshare.net/cloudera/presentations-25757981'&gt;http://www.slideshare.net/cloudera/presentations-25757981&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hadoop Summit 2013 presentation: &lt;a href='http://www.youtube.com/watch?v=pFS-FScophU'&gt;http://www.youtube.com/watch?v=pFS-FScophU&lt;/a&gt; and accompanying slides &lt;a href='http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013'&gt;http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Twitter blog post: &lt;a href='https://blog.twitter.com/2013/dremel-made-simple-with-parquet'&gt;https://blog.twitter.com/2013/dremel-made-simple-with-parquet&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Cloudera blog post: &lt;a href='http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/'&gt;http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='orc_file'&gt;ORC File&lt;/h2&gt;

&lt;p&gt;ORC File is a columnar data format that also supports nested data. It is currently implemented within Hive 0.11.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Presentation from Hortonworks: &lt;a href='http://www.slideshare.net/oom65/orc-files'&gt;http://www.slideshare.net/oom65/orc-files&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Details on the file format: &lt;a href='https://cwiki.apache.org/Hive/languagemanual-orc.html'&gt;https://cwiki.apache.org/Hive/languagemanual-orc.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hadoop Summit 2013 presentation &lt;a href='http://www.youtube.com/watch?v=GV7vpR7vpjM'&gt;http://www.youtube.com/watch?v=GV7vpR7vpjM&lt;/a&gt; and slides &lt;a href='http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit'&gt;http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_tez'&gt;Apache Tez&lt;/h2&gt;

&lt;p&gt;Tez (incubating) is a generalized DAG execution engine. The goal of the project is to remove disk barriers that exist with pipelined MapReduce jobs. The first goal of the project is to provide a MapReduce implementation using Tez, followed by Hive and Pig.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incubating page at Apache: &lt;a href='http://incubator.apache.org/projects/tez.html'&gt;http://incubator.apache.org/projects/tez.html&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Introducing Tez: &lt;a href='http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/'&gt;http://hortonworks.com/blog/introducing-tez-faster-hadoop-processing/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Hadoop Summit 2013 presentation &lt;a href='http://www.youtube.com/watch?v=9ZLLzlsz7h8'&gt;http://www.youtube.com/watch?v=9ZLLzlsz7h8&lt;/a&gt; and accompanying slides &lt;a href='http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212'&gt;http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_mesos'&gt;Apache Mesos&lt;/h2&gt;

&lt;p&gt;Mesos is a cluster manager, similar to YARN, providing resource sharing and isolation capabilities in a distributed cluster. It can support multiple instances and versions of Hadoop, Spark and other applications. It&amp;#8217;s used in Twitter to manage various applications in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project page: &lt;a href='http://mesos.apache.org/'&gt;http://mesos.apache.org/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Tech talk: &lt;a href='http://www.youtube.com/watch?v=Hal00g8o1iY'&gt;http://www.youtube.com/watch?v=Hal00g8o1iY&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='lambda_architecture'&gt;Lambda Architecture&lt;/h2&gt;

&lt;p&gt;The Lambda Architecture, an architectural blueprint from Nathan Marz, suggests that speed and batch layers should exist to play to their mutual strengths: the speed layer providing near real-time data aggregations, and the batch layer providing a mechanism to correct potential mistakes made in the speed layer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nathan&amp;#8217;s book, Big Data from Manning, which goes into detail on the Lambda Architecture: &lt;a href='http://www.manning.com/marz/'&gt;http://www.manning.com/marz/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Nathan&amp;#8217;s presentation explaining the background behind Lambda: &lt;a href='http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it'&gt;http://www.slideshare.net/nathanmarz/runaway-complexity-in-big-data-and-a-plan-to-stop-it&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='summingbird'&gt;Summingbird&lt;/h2&gt;

&lt;p&gt;Summingbird is a project out of Twitter which could be viewed as an implementation of the Lambda Architecture. It allows you to using a single API to define operations on distributed collections which can be mapped into MapReduce or Storm executions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub project page: &lt;a href='https://github.com/twitter/summingbird'&gt;https://github.com/twitter/summingbird&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Twitter blog post on Summingbird: &lt;a href='https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird'&gt;https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Sam Ritchie presentation on Summingbird: &lt;a href='http://www.youtube.com/watch?v=Y3PETLJeP7o'&gt;http://www.youtube.com/watch?v=Y3PETLJeP7o&lt;/a&gt; and accompanying slides &lt;a href='https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter'&gt;https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='apache_spark'&gt;Apache Spark&lt;/h2&gt;

&lt;p&gt;Spark (incubating) is an in-memory distributed processing system which allows you to perform MapReduce, as well as iterative workloads over data. Spark and its family of associated projects (such as Spark Streaming, GraphX) offers a complete solution to most distributed processing use cases.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project page: &lt;a href='http://spark.incubator.apache.org/'&gt;http://spark.incubator.apache.org/&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Documentation, including links to video tutorials: &lt;a href='http://spark.incubator.apache.org/documentation.html'&gt;http://spark.incubator.apache.org/documentation.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>Bucketing, multiplexing and combining in Hadoop - part 2</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/07/16/multipleoutputs-part2"/>
   <updated>2013-07-16T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/07/16/multipleoutputs-part2</id>
   <content type="html">&lt;p&gt;In the &lt;a href='http://grepalex.com/2013/05/20/multipleoutputs-part1/'&gt;first post of this series&lt;/a&gt;, we looked at how the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; class could be used in a task to write to multiple output files. This approach had a few shortcomings which included that it couldn&amp;#8217;t be used in the map-side of a job that used reducers, and it only worked with the old &lt;code&gt;mapred&lt;/code&gt; API.&lt;/p&gt;

&lt;p&gt;In this post we&amp;#8217;ll look at the &lt;code&gt;MultipleOutputs&lt;/code&gt; class, which offers an alternative to the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; and also addresses its shortcomings.&lt;/p&gt;

&lt;h2 id='multipleoutputs'&gt;MultipleOutputs&lt;/h2&gt;

&lt;p&gt;Using the &lt;code&gt;MultipleOutputs&lt;/code&gt; class is a more modern Hadoop way of writing to multiple outputs. It has both &lt;code&gt;mapred&lt;/code&gt; and &lt;code&gt;mapreduce&lt;/code&gt; API implementations, and allows you to work with multiple OutputFormat classes in your job. Its approach is different from &lt;code&gt;MultipleOutputFormat&lt;/code&gt; - rather than defining its own &lt;code&gt;OutputFormat&lt;/code&gt; it merely provides some helper methods which need to be called in your driver code, as well as in your mapper/reducer.&lt;/p&gt;

&lt;p&gt;The two &lt;code&gt;MultipleOutputs&lt;/code&gt; classes in &lt;code&gt;mapred&lt;/code&gt; and &lt;code&gt;mapreduce&lt;/code&gt; are close in functionality, the main difference being support of multi-named outputs, which we&amp;#8217;ll examine later in this post.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s look at how we would achieve the same result as we did with &lt;code&gt;MultipleOutputFormat&lt;/code&gt;. If you recall from the previous post in this series, we were working with some sample data from a fruit market, where the data points were the location of each market, and the fruit that was sold:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;cupertino   apple
sunnyvale   banana
cupertino   pear
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Our goal is to partition the outputs by city, so there would be city-specific files. First up is our driver code, where we need to tell &lt;code&gt;MultipleOutputs&lt;/code&gt; the named outputs, and their related &lt;code&gt;OutputFormat&lt;/code&gt; classes. For simplicity we&amp;#8217;ve chosen &lt;code&gt;TextOutputFormat&lt;/code&gt; for both, but you can use different &lt;code&gt;OutputFormats&lt;/code&gt; for each named output.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;MultipleOutputs&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addNamedOutput&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;cupertino&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;TextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;span class='n'&gt;MultipleOutputs&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addNamedOutput&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;sunnyvale&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;TextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The named outputs &amp;#8220;cupertino&amp;#8221; and &amp;#8220;sunnyvale&amp;#8221; are used for two purposes in &lt;code&gt;MultipleOutputs&lt;/code&gt; - first as logical keys that you use in your mapper and reducer to lookup their associated &lt;code&gt;OutputCollector&lt;/code&gt; classes. And second, they are used as the output filenames in HDFS.&lt;/p&gt;

&lt;p&gt;We can&amp;#8217;t use an identity reducer in this example as we have to use the &lt;code&gt;MultipleOutputs&lt;/code&gt; class to redirect our output to the appropriate file, so let&amp;#8217;s go ahead and see what the reducer will look like.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;Reduce&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MapReduceBase&lt;/span&gt;
        &lt;span class='kd'&gt;implements&lt;/span&gt; &lt;span class='n'&gt;Reducer&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

    &lt;span class='kd'&gt;private&lt;/span&gt; &lt;span class='n'&gt;MultipleOutputs&lt;/span&gt; &lt;span class='n'&gt;output&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;configure&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;JobConf&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='kd'&gt;super&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;configure&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;output&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;MultipleOutputs&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;reduce&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;Iterator&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt;
                       &lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;OutputCollector&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;collector&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;Reporter&lt;/span&gt; &lt;span class='n'&gt;reporter&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
            &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;while&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;hasNext&lt;/span&gt;&lt;span class='o'&gt;())&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='n'&gt;output&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;getCollector&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt; &lt;span class='n'&gt;reporter&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;collect&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;next&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;As you can you&amp;#8217;re not using the &lt;code&gt;OutputCollector&lt;/code&gt; supplied to us in the &lt;code&gt;reduce&lt;/code&gt; method. Instead you create a &lt;code&gt;MultipleOutputs&lt;/code&gt; instance in the &lt;code&gt;configure&lt;/code&gt; method which is used in the reduce method. For each reducer input record, we use the key to lookup the &lt;code&gt;OutputCollector&lt;/code&gt; and then emit each key/value pair to that collector. Remember that when calling &lt;code&gt;getCollector&lt;/code&gt; you must use one of the named outputs that you defined in the job driver. In our case our input keys are either &amp;#8220;cupertino&amp;#8221; or &amp;#8220;sunnyvale&amp;#8221;, and they map directly to the named outputs we defined in our driver, so we&amp;#8217;re in good shape.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s examine the contents of the job output directory after running the job.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadooop -lsr /output
/output/cupertino-r-00000
/output/sunnyvale-r-00000
/output/part-00000
/output/part-00001
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;This output highlights one of the key differences between &lt;code&gt;MultipleOutputs&lt;/code&gt; and &lt;code&gt;MultipleOutputFormat&lt;/code&gt;. When using &lt;code&gt;MultipleOutputs&lt;/code&gt; you can output to the reducer&amp;#8217;s regular &lt;code&gt;OutputCollector&lt;/code&gt;, or to the &lt;code&gt;OutputCollector&lt;/code&gt; for a named output, or to both, which is why you see &lt;code&gt;part-nnnnn&lt;/code&gt; files.&lt;/p&gt;

&lt;p&gt;But wait! One problem with &lt;code&gt;MultipleOutputs&lt;/code&gt; is that you needed to pre-define the partitions &amp;#8220;cupertino&amp;#8221; and &amp;#8220;sunnyvale&amp;#8221; ahead of time in our driver. What if we didn&amp;#8217;t know the partitions ahead of time?&lt;/p&gt;

&lt;h2 id='dynamic_files_with_the_multipleoutputs_class'&gt;Dynamic files with the MultipleOutputs class&lt;/h2&gt;

&lt;p&gt;Up until now &lt;code&gt;MultipleOutputs&lt;/code&gt; has treated us well - it supported both the old and new MapReduce API&amp;#8217;s, and can also support multiple OutputFormat classes within the same reducer. But as we saw we essentially had to pre-define the output files in our driver code. So how do we handle cases where we want this to be dynamically performed in the reducer?&lt;/p&gt;

&lt;p&gt;Luckily the &lt;code&gt;MultipleOutputs&lt;/code&gt; has a notion of &amp;#8220;multi named&amp;#8221; output. In the driver method instead of enumerating all the output files we want, we&amp;#8217;ll simply just add a single logical name called &amp;#8220;fruit&amp;#8221;, using &lt;code&gt;addMultiNamedOutput&lt;/code&gt; instead of &lt;code&gt;addNamedOutput&lt;/code&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;MultipleOutputs&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addMultiNamedOutput&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;fruit&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;TextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;In our reducer we always specify &amp;#8220;fruit&amp;#8221; as the name, but we use a different &lt;code&gt;getCollector&lt;/code&gt; method which takes an additional field, which is used to determine the filename which is used for output:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;output&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;getCollector&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;fruit&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt; &lt;span class='n'&gt;reporter&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;collect&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;next&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Let&amp;#8217;s do another HDFS listing:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadooop -lsr /output
/output/fruit_cupertino-r-00000
/output/fruit_sunnyvale-r-00000
/output/part-00000
/output/part-00001
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Hurray! We now have multiple output files that are dynamically created based on the reducer output key, just like we did with &lt;code&gt;MultipleOutputFormat&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now unfortunately the multi-named output is only supported by the old &lt;code&gt;mapred&lt;/code&gt; API, whereas with the new &lt;code&gt;mapreduce&lt;/code&gt; API you are forced to define your partitions in your job driver.&lt;/p&gt;

&lt;h2 id='conclusion'&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;There are plenty of things to like about &lt;code&gt;MultipleOutputs&lt;/code&gt;, namely its support for both &amp;#8220;old&amp;#8221; and &amp;#8220;new&amp;#8221; MapReduce API&amp;#8217;s, and its support for multiple &lt;code&gt;OutputFormat&lt;/code&gt; classes. Its only real downside is that multi named outputs are only supported in the old &lt;code&gt;mapred&lt;/code&gt; API, so those looking for dynamic partitions in the new &lt;code&gt;mapreduce&lt;/code&gt; API are not supported by either &lt;code&gt;MultipleOutputs&lt;/code&gt; or &lt;code&gt;MultipleOutputFormat&lt;/code&gt; described in &lt;a href='http://grepalex.com/2013/05/20/multipleoutputs-part1/'&gt;part 1&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Secondary sorting with Avro</title>
   
   <category term="--" />
   
   <category term="avro" />
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/06/03/avro-custom-partitioning-sorting-grouping-hadoop"/>
   <updated>2013-06-03T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/06/03/avro-custom-partitioning-sorting-grouping-hadoop</id>
   <content type="html">&lt;p&gt;In the &lt;a href='http://grepalex.com/2013/05/28/avro-builtin-sorting/'&gt;last Avro sorting post&lt;/a&gt; you saw how sorting Avro records works in MapReduce, and how one can ignore fields in Avro records for partitioning, sorting and grouping. In the process you discovered that ignored fields are limited by being immutable (since they can only be defined once for a schema), which means you can&amp;#8217;t vary what fields are ignored for partitioning, sorting or grouping, which is key for secondary sort.&lt;/p&gt;

&lt;p&gt;If you wish to use secondary sort with Avro, one option would be to emit a custom Writable as the map output key, and emit an Avro record as the map output value. With this approach you&amp;#8217;d write a custom partitioner, and sorting/grouping implementation.&lt;/p&gt;

&lt;p&gt;This post looks at another option, where with some hacking you can actually have secondary sort with Avro map output keys.&lt;/p&gt;

&lt;h2 id='true_secondary_sort_with_an_avrokey'&gt;True secondary sort with an AvroKey&lt;/h2&gt;

&lt;p&gt;Avro has some utility classes for sorting and hashing (required for the partitioner), but the code is locked-down with private methods. The hacking therefore requires lifting certain parts of Avro&amp;#8217;s code, and writing some helper functions to easily allow jobs fine-grained control over what fields are used for secondary sort.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s take an example with the same Avro schema we used in the last post:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;record&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;com.alexholmes.avro.WeatherNoIgnore&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;doc&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;A weather reading.&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='p'&gt;[&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;long&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;default&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
 &lt;span class='p'&gt;]&lt;/span&gt;
&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;For secondary sort you may imagine a scenario where you want to partition output records by the station, sort records using the station, time and temp fields, and finally group by the station and time fields. The code to do this is as follows &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/java/com/alexholmes/avro/sort/avrokey/AvroSortCustom.java'&gt;GitHub source&lt;/a&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;AvroSort&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;builder&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setJob&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addPartitionField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addSortField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addSortField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addSortField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addGroupField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;addGroupField&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
    &lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;configure&lt;/span&gt;&lt;span class='o'&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The ordering of the &lt;code&gt;addXXX&lt;/code&gt; calls is significant, as it determines the order in which fields are used for sorting and grouping. The last argument in the &lt;code&gt;addXXX&lt;/code&gt; methods is a boolean which indicates whether the ordering is ascending.&lt;/p&gt;

&lt;p&gt;Most of the heavy lifting is performed in the &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/java/com/alexholmes/avro/sort/avrokey/AvroSort.java'&gt;AvroSort&lt;/a&gt; and &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/java/org/apache/avro/io/AvroDataHack.java'&gt;AvroDataHack&lt;/a&gt; - the latter, as its name indicates, is where some hacking took place to get things working.&lt;/p&gt;

&lt;p&gt;The only caveat with the current implementation is that Avro union types aren&amp;#8217;t currently supported - I&amp;#8217;ll look into that in the near future.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Avro's built-in sorting</title>
   
   <category term="--" />
   
   <category term="avro" />
   
   <link href="http://grepalex.com/2013/05/28/avro-builtin-sorting"/>
   <updated>2013-05-28T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/05/28/avro-builtin-sorting</id>
   <content type="html">&lt;p&gt;Avro has a little-known gem of a feature which allows you to control which fields in an Avro record are used for &lt;em&gt;partitioning&lt;/em&gt;, &lt;em&gt;sorting&lt;/em&gt; and &lt;em&gt;grouping&lt;/em&gt; in MapReduce. The following figure gives a quick refresher as to what these terms mean. Oh, and don&amp;#8217;t take the placement of the &amp;#8220;sorting&amp;#8221; literally - sorting actually occurs on both the map and reduce side - but it&amp;#8217;s always performed in the context of a specific partition (i.e. for a specific reducer).&lt;/p&gt;

&lt;p&gt;&lt;img alt='Image of MapReduce shuffle' src='/images/mr-shuffle.png' /&gt;&lt;/p&gt;

&lt;p&gt;By default all the fields in an Avro map output key are used for partitioning, sorting and grouping in MapReduce. Let&amp;#8217;s walk through an example and see how this works. You&amp;#8217;ll begin with a simple schema &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/avro/weather-noignores.avsc'&gt;GitHub source&lt;/a&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;record&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;com.alexholmes.avro.WeatherNoIgnore&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;doc&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;A weather reading.&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='p'&gt;[&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;long&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;default&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
 &lt;span class='p'&gt;]&lt;/span&gt;
&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;We&amp;#8217;re going to see what happens when we run this code against a small sample data set, which we&amp;#8217;ll generate using Avro code &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/test/java/com/alexholmes/avro/sort/AbstractAvroTest.java'&gt;GitHub source&lt;/a&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;File&lt;/span&gt; &lt;span class='n'&gt;input&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;tmpFolder&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newFile&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;input.txt&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;span class='n'&gt;AvroFiles&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;createFile&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;input&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Arrays&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;asList&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;
    &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newBuilder&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setStation&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTime&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTemp&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;3&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;build&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt;
    &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newBuilder&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setStation&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;IAD&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTime&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTemp&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;build&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt;
    &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newBuilder&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setStation&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTime&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTemp&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;build&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt;
    &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newBuilder&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setStation&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTime&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTemp&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;build&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt;
    &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;newBuilder&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setStation&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTime&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;setTemp&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;build&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt;
&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;toArray&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;To understand how Avro is partitioning, sorting and grouping the data, we&amp;#8217;ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the &lt;code&gt;counter&lt;/code&gt; field for each record we see in an individual reducer instance &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/java/com/alexholmes/avro/sort/basic/AvroSortDefault.java'&gt;GitHub source&lt;/a&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kn'&gt;package&lt;/span&gt; &lt;span class='n'&gt;com&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;alexholmes&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;avro&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;sort&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;basic&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;com.alexholmes.avro.WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.avro.mapred.AvroKey&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.avro.mapred.AvroValue&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.avro.mapreduce.AvroJob&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.avro.mapreduce.AvroKeyInputFormat&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.avro.mapreduce.AvroKeyOutputFormat&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.fs.Path&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.io.NullWritable&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapreduce.Job&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapreduce.Mapper&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapreduce.Reducer&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapreduce.lib.input.FileInputFormat&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapreduce.lib.output.FileOutputFormat&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;java.io.IOException&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;AvroSort&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

    &lt;span class='kd'&gt;private&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;SortMapper&lt;/span&gt;
            &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;Mapper&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;,&lt;/span&gt; &lt;span class='n'&gt;NullWritable&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt;
                           &lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;,&lt;/span&gt; &lt;span class='n'&gt;AvroValue&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='nd'&gt;@Override&lt;/span&gt;
        &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;map&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;NullWritable&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Context&lt;/span&gt; &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
                &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;InterruptedException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;write&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;AvroValue&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;datum&lt;/span&gt;&lt;span class='o'&gt;()));&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='kd'&gt;private&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;SortReducer&lt;/span&gt;
            &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;Reducer&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;,&lt;/span&gt; &lt;span class='n'&gt;AvroValue&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;,&lt;/span&gt;
                            &lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;,&lt;/span&gt; &lt;span class='n'&gt;NullWritable&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='nd'&gt;@Override&lt;/span&gt;
        &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;reduce&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt;
                              &lt;span class='n'&gt;Iterable&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;AvroValue&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Context&lt;/span&gt; &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
                &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;InterruptedException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;counter&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
            &lt;span class='k'&gt;for&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;AvroValue&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt; &lt;span class='o'&gt;:&lt;/span&gt; &lt;span class='n'&gt;values&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
                &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;datum&lt;/span&gt;&lt;span class='o'&gt;().&lt;/span&gt;&lt;span class='na'&gt;setCounter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;counter&lt;/span&gt;&lt;span class='o'&gt;++);&lt;/span&gt;
                &lt;span class='n'&gt;context&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;write&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;AvroKey&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;(&lt;/span&gt;&lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;datum&lt;/span&gt;&lt;span class='o'&gt;()),&lt;/span&gt;
                              &lt;span class='n'&gt;NullWritable&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;get&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
            &lt;span class='o'&gt;}&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kt'&gt;boolean&lt;/span&gt; &lt;span class='nf'&gt;runMapReduce&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;Job&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Path&lt;/span&gt; &lt;span class='n'&gt;inputPath&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Path&lt;/span&gt; &lt;span class='n'&gt;outputPath&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt;
            &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;Exception&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='n'&gt;FileInputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInputPaths&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;inputPath&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInputFormatClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;AvroKeyInputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;AvroJob&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInputKeySchema&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setMapperClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;SortMapper&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;AvroJob&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setMapOutputKeySchema&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;AvroJob&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setMapOutputValueSchema&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setReducerClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;SortReducer&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;AvroJob&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputKeySchema&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;WeatherNoIgnore&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;SCHEMA&lt;/span&gt;&lt;span class='n'&gt;$&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputFormatClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;AvroKeyOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;FileOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputPath&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;outputPath&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;waitForCompletion&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If you look at the output of the job below, you&amp;#8217;ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. What this means is that when MapReduce is sorting these records, it compares the &lt;code&gt;station&lt;/code&gt; field first, then the &lt;code&gt;time&lt;/code&gt; field second, and so on according to the ordering of the fields in the Avro schema. This is pretty much what you&amp;#8217;d expect if you write your own complex &lt;code&gt;Writable&lt;/code&gt; type, and your comparator compared all the fields in order.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;IAD&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;3&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Oh, and before we move on notice that the value for the &lt;code&gt;counter&lt;/code&gt; field is always &lt;code&gt;1&lt;/code&gt;, meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the MapReduce partitioner, sorter and grouper were using all the fields in the record.&lt;/p&gt;

&lt;h2 id='excluding_fields_for_sorting'&gt;Excluding fields for sorting&lt;/h2&gt;

&lt;p&gt;Avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. In MapReduce these fields are ignored for sorting/partitioning and grouping in MapReduce, which basically means that we have the ability to perform secondary sorting. Let&amp;#8217;s examine the following schema &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/avro/weather.avsc'&gt;GitHub source&lt;/a&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;record&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;com.alexholmes.avro.Weather&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;doc&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;A weather reading.&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='p'&gt;[&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;long&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;ignore&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;ignore&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;default&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
 &lt;span class='p'&gt;]&lt;/span&gt;
&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;It&amp;#8217;s pretty much identical to the first schema, the only difference being that the last two fields have an additional &amp;#8220;order&amp;#8221; field whose value is set to &amp;#8220;ignore&amp;#8221;. Let&amp;#8217;s run the same (other than modified to work with the different schema) MapReduce code &lt;a href='https://github.com/alexholmes/avro-sorting/blob/master/src/main/java/com/alexholmes/avro/sort/basic/AvroSortWithIgnores.java'&gt;GitHub source&lt;/a&gt; as above against this new schema and examine the outputs.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;IAD&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;3&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;3&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;SFO&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;2&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;There are a couple of notable differences between this output, and the output from the previous schema which didn&amp;#8217;t have any ignored fields. First, it&amp;#8217;s clear that the &lt;code&gt;temp&lt;/code&gt; field isn&amp;#8217;t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. However, more interestingly, note the value of the &lt;code&gt;counter&lt;/code&gt; field. All records that had identical &lt;code&gt;station&lt;/code&gt; and &lt;code&gt;time&lt;/code&gt; values went to the same reducer invocation, evidenced by the increasing value of &lt;code&gt;counter&lt;/code&gt;. This is essentially secondary sort!&lt;/p&gt;

&lt;h2 id='sort_order'&gt;Sort order&lt;/h2&gt;

&lt;p&gt;The &lt;a href='http://avro.apache.org/docs/current/spec.html#order'&gt;Avro documentation&lt;/a&gt; will give you an idea around how ordering is performed for different Avro types. Field ordering is ascending by default, but you can make it descending by setting the value of the &amp;#8220;order&amp;#8221; field to &amp;#8220;descending&amp;#8221;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='json'&gt;&lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;record&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;com.alexholmes.avro.Weather&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;doc&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;A weather reading.&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt;
 &lt;span class='nt'&gt;&amp;quot;fields&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='p'&gt;[&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;station&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;string&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;time&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;long&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;temp&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;descending&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;},&lt;/span&gt;
     &lt;span class='p'&gt;{&lt;/span&gt;&lt;span class='nt'&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;counter&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;int&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;ignore&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;,&lt;/span&gt; &lt;span class='nt'&gt;&amp;quot;default&amp;quot;&lt;/span&gt;&lt;span class='p'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='p'&gt;}&lt;/span&gt;
 &lt;span class='p'&gt;]&lt;/span&gt;
&lt;span class='p'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;h2 id='limitations'&gt;Limitations&lt;/h2&gt;

&lt;p&gt;Now, all of this greatness isn&amp;#8217;t without some limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You can&amp;#8217;t support two MapReduce jobs that use the same Avro key, but have different sorting/partitioning/grouping requirements. Although it&amp;#8217;s conceivable that you could create a new instance of the Avro schema and set the ignored flags for these fields yourself.&lt;/li&gt;

&lt;li&gt;The partitioner, sorter and grouping functions in MapReduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). This means that your options for secondary sorting are limited. For example, you wouldn&amp;#8217;t be able to partition all stations to the same reducer, and then group by station and time.&lt;/li&gt;

&lt;li&gt;Ordering uses a field&amp;#8217;s ordinal position to determine its order within the overall set of fields to be ordered. In other words, in a two-field record, the first field is always compared before the second. There&amp;#8217;s no way to change this behavior other than flipping the order of the fields in the record.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Having said all of that - the &amp;#8220;ignoring fields&amp;#8221; feature for sorting is pretty awesome, and something that will no doubt come in handy in my future MapReduce work.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Using Avro's code generation from Maven</title>
   
   <category term="--" />
   
   <category term="avro" />
   
   <link href="http://grepalex.com/2013/05/24/avro-maven"/>
   <updated>2013-05-24T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/05/24/avro-maven</id>
   <content type="html">&lt;p&gt;&lt;a href='http://avro.apache.org/'&gt;Avro&lt;/a&gt; has the ability to generate Java code from Avro schema, IDL and protocol files. Avro also has a plugin which allows you to generate these Java sources directly from Maven, which is a good idea as it avoids issues that can arise if your schema/protocol files stray from the checked-in code generated equivalents.&lt;/p&gt;

&lt;p&gt;Today I created a simple GitHub project called &lt;a href='https://github.com/alexholmes/avro-maven'&gt;avro-maven&lt;/a&gt; because I had to fiddle a bit to get Avro and Maven to play nice. The GitHub project is self-contained and also has a README which goes over the basics. In this post I&amp;#8217;ll go over how to use Maven to generate code for schema, IDL and protocol files.&lt;/p&gt;

&lt;h1 id='pomxml_updates_to_support_the_avro_plugin'&gt;pom.xml updates to support the Avro plugin&lt;/h1&gt;

&lt;p&gt;Avro schema files only define types, whereas IDL and protocol files model types as well as RPC semantics such as messages. The only difference between IDL and protocol files is that IDL files are Avro&amp;#8217;s DSL for specifying RPC, versus protocol files are the same in JSON form.&lt;/p&gt;

&lt;p&gt;Each type of file has an entry that can be used in the &lt;code&gt;goals&lt;/code&gt; element as can be seen below. All three can be used together, or if you only have schema files you can safely remove the &lt;code&gt;protocol&lt;/code&gt; and &lt;code&gt;idl-protocol&lt;/code&gt; entries (and vice-versa).&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='xml'&gt;&lt;span class='nt'&gt;&amp;lt;plugin&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro-maven-plugin&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;executions&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;execution&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;phase&amp;gt;&lt;/span&gt;generate-sources&lt;span class='nt'&gt;&amp;lt;/phase&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;goals&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;schema&lt;span class='nt'&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;protocol&lt;span class='nt'&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;idl-protocol&lt;span class='nt'&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;/goals&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;/execution&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/executions&amp;gt;&lt;/span&gt;
&lt;span class='nt'&gt;&amp;lt;/plugin&amp;gt;&lt;/span&gt;

...

&lt;span class='nt'&gt;&amp;lt;dependencies&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro-maven-plugin&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro-compiler&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro-ipc&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;span class='nt'&gt;&amp;lt;/dependencies&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;By default the plugin assumes that your Avro sources are located in &lt;code&gt;${basedir}/src/main/avro&lt;/code&gt;, and that you want your generated sources to be written to &lt;code&gt;${project.build.directory}/generated-sources/avro&lt;/code&gt;, where &lt;code&gt;${project.build.directory}&lt;/code&gt; is typically the &lt;code&gt;target&lt;/code&gt; directory. Keep reading if you want to change any of these settings.&lt;/p&gt;

&lt;h1 id='avro_configurables'&gt;Avro configurables&lt;/h1&gt;

&lt;p&gt;Luckily Avro&amp;#8217;s Maven plugin offers the ability to customize various code generation settings. The following table shows the configurables that can be used for any of the schema, IDL and protocol code generators.&lt;/p&gt;
&lt;table&gt;
    &lt;tr&gt;
        &lt;td&gt;&lt;strong&gt;Configurable&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Default value&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;sourceDirectory&lt;/td&gt;
        &lt;td&gt;${basedir}/src/main/avro&lt;/td&gt;
        &lt;td&gt;The Avro source directory for schema, protocol and IDL files.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;outputDirectory&lt;/td&gt;
        &lt;td&gt;${project.build.directory}/generated-sources/avro&lt;/td&gt;
        &lt;td&gt;The directory where Avro writes code-generated sources.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;testSourceDirectory&lt;/td&gt;
        &lt;td&gt;${basedir}/src/test/avro&lt;/td&gt;
        &lt;td&gt;The input directory containing any Avro files used in testing.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;testOutputDirectory&lt;/td&gt;
        &lt;td&gt;${project.build.directory}/generated-test-sources/avro&lt;/td&gt;
        &lt;td&gt;The output directory where Avro writes code-generated files for your testing purposes.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;fieldVisibility&lt;/td&gt;
        &lt;td&gt;PUBLIC_DEPRECATED&lt;/td&gt;
        &lt;td&gt;Determines the accessibility of fields (e.g. whether they are public or private).
        Must be one of PUBLIC, PUBLIC_DEPRECATED or PRIVATE. PUBLIC_DEPRECATED merely adds a
        deprecated annotation to each field, e.g. &quot;@Deprecated public long time&quot;.&lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;In addition, the &lt;code&gt;includes&lt;/code&gt; and &lt;code&gt;testIncludes&lt;/code&gt; configurables can also be used to specify alternative file extensions to the defaults, which are &lt;code&gt;**/*.avsc&lt;/code&gt;, &lt;code&gt;**/*.avpr&lt;/code&gt; and &lt;code&gt;**/*.avdl&lt;/code&gt; for schema, protocol and IDL files respectively.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s look at an example of how we can specify all of these options for schema compilation.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='xml'&gt;&lt;span class='nt'&gt;&amp;lt;plugin&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.avro&lt;span class='nt'&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;avro-maven-plugin&lt;span class='nt'&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${avro.version}&lt;span class='nt'&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;executions&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;execution&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;phase&amp;gt;&lt;/span&gt;generate-sources&lt;span class='nt'&gt;&amp;lt;/phase&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;goals&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;schema&lt;span class='nt'&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;/goals&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;sourceDirectory&amp;gt;&lt;/span&gt;${project.basedir}/src/main/myavro/&lt;span class='nt'&gt;&amp;lt;/sourceDirectory&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;outputDirectory&amp;gt;&lt;/span&gt;${project.basedir}/src/main/java/&lt;span class='nt'&gt;&amp;lt;/outputDirectory&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;testSourceDirectory&amp;gt;&lt;/span&gt;${project.basedir}/src/main/myavro/&lt;span class='nt'&gt;&amp;lt;/testSourceDirectory&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;testOutputDirectory&amp;gt;&lt;/span&gt;${project.basedir}/src/test/java/&lt;span class='nt'&gt;&amp;lt;/testOutputDirectory&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;fieldVisibility&amp;gt;&lt;/span&gt;PRIVATE&lt;span class='nt'&gt;&amp;lt;/fieldVisibility&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;includes&amp;gt;&lt;/span&gt;
          &lt;span class='nt'&gt;&amp;lt;include&amp;gt;&lt;/span&gt;**/*.avro&lt;span class='nt'&gt;&amp;lt;/include&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;/includes&amp;gt;&lt;/span&gt;
        &lt;span class='nt'&gt;&amp;lt;testIncludes&amp;gt;&lt;/span&gt;
          &lt;span class='nt'&gt;&amp;lt;testInclude&amp;gt;&lt;/span&gt;**/*.test&lt;span class='nt'&gt;&amp;lt;/testInclude&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;/testIncludes&amp;gt;&lt;/span&gt;
      &lt;span class='nt'&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
    &lt;span class='nt'&gt;&amp;lt;/execution&amp;gt;&lt;/span&gt;
  &lt;span class='nt'&gt;&amp;lt;/executions&amp;gt;&lt;/span&gt;
&lt;span class='nt'&gt;&amp;lt;/plugin&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;As a reminder everything covered in this blog article can be seen in action in the GitHub repo at &lt;a href='https://github.com/alexholmes/avro-maven'&gt;https://github.com/alexholmes/avro-maven&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Bucketing, multiplexing and combining in Hadoop - part 1</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/05/20/multipleoutputs-part1"/>
   <updated>2013-05-20T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/05/20/multipleoutputs-part1</id>
   <content type="html">&lt;p&gt;This is the first blog post in a series which looks at some data organization patterns in MapReduce. We&amp;#8217;ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. These are all common patterns that are useful to have in your MapReduce toolkit.&lt;/p&gt;

&lt;p&gt;We&amp;#8217;ll kick things off with a look at bucketing data outputs in your map or reduce tasks. By default when using a FileOutputFormat-derived OutputFormat (such as TextOutputFormat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in HDFS.&lt;/p&gt;

&lt;p&gt;&lt;img alt='Image of single output file for a task' src='/images/mof-textoutputformat.png' /&gt;&lt;/p&gt;

&lt;p&gt;Imagine a situation where you have user activity logs being streamed into HDFS, and you want to write a MapReduce job to better organize the incoming data. As an example a large organization with multiple products may want to bucket the logs based on the product. To do this you&amp;#8217;ll need the ability to write to multiple output files in a single task. Let&amp;#8217;s take a look at how we can make that happen.&lt;/p&gt;

&lt;h1 id='multipleoutputformat'&gt;MultipleOutputFormat&lt;/h1&gt;

&lt;p&gt;There are a few ways you can achieve your goal, and the first option we&amp;#8217;ll look at is the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; class in Hadoop. This is an abstract class that lets you do the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define the output path for each and every key/value output record being emitted by a task.&lt;/li&gt;

&lt;li&gt;Incorporate the input paths into the output directory for map-only jobs.&lt;/li&gt;

&lt;li&gt;Redefine the key and value that are used to write to the underlying &lt;code&gt;RecordWriter&lt;/code&gt;. This is useful in situations where you want to remove data from the outputs as it duplicates data in the filename.&lt;/li&gt;

&lt;li&gt;For each output path, define the &lt;code&gt;RecordWriter&lt;/code&gt; that should be used to write the outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img alt='Image of how MultipleOutputFormat works' src='/images/mof-multipleoutputformat.png' /&gt;&lt;/p&gt;

&lt;p&gt;OK enough with the words - let&amp;#8217;s look at some data and code. First up is the simple data we&amp;#8217;ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;cupertino   apple
sunnyvale   banana
cupertino   pear
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;To help bucket your data for future analysis, you want to bin each record into city-specific files. For the simple data set above you don&amp;#8217;t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. To force more than one mapper, we&amp;#8217;ll write the data to two separate files.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ TAB&lt;/span&gt;&lt;span class='o'&gt;=&lt;/span&gt;&lt;span class='s2'&gt;&amp;quot;$(printf &amp;#39;\t&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class='nv'&gt;$ &lt;/span&gt;hdfs -put - file1.txt &lt;span class='s'&gt;&amp;lt;&amp;lt; EOF&lt;/span&gt;
&lt;span class='s'&gt;cupertino${TAB}apple&lt;/span&gt;
&lt;span class='s'&gt;sunnyvale${TAB}banana&lt;/span&gt;
&lt;span class='s'&gt;EOF&lt;/span&gt;

&lt;span class='nv'&gt;$ &lt;/span&gt;hdfs -put - file2.txt &lt;span class='s'&gt;&amp;lt;&amp;lt; EOF&lt;/span&gt;
&lt;span class='s'&gt;cupertino${TAB}pear&lt;/span&gt;
&lt;span class='s'&gt;EOF&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Here&amp;#8217;s the code which will let you write city-specific output files.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.commons.lang.StringUtils&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.conf.Configuration&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.conf.Configured&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.fs.FileSystem&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.fs.Path&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.io.Text&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapred.*&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapred.lib.IdentityMapper&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.mapred.lib.MultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.util.Progressable&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.util.Tool&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;org.apache.hadoop.util.ToolRunner&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;java.io.IOException&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='kn'&gt;import&lt;/span&gt; &lt;span class='nn'&gt;java.util.Arrays&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;

&lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt; * An example of how to use {@link org.apache.hadoop.mapred.lib.MultipleOutputFormat}.&lt;/span&gt;
&lt;span class='cm'&gt; */&lt;/span&gt;
&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;MOFExample&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;Configured&lt;/span&gt; &lt;span class='kd'&gt;implements&lt;/span&gt; &lt;span class='n'&gt;Tool&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

    &lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt;     * Create output files based on the output record&amp;#39;s key name.&lt;/span&gt;
&lt;span class='cm'&gt;     */&lt;/span&gt;
    &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt;
                 &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='nd'&gt;@Override&lt;/span&gt;
        &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;/&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt;     * The main job driver.&lt;/span&gt;
&lt;span class='cm'&gt;     */&lt;/span&gt;
    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='nf'&gt;run&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;[]&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;Exception&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;csvInputs&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;StringUtils&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;join&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Arrays&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;copyOfRange&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;length&lt;/span&gt; &lt;span class='o'&gt;-&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;),&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;Path&lt;/span&gt; &lt;span class='n'&gt;outputDir&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Path&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;[&lt;/span&gt;&lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;length&lt;/span&gt; &lt;span class='o'&gt;-&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;]);&lt;/span&gt;

        &lt;span class='n'&gt;JobConf&lt;/span&gt; &lt;span class='n'&gt;jobConf&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;JobConf&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;super&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;getConf&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
        &lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setJarByClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;MOFExample&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setNumReduceTasks&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setMapperClass&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;IdentityMapper&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInputFormat&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;KeyValueTextInputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputFormat&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='n'&gt;FileInputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInputPaths&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;csvInputs&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;FileOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputPath&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;outputDir&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;JobClient&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;runJob&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;isSuccessful&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;?&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt; &lt;span class='o'&gt;:&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt;     * Main entry point for the utility.&lt;/span&gt;
&lt;span class='cm'&gt;     *&lt;/span&gt;
&lt;span class='cm'&gt;     * @param args arguments&lt;/span&gt;
&lt;span class='cm'&gt;     * @throws Exception when something goes wrong&lt;/span&gt;
&lt;span class='cm'&gt;     */&lt;/span&gt;
    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;main&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;[]&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;Exception&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;res&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;ToolRunner&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;run&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Configuration&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;MOFExample&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='n'&gt;System&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;exit&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;res&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Run this code and you&amp;#8217;ll see the following files in HDFS, where &lt;code&gt;/output&lt;/code&gt; is the job output directory:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output
/output/cupertino/part-00000
/output/cupertino/part-00001
/output/sunnyvale/part-00000
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If you look at the output files you&amp;#8217;ll see that the files contain the correct buckets.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output/cupertino/*
cupertino	apple
cupertino	pear

&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output/sunnyvale/*
sunnyvale	banana
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Awesome, you have your data bucketed by store. Now that we have everything working, let&amp;#8217;s look at what we did to get there. We had to do two things to get this working:&lt;/p&gt;

&lt;h2 id='extend_multipletextoutputformat'&gt;Extend MultipleTextOutputFormat&lt;/h2&gt;

&lt;p&gt;This is where the magic happened - let&amp;#8217;s look at that class again.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;/&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;You are working with text, which is why you extended &lt;code&gt;MultipleTextOutputFormat&lt;/code&gt;, a class that in turn extends &lt;code&gt;MultipleOutputFormat&lt;/code&gt;. &lt;code&gt;MultipleTextOutputFormat&lt;/code&gt; is a simple class which instructs the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; to use &lt;code&gt;TextOutputFormat&lt;/code&gt; as the underlying output format for writing out the records. If you were to use &lt;code&gt;MultipleOutputFormat&lt;/code&gt; as-is it behaves as if you were using the regular &lt;code&gt;TextOutputFormat&lt;/code&gt;, which is to say that it&amp;#8217;ll only write to a single output file. To write data to multiple files you had to extend it, as with the example above.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;generateFileNameForKeyValue&lt;/code&gt; method allows you to return the output path for an input record. The third argument, &lt;code&gt;name&lt;/code&gt;, is the original &lt;code&gt;FileOutputFormat&lt;/code&gt;-created filename, which is in the form &amp;#8220;part-NNNNN&amp;#8221;, where &amp;#8220;NNNNN&amp;#8221; is the task index, to ensure uniqueness. To avoid file collisions, it&amp;#8217;s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. In our example we&amp;#8217;re using the key as the directory name, and then writing to the original &lt;code&gt;FileOutputFormat&lt;/code&gt; filename within that directory.&lt;/p&gt;

&lt;h2 id='specify_the_outputformat'&gt;Specify the OutputFormat&lt;/h2&gt;

&lt;p&gt;The next step was easy - specify that this output format should be used for your job:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setOutputFormat&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;class&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Earlier we also mentioned that you can use the input path as part of the output path, which we will look at next.&lt;/p&gt;

&lt;h2 id='using_the_input_filename_as_part_of_the_output_filename_in_maponly_jobs'&gt;Using the input filename as part of the output filename in map-only jobs&lt;/h2&gt;

&lt;p&gt;What if we wanted to keep the input filename as part of the output filename? This only works for map-only jobs, and can be accomplished by overriding the &lt;code&gt;getInputFileBasedOutputFileName&lt;/code&gt; method. Let&amp;#8217;s look at the following code to understand how this method fits into the overall sequence of actions that the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; class performs:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;write&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;K&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;V&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

    &lt;span class='c1'&gt;// get the file name based on the key&lt;/span&gt;
    &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;keyBasedPath&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;myName&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

    &lt;span class='c1'&gt;// get the file name based on the input file name&lt;/span&gt;
    &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;finalPath&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;getInputFileBasedOutputFileName&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;myJob&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;keyBasedPath&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

    &lt;span class='c1'&gt;// get the actual key&lt;/span&gt;
    &lt;span class='n'&gt;K&lt;/span&gt; &lt;span class='n'&gt;actualKey&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;generateActualKey&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='n'&gt;V&lt;/span&gt; &lt;span class='n'&gt;actualValue&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;generateActualValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;

    &lt;span class='n'&gt;RecordWriter&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;K&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;V&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;rw&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;this&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;recordWriters&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;get&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;finalPath&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='k'&gt;if&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;rw&lt;/span&gt; &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='kc'&gt;null&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
      &lt;span class='c1'&gt;// if we don&amp;#39;t have the record writer yet for the final path, create&lt;/span&gt;
      &lt;span class='c1'&gt;// one&lt;/span&gt;
      &lt;span class='c1'&gt;// and add it to the cache&lt;/span&gt;
      &lt;span class='n'&gt;rw&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;getBaseRecordWriter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;myFS&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;myJob&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;finalPath&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;myProgressable&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
      &lt;span class='k'&gt;this&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;recordWriters&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;put&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;finalPath&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;rw&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
    &lt;span class='n'&gt;rw&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;write&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;actualKey&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;actualValue&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;span class='o'&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;getInputFileBasedOutputFileName&lt;/code&gt; method is called with the output of &lt;code&gt;generateFileNameForKeyValue&lt;/code&gt;, which contains our already-customized output file. Our new &lt;code&gt;KeyBasedMultipleTextOutputFormat&lt;/code&gt; can now be updated to override &lt;code&gt;getInputFileBasedOutputFileName&lt;/code&gt; and append the original input filename to the output filename:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MultipleTextOutputFormat&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Object&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Object&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;/&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;getInputFileBasedOutputFileName&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;JobConf&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;infilename&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Path&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;get&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;map.input.file&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;)).&lt;/span&gt;&lt;span class='na'&gt;getName&lt;/span&gt;&lt;span class='o'&gt;();&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;-&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;infilename&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If you run with your modified OutputFormat class you&amp;#8217;ll see the following files in HDFS, confirming that the input filenames are now concatenated to the end of each output file.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output
/output/cupertino/part-00000-file1.txt
/output/cupertino/part-00001-file2.txt
/output/sunnyvale/part-00000-file1.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The implementation of &lt;code&gt;getInputFileBasedOutputFileName&lt;/code&gt; in &lt;code&gt;MultipleOutputFormat&lt;/code&gt; doesn&amp;#8217;t do anything interesting by default, but if you set the value of the &lt;code&gt;mapred.outputformat.numOfTrailingLegs&lt;/code&gt; configurable to an integer greater than 0, then the &lt;code&gt;getInputFileBasedOutputFileName&lt;/code&gt; will use part of the input path as the output path.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s see what happens when we set the value to 1:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;jobConf&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;setInt&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;mapred.outputformat.numOfTrailingLegs&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='mi'&gt;1&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The output files in HDFS now exactly mirror the input files used for the job:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output
/output/file1.txt
/output/file2.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If we set &lt;code&gt;mapred.outputformat.numOfTrailingLegs&lt;/code&gt; to 2, and our input files exist in the &lt;code&gt;/inputs&lt;/code&gt; directory, then our output directory looks like this:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output
/output/input/file1.txt
/output/input/file2.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Basically as you keep incrementing &lt;code&gt;mapred.outputformat.numOfTrailingLegs&lt;/code&gt;, then &lt;code&gt;MultipleOutputFormat&lt;/code&gt; will continue to go up the parent directories of the input file and use them in the output path.&lt;/p&gt;

&lt;h2 id='modifying_the_output_key_and_value'&gt;Modifying the output key and value&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. In our example, we took the output key and wrote to a directory using the key name. If you do that keeping the key in the output file may be redundant. How would we modify the output record so that the key isn&amp;#8217;t written? &lt;code&gt;MultipleOutputFormat&lt;/code&gt; has your back with the &lt;code&gt;generateActualKey&lt;/code&gt; method.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;/&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='nf'&gt;generateActualKey&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='kc'&gt;null&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;The returned value from this method replaces the key that&amp;#8217;s supplied to the underlying &lt;code&gt;RecordWriter&lt;/code&gt;, so if you return &lt;code&gt;null&lt;/code&gt; as in the above example, no key will be written to the file.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output/cupertino/*
apple
pear

&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -lsr /output/sunnyvale/*
banana
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;You can achieve the same result for the output value by overriding the &lt;code&gt;generateActualValue&lt;/code&gt; method.&lt;/p&gt;

&lt;h2 id='changing_the_recordwriter'&gt;Changing the RecordWriter&lt;/h2&gt;

&lt;p&gt;In our final step we&amp;#8217;ll look at how you can leverage multiple &lt;code&gt;RecordWriter&lt;/code&gt; classes for different output files. This is accomplished by overriding the &lt;code&gt;getRecordWriter&lt;/code&gt; method. In the example below we&amp;#8217;re leveraging the same &lt;code&gt;TextOutputFormat&lt;/code&gt; for all the files, but it gives you a sense of what can be accomplished.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;KeyBasedMultipleTextOutputFormat&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;MultipleTextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;protected&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='nf'&gt;generateFileNameForKeyValue&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt; &lt;span class='n'&gt;value&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;key&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;toString&lt;/span&gt;&lt;span class='o'&gt;()&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='s'&gt;&amp;quot;/&amp;quot;&lt;/span&gt; &lt;span class='o'&gt;+&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;

    &lt;span class='nd'&gt;@Override&lt;/span&gt;
    &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='n'&gt;RecordWriter&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;&lt;/span&gt; &lt;span class='n'&gt;getRecordWriter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;FileSystem&lt;/span&gt; &lt;span class='n'&gt;fs&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;JobConf&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Progressable&lt;/span&gt; &lt;span class='n'&gt;prog&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;IOException&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
        &lt;span class='k'&gt;if&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;startsWith&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;apple&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;))&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;TextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;().&lt;/span&gt;&lt;span class='na'&gt;getRecordWriter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;fs&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;prog&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt; &lt;span class='k'&gt;else&lt;/span&gt; &lt;span class='k'&gt;if&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;startsWith&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;banana&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;))&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
            &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;TextOutputFormat&lt;/span&gt;&lt;span class='o'&gt;&amp;lt;&lt;/span&gt;&lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;Text&lt;/span&gt;&lt;span class='o'&gt;&amp;gt;().&lt;/span&gt;&lt;span class='na'&gt;getRecordWriter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;fs&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;prog&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
        &lt;span class='o'&gt;}&lt;/span&gt;
        &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='kd'&gt;super&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;getRecordWriter&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;fs&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;name&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;prog&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='o'&gt;}&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;h2 id='conclusion'&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;When using &lt;code&gt;MultipleOutputFormat&lt;/code&gt;, give some thought to the number of distinct files that each reducer will create. It would be prudent to plan your bucketing so that you have a relatively small number of files.&lt;/p&gt;

&lt;p&gt;In this post we extended &lt;code&gt;MultipleTextOutputFormat&lt;/code&gt;, which is a simple extension of &lt;code&gt;MultipleOutputFormat&lt;/code&gt; that supports text outputs. &lt;code&gt;MultipleSequenceFileOutputFormat&lt;/code&gt; also exists to support SequenceFiles in a similar fashion.&lt;/p&gt;

&lt;p&gt;So what are the shortcomings with the &lt;code&gt;MultipleOutputFormat&lt;/code&gt; class?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If you have a job that uses both map and reduce phases, then &lt;code&gt;MultipleOutputFormat&lt;/code&gt; can&amp;#8217;t be used in the map-side to write outputs. Of course, &lt;code&gt;MultipleOutputFormat&lt;/code&gt; works fine in map-only jobs.&lt;/li&gt;

&lt;li&gt;All &lt;code&gt;RecordWriter&lt;/code&gt; classes must support exactly the same output record types. For example, you wouldn&amp;#8217;t be able to support a RecordWriter that emitted &lt;code&gt;&amp;lt;IntWritable, Text&amp;gt;&lt;/code&gt; for one output file, and have another RecordWriter that emitted &lt;code&gt;&amp;lt;Text, Text&amp;gt;&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;MultipleOutputFormat&lt;/code&gt; exists in the &lt;code&gt;mapred&lt;/code&gt; package, so it won&amp;#8217;t work with a job that requires use of the &lt;code&gt;mapreduce&lt;/code&gt; package.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All is not lost if you bump into either one of these issues, as you&amp;#8217;ll discover in the &lt;a href='http://grepalex.com/2013/07/16/multipleoutputs-part2/'&gt;next blog post in this series&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Using the libjars option with Hadoop</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/02/25/hadoop-libjars"/>
   <updated>2013-02-25T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/02/25/hadoop-libjars</id>
   <content type="html">&lt;p&gt;When working with MapReduce one of the challenges that is encountered early-on is determining how to make your third-part JAR&amp;#8217;s available to the map and reduce tasks. One common approach is to create a &lt;em&gt;fat jar&lt;/em&gt;, which is a JAR that contains your classes as well as your third-party classes (see &lt;a href='http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/'&gt;this Cloudera blog post&lt;/a&gt; for more details).&lt;/p&gt;

&lt;p&gt;A more elegant solution is to take advantage of the &lt;code&gt;libjars&lt;/code&gt; option in the &lt;code&gt;hadoop jar&lt;/code&gt; command, also mentioned in the Cloudera post at a high level. Here I&amp;#8217;ll go into detail on the three steps required to make this work.&lt;/p&gt;

&lt;h1 id='add_libjars_to_the_options'&gt;Add libjars to the options&lt;/h1&gt;

&lt;p&gt;It can be confusing to know exactly where to put &lt;code&gt;libjars&lt;/code&gt; when running the &lt;code&gt;hadoop jar&lt;/code&gt; command. The following example shows the correct position of this option:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;export &lt;/span&gt;&lt;span class='nv'&gt;LIBJARS&lt;/span&gt;&lt;span class='o'&gt;=&lt;/span&gt;/path/jar1,/path/jar2
&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop jar my-example.jar com.example.MyTool -libjars &lt;span class='k'&gt;${&lt;/span&gt;&lt;span class='nv'&gt;LIBJARS&lt;/span&gt;&lt;span class='k'&gt;}&lt;/span&gt; -mytoolopt value
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;It&amp;#8217;s worth noting in the above example that the JAR&amp;#8217;s supplied as the value of the &lt;code&gt;libjar&lt;/code&gt; option are &lt;em&gt;comma-separated&lt;/em&gt;, and not separated by your O.S. path delimiter (which is how a Java classpath is delimited).&lt;/p&gt;

&lt;p&gt;You may think that you&amp;#8217;re done, but often times this step alone may not be enough - read on for more details!&lt;/p&gt;

&lt;h1 id='make_sure_your_code_is_using_genericoptionsparser'&gt;Make sure your code is using GenericOptionsParser&lt;/h1&gt;

&lt;p&gt;The Java class that&amp;#8217;s being supplied to the &lt;code&gt;hadoop jar&lt;/code&gt; command should use the &lt;a href='http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/util/GenericOptionsParser.html'&gt;GenericOptionsParser&lt;/a&gt; class to parse the options being supplied on the CLI. The easiest way to do that is demonstrated with the following code, which leverages the &lt;a href='http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/util/ToolRunner.html'&gt;ToolRunner&lt;/a&gt; class to parse-out the options:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kt'&gt;void&lt;/span&gt; &lt;span class='nf'&gt;main&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;[]&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;Exception&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
  &lt;span class='n'&gt;Configuration&lt;/span&gt; &lt;span class='n'&gt;conf&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Configuration&lt;/span&gt;&lt;span class='o'&gt;();&lt;/span&gt;
  &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;res&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;ToolRunner&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;run&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;conf&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;com&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;example&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;MyTool&lt;/span&gt;&lt;span class='o'&gt;(),&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
  &lt;span class='n'&gt;System&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;exit&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;res&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;It is &lt;strong&gt;crucial&lt;/strong&gt; that the configuration object being passed into the &lt;code&gt;ToolRunner.run&lt;/code&gt; method is the same one that you&amp;#8217;re using when setting-up your job. To guarantee this, your class should use the &lt;code&gt;getConf()&lt;/code&gt; method defined in &lt;code&gt;Configurable&lt;/code&gt; (and implemented in &lt;code&gt;Configured&lt;/code&gt;) to access the configuration:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;class&lt;/span&gt; &lt;span class='nc'&gt;SmallFilesMapReduce&lt;/span&gt; &lt;span class='kd'&gt;extends&lt;/span&gt; &lt;span class='n'&gt;Configured&lt;/span&gt; &lt;span class='kd'&gt;implements&lt;/span&gt; &lt;span class='n'&gt;Tool&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;

  &lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='nf'&gt;run&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;final&lt;/span&gt; &lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;[]&lt;/span&gt; &lt;span class='n'&gt;args&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='kd'&gt;throws&lt;/span&gt; &lt;span class='n'&gt;Exception&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='n'&gt;Job&lt;/span&gt; &lt;span class='n'&gt;job&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Job&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kd'&gt;super&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;getConf&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
    &lt;span class='o'&gt;...&lt;/span&gt;
    &lt;span class='n'&gt;job&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;waitForCompletion&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kc'&gt;true&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='o'&gt;...;&lt;/span&gt;
  &lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If you don&amp;#8217;t leverage the Configuration object supplied to the &lt;code&gt;ToolRunner.run&lt;/code&gt; method in your MapReduce driver code, then your job won&amp;#8217;t be correctly configured and your third-party JAR&amp;#8217;s won&amp;#8217;t be copied to the Distributed Cache or loaded in the remote task JVM&amp;#8217;s.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s the &lt;code&gt;ToolRunner.run&lt;/code&gt; method (actually it delegates the command parsing to &lt;code&gt;GenericOptionsParser&lt;/code&gt;) which actually parses-out the &lt;code&gt;libjars&lt;/code&gt; argument, and adds to the Configuration object a value for the &lt;code&gt;tmpjars&lt;/code&gt; property. So a quick way to make sure that this step is working is to look at the job file for your MapReduce job (there&amp;#8217;s a link when viewing the job details from the JobTracker), and make sure that the &lt;code&gt;tmpjars&lt;/code&gt; configuration name exists with a value identical to the path that you specified in your command. You can also use the command-line to search for the &lt;code&gt;libjars&lt;/code&gt; configuration in HDFS&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -cat &amp;lt;JOB_OUTPUT_HDFS_DIRECTORY&amp;gt;/_logs/history/*.xml | grep tmpjars
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;h1 id='use_hadoop_classpath_to_make_your_thirdparty_jars_available_on_the_clientside'&gt;Use HADOOP_CLASSPATH to make your third-party JAR&amp;#8217;s available on the client-side&lt;/h1&gt;

&lt;p&gt;So far the first two steps tackled what you needed to do to to make your third-party JAR&amp;#8217;s available to the remote map and reduce task JVM&amp;#8217;s. But what hasn&amp;#8217;t been covered so far is making these same JAR&amp;#8217;s available to the client JVM, which is the JVM that&amp;#8217;s created when you run the &lt;code&gt;hadoop jar&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;For this to happen, you should set the &lt;code&gt;HADOOP_CLASSPATH&lt;/code&gt; environment variable to contain the O.S. path-delimited list of third-party JAR&amp;#8217;s. Let&amp;#8217;s extend the commands in the first step above with the addition of setting the &lt;code&gt;HADOOP_CLASSPATH&lt;/code&gt; environment variable:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;export &lt;/span&gt;&lt;span class='nv'&gt;LIBJARS&lt;/span&gt;&lt;span class='o'&gt;=&lt;/span&gt;/path/jar1,/path/jar2
&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;export &lt;/span&gt;&lt;span class='nv'&gt;HADOOP_CLASSPATH&lt;/span&gt;&lt;span class='o'&gt;=&lt;/span&gt;/path/jar1:/path/jar2
&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop jar my-example.jar com.example.MyTool -libjars &lt;span class='k'&gt;${&lt;/span&gt;&lt;span class='nv'&gt;LIBJARS&lt;/span&gt;&lt;span class='k'&gt;}&lt;/span&gt; -mytoolopt value
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Note that value for &lt;code&gt;HADOOP_CLASSPATH&lt;/code&gt; uses a Unix path delimiter of &lt;code&gt;:&lt;/code&gt;, so modify accordingly for your platform. And if you don&amp;#8217;t like the copy-paste above you could modify that line to substitute the commas for semi-colons:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;export &lt;/span&gt;&lt;span class='nv'&gt;HADOOP_CLASSPATH&lt;/span&gt;&lt;span class='o'&gt;=&lt;/span&gt;&lt;span class='sb'&gt;`&lt;/span&gt;&lt;span class='nb'&gt;echo&lt;/span&gt; &lt;span class='k'&gt;${&lt;/span&gt;&lt;span class='nv'&gt;LIBJARS&lt;/span&gt;&lt;span class='k'&gt;}&lt;/span&gt; | sed s/,/:/g&lt;span class='sb'&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;</content>
 </entry>
 
 <entry>
   <title>Installing AsciiDoc on OSX</title>
   
   <category term="--" />
   
   <category term="OSX" />
   
   <link href="http://grepalex.com/2013/02/17/installing-asciidoc-on-osx"/>
   <updated>2013-02-17T14:00:00+00:00</updated>
   <id>http://grepalex.com/2013/02/17/installing-asciidoc-on-osx</id>
   <content type="html">&lt;p&gt;&lt;a href='http://asciidoc.org/'&gt;AsciiDoc&lt;/a&gt; is a markup language and tool that I&amp;#8217;m starting to play with to produce DocBook and PDF/HTML versions of my work. It took me a little longer than expected to get it up and running, so hopefully this blog will serve as a quick install guide for you, as well as the future me.&lt;/p&gt;

&lt;p&gt;First I had to install &lt;a href='http://mxcl.github.com/homebrew/'&gt;Homebrew&lt;/a&gt;, a useful package manager fo OSX:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;sudo mkdir /usr/local/homebrew
&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;cd&lt;/span&gt; /usr/local/homebrew
&lt;span class='nv'&gt;$ &lt;/span&gt;sudo curl -L https://github.com/mxcl/homebrew/tarball/master | tar xz --strip 1 -C .
&lt;span class='nv'&gt;$ &lt;/span&gt;sudo ln -s &lt;span class='sb'&gt;`&lt;/span&gt;&lt;span class='nb'&gt;pwd&lt;/span&gt;&lt;span class='sb'&gt;`&lt;/span&gt;/bin/brew /usr/local/bin/brew
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Next-up was installing AsciiDoc and other required libraries via &lt;code&gt;brew&lt;/code&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;sudo brew install autoconf automake libevent asciidoc
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;After this I had to update my bash profile file to set an environment variable that points to the XML catalog created as part of the AsciiDoc installation:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;echo&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;export XML_CATALOG_FILES=/usr/local/etc/xml/catalog&amp;quot;&lt;/span&gt; &amp;gt;&amp;gt;  ~/.bash_profile
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Now you have to &lt;a href='http://xmlgraphics.apache.org/fop/download.html'&gt;download Apache FOP&lt;/a&gt;, a print formatter used by AsciiDoc to create PDF&amp;#8217;s, which in my case resulted in a file at &lt;code&gt;~/Downloads/fop-1.0-bin.tar.gz&lt;/code&gt;. Untar the contents and create a symbollic link for &lt;code&gt;fop&lt;/code&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;cd&lt;/span&gt; /usr/local/
&lt;span class='nv'&gt;$ &lt;/span&gt;sudo tar -xzvf ~/Downloads/fop-1.0-bin.tar.gz
&lt;span class='nv'&gt;$ &lt;/span&gt;sudo ln -s /usr/local/fop-1.0/fop /usr/bin/fop
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Finally, let&amp;#8217;s make sure that everything is installed correctly. Create a sample AsciiDoc file called &lt;code&gt;sample.asciidoc&lt;/code&gt; with the following contents:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Your First AsciiDoc
===================
Jane Blogs
:Author Initials: JB

This is your first AsciiDoc file - yay for you!&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can then run &lt;code&gt;a2x&lt;/code&gt;, which will first generate a DocBook version of your AsciiDoc file, and then goes on to generate the PDF.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;a2x -v -fpdf -dbook --fop sample.asciidoc
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;This should create a sample.pdf in the same directory as your AsciiDoc file. You can also generate a HTML version with:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;&lt;span class='nv'&gt;$ &lt;/span&gt;asciidoc -b html5 -a data-uri -a toc2 tada.asciidoc
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;</content>
 </entry>
 
 <entry>
   <title>Java 6 and 7 with the dotted/dotless I</title>
   
   <category term="--" />
   
   <category term="Java" />
   
   <link href="http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i"/>
   <updated>2013-02-14T14:00:00+00:00</updated>
   <id>http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i</id>
   <content type="html">&lt;p&gt;Imagine you&amp;#8217;re working on a project in Java where you are handling text in a language that contains characters outside the standard 128-character ASCII scheme, such as Turkish. How about we focus on the &lt;a href='http://en.wikipedia.org/wiki/Dotted_and_dotless_I'&gt;dotted and dotless I&lt;/a&gt;:&lt;/p&gt;
&lt;table&gt;
    &lt;tr&gt;
        &lt;td&gt;Letter&lt;/td&gt;
        &lt;td&gt;Description&lt;/td&gt;
        &lt;td&gt;Unicode (decimal)&lt;/td&gt;
        &lt;td&gt;Unicode (Java hex)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;İ&lt;/td&gt;
        &lt;td&gt;Upper-case dotted I&lt;/td&gt;
        &lt;td&gt;304&lt;/td&gt;
        &lt;td&gt;u0130&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;I&lt;/td&gt;
        &lt;td&gt;Upper-case (dotless) Latin I&lt;/td&gt;
        &lt;td&gt;73&lt;/td&gt;
        &lt;td&gt;u0049&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;ı&lt;/td&gt;
        &lt;td&gt;Lower-case dottless I&lt;/td&gt;
        &lt;td&gt;305&lt;/td&gt;
        &lt;td&gt;u0131&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;i&lt;/td&gt;
        &lt;td&gt;Lower-case (dotted) Latin I&lt;/td&gt;
        &lt;td&gt;105&lt;/td&gt;
        &lt;td&gt;u0069&lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;This is how the &lt;a href='http://www.i18nguy.com/unicode/turkish-i18n.html'&gt;lower and upper-case versions of the Turkish dotted/dotless &amp;#8220;I&amp;#8221; relate&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img alt='Image of combining-dot-above-i' src='/images/turkish-dotted-undotted-i.png' /&gt;&lt;/p&gt;

&lt;p&gt;Since we know that the hexadecimal Unicode representation of the upper-case dotted &amp;#8220;I&amp;#8221; (İ) is &lt;code&gt;u0130&lt;/code&gt;, how about we try and and convert it to its lower-case form, which should be the regular lower-case Latin &amp;#8220;i&amp;#8221;, which in Unicode hexadecimal form is &lt;code&gt;u0069&lt;/code&gt;.&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;System&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;out&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;println&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;valueOf&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='sc'&gt;&amp;#39;\u0130&amp;#39;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;toLowerCase&lt;/span&gt;&lt;span class='o'&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If we run this same code under Java 6 and Java 7 we get:&lt;/p&gt;

&lt;p&gt;&lt;img alt='Image of combining-dot-above-i' src='/images/combining-dot-above-i.png' /&gt;&lt;/p&gt;

&lt;p&gt;Hmm - I may be mistaken, but it looks like under Java 7 the &amp;#8220;i&amp;#8221; has grown another dot! Let&amp;#8217;s see what the Unicode codepoints in the resulting string look like using the following code:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;offset&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='k'&gt;for&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;i&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='mi'&gt;0&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt; &lt;span class='n'&gt;i&lt;/span&gt; &lt;span class='o'&gt;&amp;lt;&lt;/span&gt; &lt;span class='n'&gt;s&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;length&lt;/span&gt;&lt;span class='o'&gt;();&lt;/span&gt; &lt;span class='n'&gt;i&lt;/span&gt; &lt;span class='o'&gt;+=&lt;/span&gt; &lt;span class='n'&gt;offset&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;codepoint&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;s&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;codePointAt&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;i&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='n'&gt;offset&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;Character&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;charCount&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;codepoint&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
    &lt;span class='n'&gt;System&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;out&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;print&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;format&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;u%04x &amp;quot;&lt;/span&gt;&lt;span class='o'&gt;,&lt;/span&gt; &lt;span class='n'&gt;codepoint&lt;/span&gt;&lt;span class='o'&gt;));&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If we run again run this in Java 6 and Java 7 against the &lt;code&gt;toLowerCase&lt;/code&gt; method on the upper-case dotted &amp;#8220;I&amp;#8221; we get:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;Java 6: u0069
Java 7: u0069 u0307
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;It looks like the first codepoint is indeed correct (the Latin lower-case &amp;#8220;i&amp;#8221;), but what is &lt;code&gt;u0307&lt;/code&gt;? &lt;a href='http://en.wikipedia.org/wiki/Dot_(diacritic'&gt;Wikipedia tells us&lt;/a&gt;) it&amp;#8217;s the &amp;#8220;combining dot above&amp;#8221;, which is to say that it is displayed as a single character (called a &lt;a href='http://en.wikipedia.org/wiki/Grapheme'&gt;grapheme&lt;/a&gt;) it modifies the previous character with an additional dot, just like we saw in our example.&lt;/p&gt;

&lt;p&gt;What&amp;#8217;s puzzling about this is why do we see the behaviour of &lt;code&gt;toLowerCase&lt;/code&gt; change between Java versions? If you dig into the Java 7 &lt;code&gt;String&lt;/code&gt; class and compare the code against the Java 6 source, you&amp;#8217;ll see that the following code was added to Java 7:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='o'&gt;}&lt;/span&gt; &lt;span class='k'&gt;else&lt;/span&gt; &lt;span class='k'&gt;if&lt;/span&gt; &lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;srcChar&lt;/span&gt; &lt;span class='o'&gt;==&lt;/span&gt; &lt;span class='sc'&gt;&amp;#39;\u0130&amp;#39;&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt; &lt;span class='c1'&gt;// LATIN CAPITAL LETTER I DOT&lt;/span&gt;
    &lt;span class='n'&gt;lowerChar&lt;/span&gt; &lt;span class='o'&gt;=&lt;/span&gt; &lt;span class='n'&gt;Character&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;ERROR&lt;/span&gt;&lt;span class='o'&gt;;&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Basically the end result of this change is that for this specific case (the upper-case dotted I), Java 7 now consults a special Unicode character database (&lt;a href='http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt'&gt;http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt&lt;/a&gt;), which provides data on complex case-mappings. Looking at this file you can see several lines for the upper-case dotted I:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CODE       LOWER   TITLE   UPPER  LANGUAGE
0130;  0069 0307;   0130;   0130;
0130;  0069;        0130;   0130;       tr;
0130;  0069;        0130;   0130;       az;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Entries with a language take precedence over those without, so in my JVM where the default locale is English, the first row of the mapping is used, which lines-up with the codepoints that we saw outputted in our Java 7 example. Therefore to make Java do the right thing here for Turkish, we need to explicitly specify the Turkish locale (&amp;#8220;tr&amp;#8221; is the ISO 639 alpha-2 language code for Turkish) to the &lt;code&gt;toLowerCase&lt;/code&gt; method:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='n'&gt;dumpUnicodeCodePoints&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;String&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;valueOf&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='sc'&gt;&amp;#39;\u0130&amp;#39;&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;toLowerCase&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='k'&gt;new&lt;/span&gt; &lt;span class='n'&gt;Locale&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='s'&gt;&amp;quot;tr&amp;quot;&lt;/span&gt;&lt;span class='o'&gt;)));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;This now yields a result consistent with what we expect the Turkish lower-case mapping:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;u0069&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The bottom line is that Java 6 will &lt;em&gt;always&lt;/em&gt; convert the upper-case dotted &amp;#8220;I&amp;#8221; to a lower-case Latin &amp;#8220;I&amp;#8221;, whereas Java 7 is following the complex Unicode case mapping based on the locale passed into the &lt;code&gt;toLowerCase&lt;/code&gt; method, which defaults to &lt;code&gt;Locale.getDefault()&lt;/code&gt; if you don&amp;#8217;t supply one to the &lt;code&gt;toLowerCase&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Oh, and one last tip - for most lower-case mappings the &lt;code&gt;String.toLowerCase&lt;/code&gt; method defers to &lt;code&gt;Character.toLowerCase&lt;/code&gt;. But take stock of the advice given in the &lt;code&gt;Character.toLowerCase&lt;/code&gt; JavaDoc comment, especially in the second and third paragraphs:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='java'&gt;&lt;span class='cm'&gt;/**&lt;/span&gt;
&lt;span class='cm'&gt; * Converts the character (Unicode code point) argument to&lt;/span&gt;
&lt;span class='cm'&gt; * lowercase using case mapping information from the UnicodeData&lt;/span&gt;
&lt;span class='cm'&gt; * file.&lt;/span&gt;
&lt;span class='cm'&gt; *&lt;/span&gt;
&lt;span class='cm'&gt; * &amp;lt;p&amp;gt; Note that&lt;/span&gt;
&lt;span class='cm'&gt; * {@code Character.isLowerCase(Character.toLowerCase(codePoint))}&lt;/span&gt;
&lt;span class='cm'&gt; * does not always return {@code true} for some ranges of&lt;/span&gt;
&lt;span class='cm'&gt; * characters, particularly those that are symbols or ideographs.&lt;/span&gt;
&lt;span class='cm'&gt; *&lt;/span&gt;
&lt;span class='cm'&gt; * &amp;lt;p&amp;gt;In general, {@link String#toLowerCase()} should be used to map&lt;/span&gt;
&lt;span class='cm'&gt; * characters to lowercase. {@code String} case mapping methods&lt;/span&gt;
&lt;span class='cm'&gt; * have several benefits over {@code Character} case mapping methods.&lt;/span&gt;
&lt;span class='cm'&gt; * {@code String} case mapping methods can perform locale-sensitive&lt;/span&gt;
&lt;span class='cm'&gt; * mappings, context-sensitive mappings, and 1:M character mappings, whereas&lt;/span&gt;
&lt;span class='cm'&gt; * the {@code Character} case mapping methods cannot.&lt;/span&gt;
&lt;span class='cm'&gt; *&lt;/span&gt;
&lt;span class='cm'&gt; * @param   codePoint   the character (Unicode code point) to be converted.&lt;/span&gt;
&lt;span class='cm'&gt; * @return  the lowercase equivalent of the character (Unicode code&lt;/span&gt;
&lt;span class='cm'&gt; *          point), if any; otherwise, the character itself.&lt;/span&gt;
&lt;span class='cm'&gt; * @see     Character#isLowerCase(int)&lt;/span&gt;
&lt;span class='cm'&gt; * @see     String#toLowerCase()&lt;/span&gt;
&lt;span class='cm'&gt; *&lt;/span&gt;
&lt;span class='cm'&gt; * @since   1.5&lt;/span&gt;
&lt;span class='cm'&gt; */&lt;/span&gt;
&lt;span class='kd'&gt;public&lt;/span&gt; &lt;span class='kd'&gt;static&lt;/span&gt; &lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='nf'&gt;toLowerCase&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='kt'&gt;int&lt;/span&gt; &lt;span class='n'&gt;codePoint&lt;/span&gt;&lt;span class='o'&gt;)&lt;/span&gt; &lt;span class='o'&gt;{&lt;/span&gt;
    &lt;span class='k'&gt;return&lt;/span&gt; &lt;span class='n'&gt;CharacterData&lt;/span&gt;&lt;span class='o'&gt;.&lt;/span&gt;&lt;span class='na'&gt;of&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;codePoint&lt;/span&gt;&lt;span class='o'&gt;).&lt;/span&gt;&lt;span class='na'&gt;toLowerCase&lt;/span&gt;&lt;span class='o'&gt;(&lt;/span&gt;&lt;span class='n'&gt;codePoint&lt;/span&gt;&lt;span class='o'&gt;);&lt;/span&gt;
&lt;span class='o'&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;</content>
 </entry>
 
 <entry>
   <title>LZOP decompression - revenge of the useless cat</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2013/02/08/lzop-decompression-useless-cat"/>
   <updated>2013-02-08T14:00:00+00:00</updated>
   <id>http://grepalex.com/2013/02/08/lzop-decompression-useless-cat</id>
   <content type="html">&lt;p&gt;For me LZOP is the ubiquitous compression codec with working with large text files in HDFS due to its MapReduce data locality advantages. As a result when I want to peek at LZOP-compressed files in HDFS I use a command such as:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -get /some/file.lzo | lzop -dc | head
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;With this command the output of a LZOP-compressed file in HDFS is piped to the &lt;code&gt;lzop&lt;/code&gt; utility, where the &lt;code&gt;-dc&lt;/code&gt; flags tell lzop to decompress the stream and write the uncompressed data to standard out, and the final &lt;code&gt;head&lt;/code&gt; will show the first 10 lines of the data. I may substitute &lt;code&gt;head&lt;/code&gt; with other utilities such as &lt;code&gt;awk&lt;/code&gt; or &lt;code&gt;sed&lt;/code&gt;, but I always follow this general pattern of piping the output &lt;code&gt;lzop&lt;/code&gt; output to another utility.&lt;/p&gt;

&lt;p&gt;Imagine my surprise the other day when I tried the same command on a smaller file (hence not needing to use the &lt;code&gt;head&lt;/code&gt; command), only to see this error:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -get /some/file.lzo | lzop -dc
lzop: &amp;lt;stdout&amp;gt;: uncompressed data not written to a terminal
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;What just happened - why would the first command work, but not the second? My guess is that this is likely the authors of the &lt;code&gt;lzop&lt;/code&gt; utility safeguarding us accidentally flooding standard output with uncompressed data. Which is frustrating, because as you can see from the following example this is a different route than that which the authors of &lt;code&gt;gunzip&lt;/code&gt; took:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;echo&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;the cat&amp;quot;&lt;/span&gt; | gzip -c | gunzip -c
the cat
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;If we run the same command with &lt;code&gt;lzop&lt;/code&gt; we see the same result as was saw earlier:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;&lt;span class='nb'&gt;echo&lt;/span&gt; &lt;span class='s2'&gt;&amp;quot;the cat&amp;quot;&lt;/span&gt; | lzop -c | lzop -dc
lzop: &amp;lt;stdout&amp;gt;: uncompressed data not written to a terminal
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;A ghetto approach to solving this problem is to pipe the &lt;code&gt;lzop&lt;/code&gt; output to &lt;code&gt;cat&lt;/code&gt; (which is a necessary violation of the &lt;a href='/2012/10/30/useless-cats/'&gt;useless cat&lt;/a&gt; pattern):&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -get /some/file.lzo | lzop -dc | cat
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Luckily &lt;code&gt;lzop&lt;/code&gt; has a &lt;code&gt;-f&lt;/code&gt; option which removes the need for the &lt;code&gt;cat&lt;/code&gt;:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;hadoop fs -get /some/file.lzo | lzop -dcf
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;It turns out that &lt;code&gt;man&lt;/code&gt; page on &lt;code&gt;lzop&lt;/code&gt; is instructive with regards to the &lt;code&gt;-f&lt;/code&gt; option, indicates various scenarios where it can be helpful:&lt;/p&gt;
&lt;div class='highlight'&gt;&lt;pre&gt;&lt;code class='bash'&gt;shell&lt;span class='nv'&gt;$ &lt;/span&gt;man lzop
...
-f, --force
   Force lzop to

    - overwrite existing files
    - &lt;span class='o'&gt;(&lt;/span&gt;de-&lt;span class='o'&gt;)&lt;/span&gt;compress from stdin even &lt;span class='k'&gt;if &lt;/span&gt;it seems a terminal
    - &lt;span class='o'&gt;(&lt;/span&gt;de-&lt;span class='o'&gt;)&lt;/span&gt;compress to stdout even &lt;span class='k'&gt;if &lt;/span&gt;it seems a terminal
    - allow option -c in combination with -U

   Using -f two or more &lt;span class='nb'&gt;times &lt;/span&gt;forces things like

    - compress files that already have a .lzo suffix
    - try to decompress files that &lt;span class='k'&gt;do &lt;/span&gt;not have a valid suffix
    - try to handle compressed files with unknown header flags

   Use with care.
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;</content>
 </entry>
 
 <entry>
   <title>Executing variables that contain shell operators</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2013/01/27/shell-variable-execution-with-pipes"/>
   <updated>2013-01-27T22:20:00+00:00</updated>
   <id>http://grepalex.com/2013/01/27/shell-variable-execution-with-pipes</id>
   <content type="html">&lt;p&gt;I touched a little on &lt;a href='/2012/10/30/useless-cats/'&gt;pipes&lt;/a&gt; in a previous post. Here&amp;#8217;s a quick example of an &lt;code&gt;echo&lt;/code&gt; utility which outputs two lines, and a pipe operator which redirects that output to a &lt;code&gt;grep&lt;/code&gt; utility which performs a simple filter to only include lines that contain the word &amp;#8220;cat&amp;#8221;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ echo -e &amp;#39;the cat \n sat on the mat&amp;#39; | grep cat
the cat&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Cool - since that worked, what do you think will happen if you do the following?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ cmd=&amp;quot;echo -e &amp;#39;the cat \n sat on the mat&amp;#39; | grep cat&amp;quot;
shell$ ${cmd}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the above example we&amp;#8217;re simply assigning the original utility to a shell variable, and then executing it. So why, then, would the output be this?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ ${cmd}
&amp;#39;the cat
 sat on the mat&amp;#39; | grep cat&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is something that has bitten me in the past when I write shell scripts. What&amp;#8217;s happening here is that the shell executes the contents of variable &lt;code&gt;cmd&lt;/code&gt; as a single command, which means that everything after &lt;code&gt;echo&lt;/code&gt; are treated as arguments to the echo utility, including the pipe.&lt;/p&gt;

&lt;p&gt;&lt;img alt='variable-execution' src='/images/variable-execution.png' /&gt;&lt;/p&gt;

&lt;p&gt;What we actually need to happen is to have the entire contents of &lt;code&gt;cmd&lt;/code&gt; evaluated by the shell so that the shell can create the pipeline between the two utilities. This is where the utility &lt;a href='http://www.unix.com/man-page/posix/1posix/eval/'&gt;eval&lt;/a&gt; comes into play - &lt;code&gt;eval&lt;/code&gt; tells the shell to concatenate the arguments and have them executed by the shell.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ eval ${cmd}
the cat&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The morale of this story is that if you want to execute a variable that includes any shell constructs (such as the pipe in our example) - then make sure you &lt;code&gt;eval&lt;/code&gt;. Examples of shell constructs include redirections (i.e. &lt;code&gt;echo &amp;quot;the cat&amp;quot; &amp;gt; file1.txt&lt;/code&gt;), shell conditionals, loops and functions.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Using awk and friends with Hadoop</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2013/01/17/awk-with-hadoop-streaming"/>
   <updated>2013-01-17T14:20:00+00:00</updated>
   <id>http://grepalex.com/2013/01/17/awk-with-hadoop-streaming</id>
   <content type="html">&lt;p&gt;Imagine you have a CSV file that you want to manipulate. Here&amp;#8217;s a sample file we can play with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;lopez,charlie,2002,11,21
parker,ward,1995,04,08
henderson,russell,2007,10,01&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Our goal is to transform this into the following form by combining the last three columns:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;lopez,charlie,20021121
parker,ward,19950408
henderson,russell,20071001&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Linux this would take all of two seconds (excuse the awkward awk command):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ awk -F&amp;quot;,&amp;quot; &amp;#39;{ print $1&amp;quot;,&amp;quot;$2&amp;quot;,&amp;quot;$3$4$5 }&amp;#39; people.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What if you wanted to quickly do the same in HDFS - and let&amp;#8217;s assume you want to write the results back to HDFS. One approach would be to use the HDFS CLI to stream the inputs into awk, and stream the awk output back into HDFS. You could do this with the HDFS &lt;code&gt;cat&lt;/code&gt; and &lt;code&gt;put -&lt;/code&gt; options (note that adding a hyphen after &lt;code&gt;put&lt;/code&gt; instructs the put command to stream data from standard input to HDFS):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ hadoop fs -cat people.txt | awk -F&amp;quot;,&amp;quot; &amp;#39;{ print $1&amp;quot;,&amp;quot;$2&amp;quot;,&amp;quot;$3$4$5 }&amp;#39; | hadoop fs -put - people-coalesed.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;BTW, if your input and output files are LZOP-compressed then this command would work:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ hadoop fs -cat people.txt.lzo | lzop -dc | awk -F&amp;quot;,&amp;quot; &amp;#39;{ print $1&amp;quot;,&amp;quot;$2&amp;quot;,&amp;quot;$3$4$5 }&amp;#39; | \
         lzop -c | hadoop fs -put - people-coalesed.txt.lzo&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is great if your file isn&amp;#8217;t too large, but if it&amp;#8217;s multiple gigabytes in length then you probably want to harness the power of MapReduce to get this done in a jiffy! The words &amp;#8220;in a jiffy&amp;#8221; and &amp;#8220;MapReduce&amp;#8221; aren&amp;#8217;t commonly used together, so what do we do? Well you could crack open Pig or Hive and write some custom user-defined functions, but this means you end up in Java which we want to avoid.&lt;/p&gt;

&lt;p&gt;Hadoop Streaming comes to the rescue in these situations. Let&amp;#8217;s first create our awk script which will be executed:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ cat people.awk
#!/bin/awk -f

BEGIN { FS = &amp;quot;,&amp;quot; }
{ print $1&amp;quot;,&amp;quot;$2&amp;quot;,&amp;quot;$3$4$5 }&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Linux, if you make this awk script executable, you could execute is as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ ./people.awk people.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In MapReduce-land we don&amp;#8217;t need to join data in this particular example, so we don&amp;#8217;t need to run any reducers. Call your awk script from mappers via &lt;a href='http://hadoop.apache.org/docs/mapreduce/current/streaming.html'&gt;Hadoop Streaming&lt;/a&gt; with this command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ HADOOP_HOME=/usr/lib/hadoop
shell$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -D mapreduce.job.reduces=0 \
  -D mapred.reduce.tasks=0 \
  -input people.txt \
  -output people-coalesed \
  -mapper people.awk \
  -file people.awk&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can view the output in HDFS with a &lt;code&gt;cat&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ hadoop fs -cat /user/aholmes/people-coalesed/part*
henderson,russell,20071001
lopez,charlie,20021121
parker,ward,19950408&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A few options in the Hadoop Streaming command are worth examining:&lt;/p&gt;

&lt;p&gt;&lt;img alt='awk-streaming-image' src='/images/awk-streaming.png' /&gt;&lt;/p&gt;

&lt;p&gt;Finally - to get LZO into the picture you need to add &lt;code&gt;-inputformat&lt;/code&gt;, &lt;code&gt;-D mapred.output.compress&lt;/code&gt; and &lt;code&gt;-D mapred.output.compression.codec&lt;/code&gt; arguments:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ HADOOP_HOME=/usr/lib/hadoop
shell$ ${HADOOP_HOME}/bin/hadoop \
  jar ${HADOOP_HOME}/contrib/streaming/*.jar \
  -D mapreduce.job.reduces=0 \
  -D mapred.reduce.tasks=0 \
  -D mapred.output.compress=true \
  -D stream.map.input.ignoreKey=true \
  -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
  -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
  -input people.txt.lzo \
  -output people-coalesed \
  -mapper people.awk \
  -file people.awk&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Update 6/3/2013:&lt;/p&gt;

&lt;p&gt;This article has a &lt;a href='http://science.webhostinggeeks.com/koriscenje-awk-i-prijatelja'&gt;Serbo-Croatian&lt;/a&gt; translation by Anja Skrba.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Configuring and tuning MapReduce's shuffle</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2012/11/26/hadoop-shuffle-configurables"/>
   <updated>2012-11-26T14:20:00+00:00</updated>
   <id>http://grepalex.com/2012/11/26/hadoop-shuffle-configurables</id>
   <content type="html">&lt;p&gt;Once you have outgrown your small Hadoop cluster it&amp;#8217;s worth tuning some of the shuffle configurables to ensure that your performance keeps up with the physical growth of your cluster. The figure below shows key configurables in the shuffle stage in Hadoop versions 1.x and earlier, and identifies those that should be tuned.&lt;/p&gt;

&lt;p&gt;&lt;img alt='parition' src='/images/hadoop-shuffle-configurables.png' /&gt;&lt;/p&gt;

&lt;p&gt;You can read more about these configurables and their default values by looking at &lt;a href='http://hadoop.apache.org/docs/r1.0.3/mapred-default.html'&gt;mapred-default.xml&lt;/a&gt;. My book &lt;a href='http://www.manning.com/holmes/'&gt;Hadoop in Practice&lt;/a&gt; (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Controlling user logging in Hadoop</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2012/11/12/hadoop-logging"/>
   <updated>2012-11-12T14:20:00+00:00</updated>
   <id>http://grepalex.com/2012/11/12/hadoop-logging</id>
   <content type="html">&lt;p&gt;Imagine that you&amp;#8217;re a Hadoop administrator, and to make things interesting you&amp;#8217;re managing a multi-tenant Hadoop cluster where data scientists, developers and QA are pounding your cluster. One day you notice that your disks are filling-up fast, and after some investigating you realize that the root cause is your MapReduce task attempt logs.&lt;/p&gt;

&lt;p&gt;How do you guard against this sort of thing happening? Before we get to that we need to understand where these files exist, and how they&amp;#8217;re written. The figure below shows the three log files that are created for each task attempt in MapReduce. Notice that the logs are written to the local disk of the task attempt.&lt;/p&gt;

&lt;p&gt;&lt;img alt='parition' src='/images/hadoop-task-logs-location.png' /&gt;&lt;/p&gt;

&lt;p&gt;OK, so how does Hadoop normally make sure that our disks don&amp;#8217;t fill-up with these task attempt logs? I&amp;#8217;ll cover three approaches.&lt;/p&gt;

&lt;h2 id='approach_1_mapreduserlogretainhours'&gt;Approach 1: mapred.userlog.retain.hours&lt;/h2&gt;

&lt;p&gt;Hadoop has a &lt;code&gt;mapred.userlog.retain.hours&lt;/code&gt; configurable, which is defined in &lt;a href='http://hadoop.apache.org/docs/r1.0.3/mapred-default.html'&gt;mapred-default.xml&lt;/a&gt; as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The maximum time, in hours, for which the user-logs are to be retained after the job completion.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Great, but what if your disks are filling up before Hadoop has had a chance to automatically clean them up? It may be tempting to reduce &lt;code&gt;mapred.userlog.retain.hours&lt;/code&gt; to a smaller value, but before you do that you should know that there&amp;#8217;s a bug with the Hadoop versions 1.x and earlier (see &lt;a href='https://issues.apache.org/jira/browse/MAPREDUCE-158'&gt;MAPREDUCE-158&lt;/a&gt;), where the logs for long-running jobs that run longer than &lt;code&gt;mapred.userlog.retain.hours&lt;/code&gt; are accidentally deleted. So maybe we should look elsewhere to solve our overflowing logs problem.&lt;/p&gt;

&lt;h2 id='approach_2_mapreduserloglimitkb'&gt;Approach 2: mapred.userlog.limit.kb&lt;/h2&gt;

&lt;p&gt;Hadoop has another configurable, &lt;code&gt;mapred.userlog.limit.kb&lt;/code&gt;, which can be used to limit the file size of &lt;code&gt;stdlog&lt;/code&gt;, which is the log4j log output file. Let&amp;#8217;s peek again at the documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The maximum size of user-logs of each task in KB. 0 disables the cap.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The default value is &lt;code&gt;0&lt;/code&gt;, which means that log writes go straight to the log file. So all we need to do is to set a non-negative value and we&amp;#8217;re set, right? Not so fast - it turns out that this approach has two disadvantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hadoop and user logs are actually cached in memory, so you&amp;#8217;re taking away &lt;code&gt;mapred.userlog.limit.kb&lt;/code&gt; kilobytes worth of memory from your task attempt&amp;#8217;s process.&lt;/li&gt;

&lt;li&gt;Logs are only written out when the task attempt process has completed, and only contain the last &lt;code&gt;mapred.userlog.limit.kb&lt;/code&gt; worth of log entries, so this can make it challenging to debug long-running tasks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OK, so what else can we try? We have one more solution, log levels.&lt;/p&gt;

&lt;h2 id='approach_3_changing_log_levels'&gt;Approach 3: Changing log levels&lt;/h2&gt;

&lt;p&gt;Ideally all your Hadoop users got the memo about minimizing excessive logging. But the reality of the situation is that you have limited control over what users decide to log in their code, but what you do have control over is the task attempt log levels.&lt;/p&gt;

&lt;p&gt;If you had a MapReduce job that was aggressively logging in package &lt;code&gt;com.example.mr&lt;/code&gt;, then you may be tempted to use the &lt;em&gt;daemonlog&lt;/em&gt; CLI to connect to all the TaskTracker daemons and change the logging to ERROR level:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;hadoop daemonlog -setlevel &amp;lt;host:port&amp;gt; com.example.mr ERROR&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Yet again we hit a roadblock - this will only change the logging level for the TaskTracker process, and not for the task attempt process. Drat! This really only leaves one option, which is to update your &lt;code&gt;${HADOOP_HOME}/conf/log4j.properties&lt;/code&gt; on all your data nodes by adding the following line to this file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.com.example.mr=ERROR&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The great thing about this change is that you don&amp;#8217;t need to restart MapReduce, since any new task attempt processes will pick up your changes to &lt;code&gt;log4j.properties&lt;/code&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Pipes and useless cats</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2012/10/30/useless-cats"/>
   <updated>2012-10-30T03:20:00+00:00</updated>
   <id>http://grepalex.com/2012/10/30/useless-cats</id>
   <content type="html">&lt;p&gt;I love me some Unix command &lt;a href='http://en.wikipedia.org/wiki/Pipeline_(Unix%29'&gt;pipes&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat /some/file.txt | sort | head&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pipelines let you chain together multiple commands to manipulate data flows. Pipes are not only useful as a data filtering mechanism, but when combined with tools such as &lt;code&gt;cut&lt;/code&gt;, &lt;code&gt;awk&lt;/code&gt; and &lt;code&gt;sed&lt;/code&gt; can also be used for projections and transformations. The Unix pipe, while simple in concept, is a sophisticated shell construct and one big reason why Unix shells are to this day a popular tool in a programmer/system administrator/data scientist&amp;#8217;s toolkit.&lt;/p&gt;

&lt;p&gt;So why am I sitting here telling you something that you already know? Fair question - to answer that let&amp;#8217;s take another look at that command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat /some/file.txt | sort | head&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;While shell pipelines are great, we have a subtle problem here - and it&amp;#8217;s something that&amp;#8217;s known as a &lt;em&gt;useless cat&lt;/em&gt;. No, I don&amp;#8217;t hate cats - &lt;a href='http://partmaps.org/era/unix/award.html'&gt;this expression harks back&lt;/a&gt; to the old &lt;em&gt;usenet&lt;/em&gt; days where a forum member of &lt;em&gt;comp.unix.shell&lt;/em&gt; would write a weekly post where he would highlight a redundant use of the &lt;code&gt;cat&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;So why is the above command useless? Because &lt;code&gt;sort&lt;/code&gt; can take one or more files as arguments, much like the majority of Unix commands. So this command can be rewritten as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sort /some/file.txt | head&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Removing &lt;code&gt;cat&lt;/code&gt; from the equation means that we&amp;#8217;ve reduced the number of processes that need to execute, and cut down on the buffering and data copying that the shell needs to do to make pipelines work - a win-win.&lt;/p&gt;

&lt;p&gt;In fact &lt;code&gt;cat&lt;/code&gt; really doesn&amp;#8217;t have many uses - if you need to view the contents of a file you&amp;#8217;re better off using &lt;code&gt;vi&lt;/code&gt; or &lt;code&gt;less&lt;/code&gt;, and otherwise most Unix commands can directly work with files.&lt;/p&gt;

&lt;p&gt;So next time you&amp;#8217;re about to run a &lt;code&gt;cat&lt;/code&gt; command - think about whether or not you need it, or whether you&amp;#8217;re just perpetuating use of the &lt;em&gt;useless cat&lt;/em&gt;!&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Hadoop unit testing with MiniMRCluster and MiniDFSCluster</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2012/10/20/hadoop-unit-testing-with-minimrcluster"/>
   <updated>2012-10-20T05:20:00+00:00</updated>
   <id>http://grepalex.com/2012/10/20/hadoop-unit-testing-with-minimrcluster</id>
   <content type="html">&lt;p&gt;In a &lt;a href='http://steveloughran.blogspot.com/2012/10/hadoop-in-practice-applied-hadoop.html'&gt;recent blog post&lt;/a&gt; Steve Loughran mentioned that I didn&amp;#8217;t cover Hadoop&amp;#8217;s &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/test/org/apache/hadoop/mapred/MiniMRCluster.java?view=co'&gt;MiniMRCluster&lt;/a&gt; in my book. At the time I wrote the testing chapter of &lt;em&gt;&amp;#8220;Hadoop in Practice&amp;#8221;&lt;/em&gt; I decided that covering &lt;a href='http://mrunit.apache.org/'&gt;MRUnit&lt;/a&gt; and &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/mapred/org/apache/hadoop/mapred/LocalJobRunner.java?view=co'&gt;LocalJobRunner&lt;/a&gt; were sufficient to cover the goals of most MapReduce unit test, but for completeness I want to cover MiniMRCluster in this post.&lt;/p&gt;

&lt;p&gt;MRUnit is great for quick and easy unit testing of MapReduce jobs, where you don&amp;#8217;t want to test Input/OutputFormat and Partitioner code. LocalJobRunner is a step above MRUnit in that it allows you to test Input/OutputFormat classes, but it is single-threaded so it&amp;#8217;s not useful for uncovering bugs related to multiple map or reduce tasks, or for properly exercising partitioners.&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s where &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/test/org/apache/hadoop/mapred/MiniMRCluster.java?view=co'&gt;MiniMRCluster&lt;/a&gt; (and &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/test/org/apache/hadoop/hdfs/MiniDFSCluster.java?view=co'&gt;MiniDFSCluster&lt;/a&gt;) come into play. These classes offer full-blown in-memory MapReduce and HDFS clusters, and can launch multiple MapReduce and HDFS nodes. MiniMRCluster and MiniDFSCluster are bundled with the Hadoop 1.x test JAR, and are used heavily within Hadoop&amp;#8217;s own unit tests.&lt;/p&gt;

&lt;p&gt;The easy way to leverage MiniMRCluster and MiniDFSCluster is to extend the abstract &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/test/org/apache/hadoop/mapred/ClusterMapReduceTestCase.java?view=co'&gt;ClusterMapReduceTestCase&lt;/a&gt; class, which is a JUnit &lt;code&gt;TestCase&lt;/code&gt; and starts/stops a Hadoop cluster around each JUnit test. ClusterMapReduceTestCase runs a 2-node MapReduce cluster with 2 HDFS nodes. The way you should be able to use this class is as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;public class WordCountTest extends ClusterMapReduceTestCase {
    public void test() throws Exception {
        JobConf conf = createJobConf();

        Path inDir = new Path(&amp;quot;testing/jobconf/input&amp;quot;);
        Path outDir = new Path(&amp;quot;testing/jobconf/output&amp;quot;);

        OutputStream os = getFileSystem().create(new Path(inDir, &amp;quot;text.txt&amp;quot;));
        Writer wr = new OutputStreamWriter(os);
        wr.write(&amp;quot;b a\n&amp;quot;);
        wr.close();

        conf.setJobName(&amp;quot;mr&amp;quot;);

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(LongWritable.class);

        conf.setMapperClass(WordCountMapper.class);
        conf.setReducerClass(SumReducer.class);

        FileInputFormat.setInputPaths(conf, inDir);
        FileOutputFormat.setOutputPath(conf, outDir);

        assertTrue(JobClient.runJob(conf).isSuccessful());

        // Check the output is as expected
        Path[] outputFiles = FileUtil.stat2Paths(
                getFileSystem().listStatus(outDir, new Utils.OutputFileUtils.OutputFilesFilter()));

        assertEquals(1, outputFiles.length);

        InputStream in = getFileSystem().open(outputFiles[0]);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        assertEquals(&amp;quot;a\t1&amp;quot;, reader.readLine());
        assertEquals(&amp;quot;b\t1&amp;quot;, reader.readLine());
        assertNull(reader.readLine());
        reader.close();
    }
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, at least with the Hadoop 1.0.3 release, this will fail with the following exception:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;12/10/19 23:10:37 ERROR mapred.MiniMRCluster: Job tracker crashed
java.lang.NullPointerException
  at java.io.File.&amp;lt;init&amp;gt;(File.java:222)
  at org.apache.hadoop.mapred.JobHistory.initLogDir(JobHistory.java:531)
  at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:499)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2334)
  at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2331)
  at java.security.AccessController.doPrivileged(Native Method)
  ...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The trick here is that the JobTracker is expecting &lt;code&gt;hadoop.log.dir&lt;/code&gt; to be set in the system properties, which it isn&amp;#8217;t in our example, causing the NPE. As it turns out this is a bug (see &lt;a href='https://issues.apache.org/jira/browse/MAPREDUCE-2785'&gt;MAPREDUCE-2785&lt;/a&gt;) which according to Jira will be fixed in the Hadoop 1.1 release (thanks to Steve for that information). The fix is simple - override the &lt;code&gt;setUp()&lt;/code&gt; method in ClusterMapReduceTestCase and set the Hadoop log directory:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;@Override
protected void setUp() throws Exception {

    System.setProperty(&amp;quot;hadoop.log.dir&amp;quot;, &amp;quot;/tmp/logs&amp;quot;);

    super.startCluster(true, null);
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once you make this change the above JUnit test will work. This can be a bit tedious to have to roll into each and every one of your unit tests, but luckily there are a couple of options out there so that you don&amp;#8217;t have to.First, Steve pointed out a &lt;a href='http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/grumpy/src/org/smartfrog/services/hadoop/grumpy/LocalMRCluster.groovy'&gt;LocalMRCluster&lt;/a&gt; Groovy class bundled in &lt;a href='http://wiki.smartfrog.org/wiki/display/sf/SmartFrog+Home'&gt;SmartFrog&lt;/a&gt; which fixes this issue by extending MiniMRCluster.&lt;/p&gt;

&lt;p&gt;Another alternative is to use my GitHub &lt;a href='https://github.com/alexholmes/hadoop-utils'&gt;hadoop-utils&lt;/a&gt; project which contains a JUnit class similar to ClusterMapReduceTestCase called &lt;a href='https://github.com/alexholmes/hadoop-utils/blob/master/src/main/java/com/alexholmes/hadooputils/test/MiniHadoopTestCase.java'&gt;MiniHadoopTestCase&lt;/a&gt; which fixes this property problem, and also gives you more control over where the in-memory clusters will store their data on your local filesystem, and also let you control the number of TaskTrackers and DataNodes.&lt;/p&gt;

&lt;p&gt;Hadoop-utils also contains a helper class (TextIOJobBuilder) to help with writing MapReduce input files, and verifying the output results. You can see an example of how clean your unit tests can look when combining TextIOJobBuilder with MiniHadoopTestCase in class &lt;a href='https://github.com/alexholmes/hadoop-utils/blob/master/src/test/java/com/alexholmes/hadooputils/sort/TotalOrderSortTest.java'&gt;TotalOrderSortTest&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;public class TotalOrderSortTest extends MiniHadoopTestCase {

    @Test
    public void test() throws Exception {

        InputSampler.RandomSampler sampler = new InputSampler.RandomSampler(1.0, 6, 1);

        JobConf jobConf = super.getMiniHadoop().createJobConf();

        TextIOJobBuilder builder = new TextIOJobBuilder(
                super.getMiniHadoop().getFileSystem())
                .addInput(&amp;quot;foo-hump&amp;quot;)
                .addInput(&amp;quot;foo-hump&amp;quot;)
                .addInput(&amp;quot;clump-bar&amp;quot;)
                .addExpectedOutput(&amp;quot;clump-bar&amp;quot;)
                .addExpectedOutput(&amp;quot;foo-hump&amp;quot;)
                .writeInputs();

        new SortConfig(jobConf).setUnique(true);

        SortTest.run(
                jobConf,
                builder,
                2,
                2,
                sampler);
    }
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The only real downside to using MiniMRCluster and MiniDFSCluster is speed - it takes a good 5-10 seconds for both setup and tear-down, and when you multiply this for each test case this can add up.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>How partitioning, collecting and spilling work in MapReduce</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2012/09/24/map-partition-sort-spill"/>
   <updated>2012-09-24T05:20:00+00:00</updated>
   <id>http://grepalex.com/2012/09/24/map-partition-sort-spill</id>
   <content type="html">&lt;p&gt;The figure below shows the various steps that the Hadoop MapReduce framework takes after your map function emits a key/value output record. Please note that this figure represents what&amp;#8217;s happening with Hadoop versions 1.x and earlier - in Hadoop 2.x there have been some changes which will be discussed in a future blog post.&lt;/p&gt;

&lt;p&gt;My book &lt;a href='http://www.manning.com/holmes/'&gt;Hadoop in Practice&lt;/a&gt; (Manning Publications) in chapter 6 discusses how some of the configuration values in the figure should be tweaked when you start working with mid to large-size Hadoop clusters.&lt;/p&gt;

&lt;p&gt;&lt;img alt='parition' src='/images/hadoopv1-partition-collect-spill.png' /&gt;&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Using sed to perform inline replacements of regex groups</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2012/09/17/sed-regex-substitutions"/>
   <updated>2012-09-17T05:20:00+00:00</updated>
   <id>http://grepalex.com/2012/09/17/sed-regex-substitutions</id>
   <content type="html">&lt;p&gt;I love tools like sed and awk - I use them every day, and only realize how much I rely on them when I&amp;#8217;m forced to work on a machine that&amp;#8217;s not running Unix. Today I want to look at a feature that is really useful when working with regular expressions in sed.&lt;/p&gt;

&lt;p&gt;Imagine that you had an IP address, and you wanted to change the second octet - one way to do this in sed is the following:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ echo &amp;quot;127.0.0.1&amp;quot; | sed &amp;quot;s/127.0/127.1/&amp;quot;
127.1.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That seemed to work well, and was simple. But what if you had a file of random IP&amp;#8217;s - how would you change the second octet in that scenario? Sure, you could use awk, but that feels like it would be overkill. Well, it can be done in sed with something called regular expression group substitutions.&lt;/p&gt;

&lt;p&gt;First of all, you&amp;#8217;ll need to tell sed that you are using extended regular expressions by using the &lt;code&gt;-r&lt;/code&gt; option, so that you don&amp;#8217;t have to escape some of the regular expression characters (if you&amp;#8217;re curious, they are &lt;code&gt;?+(){}&lt;/code&gt;). If you end up needing to use any of these characters as literals, you&amp;#8217;ll ned to escape them with a backslash (&lt;code&gt;\&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;sed supports up to 9 groups that can be defined in the pattern string, and subsequently referenced in the replacement string. In the following command the pattern string starts with a group, which contains the first octet followed by the period, and that&amp;#8217;s followed by a second octet. In the replacement string we&amp;#8217;re referencing the first (and only) group with &lt;code&gt;\1&lt;/code&gt;, followed by &lt;code&gt;234&lt;/code&gt; which is the replacement for the rest of the matching string, which contains the second octet.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ echo &amp;quot;127.0.0.1&amp;quot; | sed -r &amp;quot;s/^([0-9]{1,3}\.)[0-9]{1,3}/\1234/&amp;quot;
127.234.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What if we wanted to preserve the second octet and simply a &amp;#8220;1&amp;#8221; in front of it? In that case you can define a second group in the pattern, and reference the second group in the replacement value:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ echo &amp;quot;127.0.0.1&amp;quot; | sed -r &amp;quot;s/^([0-9]{1,3}\.)([0-9]{1,3})/\11\2/&amp;quot;
127.10.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Actually it would have been easier to just remove the second octet altogether from the pattern:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ echo &amp;quot;127.0.0.1&amp;quot; | sed -r &amp;quot;s/^([0-9]{1,3}\.)/\11/&amp;quot;
127.10.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;On a final note - it wasn&amp;#8217;t so long ago that I would write a command similar to the one below if I wanted to use sed to perform a substitution and overwrite an existing file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sed &amp;#39;s/a/b/&amp;#39; file1.txt &amp;gt; file2.txt; mv file2.txt file1.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ugh! Well there&amp;#8217;s no need to do this - sed has a &lt;code&gt;-i&lt;/code&gt; option which will do an inline replace of the file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sed -i &amp;#39;s/a/b/&amp;#39; file1.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ahhh, that&amp;#8217;s better! Anything that&amp;#8217;s easy on the eyes gets a thumbs-up from me.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Sorting text files with MapReduce</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce"/>
   <updated>2012-09-10T13:00:00+00:00</updated>
   <id>http://grepalex.com/2012/09/10/sorting-text-files-with-mapreduce</id>
   <content type="html">&lt;p&gt;In my &lt;a href='http://grepalex.com/2012/09/01/sorting-large-files-in-linux/'&gt;last post&lt;/a&gt; I wrote about sorting files in Linux. Decently large files (in the tens of GB&amp;#8217;s) can be sorted fairly quickly using that approach. But what if your files are already in HDFS, or ar hundreds of GB&amp;#8217;s in size or larger? In this case it makes sense to use MapReduce and leverage your cluster resources to sort your data in parallel.&lt;/p&gt;

&lt;p&gt;MapReduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map output records (using the map output keys), so that all the records that reach a single reducer are sorted. The diagram below shows the internals of how the shuffle phase works in MapReduce.&lt;/p&gt;

&lt;p&gt;&lt;img alt='shuffle in MapReduce' src='/images/sorting-files-mapreduce-internals.png' /&gt;&lt;/p&gt;

&lt;p&gt;Given that MapReduce already performs sorting between the map and reduce phases, then sorting files can be accomplished with an identity function (one where the inputs to the map and reduce phases are emitted directly). This is in fact what the &lt;em&gt;sort&lt;/em&gt; example that is bundled with Hadoop does. You can look at the how the example code works by examining the &lt;a href='http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/examples/org/apache/hadoop/examples/Sort.java?view=markup'&gt;org.apache.hadoop.examples.Sort&lt;/a&gt; class. To use this example code to sort text files in Hadoop, you would use it as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ export HADOOP_HOME=/usr/lib/hadoop
shell$ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-examples.jar sort \
         -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
         -outFormat org.apache.hadoop.mapred.TextOutputFormat \
         -outKey org.apache.hadoop.io.Text \
         -outValue org.apache.hadoop.io.Text \
         /hdfs/path/to/input \
         /hdfs/path/to/output&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This works well, but it doesn&amp;#8217;t offer some of the features that I commonly rely upon in Linux&amp;#8217;s sort, such as sorting on a specific column, and case-insensitive sorts.&lt;/p&gt;

&lt;h2 id='linuxesque_sorting_in_mapreduce'&gt;Linux-esque sorting in MapReduce&lt;/h2&gt;

&lt;p&gt;I&amp;#8217;ve started a new GitHub repo called &lt;a href='https://github.com/alexholmes/hadoop-utils'&gt;hadoop-utils&lt;/a&gt;, where I plan to roll useful helper classes and utilities. The first one is a flexible Hadoop sort. The same Hadoop example sort can be accomplished with the hadoop-utils sort as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-&amp;lt;version&amp;gt;-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         /hdfs/path/to/input \
         /hdfs/path/to/output&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To bring sorting in MapReduce closer to the Linux sort, the &lt;code&gt;--key&lt;/code&gt; and &lt;code&gt;--field-separator&lt;/code&gt; options can be used to specify one or more columns that should be used for sorting, as well as a custom separator (whitespace is the default). For example, imagine you had a file in HDFS called &lt;code&gt;/input/300names.txt&lt;/code&gt; which contained first and last names:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ hadoop fs -cat 300names.txt | head -n 5
       Roy     Franklin
       Mario   Gardner
       Willis  Romero
       Max     Wilkerson
       Latoya  Larson&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To sort on the last name you would run:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-&amp;lt;version&amp;gt;-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --key 2 \
         /input/300names.txt \
         /hdfs/path/to/output&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The syntax of &lt;code&gt;--key&lt;/code&gt; is &lt;code&gt;POS1[,POS2]&lt;/code&gt;, where the first position (POS1) is required, and the second position (POS2) is optional - if it&amp;#8217;s omitted then &lt;code&gt;POS1&lt;/code&gt; through the rest of the line is used for sorting. Just like the Linux sort, &lt;code&gt;--key&lt;/code&gt; is 1-based, so &lt;code&gt;--key 2&lt;/code&gt; in the above example will sort on the second column in the file.&lt;/p&gt;

&lt;h2 id='lzop_integration'&gt;LZOP integration&lt;/h2&gt;

&lt;p&gt;Another trick that this sort utility has is its tight integration with LZOP, a useful compression codec that works well with large files in MapReduce (see chapter 5 of &lt;a href='http://www.manning.com/holmes/'&gt;Hadoop in Practice&lt;/a&gt; for more details on LZOP). It can work with LZOP input files that span multiple splits, and can also LZOP-compress outputs, and even create LZOP index files. You would do this with the &lt;code&gt;codec&lt;/code&gt; and &lt;code&gt;lzop-index&lt;/code&gt; options:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-&amp;lt;version&amp;gt;-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --key 2 \
         --codec com.hadoop.compression.lzo.LzopCodec \
         --map-codec com.hadoop.compression.lzo.LzoCodec \
         --lzop-index \
         /hdfs/path/to/input \
         /hdfs/path/to/output&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='multiple_reducers_and_total_ordering'&gt;Multiple reducers and total ordering&lt;/h2&gt;

&lt;p&gt;If your sort job runs with multiple reducers (either because &lt;code&gt;mapreduce.job.reduces&lt;/code&gt; in &lt;code&gt;mapred-site.xml&lt;/code&gt; has been set to a number larger than 1, or because you&amp;#8217;ve used the &lt;code&gt;-r&lt;/code&gt; option to specify the number of reducers on the command-line), then by default Hadoop will use the &lt;a href='http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapred/lib/HashPartitioner.html'&gt;HashPartitioner&lt;/a&gt; to distribute records across the reducers. Use of the HashPartitioner means that you can&amp;#8217;t concatenate your output files to create a single sorted output file. To do this you&amp;#8217;ll need &lt;em&gt;total ordering&lt;/em&gt;, which is supported by both the Hadoop example sort and the hadoop-utils sort - the hadoop-utils sort enables this with the &lt;code&gt;--total-order&lt;/code&gt; option.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ $HADOOP_HOME/bin/hadoop jar hadoop-utils-&amp;lt;version&amp;gt;-jar-with-dependencies.jar \
         com.alexholmes.hadooputils.sort.Sort \
         --total-order 0.1 10000 10 \
         /hdfs/path/to/input \
         /hdfs/path/to/output&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The syntax is for this option is unintuitive so let&amp;#8217;s look at what each field means.&lt;/p&gt;

&lt;p&gt;&lt;img alt='sampling image' src='/images/sorting-files-mapreduce-sampling.png' /&gt;&lt;/p&gt;

&lt;p&gt;More details on total ordering can be seen in chapter 4 of &lt;a href='http://www.manning.com/holmes/'&gt;Hadoop in Practice&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='more_details'&gt;More details&lt;/h2&gt;

&lt;p&gt;For details on how to download and run the hadoop-utils sort take a look at the &lt;a href='https://github.com/alexholmes/hadoop-utils/blob/master/CLI.md'&gt;CLI guide&lt;/a&gt; in the &lt;a href='https://github.com/alexholmes/hadoop-utils'&gt;GitHub project page&lt;/a&gt;.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Lexicographically sorting large files in Linux</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2012/09/01/sorting-large-files-in-linux"/>
   <updated>2012-09-01T01:14:00+00:00</updated>
   <id>http://grepalex.com/2012/09/01/sorting-large-files-in-linux</id>
   <content type="html">&lt;p&gt;When I hear the word &amp;#8220;sort&amp;#8221; my first thought is usually &amp;#8220;Hadoop&amp;#8221;! Yes, sorting is one thing that Hadoop does well, but if you&amp;#8217;re working with large files in Linux the built-in sort command is often all you need.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s say you have a large file on a host with 2GB or more of main memory free. The following &lt;a href='http://www.oreillynet.com/linux/cmd/cmd.csp?path=s/sort'&gt;sort&lt;/a&gt; command is a efficient way to &lt;a href='http://en.wikipedia.org/wiki/Lexicographical_order'&gt;lexicographically&lt;/a&gt;-order large files.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;#8217;s break this command down and examine each part in detail.&lt;/p&gt;

&lt;p&gt;&lt;img alt='sort image' src='/images/sorting-large-files-linux.png' /&gt;&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Slurper v2</title>
   
   <category term="--" />
   
   <category term="hadoop" />
   
   <link href="http://grepalex.com/2012/08/20/slurper-v2"/>
   <updated>2012-08-20T03:14:00+00:00</updated>
   <id>http://grepalex.com/2012/08/20/slurper-v2</id>
   <content type="html">&lt;p&gt;The current &lt;a href='https://github.com/alexholmes/hdfs-file-slurper'&gt;HDFS Slurper&lt;/a&gt; was created as part of writing &amp;#8220;Hadoop in Practice&amp;#8221;, and it just so happened that it also happened to fulfill a need that we had at work. The one-sentence description of the Slurper is that it&amp;#8217;s a utility that copies files between Hadoop file systems. It&amp;#8217;s particularly useful in situations where you want to automate moving files from local disk to HDFS, and vice-versa.&lt;/p&gt;

&lt;p&gt;While it has worked well for us, with the addition of a few choice features it could be even more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter and projection, to remove or reduce data from input files&lt;/li&gt;

&lt;li&gt;Write to multiple output files from a single input file&lt;/li&gt;

&lt;li&gt;Keep source files intact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As such I have come up with a high-level architecture for what v2 may look like (subject to change of course).&lt;/p&gt;

&lt;p&gt;&lt;img alt='Slurper v2 architecture' src='/images/slurper-v2.png' /&gt;&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Bare-metal installation for Nginx and Jekyll</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <category term="--" />
   
   <category term="jekyll" />
   
   <link href="http://grepalex.com/2012/08/17/full-jekyll-VM-setup"/>
   <updated>2012-08-17T00:56:00+00:00</updated>
   <id>http://grepalex.com/2012/08/17/full-jekyll-VM-setup</id>
   <content type="html">&lt;p&gt;This blog is a bunch of &lt;a href='https://github.com/mojombo/jekyll'&gt;Jekyll&lt;/a&gt;-created HTML which is served by the &lt;a href='http://nginx.org/'&gt;Ngix&lt;/a&gt; HTTP server. This post documents the process of getting Jekyll and Nginx setup from bare metal. It also shows a script being used to periodically pull and generate your blog from GitHub sources. The instructions that follow should work for RedHat 6 and derivatives (such as CentOS 6 which is what I&amp;#8217;m using).&lt;/p&gt;

&lt;h2 id='create_a_user_and_setup_ssh'&gt;Create a user and setup ssh&lt;/h2&gt;

&lt;p&gt;With a new VM you&amp;#8217;ll typically be given root access, but security 101 dictates that you avoid running commands as the root user as much as possible. Therefore the first thing you&amp;#8217;ll want to do is to create a user, in this case &lt;code&gt;bloguser&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ useradd bloguser&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, change the password for the user:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ passwd bloguser&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now you&amp;#8217;ll want to create a SSH public/private key set for your user. It&amp;#8217;s recommended that you do this on your own machine, not your VM, since you don&amp;#8217;t want your private key out there if it can be avoided.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ ssh-keygen -t rsa&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will generate the following files on your local host:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.ssh/id_rsa
.ssh/id_rsa.pub&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once these files are generated, create the &lt;code&gt;.ssh&lt;/code&gt; directory on your VM (these steps assume you&amp;#8217;re logged-in as root):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ su - bloguser
shell$ mkdir .ssh&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Create &lt;code&gt;.ssh/authorized_keys&lt;/code&gt; on your VM, and copy the contents of &lt;code&gt;.ssh/id_rsa.pub&lt;/code&gt; from your local host:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ vi .ssh/authorized_keys&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Setup the permissions on the directory and file.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ chmod 700 .ssh
shell$ chmod 600 .ssh/authorized_keys&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Test out your ssh setup, by ssh-ing from your local host to your VM as the &lt;code&gt;bloguser&lt;/code&gt; user:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ ssh bloguser@&amp;lt;vm-host&amp;gt;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As root, allow the &lt;code&gt;bloguser&lt;/code&gt; user to perform commands as root (if the &lt;code&gt;/etc/sudoers&lt;/code&gt; file doesn&amp;#8217;t exist, then you will need to install &lt;code&gt;sudo&lt;/code&gt; with the &lt;code&gt;yum install sudo&lt;/code&gt; command).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ vi /etc/sudoers&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Add the following line:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;%bloguser       ALL=(ALL)       ALL&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='setup_some_basic_security'&gt;Setup some basic security&lt;/h2&gt;

&lt;p&gt;Next up is tightening-up the SSH configuration.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo vi /etc/ssh/sshd_config&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Inside this file you will do three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Change the port from 22 to some other number (such as 52846 in the example below).&lt;/li&gt;

&lt;li&gt;Disable password authentication, so that a private key must be used to login to the server.&lt;/li&gt;

&lt;li&gt;Block the root user from ssh access to your host.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The file therefore needs to contain the following lines (make sure all other entries with these names are commented-out).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Port 52846
PasswordAuthentication no
PermitRootLogin no&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Restart the ssh daemon to pick up the changes you just made.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo /sbin/service sshd restart&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The next step is to setup a firewall to restrict incoming traffic to just ssh and HTTP. To do this create a file called &lt;code&gt;vm-iptables.sh&lt;/code&gt; with the following content. You&amp;#8217;ll be executing the following commands as root.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
#!/bin/bash

# Flush all current rules from iptables
iptables -F

# Allow SSH and HTTP connections
iptables -A INPUT -p tcp --dport 52846 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT

# Drop traffic on all other inbound ports
iptables -P INPUT DROP
iptables -P FORWARD DROP

# Allow all outbound traffic
iptables -P OUTPUT ACCEPT

# Accept any connection on the local port
iptables -A INPUT -i lo -j ACCEPT

# Accept packets belonging to established and related connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Save the iptables
/sbin/service iptables save

# List
iptables -L -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After you&amp;#8217;ve created the file, make it an executable and execute it to save your rules.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
shell$ chmod +x ./vm-iptables.sh
shell$ sudo ./vm-iptables.sh
iptables: Saving firewall rules to /etc/sysconfig/iptables:[  OK  ]
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    2   104 ACCEPT     tcp  --  any    any     anywhere             anywhere            tcp dpt:ssh
    0     0 ACCEPT     tcp  --  any    any     anywhere             anywhere            tcp dpt:http
    0     0 ACCEPT     all  --  lo     any     anywhere             anywhere
    0     0 ACCEPT     all  --  any    any     anywhere             anywhere            state RELATED,ESTABLISHED

Chain FORWARD (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 2 packets, 264 bytes)
 pkts bytes target     prot opt in     out     source               destination
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output shows your new iptables configuration which reflects the rules we saved in &lt;code&gt;myvm-iptables.sh&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='install_and_start_nginx'&gt;Install and start Nginx&lt;/h2&gt;

&lt;p&gt;Add the EPEL yum repository into your configuration:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Install Nginx using yum:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo yum install nginx&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Setup Nginx so that it auto-starts at system start time:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo chkconfig nginx on&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Start Nginx:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo /sbin/service nginx start&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can test that Nginx is up and running by pointing your browser at your VM IP address - you should see a page confirming that all is good.&lt;/p&gt;

&lt;p&gt;&lt;img alt='Nginx welcome screen' src='/images/nginx-welcome.png' /&gt;&lt;/p&gt;

&lt;h2 id='install_jekyll'&gt;Install Jekyll&lt;/h2&gt;

&lt;p&gt;The following commands will install Jekyll on your VM:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo yum install gcc rubygems ruby-devel
shell$ sudo gem install jekyll&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='install_pygments_for_code_syntax_highlighting'&gt;Install Pygments (for code syntax highlighting)&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;shell% sudo yum install python-setuptools
shell$ sudo easy_install Pygments&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='create_a_crontab_entry_and_script_to_generate_the_blog'&gt;Create a crontab entry and script to generate the blog&lt;/h2&gt;

&lt;p&gt;We&amp;#8217;re going to setup Jekyll to write to the Nginx HTML directory, and since we&amp;#8217;re going to do this as the &lt;code&gt;bloguser&lt;/code&gt; user, we&amp;#8217;ll first need to wipe-out the contents of that directory, and &lt;code&gt;chown&lt;/code&gt; it so that the &lt;code&gt;bloguser&lt;/code&gt; can write to it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo rm -rf /usr/share/nginx/html/*
shell$ sudo chown bloguser:bloguser /usr/share/nginx/html&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;#8217;ll assume that you have a GitHub repository that&amp;#8217;s hosting your Jekyll sources. Therefore you need to install git.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo yum install git&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Create a directory to contain your blog source&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo mkdir -p /app/blog
shell$ sudo chown bloguser:bloguser /app/blog&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The script will send out an email if an error is encountered, so you need to install mail:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo yum install mailx&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next on our list is creating a script which will do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pulls the latest blog sources from GitHub.&lt;/li&gt;

&lt;li&gt;Uses Jekyll to generate the HTML for the blog.&lt;/li&gt;

&lt;li&gt;Sends an email if Jekyll exits with an error, or if the home page can&amp;#8217;t be retrieved&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Create a shell script in &lt;code&gt;/app/blog/gen.sh&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ vi /app/blog/gen.sh&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Copy the following content into this file, which clones your github repo for the first time if it doesn&amp;#8217;t already exist, or updates the local copy via the &lt;code&gt;pull&lt;/code&gt; command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/bin/bash

send_email_and_exit() {
  recipient=$1
  message=$2

  echo &amp;quot;Sending email and exiting due to error&amp;quot;

  /bin/mail -s &amp;quot;Blog generation failure&amp;quot; &amp;quot;${recipient}&amp;quot; &amp;lt;&amp;lt; EOF
${message}
EOF

  exit 1
}

echo &amp;quot;Running at &amp;quot;`date`

basedir=/app/blog
gitdir=${basedir}/blog
nginxdir=/usr/share/nginx/html
githubrepo=https://github.com/alexholmes/blog.git
emailto=&amp;quot;grep.alex@gmail.com&amp;quot;

if [ ! -d ${gitdir} ]; then
  echo &amp;quot;Checking out repo for the first time&amp;quot;
  mkdir -p ${gitdir}
  cd ${basedir}
  git clone ${githubrepo}
else
  cd ${gitdir}
  git pull
fi

cd ${gitdir}

rm -rf ${nginxdir}/*
jekyll --no-auto . ${nginxdir}/

exitCode=$?

if [ ${exitCode} != &amp;quot;0&amp;quot; ]; then
  send_email_and_exit &amp;quot;${emailto}&amp;quot; &amp;quot;Jekyll failed with exit code ${exitCode}&amp;quot;
fi

curl http://0.0.0.0:80/ &amp;gt;/dev/null 2&amp;gt;&amp;amp;1

exitCode=$?

if [ ${exitCode} != &amp;quot;0&amp;quot; ]; then
  send_email_and_exit &amp;quot;${emailto}&amp;quot; &amp;quot;Curl failed with exit code ${exitCode}&amp;quot;
fi&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Make the file executible:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ chmod +x /app/blog/gen.sh&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now all you need is a crontab entry to refresh your blog every 5 minutes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ crontab -e
*/5 * * * * /app/blog/gen.sh &amp;amp;&amp;gt;&amp;gt; /app/blog/gen.out&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To check your crontab settings use the &lt;code&gt;-l&lt;/code&gt; option:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ crontab -l
*/5 * * * * /app/blog/gen.sh &amp;amp;&amp;gt;&amp;gt; /app/blog/gen.out&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now you can either wait for up to 5 minutes for the cron to execute the script, or simply run it yourself:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ /app/blog/gen.sh&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now when you refresh your browser you&amp;#8217;ll see your Jekyll-generated website!&lt;/p&gt;

&lt;p&gt;&lt;img alt='Nginx serving up Jekyll-generated content' src='/images/nginx-with-jekyll.png' /&gt;&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>OSX, Chrome and DNS</title>
   
   <category term="--" />
   
   <category term="*nix" />
   
   <link href="http://grepalex.com/2012/08/13/osx-dns"/>
   <updated>2012-08-13T02:06:00+00:00</updated>
   <id>http://grepalex.com/2012/08/13/osx-dns</id>
   <content type="html">&lt;p&gt;First post! Welcome to &amp;#8220;Hadoop Hamburgers&amp;#8221;, where I plan to write some posts about Hadoop and other topics that seem interesting. My first one is not related to Hadoop, but instead related to DNS, a subject near and dear to the heart of my employer, &lt;a href='http://verisign-inc.com'&gt;Verisign&lt;/a&gt;. Everything in getting this site setup went fairly smoothly, including updating my &lt;a href='http://en.wikipedia.org/wiki/Domain_name_registrar'&gt;registrar&amp;#8217;s&lt;/a&gt; DNS records to point my domain name at my hosting provider. Being an impatient sort, I didn&amp;#8217;t want to have to wait for the TTL on my domain name to expire, so I ran a &lt;code&gt;dig&lt;/code&gt; request to see if my registrar had pushed through the change:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ dig grepalex.com

;; ANSWER SECTION:
grepalex.com.		3600	IN	A	66.216.100.140&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Indeed they had! Next up was trying to hit my website from my browser. When I did that however, Chrome was showing the my registrar&amp;#8217;s advertising content. A few pokes around led me to Chrome&amp;#8217;s web page which lets you invalidate its DNS cache:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;chrome://net-internals/#dns&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However even after invalidating Chrome&amp;#8217;s cache it still showed the content from the registrar. The cool thing about Chrome&amp;#8217;s internal page is that it actually shows you the cached IP address, which indeed was still the old value. Clearly the OSX DNS client was performing some additional caching. After some more digging around I found the (Mountain) Lion-specific command which did indeed successfully clean OSX&amp;#8217;s cache:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;shell$ sudo killall -HUP mDNSResponder&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Hurray!&lt;/p&gt;</content>
 </entry>
 

</feed>
