Archives
December 2016
- Configuring memory for MapReduce running on YARN
This post examines the various memory configuration settings for your MapReduce job.
October 2015
- Big data anti-patterns presentation
Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
May 2014
- Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
February 2014
- Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
- Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.
- Using Hadoop 2.2 as a sink in Flume 1.4
Working around the protobuf 2.5 dependency introduced by Hadoop 2.2.
October 2013
- Simplifying secondary sorting in MapReduce with htuple
Introducing htuple, an open-source project to simplify secondary sorting in MapReduce.
September 2013
- Next Generation Hadoop - It's Not Just Batch!
Slides and additional reading from my JavaOne 2013 talk on next-generation Hadoop - mixing real-time and batch.
July 2013
- Bucketing, multiplexing and combining in Hadoop - part 2
In this part we examine the MultipleOutputs class for a more flexible way to write out multiple outputs from your mappers and reducers.
June 2013
- Secondary sorting with Avro
Complete control over how partitioning, sorting and grouping work with Avro map output keys.
May 2013
- Avro's built-in sorting
A look at how Avro supports sorting in MapReduce.
- Using Avro's code generation from Maven
Avro has a Maven plugin which lets you generate code from Avro schema, IDL and protocol files. This post looks at how to use the plugin and its various options.
- Bucketing, multiplexing and combining in Hadoop - part 1
The first in a series of MapReduce data organization patterns, which will cover various common actions such as data bucketing, multiplexing and combining.
February 2013
- Using the libjars option with Hadoop
The Hadoop CLI has an option for indicating any JAR's that should be be loaded by the MapReduce task classloader. In this post you'll see how to use this option, as well as how to ensure that your MapReduce driver properly supports these JAR's.
- Installing AsciiDoc on OSX
AsciiDoc is a cool markup language, similar to markdown, and comes with tools to generate AsciiDoc to DocBook and PDF formats. Here you'll see how to get it up and running on OSX.
- Java 6 and 7 with the dotted/dotless I
The interesting case of the dotted and dotless "I" in Java.
- LZOP decompression - revenge of the useless cat
The nuances of using the lzop CLI to view the contents of LZOP files.
January 2013
- Executing variables that contain shell operators
A look at how the eval command can be used to execute a command pipeline.
- Using awk and friends with Hadoop
How to use Linux tools such as awk in MapReduce.
November 2012
- Configuring and tuning MapReduce's shuffle
A look at the various MapReduce shuffle configurables, and where in the MapReduce process they are applied.
- Controlling user logging in Hadoop
A look at various approaches that can limit the impact of overly aggressive logging in MapReduce.
October 2012
- Pipes and useless cats
A post for all you dog lovers about instances of unloved cats in Linux.
- Hadoop unit testing with MiniMRCluster and MiniDFSCluster
September 2012
- How partitioning, collecting and spilling work in MapReduce
- Using sed to perform inline replacements of regex groups
- Sorting text files with MapReduce
- Lexicographically sorting large files in Linux