I love me some Unix command pipes:
$ cat /some/file.txt | sort | head
Pipelines let you chain together multiple commands to manipulate data flows. Pipes are not only useful as a data filtering mechanism, but when combined with tools such as
sed can also be used for projections and transformations. The Unix pipe, while simple in concept, is a sophisticated shell construct and one big reason why Unix shells are to this day a popular tool in a programmer/system administrator/data scientist’s toolkit.
So why am I sitting here telling you something that you already know? Fair question - to answer that let’s take another look at that command:
$ cat /some/file.txt | sort | head
While shell pipelines are great, we have a subtle problem here - and it’s something that’s known as a useless cat. No, I don’t hate cats - this expression harks back to the old usenet days where a forum member of comp.unix.shell would write a weekly post where he would highlight a redundant use of the
So why is the above command useless? Because
sort can take one or more files as arguments, much like the majority of Unix commands. So this command can be rewritten as:
$ sort /some/file.txt | head
cat from the equation means that we’ve reduced the number of processes that need to execute, and cut down on the buffering and data copying that the shell needs to do to make pipelines work - a win-win.
cat really doesn’t have many uses - if you need to view the contents of a file you’re better off using
less, and otherwise most Unix commands can directly work with files.
So next time you’re about to run a
cat command - think about whether or not you need it, or whether you’re just perpetuating use of the useless cat!
About the author
Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. Since 2008 he has gained expertise in using Hadoop to solve Big Data problems across a number of projects. He is the author of "Hadoop in Practice", a book published by Manning Publications. He has presented at JavaOne and Jazoon.
RECENT BLOG POSTS
Simplifying secondary sorting in MapReduce with htuple
Introducing htuple, an open-source project to simplify secondary sorting in MapReduce.
Next Generation Hadoop - It's Not Just Batch!
Slides and additional reading from my JavaOne 2013 talk on next-generation Hadoop - mixing real-time and batch.
Bucketing, multiplexing and combining in Hadoop - part 2
In this part we examine the MultipleOutputs class for a more flexible way to write out multiple outputs from your mappers and reducers.
Secondary sorting with Avro
Complete control over how partitioning, sorting and grouping work with Avro map output keys.
Avro's built-in sorting
A look at how Avro supports sorting in MapReduce.
Using Avro's code generation from Maven
Avro has a Maven plugin which lets you generate code from Avro schema, IDL and protocol files. This post looks at how to use the plugin and its various options.
Bucketing, multiplexing and combining in Hadoop - part 1
The first in a series of MapReduce data organization patterns, which will cover various common actions such as data bucketing, multiplexing and combining.
Using the libjars option with Hadoop
The Hadoop CLI has an option for indicating any JAR's that should be be loaded by the MapReduce task classloader. In this post you'll see how to use this option, as well as how to ensure that your MapReduce driver properly supports these JAR's.
Installing AsciiDoc on OSX
AsciiDoc is a cool markup language, similar to markdown, and comes with tools to generate AsciiDoc to DocBook and PDF formats. Here you'll see how to get it up and running on OSX.
Java 6 and 7 with the dotted/dotless I
The interesting case of the dotted and dotless "I" in Java.