Pipes and useless cats
I love me some Unix command pipes:
$ cat /some/file.txt | sort | head
Pipelines let you chain together multiple commands to manipulate data flows. Pipes are not only useful as a data filtering mechanism, but when combined with tools such as cut
, awk
and sed
can also be used for projections and transformations. The Unix pipe, while simple in concept, is a sophisticated shell construct and one big reason why Unix shells are to this day a popular tool in a programmer/system administrator/data scientist’s toolkit.
So why am I sitting here telling you something that you already know? Fair question - to answer that let’s take another look at that command:
$ cat /some/file.txt | sort | head
While shell pipelines are great, we have a subtle problem here - and it’s something that’s known as a useless cat. No, I don’t hate cats - this expression harks back to the old usenet days where a forum member of comp.unix.shell would write a weekly post where he would highlight a redundant use of the cat
command.
So why is the above command useless? Because sort
can take one or more files as arguments, much like the majority of Unix commands. So this command can be rewritten as:
$ sort /some/file.txt | head
Removing cat
from the equation means that we’ve reduced the number of processes that need to execute, and cut down on the buffering and data copying that the shell needs to do to make pipelines work - a win-win.
In fact cat
really doesn’t have many uses - if you need to view the contents of a file you’re better off using vi
or less
, and otherwise most Unix commands can directly work with files.
So next time you’re about to run a cat
command - think about whether or not you need it, or whether you’re just perpetuating use of the useless cat!
About the author
Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.
If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.
RECENT BLOG POSTS
-
Configuring memory for MapReduce running on YARN
This post examines the various memory configuration settings for your MapReduce job.
-
Big data anti-patterns presentation
Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
-
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
-
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
-
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.