Pipes and useless cats

I love me some Unix command pipes:

$ cat /some/file.txt | sort | head

Pipelines let you chain together multiple commands to manipulate data flows. Pipes are not only useful as a data filtering mechanism, but when combined with tools such as cut, awk and sed can also be used for projections and transformations. The Unix pipe, while simple in concept, is a sophisticated shell construct and one big reason why Unix shells are to this day a popular tool in a programmer/system administrator/data scientist’s toolkit.

So why am I sitting here telling you something that you already know? Fair question - to answer that let’s take another look at that command:

$ cat /some/file.txt | sort | head

While shell pipelines are great, we have a subtle problem here - and it’s something that’s known as a useless cat. No, I don’t hate cats - this expression harks back to the old usenet days where a forum member of comp.unix.shell would write a weekly post where he would highlight a redundant use of the cat command.

So why is the above command useless? Because sort can take one or more files as arguments, much like the majority of Unix commands. So this command can be rewritten as:

$ sort /some/file.txt | head

Removing cat from the equation means that we’ve reduced the number of processes that need to execute, and cut down on the buffering and data copying that the shell needs to do to make pipelines work - a win-win.

In fact cat really doesn’t have many uses - if you need to view the contents of a file you’re better off using vi or less, and otherwise most Unix commands can directly work with files.

So next time you’re about to run a cat command - think about whether or not you need it, or whether you’re just perpetuating use of the useless cat!

About the author

Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.

If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.

comments powered by Disqus

RECENT BLOG POSTS

Configuring memory for MapReduce running on YARN

This post examines the various memory configuration settings for your MapReduce job.
Big data anti-patterns presentation

Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers

Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
Using Oozie 4.4.0 with Hadoop 2.2

Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
Hadoop in Practice, Second Edition

A sneak peek at what's coming in the second edition of my book.

Full post archive