When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.
LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt
Let’s break this command down and examine each part in detail.
About the author
Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. Since 2008 he has gained expertise in using Hadoop to solve Big Data problems across a number of projects. He is the author of Hadoop in Practice (first edition, with second edition currently in the early access program), a book published by Manning Publications. He has presented at JavaOne and Jazoon.
RECENT BLOG POSTS
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.
Using Hadoop 2.2 as a sink in Flume 1.4
Working around the protobuf 2.5 dependency introduced by Hadoop 2.2.
Simplifying secondary sorting in MapReduce with htuple
Introducing htuple, an open-source project to simplify secondary sorting in MapReduce.