When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.
LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt
Let’s break this command down and examine each part in detail.
About the author
Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.
RECENT BLOG POSTS
Configuring memory for MapReduce running on YARN
This post examines the various memory configuration settings for your MapReduce job.
Big data anti-patterns presentation
Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.