Lexicographically sorting large files in Linux

When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.

Let’s say you have a large file on a host with 2GB or more of main memory free. The following sort command is a efficient way to lexicographically-order large files.

LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt

Let’s break this command down and examine each part in detail.

sort image

About the author

Hadoop in Practice, Second Edition

Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. Since 2008 he has gained expertise in using Hadoop to solve Big Data problems across a number of projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented at JavaOne and Jazoon.

If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.

comments powered by Disqus


Full post archive