LZOP decompression - revenge of the useless cat
For me LZOP is the ubiquitous compression codec with working with large text files in HDFS due to its MapReduce data locality advantages. As a result when I want to peek at LZOP-compressed files in HDFS I use a command such as:
shell$ hadoop fs -get /some/file.lzo | lzop -dc | head
With this command the output of a LZOP-compressed file in HDFS is piped to the lzop
utility, where the -dc
flags tell lzop to decompress the stream and write the uncompressed data to standard out, and the final head
will show the first 10 lines of the data. I may substitute head
with other utilities such as awk
or sed
, but I always follow this general pattern of piping the output lzop
output to another utility.
Imagine my surprise the other day when I tried the same command on a smaller file (hence not needing to use the head
command), only to see this error:
shell$ hadoop fs -get /some/file.lzo | lzop -dc
lzop: <stdout>: uncompressed data not written to a terminal
What just happened - why would the first command work, but not the second? My guess is that this is likely the authors of the lzop
utility safeguarding us accidentally flooding standard output with uncompressed data. Which is frustrating, because as you can see from the following example this is a different route than that which the authors of gunzip
took:
shell$ echo "the cat" | gzip -c | gunzip -c
the cat
If we run the same command with lzop
we see the same result as was saw earlier:
shell$ echo "the cat" | lzop -c | lzop -dc
lzop: <stdout>: uncompressed data not written to a terminal
A ghetto approach to solving this problem is to pipe the lzop
output to cat
(which is a necessary violation of the useless cat pattern):
shell$ hadoop fs -get /some/file.lzo | lzop -dc | cat
Luckily lzop
has a -f
option which removes the need for the cat
:
shell$ hadoop fs -get /some/file.lzo | lzop -dcf
It turns out that man
page on lzop
is instructive with regards to the -f
option, indicates various scenarios where it can be helpful:
shell$ man lzop
...
-f, --force
Force lzop to
- overwrite existing files
- (de-)compress from stdin even if it seems a terminal
- (de-)compress to stdout even if it seems a terminal
- allow option -c in combination with -U
Using -f two or more times forces things like
- compress files that already have a .lzo suffix
- try to decompress files that do not have a valid suffix
- try to handle compressed files with unknown header flags
Use with care.
About the author
Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.
If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.
RECENT BLOG POSTS
-
Configuring memory for MapReduce running on YARN
This post examines the various memory configuration settings for your MapReduce job.
-
Big data anti-patterns presentation
Details on the presentation I have at JavaOne in 2015 on big data antipatterns.
-
Understanding how Parquet integrates with Avro, Thrift and Protocol Buffers
Parquet offers integration with a number of object models, and this post shows how Parquet supports various object models.
-
Using Oozie 4.4.0 with Hadoop 2.2
Patching Oozie's build so that you can create a package targetting Hadoop 2.2.0.
-
Hadoop in Practice, Second Edition
A sneak peek at what's coming in the second edition of my book.