Next Generation Hadoop - It's Not Just Batch!

In my JavaOne talk today I presented changes that are happening in Hadoop, where it’s shaking off it’s batch-based shackles and enabling a new Hadoop platform that can support a mix of processing systems, from stream-processing systems to NoSQL systems.

The slides for my talk can be viewed on Speaker Deck. The rest of this post is an overview of the technologies covered in my talk, along with links for further reading.


With Hadoop 2.x, we now have YARN which acts as a distributed scheduler. This is a big step towards the vision of Hadoop being the Big Data Kernel, as it allows arbitrary applications to be scheduled on the same Hadoop cluster, and enables a new world where we can have silo’d applications coexisting on the same hardware and sharing the same storage.

The following links serve as a good starting ground to learn more about YARN:

Apache HBase

HBase is a NoSQL, distributed multi-dimensional map based on Google’s BigTable. It uses HDFS for persistence, which is a huge benefit if a key requirement of your NoSQL system is the ability to read and write data into HBase using MapReduce.

HBase on YARN (Hoya)

Hoya is a YARN application that allows multiple HBase clusters to coexist on a single Hadoop YARN cluster. This provides strong data/resource isolation properties, in conjunction with the ability to easily spin up, upsize/downsize and shutdown HBase clusters. Hoya was developed by Steve Loghran and friends over at Hortonworks.

Apache Accumulo

Accumulo is a BigTable implementation much like HBase. It also uses HDFS for storage, and currently has an edge in the security world due to its cell-level security. Although it should be noted that this is planned for HBase (see HBASE-6222).


ElephantDB is a read-only key-value store, which uses HDFS to load data which is served in real-time. It’s a part of Nathan Marz’s Lambda Architecture and enables the rapid loading and serving of data produced in the batch tier.


Storm is a stream processing, continuous computation and distributed RPC system developed and open-sourced by Twitter. It allows you to perform near real-time calculations such as trending topics.

Storm on YARN

Yahoo use Storm for a variety of use cases, and created the Storm-on-YARN so that then could run Storm on their YARN clusters. They also added the ability for Storm to read/write to secure HDFS.

Apache Samza

Samza (incubating) is a stream processing system that uses Kafka for messaging, and optionally YARN for resource management.


Morphlines is a ETL library from Cloudera that has implementations available for use within Flume, MapReduce and HBase. Using a modified JSON syntax it allows you to create a pipeline of work which can fulfill use cases such as near real-time writes from Flume into Solr Cloud.

Apache Giraph

Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. It offers performance advantages over graph processing with MapReduce.


Impala from Cloudera is an implementation of Google’s paper on Dremel, and provides interactive SQL capabilities on top of data in HDFS and HBase.

Apache Drill

An (incubating) project that offers the promise of interactive SQL capabilities over data in HDFS, HBase, Cassandra, MongoDB and Splunk.


Parquet, a joint initiative from Cloudera and Twitter, is a columnar data format supporting nested data. It can offer space and time advantages over row-ordered data, especially with queries that return a subset of the overall columns. It supports a wide variety of tools (MapReduce, Impala, Pig and Hive) and is used in production by Twitter.

ORC File

ORC File is a columnar data format that also supports nested data. It is currently implemented within Hive 0.11.

Apache Tez

Tez (incubating) is a generalized DAG execution engine. The goal of the project is to remove disk barriers that exist with pipelined MapReduce jobs. The first goal of the project is to provide a MapReduce implementation using Tez, followed by Hive and Pig.

Apache Mesos

Mesos is a cluster manager, similar to YARN, providing resource sharing and isolation capabilities in a distributed cluster. It can support multiple instances and versions of Hadoop, Spark and other applications. It’s used in Twitter to manage various applications in production.

Lambda Architecture

The Lambda Architecture, an architectural blueprint from Nathan Marz, suggests that speed and batch layers should exist to play to their mutual strengths: the speed layer providing near real-time data aggregations, and the batch layer providing a mechanism to correct potential mistakes made in the speed layer.


Summingbird is a project out of Twitter which could be viewed as an implementation of the Lambda Architecture. It allows you to using a single API to define operations on distributed collections which can be mapped into MapReduce or Storm executions.

Apache Spark

Spark (incubating) is an in-memory distributed processing system which allows you to perform MapReduce, as well as iterative workloads over data. Spark and its family of associated projects (such as Spark Streaming, GraphX) offers a complete solution to most distributed processing use cases.

About the author

Hadoop in Practice, Second Edition

Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects. He is the author of Hadoop in Practice, a book published by Manning Publications. He has presented multiple times at JavaOne, and is a JavaOne Rock Star.

If you want to see what Alex is up to you can check out his work on GitHub, or follow him on Twitter or Google+.

comments powered by Disqus


Full post archive