Tuesday, April 15, 2014

What is Apache Tez? A Hadoop YARN DAG-based framework for speed and flexibility

A very nice presentation on Apache Tez by Bikas Saha, Hadoop and Tez committer currently at Hortonworks, at QCon 2013 in NY. The slides are quite dense, but illuminating about what Tez (and YARN) can offer to the Hadoop ecosystem.

Apache Tez: Accelerating Hadoop Query Processing:
http://www.infoq.com/presentations/apache-tez

Tez is an Apache project that aims to accelerate Hadoop querying using YARN's possibilities. It is a client-side application (more on this later). But it is not a end-user application: it is thought to build applications on top of Tez.

CHARACTERISTICS

Expressive dataflow definition APIs.

It uses sampling to better partitioning data.

Flexible Input-Output-Processor runtime model

The idea is to have a library of inputs, outputs and processors that can be plugged into one another dinamically to perform the computations.
For example, you can define a mapper as a sequence like this:
INPUT: HDFSInput
PROCESSOR: MapProcessor
OUTPUT: FileSortedOutput
or a reducer as:
INPUT: ShuffleInput
PROCESSOR: ReduceProcessor
OUTPUT: HDFSOutput
This is an expression of what actually happens behind the scenes in MRv1, but made flexible to accomodate other tasks.

Data type agnostic

Nothing enforced. Just bytes. KV, tuples, whatever; it's up to your application.

Simplifying Deployment

Upload the libraries and use Tez client, nothing installed.

 EXECUTION PERFORMANCE

Performance gains over MapReduce

Due to the fact that this is a DAG: removed write barrier between computations and launch overhead, etc.

Plan reconfiguration at runtime

We can change the topology of the graph in runtime! This means, we can choose different topologies depending on e.g. the volume of data, you change the number of reducers. Or make joins in different orders. It really gives a great deal of flex!

Optimal resource management

By reusing YARN containers, for new tasks or to keep shared objects in memory for further task executions.

Dynamic physical flow decisions

Optimizations like "process locally if possible", "can we use in-memory datastore?", etc. This is in the pipeline, but it does open possibilities.








No comments:

Post a Comment