A very nice presentation on Apache Tez by Bikas Saha, Hadoop and Tez committer currently at Hortonworks, at QCon 2013 in NY. The slides are quite dense, but illuminating about what Tez (and YARN) can offer to the Hadoop ecosystem.
Apache Tez: Accelerating Hadoop Query Processing:
http://www.infoq.com/presentations/apache-tez
Tez is an Apache project that aims to accelerate Hadoop querying using YARN's possibilities. It is a client-side application (more on this later). But it is not a end-user application: it is thought to build applications
on top of Tez.
CHARACTERISTICS
Expressive dataflow definition APIs.
It uses sampling to better partitioning data.
Flexible Input-Output-Processor runtime model
The idea is to have a library of inputs, outputs and processors that can be plugged into one another dinamically to perform the computations.
For example, you can define a mapper as a sequence like this:
INPUT: HDFSInput
PROCESSOR: MapProcessor
OUTPUT: FileSortedOutput
or a reducer as:
INPUT: ShuffleInput
PROCESSOR: ReduceProcessor
OUTPUT: HDFSOutput
This is an expression of what actually happens behind the scenes in MRv1, but made flexible to accomodate other
tasks.
Data type agnostic
Nothing enforced. Just bytes. KV, tuples, whatever; it's up to your application.
Simplifying Deployment
Upload the libraries and use Tez client, nothing installed.
EXECUTION PERFORMANCE
Performance gains over MapReduce
Due to the fact that this is a DAG: removed write barrier between computations and launch overhead, etc.
Plan reconfiguration at runtime
We can change the topology of the graph in runtime! This means, we can choose different topologies depending on e.g. the volume of data, you change the number of reducers. Or make joins in different orders. It really gives a great deal of flex!
Optimal resource management
By reusing YARN containers, for new tasks or to keep shared objects in memory for further task executions.
Dynamic physical flow decisions
Optimizations like "process locally if possible", "can we use in-memory datastore?", etc. This is in the pipeline, but it does open possibilities.
No comments:
Post a Comment