Sunday, April 27, 2014

Storm-YARN at Yahoo! - Realtime streaming data processing over HDFS at Hadoop Summit

In Hadoop Summit last year, Andrew Feng of Yahoo! delivered a presentation on Storm-on-YARN, using as a use case the personalization they use in their web pages.

Apache Storm is a streaming data processing aimed at achieving real-time processing over big data. It is fault-tolerant (if any node fails, the task will be respawned in other nodes). We have previously talked about Lambda Architectures recently (and talking about Lambdoop at the same time), and about its seminal article by Nathan Marz. Storm is actually led by nathanmarz, and it proposes to form a topology of tasks separated in Spouts(sources of data) and Bolts (processors with inputs and outputs) in a DAG, which reminds me that of Apache Tez.

I've found the Storm explanation at Hortonworks page quite concise, concrete and to the point - so I'll just leave it there if you want to know more of Storm. I also recommend to browse the concepts at Storm home page, now at Apache Incubator.



I actually find interesting in the presentation, the explanation on how does Storm work internally by way of showing how storm-yarn makes your life easier.

PS. They actually tell everyone that Yahoo! had been using (by July 2013) YARN in a 10.000 nodes cluster. For those that doubt about stability and production-readiness of the system that brings that extra flex to HDFS.

PS2. All due respect, I find Mr. Feng's accent a bit hard to follow, but gotta say that as the presentation advanced I began to like it :)

No comments:

Post a Comment