Adventures in Big Data: April 2014

Sunday, April 27, 2014

Storm-YARN at Yahoo! - Realtime streaming data processing over HDFS at Hadoop Summit

In Hadoop Summit last year, Andrew Feng of Yahoo! delivered a presentation on Storm-on-YARN, using as a use case the personalization they use in their web pages.

Apache Storm is a streaming data processing aimed at achieving real-time processing over big data. It is fault-tolerant (if any node fails, the task will be respawned in other nodes). We have previously talked about Lambda Architectures recently (and talking about Lambdoop at the same time), and about its seminal article by Nathan Marz. Storm is actually led by nathanmarz, and it proposes to form a topology of tasks separated in Spouts(sources of data) and Bolts (processors with inputs and outputs) in a DAG, which reminds me that of Apache Tez.

I've found the Storm explanation at Hortonworks page quite concise, concrete and to the point - so I'll just leave it there if you want to know more of Storm. I also recommend to browse the concepts at Storm home page, now at Apache Incubator.

I actually find interesting in the presentation, the explanation on how does Storm work internally by way of showing how storm-yarn makes your life easier.

PS. They actually tell everyone that Yahoo! had been using (by July 2013) YARN in a 10.000 nodes cluster. For those that doubt about stability and production-readiness of the system that brings that extra flex to HDFS.

PS2. All due respect, I find Mr. Feng's accent a bit hard to follow, but gotta say that as the presentation advanced I began to like it :)

Thursday, April 24, 2014

Hadoop number of reducers rule of thumb explained in map reduce

Reviewing my documentation for my CCDH exam in a few weeks, I have encountered the topic of the recommended number of reducers.
There is a rule of thumb widely known: 0.95 or 1.75 times (#nodes * mapred.tasktracker.tasks.maximum).

With 0.95 factor, all reduce tasks are run at the same time, but each one takes typically longer. With 1.75 factor, each reduce task finishes faster, but some reduce tasks are waiting in a queue to be executed. Prefer the 1.75 factor for better load balancing.

From Cloudera documentation:

Typical way to determine how many Reducers to specify:
– Test the job with a relatively small test data set
– Extrapolate to calculate the amount of intermediate data expected from the 'real' input data.
– Use that to calculate the number of Reducers which should be specified.

However this doesn't adjust to all, or most problems and/or environments. Looking for a deeper explanation and hints from industry leaders, I found in this interesting conversation in Quora, the following advice from Allen Wittenauer from Linkedin:

I tend to tell users that their ideal reducers should be the optimal value that gets them closest to:

- A multiple of the block size
- A task time between 5 and 15 minutes
- Creates the fewest files possible

Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:

- Terrible performance on the next phase of the workflow
- Terrible performance due to the shuffle
- Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
- Destroying disk IO for no really sane reason
- Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work

Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb.

[...]

One other thing to mention: if you are changing the reducer count just to make the partitioning non-skewy in order to influence the partition, you're just plain "doing it wrong".

There is another issue, that is being in a shared cluster. If you use a FairScheduler, you can be affected by other jobs' and users' quotas. You'd need to adjust your "#nodes" value according to it.

More insight from Pere Ferrera from Datasalt:

- # reducers not too few because if they are very big and one fails, retrying it will take much more time than retrying a smaller one.

- # reducers is not generally the problem. If there is one, it is partitioning. It doesn't matter whether you use 100 reducers if only 1 is going to handle all the work.

Two good points. I like the second: partitioning the data correctly is way more important than the number of "slots" where you are sending the data to be reduced.

Tuesday, April 15, 2014

What is Apache Tez? A Hadoop YARN DAG-based framework for speed and flexibility

A very nice presentation on Apache Tez by Bikas Saha, Hadoop and Tez committer currently at Hortonworks, at QCon 2013 in NY. The slides are quite dense, but illuminating about what Tez (and YARN) can offer to the Hadoop ecosystem.

Apache Tez: Accelerating Hadoop Query Processing:
http://www.infoq.com/presentations/apache-tez

Tez is an Apache project that aims to accelerate Hadoop querying using YARN's possibilities. It is a client-side application (more on this later). But it is not a end-user application: it is thought to build applications on top of Tez.

CHARACTERISTICS

Expressive dataflow definition APIs.

It uses sampling to better partitioning data.

Flexible Input-Output-Processor runtime model

The idea is to have a library of inputs, outputs and processors that can be plugged into one another dinamically to perform the computations.
For example, you can define a mapper as a sequence like this:
INPUT: HDFSInput
PROCESSOR: MapProcessor
OUTPUT: FileSortedOutput
or a reducer as:
INPUT: ShuffleInput
PROCESSOR: ReduceProcessor
OUTPUT: HDFSOutput
This is an expression of what actually happens behind the scenes in MRv1, but made flexible to accomodate other tasks.

Data type agnostic

Nothing enforced. Just bytes. KV, tuples, whatever; it's up to your application.

Simplifying Deployment

Upload the libraries and use Tez client, nothing installed.

EXECUTION PERFORMANCE

Performance gains over MapReduce

Due to the fact that this is a DAG: removed write barrier between computations and launch overhead, etc.

Plan reconfiguration at runtime

We can change the topology of the graph in runtime! This means, we can choose different topologies depending on e.g. the volume of data, you change the number of reducers. Or make joins in different orders. It really gives a great deal of flex!

Optimal resource management

By reusing YARN containers, for new tasks or to keep shared objects in memory for further task executions.

Dynamic physical flow decisions

Optimizations like "process locally if possible", "can we use in-memory datastore?", etc. This is in the pipeline, but it does open possibilities.

Friday, April 11, 2014

Hadoop Ownership: administration and development teams inside companies survey

A research document Integratng Hadoop and Business Intelligence into Business Intelligence and Data Warehousing shows some valuable insight into Big Data (well, really, a Hadoop-biased view of big data) technologies adoption in companies out there, sponsored by Cloudera, EMC Greenplum, Hortonworks, ParAccel, SAP, SAS, Tableau Software, and Teradata. Don't try to find MongoDB or Cassandra mentions there.

Among the many interesting facts, I highlight this one of Hadoop Ownership, and maybe my interest is spurred because at the moment of writing this I am assessing entering into administration projects :D
There is a bias towards big corporate too, which it makes sense if you see some of the questions (many companies cannot have a choice for into which team this falls).

Ownership of Hadoop

Hadoop environments may be owned and primarily used by a number of organizational units.
In enterprises with multiple Hadoop environments, ownership can be quite diverse.

Data warehouse group (54%) In organizations that are intent on integrating Hadoop into the
practices and infrastructure for BI, DW, DI, and analytics, it makes the most sense that the DW group or an equivalent BI team own and (perhaps) maintain a Hadoop environment.

Central IT (35%) Every organization is unique, but more and more, TDWI sees central IT evolving into a provider of IT infrastructure, especially networks, data storage, and server resources. This means that application-specific teams are organized elsewhere, instead of being under IT. In that spirit, it’s possible that Hadoop in the future will be just another infrastructure provided by central IT, with a wide range of applications tapped into it, not just analytic ones.

Application group (29%) Many types of big data are associated with specific applications and the technical or business teams that use them. Just think about Web logs and other Web data being generated or captured by Web applications created by a Web applications development team. In those cases, it makes sense that an application group should have its own Hadoop environment.

Research or analysis group (25%) Many user organizations have dedicated teams of business analysts (recently joined by data scientists), who tackle tough new business questions and problems as they arise. Similar teams support product developers and other researchers who depend heavily on data. These relatively small teams of analysts regularly have their own tools and platforms, and it seems that Hadoop is joining such portfolios.

Business unit or department (15%) Most analytic applications have an obvious connection to a department, such as customer-base segmentation for the marketing department or supplier analytics for the procurement department. This explains why most analytic applications are deployed as departmentally owned silos. This precedence continues with big data analytic applications, as seen in the examples given in the previous two bullets.

Thursday, April 10, 2014

Spider.io and Storm: An storming love story with a happy ending

At spider.io, they have made (what I consider) a lot of changes to their core architecture, to adapt to a rapidly-changing business environment. The gotchas on these slides are worth a look (and the humour is appreciated too).

Storm at spider.io - London Storm Meetup 2013-06-18 from Ashley Brown

The fact that they changed technology so often, instead of suggesting a lack of vision, can be seen as a great flexibility and ability to adapt to their demands.

The concepts layered are also very challenging and are clearly explained. I've found these slides very rewarding for the time invested.

Wednesday, April 9, 2014

Bitcoin hacking: MongoDB infrastructure choice led to a hackeable system in finance

It's a dirty little secret that everyone knows: Bitcoin exchanges built on top of first-generation NoSQL infrastructure lack even the most basic measures to guarantee the integrity of their accounts.

A very interesting article on some Bitcoin exchanges that, based on NoSQL technologies (specifically MongoDB), have suffered from concurrency in their systems and have been (successfully) hacked.

What happened at Flexcoin, or Poloniex, or any of the other Bitcoin exchanges beset by technical problems (and I'm looking at you Coinbase!), was not an outlier or unexpected or unforeseable. The infrastructure is broken. And it is broken by design.

Flexicoin recently suffered a hacker attack that may well be shown to students on concurrency classes so they gain interest on how to make easy money due to exploiting other people's failures.

Bitcoin meets NoSQL: The story of Flexicoin and Poloniex

This article provides some ideas on fixing the problem, and also had the merit, in my case, to introduce HyperDex, a system I didn't know that provides consistency + scalability that they proactively evangelise.

I am working now for a firm specialized in finance, so I'm going to pay an extra attention to this kind of articles.

Extra Ball from Hacking Distributed: How-to make documentation on infrastructure.

Monday, April 7, 2014

Lambdoop: Lambda architecture to go

Lambdoop is a product made in Spain that tries to put the famous Lambda Architecture into place with minimal distortion. I haven't tried it, but I have to say that I love the name.

At the moment, their website is minimal too. But this presentation in NoSQLMatters Conference in Barcelona a few months ago shows the present and future of the product, which according to their words, is already in place at some (happy) customers'.

The choice of technology stack for Lambda's layers has been Hadoop + HBase for the batch layer, and Spark + redis for the real time processing layer. It also provides Java abstraction and facilities to hybridize the processing.

The Lambdoop ecosystem also comprises a set of tools to make life easier for the developers (and non-developers, since there is a workflow designer tool too - where I'd ask why not to use or adapt Oozie).

It is still to be known whether the product will honor such a good name, and more specifically, position itself in the niche. It is still to be made clear that the community needs a product that ships a whole Lambda Architecture product instead of having the flex of arranging layers separately.
With plenty of solutions around Lambda Architecture out there and more to be, they better hurry.

Thursday, April 3, 2014

Deep Learning with H20: Neural networks, machine learning and big data

A very interesting presentation from H20/0xdata, a product that aims to simplify Machine Learning over HDFS, either map reduce, HDFS or YARN. A nice explanation with ideas on neural networks and how to optimize them in the hadoop ecosystem.

I am actually doing Dr. Andrew Ng's Stanford Machine Learning course in Coursera, and by chance I am now into the neural networks chapter, so it's been good to see an application in big data to join both ideas.

Presentation on H20 Deep Learning

Embedding slideshare seems to be failing in this layout, so I leave the link to the presentation instead.

H2O Open Source Deep Learning, Arno Candel 03-20-14 from 0xdata