Friday, February 14, 2014

Java Hadoop: Reducer Key counter-intuitive behavior

One of the more curious gotchas once you're into the Hadoop Developer career, is a funny counter-intuitive behaviour that occurs in the Java Reducer.

A SecondarySort example to illustrate


Suppose you have a dataset such as the typical Stock Market data:

Stock  Timestamp  Price

GOOG 2014-02-14 15:51  1000.0
GOOG 2014-02-14 15:52  1001.3
GOOG 2014-02-14 15:53  1001.2
...

Of course with different stocks, not only GOOG. And of course again, not necessarily sorted.

You want to sort by <Stock, Timestamp> to properly do your calculations of how fast GOOG stocks are rising (they never go down, do they?). In other words, you want to know the differences between each and the next so you can draw a slope curve to show off in front of your significant other.

You would make a composite Key including Stock and Timestamp. Call it CompositeKey, implement their toString as the concatenation of the two values, and everything else as you should. I omit this code for brevity.
You would perform a secondary sort so you receive in each reducer your key, with all the values gathered.


public void reduce( CompositeKey key, Iterable<DoubleWritable> values), Context context)
      throws IOException, InterruptedException 
{
    double price = -1;
    for(DoubleWritable value: values) {
        if(price = -1)
            price = value.get();
        double difference = value.get() - price; 
        context.write(key, new DoubleWritable(difference)));
    }
}

Here we should be emitting something like:

GOOG 2014-02-14 15:51  0
GOOG 2014-02-14 15:52  1.3
GOOG 2014-02-14 15:53  -0.1

Are we really doing so?


Look at the variable key.
We are iterating the values. But apparently we are not iterating the key. Inside the reducer, the object key seems to be untouched, unchanged. So, according to regular intuition and Java standards, we should be emitting the same key every time (that would be, for example, the first key, <GOOG, 2014-02-14 15:51> ). So for a non-hadoop Java programmer, the output should be something like:

GOOG 2014-02-14 15:51  0
GOOG 2014-02-14 15:51  1.3
GOOG 2014-02-14 15:51  -0.1

So why does the above code works?
It does because of the iterability on the values.
With each new iteration over the values, the values obviously change. But also does the key: when calling next() method, the key object is also updated.

Why is this not made explicit? This is a good question. And I bet this is a question that appears in many introductory Hadoop courses from those curious enough as to take the steps to delve into Hadoop's guts.

What if this is what I want?


If you would want to keep keys (something you may have to ask yourself twice), you can only do it by performing a deep copy (or hard clone). You can do this elegantly by implementing the clone() method in your CompositeKey, and returning a newly created object and deep-copying there the attributes.

No comments:

Post a Comment