Adventures in Big Data: 2015

Tuesday, September 29, 2015

Benefits of Functional Programming in Big Data

In a recent talk about MongoDB, I saw that some of the attendants were not used to the stream-like kinda-functional pipeline programming.

I explained briefly the origins of functional programming and some reasons why it is in fashion lately even in programming languages that don't have a functional nature - like Java itself. But I neglected to give some of the reasons why this is a stable trend.

So, I'd like to point out some benefits of functional programming. I won't overthink: I took them from The Book.

Purely functional functions (or expressions) have no side effects (memory or I/O). This means that pure functions have several useful properties, many of which can be used to optimize the code:

If the result of a pure expression is not used, it can be removed without affecting other expressions.

If a pure function is called with arguments that cause no side-effects, the result is constant with respect to that argument list (sometimes called referential transparency), i.e. if the pure function is again called with the same arguments, the same result will be returned (this can enable caching optimizations such as memoization).

If there is no data dependency between two pure expressions, then their order can be reversed, or they can be performed in parallel and they cannot interfere with one another (in other terms, the evaluation of any pure expression is thread-safe).

If the entire language does not allow side-effects, then any evaluation strategy can be used; this gives the compiler freedom to reorder or combine the evaluation of expressions in a program (for example, using deforestation).

Thread-safety is a key concept for scalability: it ensures you can parallelize happily without having to worry about side effects. Scala is a good example of a language thought for this.
Another interesting snippet for Java programmers:

In Java, anonymous classes can sometimes be used to simulate closures;^[48] however, anonymous classes are not always proper replacements to closures because they have more limited capabilities.^[49] Java 8 supports lambda expressions as a replacement for some anonymous classes.^[50] However, the presence of checked exceptions in Java can make functional programming inconvenient, because it can be necessary to catch checked exceptions and then rethrow them—a problem that does not occur in other JVM languages that do not have checked exceptions, such as Scala.

Wednesday, August 19, 2015

RocksDB: how architecture changes with hardware changes (Flash memory)

RocksDB is an embedded database from Facebook oriented to take advantage of Flash storage. It's a good example of how software architecture needs to change to cater for new hardware technologies.

Dhruba Borthakur explains it better:

Why do we need an Embedded Database?

Flash is fundamentally different from spinning storage in performance. For the purpose of this discussion, let's assume that a read or write to spinning disk takes about 10 milliseconds while a read or write to flash storage takes about 100 microseconds. Network network-latency between two machines remains around 50 microseconds. These numbers are not cast in stone and your hardware could be very different from this one, but these numbers demonstrate the relative differences between two scenarios. What does this have anything to do with application-systems architecture? A client wants to store and access data from a database. There are two alternatives, it can store data on locally attached disks or it can store data over the network on a remote server that have disks attached to it. If we consider latency, then the locally attached disks can serve a read request in about 10 milliseconds. And in the client-server architecture, accessing the same data over a network results in a latency of 10.05 milliseconds, the overhead imposed by the network being only a miniscule 0.5%. Given this fact, it is easy to understand why a majority of currently-deployed systems use the client-server model of accessing data. (For the purpose of the discussion, I am ignoring network bandwidth limitations).

Now, lets consider the same scenario but with disks replaced by flash drives. A data access in the case of locally attached flash storage is 100 microseconds whereas accessing the same data via the network is 150 micros. Network data access is 50% higher overhead than local data access and 50% is a pretty big number. This means that databases that run embedded within an application could have much lower latency than applications that access data over a network. Thus, the necessity of an Embedded Database.

More in the History of RocksDB post (curiously, a one-post blog).