Adventures in Big Data: 2016

Saturday, September 10, 2016

[Solved] Fixing external network access in Hortonworks Sandbox HDP 2.4 - DNS NAT on VirtualBox

Hi,
just in case this is of help to anyone.

I installed HDP Sandbox 2.4 for Virtualbox locally in my computer. It is under a company LAN.

All HDP services worked fine. However, I had a problem getting internet connection (e.g. in order to install packages, or compile projects with dependencies to download). It turned out there was an issue resolving names.

[root@sandbox tests]# ping google.com

ping: unknown host google.com

[root@sandbox tests]# ping 216.58.201.142  

PING 216.58.201.142 (216.58.201.142) 56(84) bytes of data.

64 bytes from 216.58.201.142: icmp_seq=1 ttl=53 time=15.4 ms

64 bytes from 216.58.201.142: icmp_seq=2 ttl=53 time=12.4 ms

...

being 216.58.201.142 the IP from google.com, resolved from my local machine.

HDP comes in its file /etc/resolv.conf with an entry pointing to the famous public Google DNS, 8.8.8.8.
It happens this is not reachable from my network, for some reason I still don't know.
What you can do, is to force HDP to use the IP of your LAN's DNS server.

You can get that by a simple nslookup from your local machine.

$ nslookup.exe google.com

Non-authoritative answer:

Server:  your.nameserver.com

Address:  x.y.z.a

Name:    google.com

Addresses:  (google IPs)

What you need is your nameserver's address, x.y.z.a. Add it to your HDP Sandbox VM /etc/resolv.conf and you should be all set.

From there you can test it by e.g. pinging

Monday, September 5, 2016

Testing Alluxio

Lately I am testing Alluxio, formerly Tachyon, a tool that I will be using for keeping in-memory a large bunch of objects that may have a limited number of ad-hoc, asynchronous, daily modifications, for (mostly) Spark processing over HDFS.

Looking at feedback from other companies (the quite hyped Baidu and Barclays examples, among others), it seems like Alluxio is a good fit for this problem. Specifically from Barclays', since their architecture sounds like a common pattern out there and similar to one that actually I've met.

However, there are some others contenders too, and I've been impressed by Apache Ignite, a whole suite for all things grid. It has an astonishing set of features, but especially the IgniteRDD (Spark Shared RDD) has caught my attention. Several modes of operation... very interesting. This would make for another testing effort, and especially so because its architecture (and use case set) seems to differ largely from that of Alluxio.

Also, maybe more general-purpose tools like Redis would be OK to make the work. Alluxio's integration with HDFS made it the first point of contact for attacking the problem for us, but certainly it does not mean it has to be the best case. A recent article on the benefits of using Spark with Redis for time series computation reported to accelerate Spark over 100 times, and Spark-with-Alluxio over 45 times (which also means that Spark with Alluxio only would get about 100/45 = around twice as fast as Spark alone... which sounds too little a number).

Let's see how it goes...