The Big Data Europe project is entering its second year. Diving deep in the world of Big Data has taught us one major lesson: things are moving fast! Nowadays, there are a lot of Big Data tools available, ranging from data storage over computation to cluster monitoring. Some Big Data technologies have been on the rise since a year ago, others have received less attention. Within that year, we have plotted a platform on which you can deploy Big Data pipelines, whilst still coping with future technologies. And we’re doing everything we can to make it as simple as possible to get started.
Big Data platforms tend to run on multiple machines. Managing these machines can be a chore. Getting technologies running on them may be hard too. You may need more intimate knowledge about the cluster than you’d like to have – and we’ve taken this into account.
The base layer of the BDE platform is a set of machines. Regardless of where they are located physically. We make an abstraction of them, currently using Mesos. Given the constraints of these machines, work can automatically be sent to the most optimal nodes. The setup of such a platform may be complex, so we try to cater for your needs as a developer by supplying a Virtual Machine containing a micro-version of the platform. For cluster administrators, we’re offering Chef recipes.
The work you want to execute is packaged in Docker images. These are like tiny Virtual Machines for Linux, without the large overhead. Our approach ensures all your code is bundled, and that it doesn’t interfere with other software that’s been installed on the system. We’re going a step further and are trying to make these Docker images as easy to construct as possible. For this, we are building base images for various technologies. You extend the image, plug in the code that you need for your specific algorithm, and let the Docker ecosystem manage the packaging and deployment of your new component.
So how do we decide what to deploy? The Docker images are clearly built at a lower granularity than the whole system. One such component could be the implementation of a database, another could be an algorithm to detect man-made changes in a set of images. We consider a system that groups all the components needed to tackle a practical use-case to be a pipeline. The pipeline therefore consists of a set of Docker images and how their instances need to connect to each other. The net result is that we can deploy a network of pre-developed components to solve a Big Data problem.
A low barrier to entry, a broad application space, and cheap on maintenance. That’s the BDE platform in sales terms. The foundations of this platform are laid. We are continuing to work on easing the deployment of pipelines on the infrastructure and on clear guidelines for how to build specific components. The basics are set – let’s now build the next level!