Data types involved in Big Data analytics are many: structured, unstructured, geographic, real-time media, natural language, time series, event, network and linked. It is necessary here to distinguish between human-generated data and device-generated data since human data is often less trustworthy, noisy and unclean. A brief description of each type is given below. Structured […]
We’re only about half a year away from seeing the first beta version of the BDE platform, and I can’t wait to see what it can do. Taking data from multiple sources in any number of formats; static files, streams, large datasets, lots and lots of little datasets – it’s a tough challenge. Just how useful this really is will be tested and proven through a series of pilots that we’re working on already. They’ll be showing how BDE can make it easier to do things that are currently hard as well as one or two things that are so hard right now as to be unrealistic to attempt.
But there’s an issue that comes up again and again that needs our attention: rights, licences, permissions and obligations. Of course we love to talk about open data, that is, data that is made available for users to do whatever they want with it, including making money. But not all data is open, nor should it be. In the societal challenges that BDE is supporting, it’s obvious that a lot of health data needs to be confidential. The same is true for the social sciences and, of course, security. Between those two there is a wide spectrum of possibilities. Creative Commons offers a set of off the shelf licences that are extremely useful for many situations and are, rightly, used very widely. Good. But what about situations not covered by CC?
Data may have associated historical rights, or database rights, or any number of possibilities and, if BDE is to combine arbitrarily large datasets from arbitrarily disparate sources, and if it is to be useful in handling data that is not open, it needs to be able to process any permissions and obligations that are associated with that data.
That’s why W3C, a partner in BDE of course, is asking its membership to support the formation of a new Permissions and Obligations Expression Working Group. Quoting from the draft charter:
A permissions and obligations expression system should provide a flexible and interoperable information model that supports transparent and innovative (re)use of digital content across all sectors and communities. The underlying model should support the business models of open, educational, government, and commercial communities through profiles that align with their specific requirements whilst retaining a common semantic layer for wider interoperability. The system should not, however, be the basis of legal compliance or enforcement mechanisms.
That last sentence is important. The Working Group will not create something that can be the basis of any legal enforcement and, for clarity, a number of possible interpretations of the work are explicitly ruled out of scope.
The Working Group is set to handle use cases from the publishing world where multiple assets (data, images, video and texts) may be combined into a single article as well as the data-centric use cases of the societal challenges that are the focus of BDE. Whether the Working Group is formed or not is now in the hands of the W3C Membership but if approved, it should start work in February (March at the latest) and will start by reviewing the extensive work already done by the W3C ODRL Community Group For more information, please contact Continue Reading 1 Comment
The Big Data Europe project is entering its second year. Diving deep in the world of Big Data has taught us one major lesson: things are moving fast! Nowadays, there are a lot of Big Data tools available, ranging from data storage over computation to cluster monitoring. Some Big Data technologies have been on the rise since a year ago, others have received less attention. Within that year, we have plotted a platform on which you can deploy Big Data pipelines, whilst still coping with future technologies. And we’re doing everything we can to make it as simple as possible to get started.
Big Data platforms tend to run on multiple machines. Managing these machines can be a chore. Getting technologies running on them may be hard too. You may need more intimate knowledge about the cluster than you’d like to have – and we’ve taken this into account.
The base layer of the BDE platform is a set of machines. Regardless of where they are located physically. We make an abstraction of them, currently using Mesos. Given the constraints of these machines, work can automatically be sent to the most optimal nodes. The setup of such a platform may be complex, so we try to cater for your needs as a developer by supplying a Virtual Machine containing a micro-version of the platform. For cluster administrators, we’re offering Chef recipes.
The work you want to execute is packaged in Docker images. These are like tiny Virtual Machines for Linux, without the large overhead. Our approach ensures all your code is bundled, and that it doesn’t interfere with other software that’s been installed on the system. We’re going a step further and are trying to make these Docker images as easy to construct as possible. For this, we are building base images for various technologies. You extend the image, plug in the code that you need for your specific algorithm, and let the Docker ecosystem manage the packaging and deployment of your new component.
So how do we decide what to deploy? The Docker images are clearly built at a lower granularity than the whole system. One such component could be the implementation of a database, another could be an algorithm to detect man-made changes in a set of images. We consider a system that groups all the components needed to tackle a practical use-case to be a pipeline. The pipeline therefore consists of a set of Docker images and how their instances need to connect to each other. The net result is that we can deploy a network of pre-developed components to solve a Big Data problem.
A low barrier to entry, a broad application space, and cheap on maintenance. That’s the BDE platform in sales terms. The foundations of this platform are laid. We are continuing to work on easing the deployment of pipelines on the infrastructure and on clear guidelines for how to build specific components. The basics are set – let’s now build the next level!