twitterlinkedinslideshareslideshare

Supported Components

The Big Data Europe Integrator Platform (BDI) comes with a huge variety of software that you can instantiate and use in a pipeline within minutes. Take a look at the list below – you’ll probably find exactly what you’re looking for, but if you don’t, the BDI’s Docker-based architecture means you can add any component that runs in Java. Talk to us to find out more.

Data Processing & Computational Frameworks

Apache Flink

 navbar-brand-logo

Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.
Github README 
bde2020/flink-master

Apache Spark

spark-logo-trademark

In-memory data processing engine, providing APIs in Java, Python and Scala, with the objective to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDDs).
Github README
bde2020/spark-master

 SANSA

 SANSA-Stack Logo

SANSA is a collection of open source algorithms for distributed data processing for large-scale RDF Knowledge Graphs. SANSA provides several libraries for RDF Data ingestion, OWL library for RDF/OWL operations, Querying library to support SPARQL, Inference library for rule-based reasoning on RDF/OWL data, and a Machine Learning library for RDF analytics.
Github SANSA-Stack
SANSA Notebooks

 


Data storage

Hadoop

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Github README
bde2020/hadoop-base

Hue HDFS File Browser

WebHDFS file browser with Web graphical user interface.
Github README
bde2020/hdfs-filebrowser

OpenLink Virtuoso

Database management systems for data that is modelled using RDF, backed by an RDBMS. It is available in open-source and commercial editions. RDF data can be queried using SPARQL. Next to RDF, Virtuoso also provides the functionality of a traditional RDBMS.
Github README
tenforce/virtuoso-v7.2.0-latest

4Store

 

4store is a database storage and query engine that holds RDF data. It supports both single node and cluster deployment. 4store is available under the GNU General Public Licence, version 3.

Github README
bde2020/4store/

Apache Cassandra

Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.
Github README
bde2020/cassandra

Apache Hive

hive_logo_medium

Data warehouse software facilitating reading, writing and managing large datasets residing in distributed storage using SQL.
Github README
bde2020/hive

 

Data acquisition

Apache Flume 

flume-logo

Distributed data acquisition framework used to collect, move or redistribute large amounts of data, based on pipelines that consist of a source, a channel and a sink, setup with simple key value configuration files, either from the filesystem or stored in an Apache Zookeeper node.
Github README
 bde2020/flume

 

Message passing

Apache Kafka

Distributed publish-subscribe messaging system, scalable without downtime, with messages persisted on disk in a distributed transaction log to prevent data loss and categorized in topics to which is possible to subscribe as a consumer group, where every message is processed once by one member of the consumers’ group, and as distinct consumers, where every message will be consumed by every single consumer, distributing thus messages (e.g. data entries) among several databases.
Github README
bde2020/kafka

 

Semantic components

FOX

 

Federated knOwledge eXtraction Framework integrating the Linked Data Cloud and using the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL, while integrating and merging the results of Named Entity Recognition tools.
Github README
bde2020/fox

GeoTriples

 

Semi-automated tool transforming geospatial data into RDF graphs with the use of state-of-the-art vocabularies like GeoSPARQL and stSPARQL, without being tightly coupled to a specific vocabulary, and publishing them as Linked Open Geospatial Data, by extending the R2RML mapping language to the specificities of geospatial data.
Github README
/geotriples-ws

Silk

Extensions for Link Discovery

Open source framework, based on the Linked Data paradigm, for integrating heterogeneous data sources, generating links between related data items within different Linked Data sources and setting RDF links from a data source to another one on the Web, while applying data transformations to structured data sources via the declarative Silk – Link Specification Language (Silk-LSL), the RDF path language, the SPARQL protocol for local and remote SPARQL endpoints and the graphical user interface of Silk Workbench.
Github README
bde2020/silk-workbench

SEMAGROW engine

semagrow

Algorithmically sophisticated and well-engineered Query Federation engine, combining and cross-indexing public data, regardless of their size, update rate, and schema, while offering a single SPARQL endpoint and allowing full flexibility in terms of metadata details.
Github README
semagrow/semagrow

Sextant

Web application for visualizing, exploring and interacting with time-evolving linked geospatial data, with user-friendly interface allowing domain experts and non-experts to use its semantic web technologies and to create thematic maps by combining geospatial and temporal information from various heterogeneous data sources ranging from standard SPARQL endpoints, to SPARQL endpoints following the standard GeoSPARQL defined by the Open Geospatial Consortium (OGC), or well-adopted geospatial file formats, like KML, GML and GeoTIFF.
Github README
bde2020/sextant

Strabon

 

Semantic spatiotemporal RDF store, used to store linked geospatial data that changes over time and to pose queries using two popular extensions of SPARQL, enabling thus the serialization of geometric objects in OGC standards WKT and GML and offering spatial and temporal selections and joins, a rich set of spatial functions similar to those offered by geospatial relational database systems and support for multiple Coordinate Reference Systems.
Github README
bde2020/strabon

UnifiedViews

Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.
Github README
tenforce/unified-views

Ontario

Query processor for Data Lakes, it allows to query heterogeneous data (e.g., csv, json, rdf) while they are in their original formats. Ontario is a realization of the so-called Semantic Data Lake, where Semantic Web techniques, e.g., SPARQL, RDF mapping languages, etc. are used underneath the hoods to build the “virtual” data integration process.
Github README
/Ontario

 

 

Watch out for newest component additions.

 

Contact

For any questions related to the BDI platform, please contact us at platform@big-data-europe.eu .