twitterlinkedinslideshareslideshare

  • Home
  • Supported Components

Supported Components

The Big Data Europe Integrator Platform (BDI) comes with a huge variety of software that you can instantiate and use in a pipeline within minutes. Take a look at the list below – you’ll probably find exactly what you’re looking for, but if you don’t, the BDI’s Docker-based architecture means you can add any component that runs in Java. Talk to us to find out more.

 Data Processing & Computational Frameworks

Apache Flink

navbar-brand-logo Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.

    Github README 

  bde2020/flink-master


 

Apache Spark

spark-logo-trademark In-memory data processing engine, providing APIs in Java, Python and Scala, with the objective to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDDs).

    Github README

  bde2020/spark-master

Data storage

 

HDFS

   

    Github README

  bde2020/hadoop-base 


Hue (HDFS FileBrowser)

   

    Github README

bde2020/hdfs-filebrowser                                             


 

OpenLink Virtuoso

  Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.

    Github README

  tenforce/virtuoso:1.0.0-virtuoso7.2.2


 

4Store

   

    Github README

bde2020/4store


Apache Cassandra

 cassandra_logo  Free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure, as well as robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

    Github README

bde2020/4store


Apache Hive

 hive_logo_medium  Data warehouse software facilitating reading, writing and managing large datasets residing in distributed storage using SQL.

    Github README

bde2020/4store


Virtuoso

Database management systems, available in open-source and commercial editions, for RDF data, which can be queried using SPARQL, while providing the functionality of a traditional RDBMS.

    Github README

Data acquisition


 

Apache Flume 

flume-logo Distributed data acquisition framework used to collect, move or redistribute large amounts of data, based on pipelines that consist of a source, a channel and a sink, setup with simple key value configuration files, either from the filesystem or stored in an Apache Zookeeper node.

    Github README

 bde2020/docker-flume


Message passing


Apache Kafka

kafka Distributed publish-subscribe messaging system, scalable without downtime, with messages persisted on disk in a distributed transaction log to prevent data loss and categorized in topics to which is possible to subscribe as a consumer group, where every message is processed once by one member of the consumers’ group, and as distinct consumers, where every message will be consumed by every single consumer, distributing thus messages (e.g. data entries) among several databases.

    Github README

  bde2020/docker-kafka


Semantic components

FOX

  Federated knOwledge eXtraction Framework integrating the Linked Data Cloud and using the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL, while integrating and merging the results of Named Entity Recognition tools.

    Github README

  Link


GeoTriples

  Semi-automated tool transforming geospatial data into RDF graphs with the use of state-of-the-art vocabularies like GeoSPARQL and stSPARQL, without being tightly coupled to a specific vocabulary, and publishing them as Linked Open Geospatial Data, by extending the R2RML mapping language to the specificities of geospatial data.

    Github README

  bde2020/geotriples      


 

Silk (Extensions for Link Discovery)

  Open source framework, based on the Linked Data paradigm, for integrating heterogeneous data sources, generating links between related data items within different Linked Data sources and setting RDF links from a data source to another one on the Web, while applying data transformations to structured data sources via the declarative Silk – Link Specification Language (Silk-LSL), the RDF path language, the SPARQL protocol for local and remote SPARQL endpoints and the graphical user interface of Silk Workbench.

    Github README

  bde2020/silk-workbench


SEMAGROW engine

semagrow Algorithmically sophisticated and well-engineered Query Federation engine, combining and cross-indexing public data, regardless of their size, update rate, and schema, while offering a single SPARQL endpoint and allowing full flexibility in terms of metadata details.

    Github README

 semagrow/semagrow                                  


 

Sextant

  Web application for visualizing, exploring and interacting with time-evolving linked geospatial data, with user-friendly interface allowing domain experts and non-experts to use its semantic web technologies and to create thematic maps by combining geospatial and temporal information from various heterogeneous data sources ranging from standard SPARQL endpoints, to SPARQL endpoints following the standard GeoSPARQL defined by the Open Geospatial Consortium (OGC), or well-adopted geospatial file formats, like KML, GML and GeoTIFF.

    Github README

  bde2020/sextant


 

Strabon

  Semantic spatiotemporal RDF store, used to store linked geospatial data that changes over time and to pose queries using two popular extensions of SPARQL, enabling thus the serialization of geometric objects in OGC standards WKT and GML and offering spatial and temporal selections and joins, a rich set of spatial functions similar to those offered by geospatial relational database systems and support for multiple Coordinate Reference Systems.

    Github README

 bde2020/strabon


 

UnifiedViews

  Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.

    Github README

  tenforce/unified-views


Hadoop eco-system


Apache Hive/HCatalog

  Open source platform for distributed stream and batch data processing, providing data distribution, communication and fault tolerance for distributed computations over data streams.

    Github README

  bde2020/hive


Contact

For any questions related to the BDI platform, please contact us at platform@big-data-europe.com