Setting up the Distributed Analytics Environment

Total Page:16

File Type:pdf, Size:1020Kb

Setting up the Distributed Analytics Environment APPENDIX A Setting Up the Distributed Analytics Environment This appendix is a step-by-step guide to setting up a single machine for stand-alone distributed analytics experimentation and development, using the Hadoop ecosystem and associated tools and libraries. Of course, in a production-distributed environment, a cluster of server resources might be available to provide support for the Hadoop ecosystem. Databases, data sources and sinks, and messaging software might be spread across the hardware installation, especially those components that have a RESTful interface and may be accessed through URLs. Please see the references listed at the end of the Appendix for a thorough explanation of how to configure Hadoop, Spark, Flink, and Phoenix, and be sure to refer to the appropriate info pages online for current information about these support components. Most of the instructions given here are hardware agnostic. The instructions are especially suited, however, for a MacOS environment. A last note about running Hadoop based programs in a Windows environment: While this is possible and is sometimes discussed in the literature and online documentation, most components are recommended to run in a Linux or MacOS based environment. Overall Installation Plan The example system contains a large number of software components built around a Java-centric maven project: most of these are represented in the dependencies found in your maven pom.xml files. However, many other components are used which use other infrastructure, languages, and libraries. How you install these other components—and even whether you use them at all—is somewhat optional. Your platform may vary. Throughout this book, as we’ve mentioned before, we’ve stuck pretty closely to a MacOS installation only. There are several reasons for this. A Mac Platform is one of the easiest environments in which to build standalone Hadoop prototypes (in the opinion of the author), and the components used throughout the book have gone through multiple versions and debugging phases and are extremely solid. Let’s review the table of components that are present in the example system, as shown in Table A-1, before we discuss our overall installation plan. © Kerry Koitzsch 2017 275 K. Koitzsch, Pro Hadoop Data Analytics, DOI 10.1007/978-1-4842-1910-2 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 1 Apache Hadoop All hadoop.apache.org map/reduce distributed framework 2 Apache Spark All spark.apache.org distributed streaming framework 3 Apache Flink 1 flink.apache.org distributed stream and batch framework 4 Apache Kafka 6, 9 kafka.apache.org distributed messaging framework 5 Apache Samza 9 samza.apache.org distributed stream processing framework 6 Apache Gora gora.apache.org in memory data model and persistence 7 Neo4J 4 neo4j.org graph database 8 Apache Giraph 4 giraph.apache.org graph database 9 JBoss Drools 8 www.drools.org rule framework 10 Apache Oozie oozie.apache.org scheduling component for Hadoop jobs 11 Spring Framework All https://projects. Inversion of Control spring.io/spring- Framework (IOC) and framework/ glueware 12 Spring Data All http://projects.spring. Spring Data processing io/spring-data/ (including Hadoop) 13 Spring Integration https://projects. support for enterprise spring.io/spring- integration pattern-oriented integration/ programming 14 Spring XD http://projects.spring. “extreme data” integrating with io/spring-xd/ other Spring components 15 Spring Batch http://projects.spring. reusable batch function library io/spring-batch/ 16 Apache Cassandra cassandra.apache.org NoSQL database 17 Apache Lucene/Solr 6 lucene.apache.org lucene. open source search apache. engine org/solr 18 Solandra 6 https://github.com/ Solr + Cassandra interfacing tjake/Solandra 19 OpenIMAJ 17 openimaj.org image processing with Hadoop 20 Splunk 9 splunk.com Java-centric logging framework 21 ImageTerrier 17 www.imageterrier.org image-oriented search framework with Hadoop (continued) 276 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 22 Apache Camel camel.apache.org general purpose glue-ware in Java: implements EIP supports 23 Deeplearning4j 12 deeplearning4j.org deep learning toolkit for Java Hadoop and Spark 24 OpenCV | BoofCV opencv.org boofcv. used for low-level org image processing operations 25 Apache Phoenix phoenix.apache.org OLTP and operational analytics for Hadoop 26 Apache Beam beam.incubator.apache. unified model for creating data org pipelines 27 NGDATA Lily 6 https://github.com/ Solr and Hadoop NGDATA/lilyproject 28 Apache Katta 6 http://katta. distributed Lucene with sourceforge.net Hadoop 29 Apache Geode http://geode.apache.org distributed in-memory database 30 Apache Mahout 12 mahout.apache.org machine learning library with support for Hadoop and Spark 31 BlinkDB http://blinkdb.org massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. 32 OpenTSDB http://opentsdb.net time series–oriented database: runs on Hadoop and HBase 33 University of 17 http://hipi. image processing interface Virginia HIPI cs.virginia.edu/ with Hadoop framework gettingstarted.html 34 Distributed R and https://github.com/ Weka vertica/DistributedR statistical analysis support libraries 35 Java Advanced 17 http://www.oracle. low-level image processing Imaging (JAI) com/technetwork/java/ package download-1-0-2-140451. html 36 Apache Kudu kudu.apache.org fast analytics processing library for the Hadoop ecosystem 37 Apache Tika tika.apache.org content-analysis toolkit (continued) 277 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 38 Apache Apex apex.apache.org unified stream and batch processing framework 39 Apache Malhar https://github.com/ operator and codec library for apache/apex-malhar use with Apache Apex 40 MySQL Relational 4 Database 41 42 Maven, Brew, All mxaven.apache.org build, compile, and version Gradle, Gulp control infrastructure components Once the initial basic components, such as Java, Maven, and your favorite IDE are installed, the other components may be gradually added to the system as you configure and test it, as discussed in the following sections. Set Up the Infrastructure Components If you develop code actively you may have some or all of these components already set up in your development environment, particularly Java, Eclipse (or your favorite IDE such as NetBeans, IntelliJ, or other), the Ant and Maven build utilities, and some other infrastructure components. The basic infrastructure components we use in the example system are listed below for your reference. Basic Example System Setup Set up a basic development environment. We assume that you’re starting with an empty machine. You will need Java, Eclipse IDE, and Maven. These provide programming language support, an interactive development environment (IDE), and a software build and configuration tool, respectively. First, download the appropriate Java version for development from the Oracle web site http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html The current version of Java to use would be Java 8. Use Java–version to validate the Java version is correct. You should see something similar to Figure A-1. 278 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Figure A-1. First step: validate Java is in place and has the correct version Next, download the Eclipse IDE from the Eclipse web site. Please note, we used the “Mars” version of the IDE for the development described in this book. http://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/marsr Finally, download the Maven-compressed version from the Maven web site https://maven.apache. org/download.cgi . Validate correct Maven installation with mvn --version On the command line, you should see a result similar to the terminal output in Figure A-2. 279 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Figure A-2. Successful Maven version check Make sure you can log in without a passkey: ssh localhost If not, execute the following commands: ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys There are many online documents with complete instructions on using ssh with Hadoop appropriately, as well as several of the standard Hadoop references. Apache Hadoop Setup Apache Hadoop, along with Apache Spark and Apache Flink, are key components to the example system infrastructure. In this appendix we will discuss a simple installation process for each of these components. The most painless way to install these components is to refer to many of the excellent “how to” books and online tutorials about how to set up these basic components, such as Venner (2009). Appropriate Maven dependencies for Hadoop must be added to your pom.xml file. To configure the Hadoop system, there are several parameter files to alter as well. 280 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Add the appropriate properties to core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> also to hdfs-site.xml: <configuration> <property> <name>dfs.replication</name > <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name>
Recommended publications
  • Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217
    White Paper NetApp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (NetApp) and Udai Potluri (Cloudera) June 2018 | WP-7217 In partnership with Abstract There has been an exponential growth in data over the past decade and analyzing huge amounts of data in a reasonable time can be a challenge. Apache Hadoop is an open- source tool that can help your organization quickly mine big data and extract meaningful patterns from it. However, enterprises face several technical challenges when deploying Hadoop, specifically in the areas of cluster availability, operations, and scaling. NetApp® has developed a reference architecture with Cloudera to deliver a solution that overcomes some of these challenges so that businesses can ingest, store, and manage big data with greater reliability and scalability and with less time spent on operations and maintenance. This white paper discusses a flexible, validated, enterprise-class Hadoop architecture that is based on NetApp E-Series storage using Cloudera’s Hadoop distribution. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 4 1.1 Big Data ..........................................................................................................................................................4 1.2 Hadoop Overview ...........................................................................................................................................4 2 NetApp E-Series
    [Show full text]
  • Java Linksammlung
    JAVA LINKSAMMLUNG LerneProgrammieren.de - 2020 Java einfach lernen (klicke hier) JAVA LINKSAMMLUNG INHALTSVERZEICHNIS Build ........................................................................................................................................................... 4 Caching ....................................................................................................................................................... 4 CLI ............................................................................................................................................................... 4 Cluster-Verwaltung .................................................................................................................................... 5 Code-Analyse ............................................................................................................................................. 5 Code-Generators ........................................................................................................................................ 5 Compiler ..................................................................................................................................................... 6 Konfiguration ............................................................................................................................................. 6 CSV ............................................................................................................................................................. 6 Daten-Strukturen
    [Show full text]
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • Declarative Languages for Big Streaming Data a Database Perspective
    Tutorial Declarative Languages for Big Streaming Data A database Perspective Riccardo Tommasini Sherif Sakr University of Tartu Unversity of Tartu [email protected] [email protected] Emanuele Della Valle Hojjat Jafarpour Politecnico di Milano Confluent Inc. [email protected] [email protected] ABSTRACT sources and are pushed asynchronously to servers which are The Big Data movement proposes data streaming systems to responsible for processing them [13]. tame velocity and to enable reactive decision making. However, To facilitate the adoption, initially, most of the big stream approaching such systems is still too complex due to the paradigm processing systems provided their users with a set of API for shift they require, i.e., moving from scalable batch processing to implementing their applications. However, recently, the need for continuous data analysis and pattern detection. declarative stream processing languages has emerged to simplify Recently, declarative Languages are playing a crucial role in common coding tasks; making code more readable and main- fostering the adoption of Stream Processing solutions. In partic- tainable, and fostering the development of more complex appli- ular, several key players introduce SQL extensions for stream cations. Thus, Big Data frameworks (e.g., Flink [9], Spark [3], 1 processing. These new languages are currently playing a cen- Kafka Streams , and Storm [19]) are starting to develop their 2 3 4 tral role in fostering the stream processing paradigm shift. In own SQL-like approaches (e.g., Flink SQL , Beam SQL , KSQL ) this tutorial, we give an overview of the various languages for to declaratively tame data velocity. declarative querying interfaces big streaming data.
    [Show full text]
  • Apache Apex: Next Gen Big Data Analytics
    Apache Apex: Next Gen Big Data Analytics Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing Data Delivery Transform / Analytics Real-time visualization, … Declarative SQL API Data Beam Beam SAMOA Operator SAMOA DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2 Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring 3 Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster •
    [Show full text]
  • E6895 Advanced Big Data Analytics Lecture 4: Data Store
    E6895 Advanced Big Data Analytics Lecture 4: Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Chief Scientist, Graph Computing, IBM Watson Research Center E6895 Advanced Big Data Analytics — Lecture 4 © CY Lin, 2016 Columbia University Reference 2 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 3 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Spark SQL 4 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Apache Hive 5 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Hive to Create a Table 6 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Creating, Dropping, and Altering DBs in Apache Hive 7 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Another Hive Example 8 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive’s operation modes 9 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using HiveQL for Spark SQL 10 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Hive Language Manual 11 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Using Spark SQL — Steps and Example 12 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Query testtweet.json Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files 13 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University SchemaRDD 14 E6895 Advanced Big Data Analytics – Lecture 4: Data Store © 2015 CY Lin, Columbia University Row Objects Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.
    [Show full text]
  • The Cloud‐Based Demand‐Driven Supply Chain
    The Cloud-Based Demand-Driven Supply Chain Wiley & SAS Business Series The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions. Titles in the Wiley & SAS Business Series include: The Analytic Hospitality Executive by Kelly A. McGuire Analytics: The Agile Way by Phil Simon Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens A Practical Guide to Analytics for Governments: Using Big Data for Good by Marie Lowman Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs Business Analytics for Customer Intelligence by Gert Laursen Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs ii Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase Demand-Driven Inventory
    [Show full text]
  • Indice General
    i Indice general 4. Apache Lenya 1.4: Arquitectura 1 4.1. Introduccion .................................... 1 4.1.1. Arquitectura del sistema ......................... 1 4.1.2. Arquitectura del gestor de contenidos ................. 2 4.2. Conceptos basicos ................................ 4 4.2.1. Modulos .................................. 4 4.2.2. Polimor smo de las publicaciones .................... 4 4.2.2.1. El protocolo fallback:// .................... 6 4.3. La capa de presentacion ............................. 6 4.3.1. Los sitemaps de Lenya .......................... 6 4.3.1.1. El espacio de URIs ....................... 6 4.3.1.2. Proceso de una peticion .................... 8 4.4. La capa de gestion ................................ 11 4.4.1. Marco de casos de uso .......................... 11 4.4.1.1. Introduccion .......................... 11 4.4.1.2. Descripciondel funcionamiento ................ 13 4.4.1.3. Implementacionde un caso de uso .............. 13 4.4.2. Control de acceso ............................. 16 4.4.2.1. De niciones basicas ...................... 17 4.4.2.2. Componentes .......................... 18 4.4.2.3. Los mecanismos de autorizaciony autenticacion ...... 19 4.4.3. Flujo de trabajo ............................. 19 4.4.3.1. La de niciondel ujo de trabajo ............... 21 4.4.3.2. Personalizaciondel ujo de trabajo ............. 23 4.4.4. Noti caciones ............................... 23 4.5. La capa de repositorio .............................. 24 4.5.1. Acceso al repositorio ........................... 24 4.5.1.1. Control optimista de concurrencia .............. 24 4.5.1.2. Check-In y Check-Out ..................... 25 4.5.2. Control de versiones ........................... 25 4.6. Arquitectura fsica ................................ 25 4.6.1. Estructura de directorios de una publicacion ............. 25 4.6.2. Estructura de directorios de un modulo ................ 27 ii 4.6.2.1.
    [Show full text]
  • Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka
    White Paper Information Security | Machine Learning October 2020 IT@Intel: Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka Our Apache Kafka data pipeline based on Confluent Platform ingests tens of terabytes per day, providing in-stream processing for faster security threat detection and response Intel IT Authors Executive Summary Ryan Clark Advanced cyber threats continue to increase in frequency and sophistication, Information Security Engineer threatening computing environments and impacting businesses’ ability to grow. Jen Edmondson More than ever, large enterprises must invest in effective information security, Product Owner using technologies that improve detection and response times. At Intel, we Dennis Kwong are transforming from our legacy cybersecurity systems to a modern, scalable Information Security Engineer Cyber Intelligence Platform (CIP) based on Kafka and Splunk. In our 2019 paper, Transforming Intel’s Security Posture with Innovations in Data Intelligence, we Jac Noel discussed the data lake, monitoring, and security capabilities of Splunk. This Security Solutions Architect paper describes the essential role Apache Kafka plays in our CIP and its key Elaine Rainbolt benefits, as shown here: Industry Engagement Manager ECONOMIES OPERATE ON DATA REDUCE TECHNICAL GENERATES OF SCALE IN STREAM DEBT AND CONTEXTUALLY RICH Paul Salessi DOWNSTREAM COSTS DATA Information Security Engineer Intel IT Contributors Victor Colvard Information Security Engineer GLOBAL ALWAYS MODERN KAFKA LEADERSHIP SCALE AND REACH ON ARCHITECTURE WITH THROUGH CONFLUENT Juan Fernandez THRIVING COMMUNITY EXPERTISE Technical Solutions Specialist Frank Ober SSD Principal Engineer Apache Kafka is the foundation of our CIP architecture. We achieve economies of Table of Contents scale as we acquire data once and consume it many times.
    [Show full text]
  • An Enterprise Knowledge Network
    Fogbeam Labs Cut Through The Information Fog http://www.fogbeam.com An Enterprise Knowledge Network Knowledge exists in many forms inside your organization – ranging from tacit knowledge which exists only in the minds of the users who possess it, to codified knowledge stored in databases and document repositories. Unfortunately while knowledge exists throughout the organization, it is often not easy (if even possible) to locate, use, share, and reuse existing knowledge. This results in a situation often described as “the left hand doesn't know what the right hand is doing” and damages morale as employees spend their days frustrated and complaining that “nobody knows what is going on around here”. The obstacles that hinder access to existing knowledge can be cultural, geographical, social, and/or technological. And while no technological solution can guarantee perfect knowledge-sharing, tools drawn from big data, data mining / machine learning, deep learning, and artificial intelligence techniques can improve an organization's power to generate, capture, use, share and reuse knowledge. Using technologies developed as part of the semantic web initiative, and applying the principles of linked data within the enterprise, the Fogbeam Labs Enterprise Knowledge Network approach can help your firm integrate and aggregate knowledge which is spread across your existing enterprise applications, content repositories and Intranet. An Enterprise Knowledge Network enables your firm's capabilities to: • engage in high levels of knowledge transfer and
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]