Modern Federated Systems: An Overview

Leonardo Guerreiro Azevedo, Elton Figueiredo de Souza Soares, Renan Souza and Marcio Ferreira Moreno IBM Research, Brazil

Keywords: Federated Database, Polyglot Database, Multistore, Polystore, Multidatabase, Heterogeneous Data Stores, NoSQL, Dbaas, Distributed File System, Data Processing Frameworks.

Abstract: Usually, modern applications manipulate datasets with diverse models, usages, and storages. “One size fits all” approaches are not sufficient for heterogeneous data, storages, and schemes. The rise of new kinds of data stores and processing, like NoSQL data stores, distributed file systems, and new data processing frameworks, brought new possibilities to meet this scenario’s requirements. However, semantic, schema and storage het- erogeneity, autonomy, and distributed processing are still among the main concerns when building data-driven applications. This work surveys the literature aiming at giving an overview of the state of the art of modern federated database systems. It presents the background, characterizes existing tools, depicts guidelines one should follow when creating solutions, and points out research challenges to consider in future work. This work gives fundamentals for researchers and practitioners in the area.

1 INTRODUCTION ment System) has been evolved to manage different kinds of data (e.g., multimedia objects, XML docu- Several modern applications manipulate diverse ments, spatial data), like IBM DB23 which was built datasets with different models and usages, e.g., med- on a standard SQL engine, but it has evolved to be ical informatics, intelligent transportation, etc. “One a hybrid data management system for structured and size fits all” is not effective in such scenarios. The unstructured data. Usually, using one single DBMS use of a single database and a unique data model for results in loss of performance and flexibility for spe- all data in different data models may degrade perfor- cific applications. For instance, a -oriented mance and executing ETL (Extract-Transform-Load) DBMS is one order of magnitude better for On- processes to load all data in a single database may be line Analytical Processing (OLAP) workloads than an very expensive (Stonebraker et al., 2007). Besides, RDBMS (Ozsu¨ and Valduriez, 2020), while SDBMS manual data curation and maintenance of the ETL (Stream Database Management System) is more effi- pipelines (due to adaptations caused by, e.g., domain cient for stream data, which RDBMS does not even evolution) are labor-intensive (Tan et al., 2017) (Bon- support (Nayak et al., 2013). Thus, a variety of data- diombouy and Valduriez, 2016) (Stonebraker, 2015). processing architectures may be required for special- The problem of accessing heterogeneous data ized markets (Stonebraker et al., 2007). sources has been studied in the context of multi- Schema, semantic, and data sources heterogene- database and data integration systems (Kolev et al., ity, autonomy, and distributed processing are still 2016a). Several new data management solutions have concerns (Tan et al., 2017). A federated system emerged, such as distributed file systems (e.g., GFS1 arises as a solution. It is a middleware that provides and HDFS2), NoSQL data stores (e.g., MongoDB, a seamless interface to heterogeneous data systems Allegrograph, Neo4J, Titan, Dynamo, BigTable, Re- with an independent data model and (perhaps) data dis) and new data processing frameworks (e.g., schemes (Stonebraker, 2015). Spark) as well as hybrid (multimodal, e.g., OrientDB, This work overviews the state-of-the-art of the ArangoDB, or NewSQL, e.g., Google F1, LeanX- new generation federation systems. cale). The RDBMS (Relational Database Manage- It is divided as follows. Section 2 presents the main concepts. Section 3 characterizes existing tools. 1Google File System. 2Hadoop Distributed File System. 3https://www.ibm.com/analytics/db2

276

Azevedo, L., Soares, E., Souza, R. and Moreno, M. Modern Federated Database Systems: An Overview. DOI: 10.5220/0009795402760283 In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 276-283 ISBN: 978-989-758-423-7 Copyright c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Modern Federated Database Systems: An Overview

Section 4 presents guidelines and research challenges. value, and a timestamp (used for versioning). Exam- Finally, Section 5 concludes. ples are Google Bigtable, Apache HBase, Cassandra, and Accumulo. Document Systems: are advanced key-value systems 2 MAIN CONCEPTS where values are of the document type, such as JSON, YAML, or XML. It stores records in collections (sim- ilar to tables). Records in a collection may have The shift in the federation database has arisen due different schemes. Besides simple key-value oper- to the storage requirements of modern applications, ations, document stores offer an API or query lan- which resulted in the development of several distinct guage. Examples are MongoDB, CouchDB, Couch- storage technologies to meet specific needs. Now, it is base, RavenDB, and Elasticsearch. a requirement for these technologies to work together. Systems: manipulate data as graphs. This section overviews the state-of-the-art. Their use has grown to manage data with inherent graph-like nature, e.g., Web, geographical systems, 2.1 Storage Solutions transportation, telephones, social and biological net- works. The graph represents schema There are three main layers of storage: distributed and instances as a (labeled)(directed) graph or gener- storage; database management; and, distributed pro- alization of the graph structure (e.g., hypergraphs or cessing (Bondiombouy and Valduriez, 2016). hypernodes) and graph integrity constraints (Angles Distributed Storages. Include files and objects stor- and Gutierrez, 2008). Data are represented as nodes ages. File storage works on unstructured data (i.e., and edges (which connect two nodes). E.g., horse sequences of bytes), organizing them as fixed-length and apple nodes and a likes edge to represent horse or variable-length records. The system organizes likes apple. Nodes and edges may have properties, files hierarchically, and stores file metadata (e.g., file e.g., name and birthday properties for horse and color name, owner, access permission) separate from con- for apple. Often, these systems provide query lan- tent. For shared-nothing, examples are GFS, HDFS, guages that allow for graph traversals and other typi- and GlusterFS; for shared disk, an example is Global cal graph operations, like breadth and depth search. File System 2 (GFS2). Object storage stores data Examples are Neo4J, Infinite Graph, Titan, Graph- as an object which has a unique identifier (oid), Base, Trinity, and Sparksee. properties and metadata. Examples are Lustre and : or RDF stores, are the matter of choice XtreemFS. Also, Ceph and Ozone are systems that for storing and querying semantic datasets (Hasl- combine block and object storage. hofer et al., 2011)(Iancu and Georgescu, 2018), which NoSQL (Not Only SQL) Systems. Emphasize scal- are often described using RDF (Resource Description ability, fault-tolerance, and availability, sometimes at Framework), a standard model for data interchange. the expense of consistency. The main categories are In RDF, datasets are represented as triples (subject, key-value, wide column, document, and graph, as predicate, object). That is a value (object) of a prop- well as hybrid (multimodel or NewSQL) (Ozsu¨ and erty (predicate) of a resource (subject) (Zulkefli et al., Valduriez, 2020). SolidIT4 presents a comparison of 2013). E.g., (LeonardoDaVinci, hasCreated, systems. TheMonalisa). Each part is represented as a Uniform Key-value Systems: store data as key-value pairs Resource Identifier (URI). RDFS (RDF Schema) and where the key identifies the record and the value OWL (Web Ontology Language) are RDF serializ- is a schemaless data. Their typical operations able vocabularies, commonly used to represent on- are put(key, value), get(key) and delete(key). tologies, which define classes and attributes of URIs Examples are Redis, Dynamo, Memcached, and Riak. and their relationships (Iancu and Georgescu, 2018). Extended Key-Value systems store records (a set of Moreover, triplestores are capable of processing a key-value pairs) in collections (or domains), e.g., the large amount of RDF data (Modoni et al., 2014), han- domain Customers where each customer has Cus- dling semantic queries and using inference for uncov- tomer Id, first name, last name, etc. Examples are ering new information out of the existing relations. Amazon SimpleDB and Oracle NoSQL Database. Examples are AllegroGraph, GraphDB, MarkLogic, Wide Column Systems: store data as a but allow- Mulgara, Profium Sense, Blazegraph, Virtuoso, Mar- ing nested values in a schemaless way where a column motta, Stardog, Apache Jena, RDF4 (former Sesame), may have column values. Each column has a name, a Oracle Database 12c (Iancu and Georgescu, 2018). Hybrid Data Stores: combines capabilities typically 4https://db-engines.com/en/ranking found in different data stores and DBMS. They

277 ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

may be multimodel NoSQL systems, which com- – Systems that adopt ontologies and apply bines multiple data models (examples are OrientDB, semantic approaches (schema-mapping and ArangoDB, and Microsoft Azure Cosmos DB), and entity-resolution techniques) to mediate rela- NewSQL DBMSs, which combines the of tional and non-relational data sources, such as NoSQL with the strong consistency and usability of TATOOINE and OPTIQUE. relational DBMS (examples are Google F1, LeanX- • Polystore System. Heterogeneous data stores cale, Apache Ignite, among others). Hybrid Trans- and multiple query interfaces, categorized as: action and Analytics Processing (HTAP) is a class of New SQL aiming at performing OLAP and OLTP in – Systems focused on query answering, such as: the same data allowing real-time analysis and avoid- BigDAWG, Myria and Apache Drill. ing ETL processing. – Systems that concentrate on multi-platform Data Processing Frameworks. Handle a high vol- data-flow scheduling and analytics, such as ume of data in real-time (Zheng et al., 2015). They QoX, Musketeer, and Rheem. focus on data analysis to increase understanding, pat- – Systems focused on data ingestion and deriva- tern discovery, and gain insights. They handle data tion with heterogeneous data stores, such as in batches, in a continuous stream or both ways (Gu- AWESOME. rusamy et al., 2017). Typically, those systems support operators that are automatically parallelized (Bon- Bondiombouy and Valduriez’s classification is based diombouy and Valduriez, 2016). Examples are (Gu- on the data coupling of the systems (Bondiombouy rusamy et al., 2017): (i) Batch-only: Hadoop MapRe- and Valduriez, 2016). They base their work in cloud duce; (ii) Stream-only: Apache Storm and Apache data stores and call them as multistore. Multistore is a Samza; (iii) Hybrid: Apache Spark and Apache Flink. system that provides integrated access to several data stores, such as NoSQL, RDBMS, or HDFS, some- 2.2 Taxonomies times through a data processing framework. Multi- stores can be classified as: There are two main taxonomies to classify federated • Loosely Coupled System. Autonomous local data systems. data stores accessed by a common language or Tan et al. proposed a taxonomy of four categories by their local language. Examples: BigIntegra- considering heterogeneity in data stores and query in- tor, Forward, and Qox. terfaces (Tan et al., 2017): • Tightly Coupled System. Local data stores ac- • Federated Database System. Homogeneous cessed by the multistore system using a single lan- data stores and single standard query inter- guage for querying structured and unstructured face. They feature mediator-wrapper architecture data. They aim at efficient querying for (big) data and employ schema-mapping and entity-merging analytics and/or self-tuning. Examples: Polybase, techniques for data integration. Semantics hetero- HadoopDB, Estocada, Odyssey, and JEN. geneity is a challenge. Example: Multibase. • Hybrid Systems. Some data stores are loosely • Polyglot System. Homogeneous data stores and coupled, while others are tightly coupled. Exam- multiple query interfaces. Different query inter- ples: Spark SQL, CloudMdsQL, and BigDAWG. faces provide semantics, which significantly sim- plifies query formulation. Example: Spark SQL 2.3 Frameworks for Multidatabase allows access to data in relational and procedural modes. Federation Characterization • Multistore System. Heterogeneous data stores et al. and single query interface, categorized as: Besides the taxonomies, Tan and Bondiom- bouy and Valduriez proposed frameworks for feder- – Systems that integrate distributed file systems ated data systems characterization. with RDBMs, such as HadoopDB, Polybase, Tan et al. proposed a framework inspired and JEN. on (Sheth and Larson, 1990) and composed by – Systems that integrate NoSQL systems with five dimensions: (i) Heterogeneity; (ii) Autonomy; RDBMSs, such as BigIntegrator, Forward, and (iii) Transparency; (iv) Flexibility; (v) Optimality. D4M. Heterogeneity: in data-integration systems, implies – Systems focused on optimizing data placement the design intent is threefold: (i) Uniform access and across data stores for query performance, such management of stores’ data; (ii) Advantage of com- as ESTOCADA, Odyssey, and MISO. ponent processing engines; (iii) Minimal loss of ex-

278 Modern Federated Database Systems: An Overview

pressiveness of the underlying query interfaces. Het- 2. Data-placement Optimization: place the data in erogeneity may be classified in: the best fitting engine using, e.g., rule-based and 1. Data-store Heterogeneity: different modeling cost-based methods. techniques, e.g., column store, key-value stores. Tan et al. analyze polystore systems whose de- 2. Processing-engine Heterogeneity: different pro- sign and implementation emphasize query-processing cessing capabilities, e.g., processing engines mod- and query-answering challenges, such as BigDAWG, eled around relations, arrays, and graphs. CloudMdsQL, Myria, and Apache Drill. The framework proposed by Bondiombouy and 3. Query-interface Heterogeneity: different data Valduriez to compare multistore systems is based on models in query engines, supporting various for- two dimensions: (i) Functionality: concerns the di- mal algebras, towards expressiveness in a seman- mensions objective, data model, , and tic context, e.g., a system with relational and array data stores that are supported; (ii) Implementation query interface allows expressing simple linear- Techniques: concern the dimensions special modules, algebra operations on relational data. schema management, query processing, and query Autonomy: relates to regulations and constraints: optimization (Bondiombouy and Valduriez, 2016). 1. Association Autonomy: local data stores decide Related to Functionality, they point out: when to associate and dissociate itself from the • The major Objective of multistore is the ability to federation. integrate relational data (stored in RDBMS) with 2. Execution Autonomy: local data stores may sup- other kinds of data in different data stores; port federation and native applications. They de- • Each multistore supports different kinds of Data cide prioritization when required. Stores (e.g., RDBMS, NoSQL, BigTable, HDFS, 3. Evolution Autonomy: local stores’ Array DBMS, DSMS). evolve independently from the federation, which • In terms of the Data Model, most systems pro- has to adapt to the changes. vide a relational abstraction. BigIntegrator, Poly- Transparency: concerns to integration details of base, and HadoopDB have relational data mod- storage and layout: els. Forward and CloudMdsQL are JSON-based. 1. Location Transparency: local stores provide QoX has a more general graph abstraction to cap- mechanisms to hide location details. ture analytic data flows. SparkSQL has a nested 2. Transformation Transparency: local stores pro- model. BigDAWG and Estocada have no unique vide mechanisms to hide details of data types and model since they allow access data stores with structures. Users can focus on logical-level trans- their native (or island) languages. formations, and the transparent system infers and • Considering Query Language, most systems pro- map data types, and adjust data structures. vide a SQL-like language, like BigIntegrator, Flexibility: concerns the capacity to manage arbitrary HadoopDB (HiveQL), SparkSQL, CloudMdsQL formats and to support flexible workflows and user- (with native subqueries). Polybase uses SQL. defined functions. QoX is XML-Based. Estocada and BigDAWG 1. Schema Flexibility: allows user-defined schemata use native query languages. and dynamic schema discovery, making it possi- Related to Implementation Techniques: ble to automate transformations. • Special Modules: refine the generic architecture 2. Interface Flexibility: the query interface allows or bring new functionalities. Examples for the user-defined functions and extensions instead of first are importer, absorber and finalizer, query being around a fixed algebra. processor, query planner. Examples for the sec- 3. Architectural Flexibility: provide a modularized ond are dataflow engine, HDFS bridge, storage architecture allowing customization to different advisor. scenarios, making possible extending query inter- • For Schema Management, most multistore man- face, query optimizer, backend engines, etc. age a Global-as-View (GAV) or a Local-as- Optimality: concerns opportunities for optimization (LAV) 5 global schema approach, which indicates through improvements in data placement and feder- how the elements of the global schema can be de- ated generation. rived, when needed, from the elements of the data 1. Federated Plan Optimization: federated query 5In LAV, each data source schema is treated as a view defi- plans for sub-queries or data transformation and nition of the global schema. In GAV, the global schema is migration to achieve better performance. defined as a set of views over the data source schemes.

279 ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

source schemes (Lenzerini, 2002). E.g., QoX, Es- – Heterogeneity: handled by wrappers’ (shims). tocada, SparkSQL, and CloudMdsQL do not sup- – Main Components: port global schemes, although they provide mech- ∗ Query-planning module (planner/optimizer); anisms to deal with the data stores’ local schemes. ∗ Performance-monitoring module (monitor); • The Query Processing techniques usually are ex- ∗ Data-migration module (migrator): moves tensions of known techniques from distributed data across database engines when needed; database systems, e.g., data/function shipping, ∗ Query-execution module (executor). query decomposition (based on the data stores’ – Demonstration: BigDAWG was demonstrated capabilities, bind join, select pushdown). using MIMIC II (or “Multiparameter Intelli- • The Query Optimizations are usually supported gent Monitoring in Intensive Care II”) use case, by a (simple) cost model or heuristics. which includes the relational, array, stream, and key-value databases. • CloudMdsQL (Tan et al., 2017; Kolev et al., 2016a; 3 TOOLS Kolev et al., 2016b). – Definition: CloudMdsQL (Cloud Multidata This section characterizes existing polystore tools. store Query Language) is a functional SQL- like language, designed for querying multiple • BigDAWG ( Analytics Working heterogeneous databases within a single query Group) (Elmore et al., 2015; Tan et al., 2017). containing nested subqueries. The CloudMd- – Description: Polystore system for large-scale sQL technology is now at LeanXcale, in a pro- analytics, real-time streaming support, smaller prietary product. Only the compiler remains analytics at interactive speeds, data visual- available for research. ization, and query processing over multiple – Owner/License: LeanXcale8 – proprietary. databases. Each storage engine may have a dif- – Goal: develop a functional SQL-like language ferent data model. for heterogeneous databases within a single – Owner/License: Intel Science and Technology query containing nested subqueries. 6 7 Center for Big Data (ISTC) – BSD-3 . – Highlight: CloudMdsQL exploits local data – Goal: data integration with federated archi- stores by allowing part of the query to be tecture over collections of vertically integrated expressed and processed using local native database engines. queries (e.g., a breadth-first search in a graph – Internal Data Representation and Platform database) which can be called as functions, and for Data Operations: it employs no internal at the same time be optimized using a cost data model or intermediate algebra for query model like pushing down select predicates, us- translation and data transformation. ing binding join, performing join order, or plan- – Context Segregation: BigDAWG separates ning intermediate data shipping. data in islands where each island has a data – Internal Data Representation and Platform model, logical structure, query language or al- for Data Operations: it uses a table-based gebra, and one or more backend engines for common data model that supports other data data storage and query execution, e.g., a rela- types, like arrays and JSON objects to handle tional island may be composed by PostgreSQL non-flat and nested data with basic operators and MySQL DBMSs. over them. – Queries Specification: in the scope of the is- – Context Segregation: by database engines. land, e.g., a SQL query to a relational island. – Query Specification: SQL based on embed- – Query Execution: queries are decomposed in ded functional subqueries written in the native subqueries executed by the database engines query languages of the database engines. The connected to the island. The queries over more language also addresses distributed processing than one island are expressed using a SCOPE frameworks (e.g., Apache Spark), allowing us- operator. A CAST operator is used to change age of user-defined map/filter/reduce operator the semantic context in a cross-island query. as subqueries. – Query Execution: a subquery is defined as a 6https://bigdawg.mit.edu named table expression where the user defines 7https://github.com/bigdawg-istc/bigdawg/blob/master/ license.txt 8https://www.leanxcale.com

280 Modern Federated Database Systems: An Overview

the columns and types of the table and expres- – Context Segregation: in the level of the en- sion in SQL SELECT (which the compiler can capsulated data stores. analyze and possibly rewrite) or a native ex- – Query Specification: in MyriaL, an pression (which is directly delegated to the cor- imperative-declarative hybrid language, or responding data store). E.g., two named ta- via a Python API. It supports user-defined ble expressions may query a relational database function (UDP) and user-defined aggregates and a document store, and be joined to produce (UDA) via an exposed Python API. the result. The query compiler decomposes the – Query Execution: the query into a query execution plan (QEP) in a Compiler (RACO) parses queries in Myria al- directed acyclic graph of relational operators. gebra and transforms them into the specific API Leaf nodes correspond to subqueries to be exe- calls or the query primitives supported by the cuted by the wrappers over the data stores. local database engines. RACO uses rule-based – Heterogeneity: is handled through a medi- optimization to generate federated query plans ator/wrapper architecture and the table-based that take advantage of the performance charac- common model. teristics of the supported database engines. – Main Components: – Main Components: ∗ Query Planner: performs lexical and syntax ∗ MariaX: query-execution engine that uses a analysis besides query rewrite plans; parallel, pipelined, possibly cyclic graph of ∗ Capability Manager: validates rewritten sub- dataflow operators with built-in support for queries against datastore capabilities; asynchronous evaluation of recursive queries. ∗ Query Optimizer: uses cost functions and ∗ RACO: query optimizer and federated query database statistics in a cost model to select executor that uses relational algebra extended the best plan. Besides, users may define cost with imperative constructs capturing the se- and selectivity functions. The optimizer may mantics of array, graph, and key-value data rewrite the QEP generated by the query com- models. piler. CloudMdsQL uses bind join to perform • Apache Drill (Tan et al., 2017; Hausenblas and semi-joins across heterogeneous data stores. Nadeau, 2013). ∗ QEP Builder: generates plans and serializes – Definition: distributed, massively parallel them in JSON. query engine. ∗ Query Execution Controller: parses QEP, identifies sub-plans, and invokes wrappers. – Owner/License: the Apache Software Founda- tion12 – Apache License13. ∗ Finalizer: translates subquery plans into na- tive queries that can be executed by the en- – Goal: answer fast to ad-hoc queries over a gine. huge amount of unstructured and weakly struc- ∗ X-Ray (Guimaraes˜ and Pereira, 2015): Moni- tured data spread across servers. tors execution. . – Internal Data Representation and Plat- form for Data Operations: JSON-based data 9 • Myria (Tan et al., 2017; Halperin et al., 2014). model. – Definition: cloud service for big data manage- – Context Segregation: in the level of the en- ment and analytics. capsulated data stores. – Goal: simplify data upload and data science – Query Specification: supports ANSI SQL and tasks with efficient query execution to process MongoDB QL, and user-defined functions. and explore the data. – Query Execution: queries are parsed and – Owner/License: the University of Washing- transformed into a logical plan, which is trans- ton10 – BSD 311. formed and optimized into a physical plan. – Internal Data Representation and Platform – Main Components: for Data Operations: relational data represen- ∗ Drillbit: daemon service running cluster tation with a relational-algebra compiler. nodes aiming at maximizing data locality. ∗ Zookeeper: broker between the clients and 9 https://myria.cs.washington.edu/ Drillbits, and among Drillbits. 10http://www.washington.edu/ 11https://github.com/uwescience/myria/blob/master/ 12https://drill.apache.org/ LICENSE 13https://drill.apache.org/apacheASF/

281 ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

4 REQUIREMENTS AND • Cross-system Solution (Elmore et al., 2015): RESEARCH CHALLENGES able to include other polystores. The research challenges are (Stonebraker, 2015)(Bon- The modern federated database system main require- diombouy and Valduriez, 2016)(Elmore et al., 2015): ments are: • Query Language: easy to use query language • Location and Data Sources Encapsulation (El- with efficient processing over diverse stores. more et al., 2015): Provide a smooth interface SQL-like language facilitates integration with ex- to free programmers from having to learn several isting tools but comprises efficiency. Alternatives query languages and store engines. are to access stores directly or to use a functional • The Deployment: should support cloud prac- query language that allows native subqueries as tices (Halperin et al., 2014) complexity and cost functions within the query language. of installation, administration, and maintenance • Complex Analytics: efficient combination of should not be prohibitive; mechanisms for pre- data from multiple stores with linear algebra al- dicting and debugging performance, as well as gorithms. Analytics tasks have been moved from controlling costs, should be available. relational analytics (e.g., COUNT, SUM, AVG • Query Language (Kolev et al., 2016b): should with group by) to predictive models (e.g., ma- work on heterogeneous data stores; support arbi- chine learning, regression, statics). The majority trary chain of queries, i.e., a query result in one of such predictive models are based on linear alge- database be input for a query in another database; bra algorithms, such as regression analysis, singu- be schema independent, i.e., allow the integration lar value decomposition, eigenanalysis, k-means of databases with or without schema; allow data- clustering, etc. Although linear algebra pack- metadata transformation, e.g., convert attributes ages are optimized by software and hardware, or relations into data and vice-versa. they have different characteristics when compared to data systems, e.g., size of computation tiles, • Query Tools: should support users and algo- choice of networks, compression techniques. So, rithm designers with an easy-to-use set of inter- algebra packages and data stores should be decou- faces, languages, and APIs that scale from sim- pled, and data should be converted back and forth ple SPJ (Select-Project-Join) queries to advanced between them in a non-expensive manner. application-specific ones (Halperin et al., 2014). • : optimization should han- • Processing (Halperin et al., 2014): should be effi- dle data-flow and multi-platform scheduling able cient and support: scale queries and optimization to update the cost model or add new heuristics combining state-of-the-art and novel techniques, when data stores join or disjoin the system. e.g., use of bind joins, semi-joins, core parallel query processing concepts. • Semantic Mapping and Record Linkage: auto- • Real-time Decision Support (Elmore et al., matic translation of utterances to the local dialect 2015): through stream processing able to connect of a storage system and integration of the results. historical and stream data. • Automatic Copy: efficient copy of data between • Visualization Tools (Elmore et al., 2015): sup- stores and federated system, considering data porting disparate data models and new user in- transformation and memory access techniques. teraction mechanisms. E.g., big data application • Distributed Transactions: handle transactions visualization should support questions like “give over distributed, heterogeneous store systems with me something interesting from data” in an ex- diverse local transaction models. The problem is ploratory way. harder when considering, e.g., NoSQL data stores 14 • Shuffle Data among Backends (Elmore et al., that do not provide ACID transaction support. 2015): i.e., supporting moving data and interme- • Automatic Load Balancing and Provisioning: diate results from one storage to another as needed automatically balance data load among data to fit a user’s query and high performance. E.g., stores by replications or moving data, which re- each engine may know how to read binary data quires a monitoring feature. in parallel directly from another engine (Elmore • Novel User Interfaces: innovative data mining, et al., 2015). Using a monitoring system that visualization, and browsing user interfaces. Typi- learns types of queries, and move data accord- cal user interaction flows may not fit modern fed- ingly, i.e., it transfers the data to the engine that has the best data model to answer the query. 14Atomicity, Consistency, Isolation, and Durability.

282 Modern Federated Database Systems: An Overview

eration. E.g., as new data sources may join or dis- Haslhofer, B., Momeni Roochi, E., Schandl, B., and Zan- join the federation, new data may rise or disap- der, S. (2011). Europeana rdf store report. Technical pear, which brings new knowledge. report, University of Vienna. Hausenblas, M. and Nadeau, J. (2013). Apache drill: in- • Benchmarks: should be developed to evalu- teractive ad-hoc analysis at scale. Big data, 1(2):100– ate federations addressing store combinations and 104. multiple query language processing, like Poly- Iancu, B. and Georgescu, T. M. (2018). Saving Large Se- Bench (Karimov et al., 2018). mantic Data in Cloud: A Survey of the Main DBaaS Solutions. Informatica Economica, 22(1). Karimov, J., Rabl, T., and Markl, V. (2018). Polybench: The first benchmark for polystores. In Technology Confer- 5 CONCLUSION ence on Performance Evaluation and Benchmarking, pages 24–41. Springer. Modern applications require the manipulation of Kolev, B., Bondiombouy, C., Valduriez, P., Jimenez-Peris,´ structured and unstructured data, usually in high vol- R., Pau, R., and Pereira, J. (2016a). The CloudMdsQL ume, over distributed and heterogeneous data sources. Multistore System. In Proc. of Intl. Conf. on Manage- ment of Data (SIGMOD’16) The pattern “one size fits all” does not hold any- , pages 2113–2116. ACM. more. Thus, innovative solutions capable of access- Kolev, B., Valduriez, P., Bondiombouy, C., Jimenez-Peris, R., Pau, R., and Pereira, J. (2016b). CloudMd- ing and manipulating data in such an environment are sQL: Querying Heterogeneous Cloud Data Stores required. with a Common Language. Distributed and parallel This work presented the state-of-the-art, detailed databases, 34(4):463–503. solutions, their main components, how queries are Lenzerini, M. (2002). Data integration: A theoretical per- specified and executed, and other features. Afterward, spective. In Proceedings of the Twenty-First ACM we presented guidelines and challenges the solutions SIGMOD-SIGACT-SIGART Symposium on Principles should address. Researchers and practitioners can use of Database Systems, pages 233–246. ACM. our finds to focus their work. Modoni, G. E., Sacco, M., and Terkaj, W. (2014). A survey of rdf store solutions. In Intl. Conf. on Engineering, As future work, we aim at evaluating the tools in Technology and Innovation (ICE), pages 1–7. IEEE. practice through case studies and experiments to iden- Nayak, A., Poriya, A., and Poojary, D. (2013). Type of tify the level they meet the challenges and to bring NOSQL databases and its comparison with relational new open issues. We also intend to perform a system- databases. International Journal of Applied Informa- atic review towards a broader analysis improving the tion Systems, 5(4):16–19. overview presented in this work. Ozsu,¨ M. T. and Valduriez, P. (2020). Principles of dis- tributed database systems. Springer, 4th edition. Sheth, A. P. and Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous, and REFERENCES autonomous databases. ACM Computing Surveys (CSUR), 22(3):183–236. Angles, R. and Gutierrez, C. (2008). Survey of graph Stonebraker, M. (2015). The case for polystore. https://wp. database models. ACM Comp. Surveys, 40(1):1–39. sigmod.org/?p=1629. Bondiombouy, C. and Valduriez, P. (2016). Query process- Stonebraker, M., Bear, C., C¸etintemel, U., Cherniack, M., ing in multistore systems: an overview. Research Re- Ge, T., Hachem, N., Harizopoulos, S., Lifter, J., port RR-8890, INRIA. Rogers, J., and Zdonik, S. (2007). One size fits all? Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M., part 2: Benchmarking results. In Proc. CIDR. Cetintemel, U., Gadepally, V., Heer, J., Howe, B., Tan, R., Chirkova, R., Gadepally, V., and Mattson, T. G. Kepner, J., Kraska, T., et al. (2015). A demonstra- (2017). Enabling query processing across heteroge- tion of the bigdawg polystore system. Proceedings of neous data models: A survey. In IEEE Intl. Conf. on the VLDB Endowment, 8(12):1908–1911. Big Data (Big Data), pages 3211–3220. IEEE. Guimaraes,˜ P. and Pereira, J. (2015). X-ray: Monitoring Zheng, Z., Wang, P., Liu, J., and Sun, S. (2015). Real-Time and analysis of queries. In IFIP Big Data Processing Framework: Challenges and So- International Conference on Distributed Applications lutions. Applied Math. & Inf. Sciences, 9(6):3169. and Interoperable Systems, pages 80–93. Springer. Zulkefli, N. S. S., Rahman, N. A., Bakar, Z. A., Nordin, S., Gurusamy, V., Kannan, S., and Nandhini, K. (2017). The Sembok, T. M. T., and Teo, N. H. I. (2013). Evalua- Real Time Big Data Processing Framework: Advan- tion of triple indices in retrieving web documents. In tages and Limitations. Intl. Journal of Computer Sci- Intl. Conf. on Advanced Computer Science Applica- ences and Engineering (JCSE), 5(12):305–312. tions and Technologies (ACSAT), pages 525–529. Halperin, D., Teixeira de Almeida, V., Choo, L. L., and et al. (2014). Demonstration of the myria big data manage- ment service. In Proceedings of the 2014 ACM SIG- MOD Intl. Conf. on Mngt. of Data, pages 881–884.

283