Modern Federated Database Systems: an Overview
Total Page:16
File Type:pdf, Size:1020Kb
Modern Federated Database Systems: An Overview Leonardo Guerreiro Azevedo, Elton Figueiredo de Souza Soares, Renan Souza and Marcio Ferreira Moreno IBM Research, Brazil Keywords: Federated Database, Polyglot Database, Multistore, Polystore, Multidatabase, Heterogeneous Data Stores, NoSQL, Dbaas, Distributed File System, Data Processing Frameworks. Abstract: Usually, modern applications manipulate datasets with diverse models, usages, and storages. “One size fits all” approaches are not sufficient for heterogeneous data, storages, and schemes. The rise of new kinds of data stores and processing, like NoSQL data stores, distributed file systems, and new data processing frameworks, brought new possibilities to meet this scenario’s requirements. However, semantic, schema and storage het- erogeneity, autonomy, and distributed processing are still among the main concerns when building data-driven applications. This work surveys the literature aiming at giving an overview of the state of the art of modern federated database systems. It presents the background, characterizes existing tools, depicts guidelines one should follow when creating solutions, and points out research challenges to consider in future work. This work gives fundamentals for researchers and practitioners in the area. 1 INTRODUCTION ment System) has been evolved to manage different kinds of data (e.g., multimedia objects, XML docu- Several modern applications manipulate diverse ments, spatial data), like IBM DB23 which was built datasets with different models and usages, e.g., med- on a standard SQL engine, but it has evolved to be ical informatics, intelligent transportation, etc. “One a hybrid data management system for structured and size fits all” is not effective in such scenarios. The unstructured data. Usually, using one single DBMS use of a single database and a unique data model for results in loss of performance and flexibility for spe- all data in different data models may degrade perfor- cific applications. For instance, a column-oriented mance and executing ETL (Extract-Transform-Load) DBMS is one order of magnitude better for On- processes to load all data in a single database may be line Analytical Processing (OLAP) workloads than an very expensive (Stonebraker et al., 2007). Besides, RDBMS (Ozsu¨ and Valduriez, 2020), while SDBMS manual data curation and maintenance of the ETL (Stream Database Management System) is more effi- pipelines (due to adaptations caused by, e.g., domain cient for stream data, which RDBMS does not even evolution) are labor-intensive (Tan et al., 2017) (Bon- support (Nayak et al., 2013). Thus, a variety of data- diombouy and Valduriez, 2016) (Stonebraker, 2015). processing architectures may be required for special- The problem of accessing heterogeneous data ized markets (Stonebraker et al., 2007). sources has been studied in the context of multi- Schema, semantic, and data sources heterogene- database and data integration systems (Kolev et al., ity, autonomy, and distributed processing are still 2016a). Several new data management solutions have concerns (Tan et al., 2017). A federated system emerged, such as distributed file systems (e.g., GFS1 arises as a solution. It is a middleware that provides and HDFS2), NoSQL data stores (e.g., MongoDB, a seamless interface to heterogeneous data systems Allegrograph, Neo4J, Titan, Dynamo, BigTable, Re- with an independent data model and (perhaps) data dis) and new data processing frameworks (e.g., schemes (Stonebraker, 2015). Spark) as well as hybrid (multimodal, e.g., OrientDB, This work overviews the state-of-the-art of the ArangoDB, or NewSQL, e.g., Google F1, LeanX- new generation federation systems. cale). The RDBMS (Relational Database Manage- It is divided as follows. Section 2 presents the main concepts. Section 3 characterizes existing tools. 1Google File System. 2Hadoop Distributed File System. 3https://www.ibm.com/analytics/db2 276 Azevedo, L., Soares, E., Souza, R. and Moreno, M. Modern Federated Database Systems: An Overview. DOI: 10.5220/0009795402760283 In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 276-283 ISBN: 978-989-758-423-7 Copyright c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Modern Federated Database Systems: An Overview Section 4 presents guidelines and research challenges. value, and a timestamp (used for versioning). Exam- Finally, Section 5 concludes. ples are Google Bigtable, Apache HBase, Cassandra, and Accumulo. Document Systems: are advanced key-value systems 2 MAIN CONCEPTS where values are of the document type, such as JSON, YAML, or XML. It stores records in collections (sim- ilar to tables). Records in a collection may have The shift in the federation database has arisen due different schemes. Besides simple key-value oper- to the storage requirements of modern applications, ations, document stores offer an API or query lan- which resulted in the development of several distinct guage. Examples are MongoDB, CouchDB, Couch- storage technologies to meet specific needs. Now, it is base, RavenDB, and Elasticsearch. a requirement for these technologies to work together. Graph Database Systems: manipulate data as graphs. This section overviews the state-of-the-art. Their use has grown to manage data with inherent graph-like nature, e.g., Web, geographical systems, 2.1 Storage Solutions transportation, telephones, social and biological net- works. The graph database model represents schema There are three main layers of storage: distributed and instances as a (labeled)(directed) graph or gener- storage; database management; and, distributed pro- alization of the graph structure (e.g., hypergraphs or cessing (Bondiombouy and Valduriez, 2016). hypernodes) and graph integrity constraints (Angles Distributed Storages. Include files and objects stor- and Gutierrez, 2008). Data are represented as nodes ages. File storage works on unstructured data (i.e., and edges (which connect two nodes). E.g., horse sequences of bytes), organizing them as fixed-length and apple nodes and a likes edge to represent horse or variable-length records. The system organizes likes apple. Nodes and edges may have properties, files hierarchically, and stores file metadata (e.g., file e.g., name and birthday properties for horse and color name, owner, access permission) separate from con- for apple. Often, these systems provide query lan- tent. For shared-nothing, examples are GFS, HDFS, guages that allow for graph traversals and other typi- and GlusterFS; for shared disk, an example is Global cal graph operations, like breadth and depth search. File System 2 (GFS2). Object storage stores data Examples are Neo4J, Infinite Graph, Titan, Graph- as an object which has a unique identifier (oid), Base, Trinity, and Sparksee. properties and metadata. Examples are Lustre and Triplestores: or RDF stores, are the matter of choice XtreemFS. Also, Ceph and Ozone are systems that for storing and querying semantic datasets (Hasl- combine block and object storage. hofer et al., 2011)(Iancu and Georgescu, 2018), which NoSQL (Not Only SQL) Systems. Emphasize scal- are often described using RDF (Resource Description ability, fault-tolerance, and availability, sometimes at Framework), a standard model for data interchange. the expense of consistency. The main categories are In RDF, datasets are represented as triples (subject, key-value, wide column, document, and graph, as predicate, object). That is a value (object) of a prop- well as hybrid (multimodel or NewSQL) (Ozsu¨ and erty (predicate) of a resource (subject) (Zulkefli et al., Valduriez, 2020). SolidIT4 presents a comparison of 2013). E.g., (LeonardoDaVinci, hasCreated, systems. TheMonalisa). Each part is represented as a Uniform Key-value Systems: store data as key-value pairs Resource Identifier (URI). RDFS (RDF Schema) and where the key identifies the record and the value OWL (Web Ontology Language) are RDF serializ- is a schemaless data. Their typical operations able vocabularies, commonly used to represent on- are put(key, value), get(key) and delete(key). tologies, which define classes and attributes of URIs Examples are Redis, Dynamo, Memcached, and Riak. and their relationships (Iancu and Georgescu, 2018). Extended Key-Value systems store records (a set of Moreover, triplestores are capable of processing a key-value pairs) in collections (or domains), e.g., the large amount of RDF data (Modoni et al., 2014), han- domain Customers where each customer has Cus- dling semantic queries and using inference for uncov- tomer Id, first name, last name, etc. Examples are ering new information out of the existing relations. Amazon SimpleDB and Oracle NoSQL Database. Examples are AllegroGraph, GraphDB, MarkLogic, Wide Column Systems: store data as a table but allow- Mulgara, Profium Sense, Blazegraph, Virtuoso, Mar- ing nested values in a schemaless way where a column motta, Stardog, Apache Jena, RDF4 (former Sesame), may have column values. Each column has a name, a Oracle Database 12c (Iancu and Georgescu, 2018). Hybrid Data Stores: combines capabilities typically 4https://db-engines.com/en/ranking found in different data stores and DBMS. They 277 ICEIS 2020 - 22nd International Conference on Enterprise Information Systems may be multimodel NoSQL systems, which com- – Systems that adopt ontologies and apply bines multiple data models (examples are OrientDB, semantic approaches (schema-mapping and ArangoDB, and Microsoft Azure Cosmos DB), and entity-resolution