Evaluation of Replication Mechanisms on Selected Database Systems

International Journal of Geo-Information Article Evaluation of Replication Mechanisms on Selected Database Systems Tomáš Pohanka * and Vilém Pechanec Department of Geoinformatics, Faculty of Science, Palacky University, 17. listopadu 50, 77146 Olomouc, Czech Republic; [email protected] * Correspondence: [email protected] Received: 16 March 2020; Accepted: 16 April 2020; Published: 17 April 2020 Abstract: This paper is focused on comparing database replication over spatial data in PostgreSQL and MySQL. Database replication means solving various problems with overloading a single database server with writing and reading queries. There are many replication mechanisms that are able to handle data differently. Criteria for objective comparisons were set for testing and determining the bottleneck of the replication process. The tests were done over the real national vector spatial datasets, namely, ArcCR500, Data200, Natural Earth and Estimated Pedologic-Ecological Unit. HWMonitor Pro was used to monitor the PostgreSQL database, network and system load. Monyog was used to monitor the MySQL activity (data and SQL queries) in real-time. Both database servers were run on computers with the Microsoft Windows operating system. The results from the provided tests of both replication mechanisms led to a better understanding of these mechanisms and allowed informed decisions for future deployment. Graphs and tables include the statistical data and describe the replication mechanisms in specific situations. PostgreSQL with the Slony extension with asynchronous replication synchronized a batch of changes with a high transfer speed and high server load. MySQL with synchronous replication synchronized every change record with low impact on server performance and network bandwidth. Keywords: database; replication; spatial data; MySQL; PostgreSQL 1. Introduction Replication is a process of copying and maintaining database objects [1]. The replicated databases are monitored for changes and are synchronized when a change is made [1]. Database replication makes a copy of data accessible from many various servers instead of accessing one central server, or enables many servers to behave like one (parallel query processing) [1]. Database replication is not about joining various data storages into one, which is known as a data warehouse [2,3]. Database replication creates copies of the whole database or just specific tables on a different database server, and it creates a connection between them for secure one-way or two-way synchronization [4,5]. There are typical scenarios where database replications are used, e.g., secure lower latency for long-distance communication, separate raw data storage and user data storage, secure high availability and performance. Replication mechanisms are commonly used for web applications (e.g., sensor web and WebGIS), or as georeplication for lower intercontinental latency [6,7]. Replication can be improperly considered as a mechanism for creating data backups. Nevertheless, this is not the primary function of database replications. The essence of database replication is to create a redundant, interconnected distributed environment. Keeping all data in one database server is not recommended for many reasons: (a) keeping raw and publicly available data separate; (b) splitting processed data into specific parts for specific users and (c) keeping databases in a local place to gain low query latency [1,4]. ISPRS Int. J. Geo-Inf. 2020, 9, 249; doi:10.3390/ijgi9040249 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2020, 9, 249 2 of 20 Database replication is a means for increasing performance by the scale-out technique [8,9]. An increasing number of servers can help distribute peaks and heavy loads to many servers, which can respond to users. The replication technique can be used to decrease latency time for user requests [4,10,11]. The base unit of replication is a node, which represents one database server. A node can act as a master or as a slave server. There are no standardized names for “master” and “slave” nodes, and every company can have a different label for them, e.g., for a master these include primary, publisher and leader, while for a slave these include standby, subscriber and follower. These names can have a connection with a different type of replication logic, but most of the time the names express the same behavior of the replication process [12]. The master server should focus mainly on inserting and updating SQL statements. The slave server should work primarily on a selecting SQL statement. A minimum of two servers is required to create a replication cluster. There can be many master servers (multi-master replication) for two-way synchronization, or one master server with one or more slave servers for one-way synchronization from master to slaves [13], or there can be a master server in the standby mode. The standby server is in the offline mode and only receives changes from a master server. The standby server will be activated by the inaccessibility of the first master server (e.g., watchdog logic). There is a typical division among replication types between synchronous vs. asynchronous and logical vs. physical replication [1,14]. Synchronous replication synchronizes every change instantly, and after confirmation by the slave server, a master server synchronizes another data change. The asynchronous replication will synchronize a batch of changes at a defined time step, e.g., one second, one minute or one day. Almost every company that creates database solutions has different replication strategies, but they can be placed into these basic categories. For example, the Oracle Database has multi-master and materialized view replication [1]; Microsoft SQL Server has the snapshot, merge and transaction replication [15]; MySQL contains statement-based, row-based and mixed-base replication [16] and PostgreSQL natively (from version 9.0) supports streaming replication. Many plug-ins and third-party software for PostgreSQL are available that can provide on-demand functionality (e.g., Slony, PgCluster and Bucardo). Spatial data are stored in a database as a special type of attribute, e.g., geometry, shape and geography. The size of a spatial attribute varies. The shortest length is a point, while line and polygon attributes (closed line) have a length based on a count of edge points. The selection of spatial database management (SDBMS) was established based on criteria shown in Table1. All SDBMS support some form of database replication and the most popular and most used databases are based on these [17]. This is also the case with the Department of Geoinformatics in Olomouc, Czech Republic, which uses PostgreSQL or MySQL in most of its research projects and student classes. In addition, at the Department of Geoinformatics PostgreSQL is being used as the primary database for ArcGIS Enterprise, especially for the ArcGIS Server as data storage. PostgreSQL also uses the ArcGIS Portal by Esri (also being used at the Department). MySQL is very popular in web design (e.g., WordPress, Drupal and Wikipedia). MongoDB is the most popular NoSQL database system, which also supports storing and base-processing of spatial data (three base functions—geoIntersects, geoWithin and near). MongoDB does not have strong support in GIS desktop applications such as PostgreSQL, MySQL, Oracle and the MS SQL server. ISPRS Int. J. Geo-Inf. 2020, 9, 249 3 of 20 Table 1. Criteria for a select database system for the experiment. Criterion PostgreSQL MySQL Oracle MS SQL MongoDB Coherence with projects and Yes Yes No No No students’ lessons Multiplatform Yes Yes Yes Yes Yes Hardware demands Minimal Minimal Higher Higher Minimal Software demands Minimal Minimal Higher Higher Minimal Store and process spatial Yes Yes Yes Yes Limited data Interconnection with GIS Yes Yes Yes Yes Limited User base World-wide World-wide World-wide World-wide World-wide The replication mechanism used for this experiment was Slony-I v2.2.5 as an extension for PostgreSQL and the native streaming replication for MySQL. The replication cluster included one master server and one slave server, created by PostgreSQL using the Slony extension, which enabled master–slave, asynchronous, logical, trigger-based replication. The replication cluster (one master and one slave server) for MySQL was created using native streaming master–slave, synchronous, logical, statement-based replication (SBR). Slony uses database triggers to collect the events happening in a selected table. The selected table is configured to the publisher/subscriber technique by Slonik Execute Script [18]. In the configuration file connection info is set up to all databases in the replication cluster, and table sets that are involved in the replication process. Slony creates unique tables in PostgreSQL to maintain the whole replication process. Events (described by origin, type and parameters) are stored and queued in the “sl_event” table. Data for replication are caught by the triggers and stored in “sl_log_1” and “sl_log_2” tables. Then, there is a “localListener” thread, which periodically generates a “SYNC” event. This “SYNC” event triggers the “remoteListener” thread to start synchronization in the replication cluster. The idea of MySQL Statement-based replication is that there are at least two identical databases where modifying statements can be executed (i.e., Insert, Update and Delete) instead of transmitting “raw” data (as Slony or MySQL row-based replication (RBR) do), with SBR transmitting just the SQL statement for modification of the database. MySQL creates a stream of SQL statements, which are completed in the master server and then are completed in the slave server [19]. MySQL uses three threads (Slony uses two) to process a replication. A “Binlog dump” thread sends binary log content from the master to the slave server. The second thread, “Slave I/O”, runs on the slave server and asks the master server to send the updates. The last thread, “Slave SQL”, also runs on the slave server and reads the logs, which were previously written by the “Slave I/O” thread, and executes the event contained in the log [20].

Evaluation of Replication Mechanisms on Selected Database Systems

Quantum Dxi4500 User's Guide

Replication Using Group Communication Over a Partitioned Network

Replication Vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Replication As a Prelude to Erasure Coding in Data-Intensive Scalable Computing

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Manufacturing Equipment Technologies for Hard Disk's

A Crash Course in Wide Area Data Replication

Replication Under Scalable Hashing: a Family of Algorithms for Scalable Decentralized Data Distribution

Score: a Scalable One-Copy Serializable Partial Replication Protocol⋆

Overview of Data Replication Strategies in Various Mobile Environment

Storage Replication Compudyne Cloud Services

Scalable State-Machine Replication (S-SMR)