Bachelor Degree Project
Total Page:16
File Type:pdf, Size:1020Kb
Performance comparison between NewSQL and SQL Sharded TiDB vs MariaDB Bachelor Degree Project in Information Technology Basic level 30 ECTS Spring 2020 Mathias Johansson, Jonatan Röör Supervisor: András Márki Examiner: Yacine Atif Bachelor Degree Project Abstract Databases are used extensively for websites and a large amount of websites are built upon Wordpress. Wordpress uses MySQL compatible databases and for many larger websites it can be imperative to have the best possible performance. Recently, NewSQL databases have been appearing that combine the features of NoSQL databases with SQL compatibility and ACID compliance that is usually not found in NoSQL databases. Due to their recency, there is a knowledge gap in the literature regarding NewSQL databases. Therefore, this work compares the performance of the NewSQL database TiDB against the SQL database MariaDB in a performance benchmark. The benchmark includes three testing approaches with an aim of testing multiple performance aspects. These include load testing, complex queries and performance in a realistic environment. Results from this thesis show that TiDB achieves better average response time for complex queries and in load testing where the databases and load is large but gets worse results for simple queries and small datasets. MariaDB performs better when used with a web server and with tests that involve write-operations. Keywords: Database, Performance comparison, NewSQL, TiDB, SQL, MariaDB 1 Table of contents 1 Introduction 3 2 Background 4 2.1 Databases 4 2.2 SQL 4 2.3 NoSQL 5 2.4 NewSQL 6 2.5 TiDB 8 2.6 Wordpress 9 3 Problem 10 3.1 Problem background 10 3.2 Problem description 10 3.3 Aim 10 3.4 Research questions 11 3.5 Objectives 11 3.6 Hypotheses 12 4 Method 13 4.1 Empirical Strategies 13 4.2 Related research 13 4.3 Approach 15 5 Experiment 17 5.1 Benchmarking tools and relevant variables in the domain 17 5.1.1 Sysbench 17 5.1.2 Siege 18 5.1.3 Database queries 18 5.2 Benchmarking setup 19 5.3 Results 20 2 5.3.1 Sysbench 20 5.3.2 Database queries 22 5.3.3 Wordpress - Siege load testing 23 5.4 Analysis & Conclusions 24 6 Discussion 26 6.1 General discussion 26 6.1.1 Documentation and community 27 6.1.2 TiDB write performance 27 6.2 Research usefulness 27 6.3 Ethics and Validity threats 28 6.3.1 Ethics 28 6.3.2 Validity threats 28 6.4 Future work 29 7 References 32 3 1 Introduction Websites that require persistent storage usually use databases to store data such as user information and product catalogs. As the number of users increases on a website it becomes necessary to improve the server architecture to keep up with demand. The web server layer can be improved by adding more hardware resources to the web server, called vertical scaling or by adding new web servers, called horizontal scaling. More resources can be added to the database server as well to improve performance, however adding new database servers introduces a number of problems. NoSQL databases are able to make use of sharding to split their databases across several physical locations referred to as nodes, improving query response times and redundancy. NoSQL databases also do not enforce strict table structure and can get better response times than traditional databases even on single nodes. Győrödi, Győrödi, Pecherle & Olah (2015) compared the relational SQL database MySQL to the NoSQL MongoDB and found that MongoDB gave lower execution times for all operations tested. However NoSQL databases also require the website to be built around the NoSQL database. This makes it difficult to move existing websites to non-relational NoSQL databases. NoSQL databases also often cannot guarantee full ACID compliance (Grolinger, Higashino, Tiwari, Capretz 2013). NewSQL databases amend this by providing full ACID compliance and SQL interfaces while still allowing for horizontal partitioning. TiDB is one of these NewSQL implementations that provides a MySQL compatible SQL interface and sharding capabilities to potentially increase scalability and general performance. Many studies have been conducted that compare NoSQL databases to SQL databases but NewSQL databases have not been thoroughly compared to SQL databases. This thesis compares TiDB to MariaDB to see how they compare in performance and see when it might make sense to use a NewSQL database over a traditional SQL database. This study consists of a series of benchmarks that test various different use cases and load configurations to see when each database gets the lowest response time. The Wordpress website framework is used to evaluate how well the databases would work with a real website. 4 2 Background 2.1 Databases Almost everything today that needs to store a large amount of data uses a database of some kind. Whether it is an online store that uses a database for storing products and customer information or a bank that stores account balance, databases can be used to store all kinds of data. There are a multitude of different database implementations that have different characteristics and features and the database management systems can be controlled through specific programming languages and APIs. Most database implementations can be categorized into one of the following groups: SQL, NoSQL and recently NewSQL. 2.2 SQL Structured Query Language (SQL) is used for managing structured data in relational database management systems. Elmasri & Navathe (2011) describes the relational model as a representation of the database as a collection of relations. A relation can be represented as a table of data. Each relation consists of a number of attributes defining what values each row will contain as well as what datatype will be used to store them. The actual definition for the relational model is based in mathematics and uses set theory and first-order predicate logic to reason about databases. A common feature of relational databases is to support transactions. Transactions group together several queries into a single unit of work that is either performed completely or not at all. Using transactions, a system can avoid the situation of an operation being half-completed because a query failed in the middle of the operation (The PostgreSQL Global Development Group 2020). ACID is a set of properties that database transactions should possess in order to guarantee validity under all circumstances. These properties are Atomicity, Consistency preservation, Isolation and Durability (Elmasri & Navathe 2011). ● Atomicity means that a transaction should always be performed entirely or not at all. In practice this means that in case a transaction fails, e.g in the case of the system crashing during the query, the database must recover from and remove any trace of the transaction from the database. ● Consistency preservation, that a transaction should take the database from a consistent state to another, without interference from other transactions. ● Isolation, that transactions should appear to be executed in isolation and not have any interference with other transactions executing concurrently. 5 ● Durability, that all changes to the database must persist and that no data should be lost due to failure. Some notable SQL databases are MySQL and MariaDB: ● MySQL is an open source relational database management system originally created by MySQL AB. Following a series of acquisitions, MySQL changed ownership to first Sun Microsystems and then Oracle Corporation. ● MariaDB is a MySQL compatible database that was forked from MySQL in 2010 after the Oracle acquisition. Its lead developer is one of the core developers behind MySQL and the project is being developed with an emphasis on providing a high level of compatibility with MySQL. Both MySQL and MariaDB can be ACID compliant depending on which storage engine is being used. The default storage engine in both databases provide ACID compliance. (Oracle Corporation 2020). SQL is one of the most popular database types, but there are others like NoSQL where different implementations focus on better query performance or include specific features not usually found in SQL. 2.3 NoSQL Non-relational databases, also called NoSQL are mainly thought of as dissimilar to SQL databases but can also be referred to as “not only SQL” since it is possible for a NoSQL database to support query languages like SQL. In comparison to traditional relational databases like SQL, NoSQL can provide different storage mechanisms for data. Relational SQL databases use a tabular structure which can be read and understood without additional explanations (Meier & Kaufmann 2019). NoSQL databases however, utilize different storage strategies that do not necessarily have a common or easily understandable structure for how data is stored. The categorization of NoSQL databases is based on the type of data storage mechanism they employ. Meier & Kaufmann (2019) describes that there are four general strategies NoSQL databases use: Key-value stores, Document Stores, Column stores and Graph databases. ● Key-value databases are one of the simplest NoSQL types as data is simply represented with a key-value pair where a unique key points to a specific value in the database. This structure resembles a hash table in function and does not enforce any particular structure in the value fields. ● Document stores are based on the key-value structure but save data into documents such as JSON, BSON, XML and YAML. This means that each document can store many fields, similar to how a relational database would store data. Different 6 documents are free to contain different fields from each other, in contrast with relational databases where each row needs to contain all the fields that its table has defined. ● Column stores are similar to key-value databases except a single key can link to multiple columns and a variable number of columns can exist in a single record.