Degree Project: Bachelor of Science in Computing Science
Total Page:16
File Type:pdf, Size:1020Kb
UMEÅ UNIVERSITY June 3, 2021 Department of computing science Bachelor Thesis, 15hp 5DV199 Degree Project: Bachelor of Science in Computing Science A scalability evaluation on CockroachDB Tobias Lifhjelm Teacher Jerry Eriksson Tutor Anna Jonsson A scalability evaluation on CockroachDB Contents Contents 1 Introduction 1 2 Database Theory 3 2.1 Transaction . .3 2.2 Load Balancing . .4 2.3 ACID . .4 2.4 CockroachDB Architecture . .4 2.4.1 SQL Layer . .4 2.4.2 Replication Layer . .5 2.4.3 Transaction Layer . .5 2.4.3.1 Concurrency . .6 2.4.4 Distribution Layer . .7 2.4.5 Storage Layer . .7 3 Research Questions 7 4 Related Work 8 5 Method 9 5.1 Data Collection . 10 5.2 Delimitations . 11 6 Results 12 7 Discussion 15 7.1 Conclusion . 16 7.2 Limitations and Future Work . 16 8 Acknowledgment 17 June 3, 2021 Abstract Databases are a cornerstone in data storage since they store and organize large amounts of data while allowing users to access specific parts of data eas- ily. Databases must however adapt to an increasing amount of users without negatively affect the end-users. CochroachDB (CRDB) is a distributed SQL database that combines consistency related to relational database manage- ment systems, with scalability to handle more user requests simultaneously while still being consistent. This paper presents a study that evaluates the scalability properties of CRDB by measuring how latency is affected by the addition of more nodes to a CRDB cluster. The findings show that the la- tency can decrease with the addition of nodes to a cluster. However, there are cases when more nodes increase the latency. A scalability evaluation on CockroachDB 1 Introduction 1 Introduction Databases have been a hot topic over the latest decades. Databases are useful for long-time storage and ordering data related to each other. Databases can store a large amount of data and organize it so that it is easy to search and fetch parts of data or information. However, databases are constantly under development, and there are always new types of databases entering the market to meet new demands. Furthermore, with the growth of the internet and more users connecting every year, databases have to adapt to this demand to maintain their usefulness [1]. In 1970 Edgar F. Codd [2] introduced a new model to organize data indepen- dently, reduce redundancy, and improve data storage consistency. With this new model, he set the foundations for the coming revolution for database sys- tems, namely, relational database management systems (RDBMS). RDBMS dominated the industry over the 1980s and ’90s, offering atomicity that en- sures that the database remains in a consistent state, efficient queries, and efficient disc space usage. RDBMS organize data into tables related to each other, thus enabling the retrieval of new tables consisting of data from one or more tables in a single query. This retrieval is a database operation known as table join. Users communicate with these database systems using the Structured Query Language (SQL). SQL allows for table joins with no need to know where tables reside on the disc. SQL syntax offers several opera- tions to read from and modify tables in a database. Some frequently used operations are: • Select to requests information or data from the database. • Update for modifying and updating existing data. • Insert to add data to the database. • Delete to remove data. Traditional SQL databases were architectured to run on a single server. How- ever, with the internet workload growing larger than any single server can handle, a need to move from single database servers to distributed servers working in unity as one big logical database emerged. Scalability, the ability to add or remove resources to handle changing demands, became an impor- tant feature. A server that is a part of a bigger cluster is also called a node. The ability of a database to tolerate node failures is also an essential factor. Being tolerant to node failures means that a database is still intact even if some node (server) crashes. Tolerance to node failures is an important factor in ensuring that data is always available from the database. For this reason, NoSQL became an alternative, easy to scale, and tolerant of node failures. However, NoSQL comes with a trade-off in lack of table joins, and other SQL properties [3]. Then came a new type of database architecture that offers scalability without the drawbacks in NoSQL, called distributed SQL. This 1 June 3, 2021 A scalability evaluation on CockroachDB 1 Introduction new database can distribute data globally (geo-distribution) and at the same time provide familiar SQL syntax and properties [3]. CockroachDB (CRDB) [4] is one of these distributed SQL databases. CRDB is a distributed SQL database developed by Cockroach Labs. It has grown in interest in recent years, especially when there is a need for consistency, which means that the database is always consistent and goes from one valid state to another, and scalability is highly valued. A CRDB database cluster consists of nodes. Each node is a separate logical sub- database, and all nodes in a cluster act together as one big logical database. CRDB scales easily by automatically increase capacity and migrating data when adding nodes [5]. CRDB describes as resilient to node crashes, highly scalable while still providing consistency through the whole cluster. A CRDB node consists of several ranges depending on the size of data in the database [5]. Ranges are sub-parts of the data in the database which distribute across different nodes. To ensure that data is not lost if a node crashes or goes offline, these ranges replicate across nodes, and the replicated ranges are called replicas. Each node in a CRDB cluster work as a gateway node [5], which means that every node can establish SQL connections directly to a client. When a client starts a SQL connection to a node, it establishes a network connection that it can use to communicate with the database using the SQL syntax. Furthermore, it can request any data from the database and not just that from the node. A node connected to the client can either return the request directly to the client if in position. If not, the node will communicate with the other nodes to resolve the request. By every node being a gateway node means that CRDB works well with load balancers [6]. A load balancer distributes and establishes SQL connections equally across the nodes in the cluster. Scalability in database management is divided into two categories, horizon- tally and vertical [7]. Horizontal scaling is the ability to partition data so that each node contains a part of the data and runs on a different server. Vertical scaling refers to adding better hardware to a single installation, such as more memory or a better CPU. The purpose of scaling is to handle chang- ing demands. These demands are often to handle greater throughput due to the increasing number of users or requests. Throughput can be the number of writes and reads per second or queries and transactions per second per- formed by the database. The difference between a query and a transaction is that a query is a single statement done in one section, whereas a transac- tion can consist of multiple queries. Another purpose for scaling is to reduce the latency taken for requests. In this study, latency refers to the time it takes for a user to send a request and get a response. This study focuses on horizontal scaling. This study evaluates CRDB, focusing on latency and its correlation with multiple connections and how it is affected by the size of the cluster. This evaluation is done by setting up CRDB clusters on the cloud and implement- ing a test software that measures latency. Doing this will give an insight into 2 June 3, 2021 A scalability evaluation on CockroachDB 2 Database Theory how CRDB adapts to the new demands that databases expose to. The outline of this study is as follows: 1. Central concepts are defined, and we look at the architecture of CRDB nodes. 2. The research question answered in this study is presented along with its hypothesis. 3. A description of the related work for this study. 4. The method description with the choice of method, data collection, and the delimitations of this study. 5. Presentation of the results from the tests. 6. The discussion with the interpretation of results, limitations, conclu- sion and future work. 2 Database Theory and Architecture In RDBMS, a database consists of one or more tables. A table is a collation of related data with a specified number of vertical columns identified by name. The values identify rows, also know as records, in one or more columns. To make each row unique in a table, one column can be a primary key. A primary key is a specific trait for a column that does not allow rows to have the same values for that column. Workload refers to the type of job a database performs on. Workload may differ depending on what request the database gets. For example, one work- load can be that all request to the database consists of update operations, and another workload can be that all requests consist of select operations. A SQL connection establishes a network connection to the cluster, and then a client can make requests to the CRDB cluster using the SQL syntax [8]. 2.1 Transaction A transaction starts with a BEGIN statement, which means that the following statements are treated as one transaction. The transaction then continues with one or more statements, such as update or insert and etc.