Response time analysis on indexing in relational databases and their impact Sam Pettersson & Martin Borg 18, June, 2020 Dept. Computer Science & Engineering Blekinge Institute of Technology SE–371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the bachelor’s degree in software engineering. The thesis is equivalent to 10 weeks of full-time studies. Contact Information: Authors: Martin Borg E-mail: [email protected] Sam Pettersson E-mail: [email protected] University advisor: Associate Professor. Mikael Svahnberg Dept. Computer Science & Engineering Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract This is a bachelor thesis concerning the response time and CPU effects of indexing in relational databases. Analyzing two popular databases, PostgreSQL and MariaDB with a real-world database structure us- ing randomized entries. The experiment was conducted with Docker and its command-line interface without cached values to ensure fair outcomes. The procedure was done throughout seven different create, read, update and delete queries with multiple volumes of database en- tries to discover their strengths and weaknesses when utilizing indexes. The results indicate that indexing has an overall enhancing effect on almost all of the queries. It is found that join, update and delete op- erations benefits the most from non-clustered indexing. PostgreSQL gains the most from indexes while MariaDB has less of an improvement in the response time reduction. A greater usage of CPU resources does not seem to correlate with quicker response times. Keywords: Database index, MariaDB, PostgreSQL, response time Abbreviations and Definitions Column exists inside of tables and gives information of which type an object is. Counterpart refers to the index (either with or without non-clustered indexes) database version. CPU (Central Processing Unit) is a computer component that pro- cesses instructions. It runs operating system, applications and much more to constantly receive input from the user or software programs. CRUD is a database term that stands for create, read, update and delete. Database management system (DBMS) is a group of programs that gives a user access to database entries and the power to change, add or remove information stored inside them [3]. DDL (Data definition language) is the SQL commands that can be used to create and modify the structure of database objects. DML (Data manipulation language) contains the SQL commands that handles the manipulation of the database data. Docker is a container engine which runs the container on top of the operating system and utilizes the features of the Linux kernel [49]. In return this gives the host the ability to run multiple and different systems on one host. Foreign key (FK) are columns in a database table. This provides links between data in two tables. FKs acts as cross-references be- tween the tables and are used to reference the primary key of another table. ii Hierarchical is a data model that organizes data in a tree-like struc- ture. Indexes defines as data structures that improves the speed when ex- ecuting most retrieval operations on database tables. ISO/IEC data management standards for local and distributed sys- tem environments, promoting the harmonization of data management facilities in different areas [55]. Primary key (PK) is a database key that is unique with every record. This acts as an identifier entry such as, vehicle identification numbers, telephone numbers and much more. Query is executable code that accesses or modifies data. Record (Database record) is a set of data stored inside a table. Response time (latent period, latency) is a measurement of how quickly a system responds to input. Row is present in a table and represent an object of information, also known as record. Storage engine (database engine) is the root software component that DBMS uses to alter information from a database. These are most of the time unique to the database itself and allows the user to interact with the engine without going through the DBMS interface [22]. Structured Query Language (SQL/RDBMS/Relational databases) is a standard language to communicate with Relational-DBMS (RDBMS). The data have to be predefined inside tables or columns. It is partic- ularly useful when one works with systematic data (relations between entities and variables) [3, 1]. Table A table is a structure that contains rows and columns and can hold different types of data. iii Contents Abstract i Abbreviations and Definitions ii 1 Introduction 1 1.1 Scope and Limitations ........................ 2 1.2 Research Questions .......................... 2 2 Literature Study 4 2.1 Literature Study Methodology .................... 4 2.1.1 Search results ......................... 4 2.1.2 Selection of databases .................... 5 3 Theory 6 3.1 Relational Databases ......................... 6 3.1.1 Structured query language .................. 6 3.1.2 Database schema ....................... 6 3.1.3 Data definition and manipulation language ......... 7 3.1.4 Executing queries ....................... 7 3.1.5 Join operation ......................... 8 3.1.6 Compound statements .................... 8 3.2 Keys .................................. 9 3.2.1 Primary keys ......................... 9 3.2.2 Foreign keys . ......................... 9 3.3 Indexes ................................. 9 3.3.1 Index types .......................... 9 3.3.2 Index structures ........................ 10 3.3.3 Indexes impact on CPUs ................... 11 3.4Docker................................. 11 3.4.1 Dockerfiles & images ..................... 12 3.4.2 Docker containers & compose files .............. 12 iv 4 Method 13 4.1 Empirical ............................... 13 4.1.1 Hardware and software setup ................ 13 4.1.2 Preparation .......................... 13 4.1.3 Experiment .......................... 15 4.2 Statistical Analysis .......................... 19 4.3 Execution ............................... 19 4.3.1 Digressions .......................... 20 5 Results 21 5.1 The Effects of Indexes ........................ 21 5.1.1 Query #1 - Select ....................... 21 5.1.2 Query #2 - Inner join .................... 22 5.1.3 Query #3 - Outer join .................... 23 5.1.4 Query #4 - Inner & Outer -join ............... 24 5.1.5 Query #5 - Insert ....................... 25 5.1.6 Query #6 - Update ...................... 26 5.1.7 Query #7 - Delete ...................... 27 5.2 The Results as a Whole ....................... 27 6 Analysis and Discussion 28 6.1 Overview ................................ 28 6.2 Research Questions .......................... 29 6.2.1 RQ1 - How much does indexes on foreign keys impact the databases speed when retrieving data, compared to using no non-clustered indexes during a set of select and join queries? 29 6.2.2 RQ2 - In what way does indexing affect the databases re- sponse time when inserting, deleting and updating data? . 30 6.2.3 RQ3 - To what degree does indexing impact the CPU per- formance when analyzing the data from research questions oneandtwo?......................... 31 6.3 Threats to Validity .......................... 31 6.3.1 Internal validity ........................ 31 6.3.2 Conclusion validity ...................... 32 6.3.3 External validity ....................... 33 6.3.4 Construct validity ....................... 33 7 Conclusions and Future Work 34 References 36 v A MariaDB SQL Code vii A.1 init_maria.sql ............................. vii B.2 init_maria_no_index.sql ...................... ix B Python Code xi A.1 main.py ................................ xi B.2 generate/__init__.py ........................ xiv C.3 generate/generateRandomUser.py .................. xiv D.4 generate/generateRandomPost.py . ................. xv E.5 generate/generateRandomThread.py ................ xvi F.6 generate/generateCategories.py ................... xvii G.7 generate/utils.py ........................... xvii H.8 generate/config/values.json ...................... xviii C Dockerfiles xix A.1 docker-compose.yml .......................... xix B.2 Dockerfile_maria ........................... xxi C.3 Dockerfile_postgre . ......................... xxi vi Chapter 1 Introduction Databases are storages of organized information that in most cases are located electronically in a computer or server. The database is controlled by a database management system (DBMS) which can add, alter, remove and display informa- tion [20]. The first DBMS came in the 1950s when computer software started to emerge which, made computers more than just giant calculators with minimal storage capacity. But, it was when businesses started using more software in which storage became important. Charles W. Bachman designed the first DBMS in 1960 [59]. Bachman would later form the Database Task Group which handled the standardization of the programming language Common Business Oriented Language (COBOL) that was presented in the year of 1971. This became known as the CODASYL approach [51, 52]. Today in 2020 things are different. There are a plethora of databases one could chose varying from SQL (relational database, RDBMS), NoSQL, objective database along with others [3]. This paper will study relational databases because of popularity and architectural differences as the feature this thesis is analyzing operate in different approaches
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages65 Page
-
File Size-