Testing SQL-compliance of current DBMSs

Elias Spanos

Master of Science School of Informatics University of Edinburgh 2017 Abstract

Database Management Systems (DBMSs) are widely used in various fields ranging from online banking, web systems and other online services and are very crucial in the day to day management of data. The correct and efficient operation of such sys- tems is an important factor when choosing the proper DBMS software. In the past few decades, the amount of data has increased exponentially, causing a rapid increase in the demand of such systems that can store, organise and manipulate data. With these requirements in mind, different vendors have implemented their own DBMS. However, since these DBMSs have been implemented with a significant amount of differences, a Standard was proposed in order to provide a common way of using such systems. Even though, a Standard has been introduced and adopted by most DBMS vendors, there are some differences that still exist, probably due to the difficulty of interpreting and implementing all the parts of the Standard, and also due to other issues regarding the performance of their systems. Hence, there is a need for alleviating such a prob- lem by exposing such issues between DBMSs in order to evaluate their conformance to the common Standard. In particular, this project investigates five popular DBMSs implementations and evaluate their conformance to the SQL Standard. A crucial ques- tion that will be investigated by conducting this project is whether DBMSs have been implemented the SQL Standard in the same way. The SQL language should pledge that identical SQL code should always return identical answers when it is evaluated on the same database independently of which DBMS is running on. The aim of this project is the implementation of a random query generator and a comparison tool for investigating and highlighting the differences that may exist among current DBMSs. Further, it aims to provide a detailed explanation in regards to the SQL Standard of potential differences and explain how they might affect the transition between current DBMSs.

i Acknowledgements

I would like to thank my supervisors Paolo Guagliardo and Leonid Libkin who were always willing to advise me and help me in order to overcome any difficulty. In addition, I want to thank my family who is always by my side.

ii Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Elias Spanos)

iii Table of Contents

1 Introduction1 1.1 Motivation...... 3 1.2 Related Work...... 3 1.3 Thesis Structure...... 4

2 Background5 2.1 The SQL Standard...... 5 2.2 The SQL Language...... 6 2.3 Commands of SQL...... 6 2.3.1 Missing values...... 12 2.4 SQL Standard issues...... 14 2.5 The Database Management Systems...... 15

3 Methodology 17 3.1 Methodology...... 17

4 Implementation 19 4.1 Internal Implementation...... 19 4.1.1 Random Query Generator Tool...... 20 4.1.1.1 Configuration file...... 21 4.1.2 Comparison Tool...... 23 4.1.3 Random data generator tool...... 24

5 Experimental Evaluation 26 5.1 The experiment Setup...... 26 5.1.1 Performance Evaluation...... 27 5.2 Experiment Results...... 28

iv 6 Conclusion 46 6.1 Conclusion...... 46 6.2 Summary of the findings...... 47 6.3 Suggestions for future work...... 48

A Source Code 50

B Compilation & Execution Instructions 51

Bibliography 54

v List of Figures

1.1 Abstract DBMS architecture...... 2

2.1 Popularity of modern DBMSs...... 15

3.1 High level overview of the framework...... 18

4.1 Complete architecture of the framework...... 19 4.2 Abstract Architecture of Comparison Tool...... 23 4.3 Method of importing csv files...... 25

5.1 Performance Evaluation...... 27

vi List of Tables

2.1 Aggregation Commands...... 8 2.2 Complex Conditions...... 9 2.3 SET Commands...... 10 2.4 Data Types...... 11 2.5 String Commands...... 11 2.6 OR Truth Table...... 13 2.7 AND Truth Table...... 13 2.8 NOT Truth Table...... 13

5.1 Statistics of Generated queries...... 29 5.2 Difference #1...... 30 5.3 Difference #2...... 31 5.4 Difference #3...... 31 5.5 Difference #4...... 32 5.6 Difference #5...... 33 5.7 Difference #6...... 34 5.8 Difference #7...... 35 5.9 Difference #8...... 36 5.10 Difference #9...... 37 5.11 Difference #10...... 37 5.12 Difference #11...... 38 5.13 Difference #12...... 39 5.14 Difference #13...... 40 5.15 Difference #14...... 41 5.16 Difference #15...... 41 5.17 Difference #16...... 42 5.18 Difference #17...... 43

vii 5.19 Difference #18...... 43 5.20 Difference #19...... 44 5.21 Difference #20...... 45

6.1 Summarised Results...... 47 6.2 Summarised Results...... 48

viii Chapter 1

Introduction

Database Management Systems (DBMSs) have thrived since their first appearance, and they remain the dominant manner for storing various kinds of information. Specif- ically, DBMSs have been extensively used in many fields and has been widely used by almost all companies as they provide a relatively easy way of performing various operations on data, such as insertion, deletion and modification [15, 21, 22]. For this reason, applications can be implemented efficiently and reliably without the need for handling low-level issues such as concurrent and efficient access of data which it is taken cared by DBMS. Hence, the primary role of a DBMS is not only to store data but also to provide a common interface for manipulating it. Figure 1.1 shows from a high-level point of view the structure of a modern DBMS. Moreover, the relational model (RM) is the mainstream model for the DBMSs for managing data and it is by far the most popular model for current DBMSs. It was proposed by Edgar F.Codd in 1969 [6,7] and it has brought a revolution in the area of data management due to its simplicity. A database that uses the relational model is known as a relational database. According to the relational model, data is stored in a database as tables, and each table is composed of rows and columns. Also, each row of a table is known as a tuple based on RM and each column as characteristic or attribute. In fact, in early stages, each database system had its own interface, and migrat- ing applications from one system to another was a slow and difficult task, since an almost complete rewriting of the code would be required. However, as these systems were promising from their first appearance, a standardised language was unavoidable. Structured Query language (SQL) has become a standardised language for querying and managing data and has rapidly become the most widely used DBMS language [4]. Since then, comparable languages have been emerged over the years, however, SQL

1 Chapter 1. Introduction 2

Figure 1.1: Abstract DBMS architecture has persisted to be the dominant language since it has been easy to learn. SQL users and programmers can take advantage of SQL language in order to learn a new lan- guage which is used by essentially all modern DBMSs, and they can write SQL code that with minor changes it can be run on any database system. On the other hand, a sig- nificant disadvantage of programming languages is that each language has its benefits and usage and they cannot always run on any system. In particular, such systems have been extensively studied and tested for their cor- rectness and efficiency. Software testing is an essential approach for testing and eval- uating the quality of such systems [5]. However, implementing a software that can be used to assess complex systems, it usually requires sophisticated and large-scale im- plementation. On the contrary, by automating the process of testing is the only way to have a systematic and effective way of testing such complex systems. Since DBMSs are the most popular systems for storing and manipulating data, many studies have conducted focusing mainly for evaluating their performance. For example, TPC-H is a popular benchmark for evaluating vendors DBMSs implementations [11]. However, this benchmark is designed for analyzing performance and it generates a relatively small number of SQL queries. Chapter 1. Introduction 3

1.1 Motivation

The SQL Standard was established a long time ago, and it aims to provide a common interface among modern DBMSs [2, 12]. As each vendor provides its own DBMS implementation, the extend of which these systems implementations comply with the SQL Standard is essential and needs to be studied. Database Systems have evolved rapidly, and new features have been continuously added. As a result, users, programmers, and companies that make use of such systems are facing unprecedented problems when it comes to migrating their existing SQL code to another DBMS as most of the time the SQL code is partially portable. Mainly three major reasons are involved in the aforesaid problem. Firstly, DBMS vendors do not always comply immediately with the latest Standard, since they have to cope with many considerations such as performance issues and better system management, or even sometimes it is not practical to implement the entire Standard. Therefore vendors usually adopted new changes of the updated Standard into their next product releases. Secondly, DBMS vendors offer their own feature in order to be distinctive among other DBMSs which make SQL code less portable. Lastly, the Standard is written in a natural language which makes the process of interpreting all the aspect in exactly the same way almost impossible. Currently, there is no well-known framework that can be used to test the SQL- Compliance among these systems and be capable of detecting any incompatibilities or semantic issues. Hence, this project aims to build a complete framework for system- atically discovering and highlighting both existing and new differences that may arise between popular DBMSs. Afterwards, experimental evidence with a comprehensive explanation with respects to the SQL standard is provided. Unfortunately, DBMSs do not always provide a meaningful message whenever an SQL code has a syntax er- ror. As a consequence, it can take considerable time to detect and resolve any issue. By conducting this project, it is also intended to disclose all the incompatibilities that may exist and highlight all the differences, with such a way that it will make users, programmers and vendors aware of the current problems.

1.2 Related Work

As DBMSs are necessary for many fields some work has been conducted for evaluating whether different vendors have implemented various points of the Standard similarly. Chapter 1. Introduction 4

In fact, some parts of the Standard is interpreted differently, and the SQL code is not portable among these systems. Albeit some differences are demonstrated [4, 17, 20], there is no well-known tool for evaluating these systems and automatically identify any issues and incompatibilities. Queries that execute without raising any error but are interpreted differently by DBMSs can lead to incoherent results. Some differences in DBMSs behaviours have been detected by users from experience with various systems and are reasonably well documented. But more subtle differences are harder to expose and may go unnoticed. In this work, we present a tool that may assist users in this task.

1.3 Thesis Structure

The structure of the remainder of this thesis report is organized as follows: Chapter 2 introduces SQL language, SQL Standards, the core commands of SQL language and briefly describes the usage of the most important commands, and af- terwards the main issues are introduced by illustrating problematic SQL queries. In addition, an overview of the most popular DBMSs is provided. Chapter 3 describes the main methodology for this thesis project and briefly ex- plains the architecture which is implemented for systematically checking the SQL- compliance of current DBMSs. Chapter 4 provides a detailed explanation about the complete framework and an explanation for each individual tool that the framework is composed of. Also, it briefly discusses the main issues that are addressed. Chapter 5 illustrates the main findings and provides an explanation about the dif- ferences between current DBMSs with appropriate reference to the SQL standard. Chapter 6 concludes this project with explanation of what is been achieved, and a summarised table of all the differences is provided. Afterwards, futures ideas and extensions are provided. Chapter 2

Background

2.1 The SQL Standard

DBMS from its first appearance shows that it will be the dominant trend for storing and managing data. Consequently, different implementations have emerged from various vendors and inevitably, a standardised language should be implemented in order to provide portability among current systems. If applications were implemented using only SQL commands which are defined in that standard and vendors implemented these commands in exactly the same way, then SQL code could be migrated on any DBMS without the need to be adopted. The first appearance of SQL language was in 1970 where IBM developed its first prototype of Relational Database Management System (RDBMS) [7, 10]. Subse- quently, the first SQL standard arose in 1986 by the American National Standard In- stitute (ANSI) with the name of SQL-86 for bearing conformity among vendors im- plementations. Since then, different flavors of the Standard have emerged for revising previous versions or for adding new features such as SQL-87, SQL-89, ANSI/ISO SQL-92 and ANSI/ISO SQL: 1999 which has also been approved by the International Standards Organization (ISO) [2, 12, 14, 23, 25]. The SQL Standard has been con- tinuously developed with the current version being SQL:2016 or ISO/IEC 9075:2016. The ANSI SQL Standard is divided into several parts, and this project focuses on the SQL/Foundation part. In particular, this part contains central elements of SQL. Ex- planations about the findings are given according to SQL:2016 since it is the newest version of the Standard, though this part remains the same in comparison with earlier flavors.

5 Chapter 2. Background 6

2.2 The SQL Language

SQL operations are in the form of commands known as SQL statements. In particular, SQL is composed of primarily two sub-languages, the data definition language (DDL) and the data manipulation language (DML) [10, 13]. DML is a part of the SQL lan- guage that is composed of a family of commands like any programming language and it is used for the creation of queries for inserting, modifying and deleting rows in a database. This sub-language is consisted of ’SELECT-FROM-WHERE’ commands as to be fundamental for any query. In this project, we particularly focus on DML since we want to generate a diversity of random SQL queries. Also, the SQL Standard sup- ports more complex rather than just simple commands for performing various tasks on data. For example, aggregation functions such as ’SUM’, ’MAX’, ’MIN’, and ’AVG’ are used in combination with ’GROUP BY’ and ’HAVING’ SQL statements. The pur- pose of having such commands is to perform a calculation on specific columns in order to return a value. For example the calculation of average salary of a department by per- forming a ’GROUP BY’ based on all the salaries of the employees of that department. DDL is also part of SQL language and can be used to create, modify, delete tables and views, and usually DDL statements start with keywords ’CREATE’, ’DROP’ and ’ALTER’. Also, it supports a command that provides the capability of defining new domains. Moreover, in general, tables and rows are denoted as relations and tuples. Hence, SQL is extremely popular as it offers two capabilities. Firstly, it can access many tuples using just one command and secondly it does not need to specify how to reach a tuple, for example by using an index or not.

2.3 Commands of SQL

In this section, a high-level description of the basic SQL commands is given, as defined in the Standard. Due to the fact that these commands are used to generate random queries, it is imperative to give a short presentation of each command and its purpose. Chapter 2. Background 7

SQL Basic Structure

SELECT [DISTINCT] columns_list FROM tables_list WHERE condition

condition := comparison1 AND comparison2 comparison1 OR comparison2 NOT comparison

term := attribute value

comparison := term1 op term2(op ε {=, <>, <, > <=, >=}) term IS NULL term IS NOT NULL

The SQL basic structure represents the basic SQL commands which are used to re- trieve data from a database, based on some conditions that are specified in the ‘WHERE’ clause. Each SQL query should have at least ‘SELECT’ and ‘FROM’ clause. The ‘WHERE’ clause is optional. The basic SQL query is executed as follows: all the rows of the tables list (‘FROM’ clause) are evaluated. Each row that satisfies the search conditions is selected. Then, only the specified columns of the selected rows appear in the output results. In addition, the ‘DISTINCT’ keyword is optional and can be used to eliminate duplicate rows from the output results. Also, instead of having columns list in the SELECT clause, the “*” keyword can be specified which indicates that all the columns of the tables list will appear in the results. The purpose of the ‘WHERE’ clause is to filter the results, and normally SQL queries contain a ‘WHERE’ clause. Conditions compare expressions using comparison operators. A comparison operator is used to compare two expressions, and logical connectivities, such as AND, NOT and OR are used to connect the comparisons. Chapter 2. Background 8

Example of basic query:

SELECT St.Student_Name FROM Students AS St WHERE St.age >= 20 AND St.age <= 24

The above example is built based on the basic SQL structure. Thus, it retrieves all students’ name who are between 20 and 24 years old.

SQL Basic aggregation Syntax:

SELECT [DISTINCT] Columns_list1 FROM Tables_list WHERE Condition1 GROUPBY Columns_list2 HAVING Condition2

SQL query with aggregation is used to perform a calculation on specific columns in order to return a value. GROUP BY and HAVING commands are used to perform an aggregation. HAVING command is optional.

Table 2.1: Aggregation Commands

Operator Use MIN() Finds the minimum value of a column COUNT() Counts the total number of rows MAX() Finds the maximum value of a column SUM() Calculates the sum of all values of a column AVG() Calculates the average of all values of a column

Aggregate commands can be used both in SELECT and HAVING clause, and in combination with GROUP BY clause. The example below illustrates the proper use of aggregation commands. Chapter 2. Background 9

Example:

SELECT COUNT(St.student_id), St.Country FROM Students AS St GROUPBY St.Country HAVING COUNT(St.student_id) > 3

The above query makes proper use of the aggregation commands. In particular, it lists the number of students in every country where there are more than three students in this Country.

Table 2.2: Complex Conditions

Operator Use EXISTS Returns true, if there is at least one row in the query query Row Op Returns true, if all the comparisons with the operator OP return true ALL Row Op Returns true, if at least one comparison with the operator returns true ANY Row Op IN Returns true, if at least an element exist in a given set (=ANY)

Example:

SELECT * FROM Students AS St WHERE St.Country IN ( UK , Netherland )

The query above retrieves all the students who come from UK and Netherland. Chapter 2. Background 10

Table 2.3: SET Commands

Command Use UNION [ALL] Returns the combination of the results of two SQL queries INTERSECT Return the rows that appear exactly the same in both SQL queries [ALL] EXCEPT [ALL] Return each row that appears to the first query but does not appear to the second query

By default, SET commands remove duplicates in an SQL query result. However, we can keep duplicates in the result, by adding the optional ALL keyword for any of the commands above.

Example:

SELECT Country FROM Students UNION SELECT Country FROM Professor

The above query retrieves the Countries from where both Students and professors come from.

Data Types: Since a variety of data is stored in a database, data types are introduced by the Standard, and some of them should be implemented by all DBMS vendors. Data types indicate the type of data that a column can hold. We provide below a detailed explanation of the most important data types for our project; however, there are numerous data types rather just only these. Chapter 2. Background 11

Table 2.4: Data Types

Types Description SMALLINT Can store a number from a range between -32768 and 32767 INT Can store a number from a range between -2147483648 and 2147483647 BIGINT Can store a number from a range between -9223372036854775808 and 9223372036854775807 VARCHAR Takes as input the length of a variable string which can contains up to 255 characters CHAR Takes as input the length of a fixed size string which can contains up to 255 characters

Table 2.5: String Commands

Operator Use TRIM() Returns the string without leading/trailing characters SUBSTRING() Retrieves a subset of the given string CONCAT() Concatenate two or more strings REPLACE() Replaces a subset of a string with another string LIKE Returns true, if an attribute matches with a pattern

Example:

SELECT SUBSTRING ( SQLSTANDARD , 1, 3 ) AS SQLExtraction

The SQL above query extracts a part of the initial string which is given as pa- rameter to the SUBSTRING. The prototype of the substring function is as follows: SUBSTRING(initial string, start position, end position). Thus, the output result is: SQL

Operator Precedence Operator precedence is an important concept for understanding how an SQL query is evaluated and it is also significant for determining if major DBMSs follow the same Chapter 2. Background 12 operator precedence. There are cases where the expressions in the WHERE clause are quite complicated, and the operator precedence defines in which order the expres- sion should be evaluated. The sequence of operators is provided as follows; classified according to their priority, starting from those with the highest priority:

• () • +, - , • *, /, • = , >, <, >= , <= , <> • NOT , AND • ALL , ANY , IN , LIKE • = - variable assignment.

Operators that have the same priority are evaluated from left to the right. In addi- tion, parenthesis abrogate the priority of the rest operators and as a result expressions that are enclosed by a parenthesis are evaluated first.

2.3.1 Missing values

This section aims to provide a basic background regarding null values and present the problems that can be arisen from the presence of nulls in a database. In the chapter of experiments, it is illustrated that many problems can appear. SQL uses null marker for missing or unknown values in a database, and for that reason null is a reserved word [24]. It is worth mentioning that null should not be confused either with zero value or an empty string. However, Oracle’s database treats the empty string as null [3, 19], and for this reason new issues are arisen. An important consideration is that it cannot be tested if a value of a field is null using usual common comparison operators such as <>, =, < and > but instead IS NOT NULL and IS NULL commands are used. In general, the existence of nulls in a database is the fundamental source of issues and incompatibilities among current DBMS [16, 18]. Also, the evaluation of each com- parison with the existence of nulls is performed based on a three-valued logic (3VL) which is an extension of the common Boolean logic. In Boolean logic, an expression can be evaluated only to two values such as true and false, where the negation evaluate to the opposite values. In contrast with 3VL, where there is an addition value called unknown, and the opposite of it remains the same. In addition, all comparisons in- volving null should be resulted to be unknown according to SQL Standard. Below it is illustrated a truth table for the different comparisons with the suitable outcomes. Chapter 2. Background 13

Table 2.6: OR Truth Table Table 2.7: AND Truth Table

OR True False Unknown AND True False Unknown True True True True True True False Unknown False True False Unknown False False False False Unknown True Unknown Unknown Unknown Unknown False Unknown

Table 2.8: NOT Truth Table

NOT True False False True Unknown Unknown

In an SQL query, each tuple which is evaluated to true is returned in the output result, and tuples that are evaluated to either false or unknown are not returned to the output results. Comparisons involving nulls are considered as false in the WHERE Clause. Hence, if we take into consideration that only Oracle’s database treats an empty string as null then it can be realized that many problems can arise.

Examples

NULL = 10 is evaluated to Unknown

NULL = NULL is evaluated to Unknown

NULL <= 3 is evaluated to Unknown

SELECT St.Student_Name FROM Students AS St WHERE NULL <= 20 Chapter 2. Background 14

The above query will result in an empty set since for each row the WHERE clause will be evaluated to unknown and none of the rows will appear in the output result.

SELECT St.Student_Name FROM Students AS St WHERE NULL = NULL

Logically it is expected that the NULL = NULL should be evaluated to true. However, this expression is evaluated to unknown for every row, and the output result of this query is an empty set.

2.4 SQL Standard issues

An example is provided below which demonstrates that indeed some aspects of the Standard is implemented differently by different vendors. Example: The following SQL query does not return identical output results on both Post- greSQL and Oracle’s database [17].

Query Q1:

SELECT * FROM ( SELECT S.A,S.A FROM S)R

While query Q1 is expected to return identical results, independently of which system is executed on, this is not the truth since this query will output a table with two columns named A in PostgreSQL. On the other hand, in Oracle’s database, the query will return a compile-time error. Indisputably, there are indeed differences between current DBMSs [17]. It can supposed that in most of the cases these differences are minor but if we take into account that these systems are used in many different fields, then small differences might be critical. Chapter 2. Background 15

2.5 The Database Management Systems

Microsoft SQL Server: is a relational database management systems (RDBMS) de- veloped by Microsoft. Microsoft offers different edition of its product such as Stan- dard, Enterprise and Express edition. Express edition is free. All the editions are mainly available on Windows platform. MySQL: is an open-source RDBMS which is freely available, and Oracle owns it. MySQL is used by huge companies such as Google, Facebook and Youtube. It is available on MacOS, Linux and Windows platforms. PostgreSQL: is an open-source RDBMS which is trying to comply with the SQL Standard, and it primarily focuses on portability. It is freely available on both Windows and Linux platforms. IBM DB2: is an RDBMS developed by IBM, and it is available on Linux and Windows platforms. Oracle Database: is an RDBMS developed by Oracle Corporation which is avail- able on MacOS, Linux and Windows platforms.

Figure 2.1: Popularity of modern DBMSs

Figure 2.1 illustrates the most popular DBMSs between 2013 and 2017 using a calculated score based on Google Trend, the number of mentions of DBMSs on the web, and discussion of the systems in StackOverflow and DBA Stack Exchange [1]. It can be seen that the popularity of Oracle database, MySQL, and Microsoft SQL Server remain constant over the years with a relatively high score. Also, there is a steady rise Chapter 2. Background 16 for PostgreSQL in the last three years. Lastly, IBM DB2 popularity has decreased slightly in the last year. Chapter 3

Methodology

3.1 Methodology

As mentioned earlier, the fundamental idea of the current project is to investigate whether current DBMSs comply according to the Standard, and highlight the key is- sues and incompatibilities among these systems. The insight of our approach is to generate a massive number of random queries in order to execute the same queries against numerous vendor’s DBMS, and then to com- pare and verify the output results. The output result of each DBMS is expected to be identical since all the systems will contain the same databases. In this way, the results can be checked for compilation errors, mismatches, and issues/incompatibilities.

Thus, our main contribution is as follows:

• The implementation of a fully-configurable random query tool that can be used to generate a diversity of syntactically correct SQL queries based on a database schema. • The implementation of a comparison tool that can be used for executing each randomly generated query on all DBMSs and compare the output results. • The use an open-source tool called Datafiller for generating realistic data, which is finally imported into the databases. • The tools, as mentioned earlier, can be used as the core framework to run multi- ple experiments in order to detect issues/incompatibilities and compilation-time errors. • In the end, all the issues/incompatibilities and errors are presented with an ap- propriate explanation to the Standard, together with a summarised table.

17 Chapter 3. Methodology 18

Figure 3.1: High level overview of the framework

In particular, we examine five popular DBMSs such as MySQL, IBM DB2, Mi- crosoft SQL Server, PostgreSQL and Oracle database based on their SQL-Compliance. A detailed explanation of the internal implementations and structures of the aforemen- tioned tools is given in the following Chapter. Chapter 4

Implementation

This chapter describes the implementation details of the framework by explaining the different tools which were implemented.

4.1 Internal Implementation

Figure 4.1: Complete architecture of the framework

The implemented framework is composed of three different tools, as introduced in the previous chapter. The random query generator tool and the comparison tool are implemented in Java programming language, whereas the Datafiller tool is imple-

19 Chapter 4. Implementation 20 mented in the Python programming language which is an open-source project that can generate random realistic data. This project uses Java programming language since it is prevalent. It can also be used to build cross-platform systems and our implemented framework consists of several Java classes.

4.1.1 Random Query Generator Tool

The random query generator tool (RQGT) is used to generate valid and syntactically correct SQL queries based on a database schema. In particular, RQGT can automat- ically retrieve the database schema either from PostgreSQL or MySQL. Even though the RQGT runs in a stand-alone manner, it can also run in conjunction with the Com- parison Tool (CT).

The phases of which the RQGT proceeds can be defined as follows:

• During the initialization phase, both the database schema and configuration file are retrieved. • During the run-time phase, the tool generates random SQL queries continuously, based on database schema and parameters which are specified in the configura- tion file. • Each query is pipelined to the Comparison Tool, which is then executed against the DBMSs.

Generating a very large number of syntactically correct SQL queries is a complex task since many considerations must be taken into account. Since the SQL language consists of numerous SQL commands and functions, some of them are simple in terms of their usage such as SELECT...FROM...WHERE, where some others are more trick- ier such as GROUP BY, HAVING or aggregation functions such as MAX, COUNT or AVG. These SQL commands and functions need to be properly generated and syntacti- cally correct. As a result, for generating meaningful and syntactically valid queries, the implementation of the generator tool must take into consideration many factors (i.e., attribute naming, data type compatibility and selected columns). In addition, the tool has been designed in a modular way for re-usability, and there- fore new systems can be easily supported without the need of changing the whole structure of the tools. Hence, an internal representation is implemented with each of its classes responsible for generating a different SQL clause that contributes to the overall query. For example, one of the java classes generates the SELECT clause. Chapter 4. Implementation 21

Having different classes for each SQL statement makes it easier to extend the tool in order to add new functionality and at the same time there is no need to change other parts of the tool. The final SQL query is converted to an SQL string which then is executed to the current DBMSs with the contribution of the comparison tool. An important note is that it is not feasible to generate SQL strings directly because the attribute names and data types need to be tracked first. If SQL strings were to be used, it would not be possible to check if a variable mentioned in the WHERE clause is also mentioned in the FROM clause or if that variable comes as a parameter from an external query.

4.1.1.1 Configuration file

The RQGT is a fully-configurable tool in order to make it feasible to test various sce- narios. A detailed explanation of all the parameters is given subsequently. Another important decision that should be taken into account is the provision of relations and attributes to the tool. An initial approach was to use them as parameters in the configu- ration file. Although this approach works acceptably, the resulting tool lacks portabil- ity, especially when a DBMS has a large number of tables and columns. Hence, It will be time-consuming to give these parameters via the configuration file. Thus, an effi- cient approach is to retrieve the whole schema from DBMSs automatically. As a result, the implemented tool has the capability to automatically retrieve the whole schema for either MySQL or PostgreSQL just by providing the credential for connecting to the database either in the configuration file or as input parameters to the tool.

All the parameters are described as follows:

• relations and attributes = parameters are used to provide to the query generator tool the tables and columns for generating SQL queries according to the database schema. The current architecture has the capability to automatically retrieve the whole database schema from any DBMS by just providing the credentials in the configuration file such as user, pass and dbName parameters. • MaxTablesFrom = parameter is used to set an upper bound of the number of tables that a FROM clause can have. If the upper bound is greater than the total number of tables in the schema, then the upper bound is automatically set to the total number of tables. • MaxAttrSel = parameter indicates an upper bound of projections in the SELECT Chapter 4. Implementation 22

clause. In other words, it is the total number of attributes that can be selected in the SELECT clause. • MaxCondWhere = parameter represents the total number of comparison that the WHERE clause can have. • ProbWhrConst = parameter represents the probability of having comparisons with constant or Null in the WHERE clause. Therefore, a number between zero and one can be given for this parameter. • RepAlias = parameter indicates the probability of having repetition of alias in the SELECT clause. • NestLevel = parameter represents the maximum level of nesting that a query can have. For generating such a query many considerations should be taken into ac- count. For example, we should track attributes for outer queries, because inner query can access outer attributes or attributes from its FROM clause. The oppo- site is not true, meaning that we cannot access attributes from an inner query. • ArithCompSel = parameter represents the probability of having arithmetic compar- isons in the SELECT clause. • Dinstinct = parameter represents the probability of the DISTINCT command ap- pearing in the SELECT clause. • StringInSel = parameter indicates the probability of projecting an attribute of type String or having String functions in the SELECT clause. • StringInWhere = parameter represents the probability of having String compar- isons in WHERE clause. • Rowcompar = parameter represents the probability of having row comparisons (com- parison with more than one attributes) in the WHERE clause. • isNULL = parameter indicates the probability of having ISNULL or IS NOT NULL conditions in the WHERE clause. • isSelectAll = parameter indicates the probability of having * command in the SE- LECT clause. • compTool = parameter indicates the total number of queries that will be generated and executed against the DBMSs. In particular, if the value given to this parame- ter, is equal to one, then only one query is generated without to executing on any DBMS. Otherwise, x (where x the value given to this parameter) random queries will be generated and executed on DBMS.

It can be seen from the configuration file that many parameters of the random query generator can be controlled but, on the other hand, it does not imply that the diversity of Chapter 4. Implementation 23

SQL queries that can be generated is restricted. For example, an upper bound of tables that appear in the FROM clause can be set. In that way, having an enormous table size from Cartesian product can be avoided and in addition we can generate random queries under various scenarios.

SQL Query generated by the random query generator

SELECT r51.c1 AS A1 , (482/AVG(r51.c2) ) AS AGGR0 FROM r5 AS r51 WHERE (r51.c3 = ’Schmedeman Park’) AND ’Heath Avenue’ < r51.c3 GROUPBY r51.c1 HAVING MAX(r51.c1) <= MIN(r51.c1)

4.1.2 Comparison Tool

The generated SQL queries are evaluated on five different DBMSs, as they were men- tioned in the methodology chapter. The comparison tool (CT) takes into consideration all the ways of accessing different databases so that the process of executing and com- paring identical queries among all databases can be automated. The CT is illustrated below and a detailed explanation of its internal implementa- tion is provided.

Figure 4.2: Abstract Architecture of Comparison Tool Chapter 4. Implementation 24

The tool is implemented in Java programming language and is one of the main components of the complete framework. Specifically, the CT is used to compare the results of each randomly generated query. As each DBMS uses various algorithms to evaluate each query, sometimes the results are identical but the order or format differs. As a consequence, for efficient comparison of the query results between DBMSs, the following approach is used: Initially, the query results are stored in a LinkedList, so that an efficient in-memory sorting algorithm can be performed, called Quicksort. In that way, the results among the DBMSs should be the same and a row by row compar- ison is performed to verify that they are identical. If a difference is found or a query raises an error, on one of the tested DBMSs, then an explanation of the difference or the error message is recorded in the log file. An important consideration for implementing this tool is its ease of reuse and ex- tensibility by other systems or software. As each system needs its own credentials such as username, password and database name, a configuration file called compar- isonTool.properties is created for providing these information to the CT. By doing so, the CT becomes portable and adaptable and a new system can be easily added. As shown in the experiments chapter, our approach can detect many important differences and issues.

4.1.3 Random data generator tool

Generating random databases for testing DBMSs is a complex task since meaningful data should be produced in order to reveal corner cases. However, if we take into consideration that different interpretations or missing features can be still revealed, then by generating a small size of data, we can increase the efficiency of our evaluation. Datafiller is a well-known open-source project that provides the capability of gen- erating random data [8,9]. For our project, Datafiller will be used for generating a diversity of database instances in order to evaluate all major DBMSs. In particular, the Datafiller script generates random data, based on an SQL schema which is provided as a parameter to the tool, and it takes into account constraints of the schema in order to generate valid data. For example, it takes into account the domain of each field and if each field should be unique, foreign key or primary key. Another important parameter is the df: null=rate% which indicates the nullable rates. It is necessary to test the be- haviour of current DBMSs containing databases with nulls in order to evaluate whether all DBMSs behave in a similar fashion regarding null values. Chapter 4. Implementation 25

Figure 4.3: Method of importing csv files

Additionally, more complex parameters can be provided as well, such as the num- ber of tuples per table using –size SIZE parameter. It is worth mentioning that these parameters should be defined within the schema script and should start with – df. Fur- thermore, we can generate more realistic data by providing some information in the schema SQL script. For instance, if there is a field which represents a date, then we can provide a range in order for the tool to generate dates only within that range. This can be achieved by specifying the following parameter: range – df: start=year-month-day end=year-month-day next to the Date field. Subsequently, we need to add the –filter parameter while running the script. These are only some of the important parameters that the Datafiller provides but apart from these; it also provides more sophisticated properties which are out of importance for our project. The current version of Datafiller tool supports data import only for PostgreSQL and a beta version for MySQL. Since for our project the data should also be imported into Microsoft SQL Server, Oracle Database and IBM DB2, we transform it into a CSV format and then, we use the bulk-loading command which is supported by all system to important the csv file into the DBMSs. Chapter 5

Experimental Evaluation

In this chapter is presented the procedure that was followed for providing the experi- mental evidence. Subsequently, the main findings are presented with appropriate ex- planation according to the SQL Standard.

5.1 The experiment Setup

The experiments are conducted on a server with the following specifications:

– Intel core i7-3632QM 2.20GHz – 12GB Main memory Ram – Solid state disk (SSD) – Windows 8 operating system – The following versions of DBMSs were installed: PostgreSQL Version 9.6, Microsoft SQL Server Express Edition 2016, IBM DB2 Express-C, Oracle Database 12c and MySQL Community Edition 5.7.

Our implemented framework (Random query generator tool and Comparison tool) was used to conduct the experiments and in this way the current implementation is ver- ified and checked whether is capable of detecting any differences and issues. A very large number of SQL queries is generated and executed against the five aforementioned DBMSs. In particular, more than 100,000 queries are generated and executed. Also, different schemas are used varies from 2 to 5 relations and 2 to 5 attributes and the fol- lowing data types were checked: TEXT, CHAR, VARCHAR, INTEGER, SMALL- INT, BIGINT. Furthermore, the generated data contained different null rates such as 5%, 10%, 20% for identifying possible differences that may arise due to the number of

26 Chapter 5. Experimental Evaluation 27 nulls. For each schema, a database instance is generated using the random data gen- erator called Datafiller. As an important consideration that needs to be checked with these experiments, is whether various features of the Standard are implemented in the same way. The size of the instances was 100 rows per table since it will be aimless to use a larger size, since issues and differences can be still disclosed with a smaller size. In this way, the time that needs for the queries to be executed is increased, and more queries can be evaluated in less time. We additionally generated more than a million randomly queries to evaluate the performance of the random query generator tool.

5.1.1 Performance Evaluation

PERFORMANCE EVALUATION

Conf1 Conf2 Conf3 Conf4

3500000 3000000 2500000 2000000 1500000 1000000 500000 0 1 2 4 8 16 32 NUMBER OF GENERATED QUERIES GENERATED OF NUMBER Conf1 100138 201123 401239 802521 1600133 3205674 Conf2 50069 100580 200630 401277 800069 1602725 Conf3 25034 50270 100313 200630 400033 801418 Conf4 16689 33520 66873 133753 266688 534279 EXECUTION TIME ( MINUTES)

Figure 5.1: Performance Evaluation

In Figure 5.1, we present the number of random queries that are generated over time under different configuration settings such as Conf1, Conf2, Conf3, and Conf4. The purpose of the performance evaluation is to provide an insight about the total number of queries that can be generated over a specific time without being executed. In particular, the performance evaluation test is conducted under different configura- tion settings (Different values are assigned to the parameters in the configuration file). Chapter 5. Experimental Evaluation 28

Conf1 and Conf2 were configured in such a way that a low probability is given for most of the parameters and zero probability is given for having string comparisons in the WHERE clause and String functions in the SELECT clause. The only difference between Conf1 and Conf2, is that Conf2 has a greater number of the total number of comparisons parameter. Conf3 and Conf4 differ from Conf1 and Conf2, in such a way that a high probability is given for having String comparisons in the WHERE clause and String functions in the SELECT clause. As expected, fewer queries are generated within a specific time under conf2 settings, as the number of comparisons is greater compared to Conf1, which takes more time for the queries to be generated. Addition- ally, under Conf3 and Conf4 settings, fewer queries are generated compared to Conf1 and Conf3, since by having String comparisons and String functions takes, even more time for the queries to be generated. Overall, it can be seen that a very large num- ber of random queries can be still generated faster, even under different configuration settings.

5.2 Experiment Results

Below we present our findings, which have arisen using our framework and by gener- ating more than 100,000 random queries and execute them against five DBMSs. The findings have the following format:

• Firstly, the generated SQL query that causes an error or a semantic issue is pre- sented ( Since some features of the Standard are not implemented at all, or im- plemented differently can cause an error). • Then, a table for each query is provided which demonstrates if any DBMS raises an error, otherwise if it is executed as expected or not. • Afterwards, a comprehensive explanation is given for each query according to the SQL Standard. Chapter 5. Experimental Evaluation 29

Table 5.1: Statistics of Generated queries

Conf1 Conf2 Conf3 Conf4 Total Queries 49163 51254 43251 32568 Problematic Queries 984 1538 3027 2931 Percentage of 2% 3% 7% 9% problematic queries

Table 5.1 illustrates the collected statistics of the randomly generated queries which are executed against five modern DBMSs. As described earlier, in the Conf1 and Conf2, a low probability is given for most of the parameters (Zero probability is given for string comparisons in the WHERE clause parameter and String functions in the SELECT clause parameter). In Conf3 and Conf4 a high probability is given for string comparisons and functions, in particular in Conf4 a very high probability is given for strings. As it can be seen from the above table, fewer incompatibilities/issues arise if the generated queries do not contain strings comparisons and functions. Furthermore, a larger number of problematic queries arose in particular under Conf3 and Conf4, since queries that contained string comparisons and functions cause more issues due to the fact that these features of the Standard are implemented differently by various vendors.

Difference 1: Q1:

SELECT r41.A AS A0 FROM r4 AS r41 WHERE r41.A > 5 Chapter 5. Experimental Evaluation 30

Table 5.2: Difference #1

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle [42000][933] ORA-00933: SQL command not properly ended IBM DB2 Executed (Works as expected)

With respect to the Standard, alias feature is optional and it is not compulsory to be implemented by the vendors of DBMSs. The purpose of this feature is to rename tables or columns and can be used to both rename attributes in the SELECT clause and tables in the FROM clause. Aliases are given using ’AS’ keyword. The Oracle database supports the ’AS’ when defining column aliases, but it does not allow to use ’AS’ when defining table aliases (in the FROM clause), in contrast with the rest DBMSs, which support the alias feature both to rename tables or columns. An important note is that the rest SQL queries that are presented below are executed without the keyword AS only on the Oracle Database.

Difference 2: Q2:

SELECT r21.A AS A0 FROM r2 AS r21 WHERE true

According to the Standard, the Boolean data type is composed of the values true and false. It also comprises the value unknown, unless a NOT NULL constraint is defined. Although Boolean data type is essential, it is defined as optional in the Stan- dard, and as a result only a few DBMS implement this feature. PostgreSQL and IBM DB2 have implemented this feature, in contrast with Microsoft SQL Server, Oracle Database and MySQL which do not natively support the Boolean data type. If the query Q2 is executed on Oracle database and MySQL, it will raise an error since true keyword cannot be recognised. Chapter 5. Experimental Evaluation 31

Table 5.3: Difference #2

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft [S0001][4145] An expression of non-boolean type specified in a con- SQL Server text where a condition is expected, near ’true’ Oracle [42000][920] ORA-00920: invalid relational operator IBM DB2 Executed (Works as expected)

Difference 3: Q3:

SELECT r41.A AS A0 FROM r4 AS r41 WHERE 1*1

Table 5.4: Difference #3

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL [42804] ERROR: argument of WHERE must be type boolean, not type integer Microsoft [S0001][4145] An expression of non-boolean type specified in a con- SQL Server text where a condition is expected, near ’1’. Oracle [42000][933] ORA-00933: SQL command not properly ended IBM DB2 Executed (Works as expected)

According to the Standard, each expression in the ’WHERE’ clause is evaluated to a Boolean type, taking the values True, False and Unknown. Since the Boolean type is not supported by all DBMSs, there are some variations. The most common variation is with a bit, where 0 value represents false and 1 value represents true. The query Q3 performs a multiplication between two constants in the ’WHERE’ clause, and since it cannot be evaluated to a Boolean type, it returns an integer type instead. It is expected Chapter 5. Experimental Evaluation 32 that this query will raise an error if it is executed on any DBMSs. However, MySql and IBM DB2 execute this query without to raise any error. The reason is based on the fact that if the number is positive, then, they implicitly interpret it as true, otherwise as false. On the contrary, the rest three DBMSs will raise an error.

Difference 4: Q4:

SELECT NULL/NULL, 1/2, NULL-NULL FROM r2 AS r21 , r4 AS r41 WHERE r41.B > r21.B

Table 5.5: Difference #4

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL [42725] ERROR: operator is not unique: unknown / unknown Hint: Could not choose a best candidate operator. You might need to add explicit type casts. Microsoft Executed (Works as expected) SQL Server Oracle Executed (Works as expected) IBM DB2 Executed (Works as expected)

According to the Standard, any comparison/operation involving null or an unknown truth value should return a null output result. Hence, a division with null should return a null. The evaluation of the NULL/NULL as can be seen from the query Q4 should lead to a null value. However, PostgreSQL raised an error when executed this query, in contrast with the rest DBMSs which are executed this query without any error. The problem still exists in PostgreSQL with other arithmetic operators such as +, - and *. Chapter 5. Experimental Evaluation 33

Difference 5: Q5:

SELECT r21.B/3 FROM r2 AS r21 , r4 AS r41 WHERE NOT(NOT(r41.A <> 18 ) )

Table 5.6: Difference #5

DBMS DBMS Message Mysql Executed (Does not work as expected) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle Executed (Does not work as expected) IBM DB2 Executed (Works as expected)

The query Q5 is executed on a database that contained smallint data type for the attribute r21.B. This query is executed without causing an error on any of the five DBMSs. However, the return type of the output results differs slightly. In IBM DB2, PostgreSQL and Microsoft SQL Server the output results for the column r21.B/3 are returned as an integer type, in the contrast with MySQL and Oracle database, where the results are returned as a decimal data type.

Difference 6: Q6:

SELECT (MIN(r41.B) % AVG(r41.A) ) FROM r4 AS r41 WHERE (10 >= 19 ) GROUPBY r41.A, r41.B Chapter 5. Experimental Evaluation 34

Table 5.7: Difference #6

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle [22019][911] ORA-00911: invalid % character IBM DB2 Executed (Works as expected)

An important consideration is that arithmetic operations such as addition, multi- plication and subtraction are supported in the SELECT clause. Another important arithmetic operation which is also supported in the SELECT clause is the % (modulo). However, the syntax of % operator differs in Oracles database. In particular, Oracle uses a function called MOD, instead of using % which is supported by the rest DBMSs. Thus, the query Q5 needs to replace the operator ’%’ with the function mod in order to be syntactically correct on Oracles database, that is, MOD( MIN(r41.B), AVG(r41.A)).

Difference 7: Q7:

(SELECT r41.A AS A0 FROM r4 AS r41 ) EXCEPT ALL (SELECT r21.A AS A0 FROM r2 AS r21 , r4 AS r42 ) Chapter 5. Experimental Evaluation 35

Table 5.8: Difference #7

DBMS DBMS Message Mysql [42000][1064] You have an error in your SQL syntax; check the manual that,corresponds to your MySQL server version for the right syntax to use near ’EXCEPT ALL PostgreSQL Executed (Works as expected) Microsoft S0002][324] The ’ALL’ version of the EXCEPT operator is not sup- SQL Server ported. Oracle [42000][933] ORA-00933: SQL command not properly ended IBM DB2 Executed (Works as expected)

EXCEPT ALL is an optional feature according to the Standard. The purpose of this feature is to return all the rows from the outer query which are not presented in the inner query without removing duplicates. In particular, MySql and Oracle database do not support this operator, in contrast with PostgreSQL and IBM DB2 which support the EXCEPT ALL operator.

Difference 8: Q8:

(SELECT r41.A AS A0 FROM r4 AS r41 , r2 AS r21 , r3 AS r31 WHERE NOT(r31.B <> r41.A ) ) EXCEPT (SELECT r21.A AS A0 FROM r2 AS r21 , r4 AS r41 WHERE r41.A <> r41.B ) Chapter 5. Experimental Evaluation 36

Table 5.9: Difference #8

DBMS DBMS Message Mysql [42000][1064] You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ’EXCEPT PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle [42000][933] ORA-00933: SQL command not properly ended IBM DB2 Executed (Works as expected)

EXCEPT is a mandatory feature according to the Standard and all DBMSs must support this feature in order to comply with the Standard. The purpose of this feature is to return all rows from the outer query which are not presented in the inner query and removing duplicates. Only MySQL does not support the EXCEPT even though it is a compulsory feature according to the Standard. However, Oracle database uses MINUS instead of EXCEPT which has an identical behaviour. The rest DBMSs support this operator.

Difference 9: Q9:

(SELECT r41.A AS A0 FROM r4 AS r41 , r3 AS r31 WHERE (NULL <= 6 OR NOT(r31.B <> r41.A ) ) ) INTERSECT ALL (SELECT r21.A AS A0 FROM r2 AS r21 , r4 AS r41 WHERE r41.A <> r41.B OR ( 0 <> 14) AND r41.A > r21.B) Chapter 5. Experimental Evaluation 37

Table 5.10: Difference #9

DBMS DBMS Message Mysql [42000][1064] You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ’ALL PostgreSQL Executed (Works as expected) Microsoft [S0001][324] The ’ALL’ version of the INTERSECT operator is not SQL Server supported. Oracle [42000][928] ORA-00928: missing SELECT keyword IBM DB2 Executed (Works as expected)

With respect to the Standard INTERSECT ALL is an optional feature. The pur- pose of this feature is to return all the rows which are presented in both inner and outer queries result, without removing duplicates. MySQL, Microsoft SQL Server and Or- acle database do not support this feature at all, in contrast with PostgreSQL and IBM DB2 which support this feature.

Difference 10: Q10:

SELECT 7/0 AS ART0 , 1%NULLAS ART1 FROM r1 AS r11 WHERE (NULL = 19 AND r11.b <> 5)

Table 5.11: Difference #10

DBMS DBMS Message Mysql Executed (Does not work as expected) PostgreSQL [22012] ERROR: division by zero. (Works as expected) Microsoft [S0001][8134] Divide by zero error encountered. (Works as ex- SQL Server pected) Oracle ORA-01476: divisor is equal to zero.(Works as expected) IBM DB2 [22012][-801] Division by zero was attempted.(Works as expected)

A division by zero should not be allowed according to the Standard, and it should Chapter 5. Experimental Evaluation 38 always raise an error. However, MySQL does not raise an error and the out results of the division by zero is NULL.

Difference 11: Q11:

SELECT r11.a AS A1 , r11.b AS A2 FROM r1 AS r11 WHERE (r11.b, r11.a) IN (SELECT r12.a AS A4 , r12.b AS A3 FROM r1 AS r12)

Table 5.12: Difference #11

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft [S0001][4145] An expression of non-boolean type specified in a con- SQL Server text where a condition is expected, near ’,’. Oracle Executed (Works as expected) IBM DB2 Executed (Works as expected)

The usage of IN predicate is to compare one or more attributes values with a set of values. However, the query Q11 will raise an error on Microsoft SQL Server as more than one value is specified in the IN predicate. The rest DBMSs execute the query without raising any error. Chapter 5. Experimental Evaluation 39

Difference 12: Q12:

SELECT r31.b AS A1 FROM r3 AS r31 WHERE r31.a >= r31.a GROUPBY r31.a

Table 5.13: Difference #12

DBMS DBMS Message Mysql Executed (Does not work as expected) PostgreSQL [42803] ERROR: column ”r31.b” must appear in the GROUP BY clause or be used in an aggregate function Position: 9 Microsoft [S0001][8120] Column ’r3.B’ is invalid in the select list because it SQL Server is not contained in either an aggregate function or the GROUP BY clause. Oracle [42000][979] ORA-00979: not a GROUP BY expression IBM DB2 [42803][-119] An expression starting with ”B” specified in a SE- LECT clause, HAVING clause, or ORDER BY clause is not spec- ified in the GROUP BY clause or it is in a SELECT clause, HAV- ING clause, or ORDER BY clause with a column function and no GROUP BY clause is specified

”According to the Standard, an SQL query that includes a GROUP BY clause cannot refer to non-aggregated columns in the select list that are not defined in the GROUP BY clause.” The intuition behind this issue is based on the fact that if non- grouped and non-aggregate fields are selected, then the DBMSs cannot know which row to return since there can be more than one row for each corresponding group. The query Q12 selects the attribute r31.b which does not appear in the GROUP BY clause. PostgreSQL, Microsoft SQL Server, Oracle database and IBM DB2 correctly raise an error as the attribute does not appear in the GROUP BY clause. However, MySQL executes this query without raising an error. In particular, MySQL simply returns the first row that exists for each group. Chapter 5. Experimental Evaluation 40

Difference 13: Q13:

SELECT ’SQL’ || ’STANDARD’ FROM R1

Table 5.14: Difference #13

DBMS DBMS Message Mysql Executed (Does not work as expected) PostgreSQL Executed (Works as expected) Microsoft [S0001][102] Incorrect syntax near 0|0. SQL Server Oracle Executed (Works as expected) IBM DB2 Executed (Works as expected)

According to the Standard, concatenation is a mandatory feature and it should be supported by all DBMSs with a double-pipe mark ||. The purpose of this feature is the concatenation of two or more strings into one. Thus, by executing the query Q13 is expected the output result to be SQLSTANDARD. However, MySQL supports the double pipe operator, but interprets it as a logical OR and thus the query returns 0. MySQL uses the CONCAT() function as concatenation and it takes as parameters one or more strings. Microsoft SQL Server uses the + operator in order to perform the concatenation. Only Oracle and PostgreSQL support the double-pipe || concatenation operator.

Difference 14: Q14:

SELECT * FROM r1 AS R1 , r1 AS R1 Chapter 5. Experimental Evaluation 41

Table 5.15: Difference #14

DBMS DBMS Message Mysql [42000][1066] Not unique table/alias: ’R1’ PostgreSQL [42712] ERROR: table name ”r1” specified more than once Microsoft [S0001][1011] The correlation name ’R1’ is specified multiple times SQL Server in a FROM clause. Oracle [42000][933] ORA-00933: SQL command not properly ended IBM DB2 Executed (Does not work as expected)

Query Q14 should raise an error since the same alias is assigned to two tables. Thus, it is unambiguous to which table will refer to using the specific alias. However, the query is executed on IBM DB2 database without to raise an error.

Difference 15: Q15:

SELECT TRIM(’SQLSTANDARD’) FROM R1;

Table 5.16: Difference #15

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft [S00010][195] ’TRIM’ is not a recognized built-in function name. SQL Server Oracle Executed (Works as expected) IBM DB2 Executed (Works as expected)

According to Standard, TRIM function is a mandatory feature and it should be implemented by all DBMSs. The use of TRIM feature is to return a string which is given as argument with remove leading and trailing space. This function is not supported by Microsoft SQL Server. Chapter 5. Experimental Evaluation 42

Difference 16: Q16:

SELECT 2 * 5 AS ART WHERE ( 1 = 1 )

Table 5.17: Difference #16

DBMS DBMS Message Mysql [42000][1064] You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ’WHERE (1 = 1 ) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle [42000][923] ORA-00923: FROM keyword not found where ex- pected IBM DB2 [42601][-104] Expected tokens may include: ”FROM”

The basic structure of an SQL query is of the form SELECT...FROM...WHERE. The query, but the query Q16 omits the FROM clause. SQL query without FROM clause can be executed only on PostgreSQL and Microsoft SQL Server, in contrast with Mysql, Oracle and IBM DB2 which raise an error.

Difference 17: Q17:

SELECT SUBSTRING (’Standard’, 1, 4) FROM R1 Chapter 5. Experimental Evaluation 43

Table 5.18: Difference #17

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle [42000][904] ORA-00904: ”SUBSTRING”: invalid identifier IBM DB2 Executed (Works as expected)

The substring function is a mandatory feature according to the Standard and as a re- sult all DBMSs should implement this feature. The use of substring feature is to return a part of the string which is given as argument. Only Oracle database does not support this feature and it uses Substr instead of Substring which has a similar behaviour. The prototype of oracle’s function is: substr(column name,start pos,noo f characters).

Difference 18: Q18:

SELECT "SQLSTANDARD" FROM R1

Table 5.19: Difference #18

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL [42703] ERROR: column ”SQLSTANDARD” does not exist Posi- tion: 8 Microsoft [S0001][207] Invalid column name ’SQLSTANDARD’. SQL Server Oracle [42000][904] ORA-00904: ”SQLSTANDARD”: invalid identifier IBM DB2 [56098][-727] An error occurred during implicit system action type ”2”

According to the Standard a string should be encompassed by the ’’ symbol. It is demonstrated in query Q19 that MySQL allows a string to be encompassed by both Chapter 5. Experimental Evaluation 44

’’ and ”” symbols which make the SQL code less portable as the rest DBMSs raise an error if it is used ”” instead of ’’ symbol.

Difference 19: Q19:

SELECT * FROM r1 AS R1 WHERE R1.c3 = ’’

Table 5.20: Difference #19

DBMS DBMS Message Mysql Executed (Works as expected) PostgreSQL Executed (Works as expected) Microsoft Executed (Works as expected) SQL Server Oracle Executed (Does not work as expected) IBM DB2 Executed (Works as expected)

The query Q19 is executed on all DBMSs without to raise an error, though the se- mantic of this query differs. As mentioned in the background chapter, and in particular in the missing value, Oracle database treats the empty string as NULL, in the contrast with the rest DBMSs, which treats it as a normal String. Thus, all the comparisons in the WHERE clause involving NULL are evaluated to Unknown which implies that the output result will be empty since none of the rows will be satisfied. On the other hand, if there is at least a row which contains an empty string, then by executing the above query will be included in the output results. All the DBMSs except Oracle Database return in the output results all the rows that contains an empty string in the column c3. In Oracle database is returned an empty set as output result. Chapter 5. Experimental Evaluation 45

Difference 20: Q20:

SELECT * FROM R1 WHERE r1.c3 LIKE ’standard%’

Table 5.21: Difference #20

DBMS DBMS Message Mysql Executed (Case-Sensitive) PostgreSQL Executed (It is not Case-Sensitive) Microsoft Executed (Case-Sensitive) SQL Server Oracle Executed (It is not Case-Sensitive) IBM DB2 Executed (It is not Case-Sensitive)

Query Q20 is executed on all DBMSs without to raise an error but it differs seman- tically, since this query will not return the same results on all DBMSs. In particular, all the tested systems contained a database that stored a field with the word STANDARD (in capital letters) in the attribute c3 of the relation r1. By executing this query, it can be observed that some of the DBMSs are case sensitive with the LIKE operator, such as PostgreSQL and Oracle and as a result return an empty set as output. In the contrast with the rest systems, which returned one row that is the row that contains the word STANDARD in the attribute c3. Chapter 6

Conclusion

6.1 Conclusion

In this project an entire framework is implemented and composed of the random query generator tool and the comparison tool which are used to evaluate the SQL-compliance of five DBMSs. Moreover, we verified the correctness of the implemented tools by conducting the experiments. We demonstrated from the experimental evaluation chap- ter that the implemented tools are competent to reveal crucial differences and issues among current systems. Without a similar framework, it would be almost impossi- ble to detect some of the differences and issues by generating queries manually, or by testing all DBMSs empirically. As described in the related work, there is not any similar framework, apart from some documentations provided by the vendors of re- spective systems. There are also some studies presenting some issues according to the Standard, but without integrating an automated tool. In addition, a summarised table is provided exposing all the issues and incompati- bilities between the most popular DBMSs. These issues/incompatibilities are of major importance for vendors, users, programmers and researchers of such systems. We be- lieve that this project has contributed in the research area of the database management by introducing a new way of testing such systems according to their SQL-Compliance. In addition, the implemented tools could be of major importance for future vendors or researchers. Furthermore, observing and analysing these incompatibilities makes users, pro- grammers and researchers aware of these issues for avoiding crucial errors in the fu- ture. We verified our assumptions that some parts of the Standard are implemented differently, but it was somewhat unexpected that so many differences have emerged.

46 Chapter 6. Conclusion 47

Lastly, the implemented framework is portable and can be extended efficiently. Hence, although experimental evidence is provided for both numeric and alphanumeric data types, the random generator tool is implemented in such a way that can track any data types such as Date. In this way, it could be extended efficiently to generate queries with attributes of Date data type.

6.2 Summary of the findings

The below tables summarize the main features of the SQL language which are not supported by all popular DBMSs.

Table 6.1: Summarised Results

Microsoft Postgre IBM Operation Mysql SQL Oracle SQL DB2 Server INTERSECT X  XX  ALL INTERSECT    X  AS in FROM    X  Clause X EXCEPT ALL X  X (MINUS ALL is  not supported)  EXCEPT X   (It uses MINUS  instead of EXCEPT) GROUP BY con- tains columns not  XXXX in SELECT list Arithmetic opera- tions in WHERE  XXX  Clause Support keyword True in WHERE   XX  clause  Support of % op- (It uses     erator mod function instead of %) Division by zero  XXXX Row comparisons   X   Identical names in the FROM XXXX  Clause Query without X   XX FROM Clause Chapter 6. Conclusion 48

Table 6.2: Summarised Results

Microsoft IBM Operation Mysql PostgreSQL SQL Oracle DB2 Server  (But It usage is || operator in SE- X || as logical OR),    LECT Clause ( But it uses + ) (Uses CONCAT function instead of ||) TRIM function in   X   SELECT Clause  && operator in (But it usage is     SELECT Clause as logical AND) SUBSTRING X function in    (It uses SUBSTR  SELECT Clause function instead) Enclose Strings with “ ” instead  XXXX of ‘ ’   LIKE Operator    (Case- sensitive) (Case- sensitive)

The following additional conclusions are drawn by conducting our experiments. Despite the fact that the Boolean data type is essential and used by the majority of the developers, three major database systems (Microsoft SQL Server, Oracle database, MySQL) do not provide native support for this data type. Thus, a migration of SQL code into these systems may cause incompatibilities. It was also observed that the behaviour of MySQL might be “weird” under certain situations. In particular, it is the only system that allows a division by zero. Also, it allows non-aggregated columns in the select list, which was something unexpected. Lastly, the operators INTERSECT ALL and EXCEPT ALL are not supported by Mi- crosoft SQL Server and MySQL.

6.3 Suggestions for future work

Several issues arose when DBMSs are tested. The experiments are conducted in var- ious databases that contained integers and strings. We expect that more issues can Chapter 6. Conclusion 49 arise by also generating databases containing dates but by doing so, the generator tool should be extended in order to support this new data type. This should be an easy ex- tension as there is the provision for supporting any data type. Another future extension could be to include a new DBMS for evaluation of its SQL-compliance. This extension also should not need a lot of effort as the framework is implemented in such a way that a new system can be easily added. Appendix A

Source Code

The complete source code was submitted with the current dissertation report. The source code can be also found in the following github repository: https://github.com/spanoselias/TestingSQLCompliance

50 Appendix B

Compilation & Execution Instructions

Compilation Instructions: Note: The Postgresql JDBC library is passed as an argument since the program needs to retrieve the schema from PostgreSQL database. (Credentials need to be given via configuration file or as arguments). In addition, the schema can be retrieved from MySQL as well using ‘mysql-connector-java-5.1.41-bin.jar’

The following commands should be executed in the ‘src’ directory in conjunction with PostgreSQL in order to retrieve the schema (The below commands should be copied from readme TXT which was submitted along with this thesis report or from github repository since by coping these commands straight from the some symbols like 0 might change) Linux: javac -cp ’.:postgresql-42.1.1.jar’ SQLEngine.java

Windows: javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java

Execution Instructions: Linux: javac -cp ’.:postgresql-42.1.1.jar’ SQLEngine.java

Windows: javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java

51 Appendix B. Compilation & Execution Instructions 52

The following parameters can be passed as arguments to the random query tool: -maxTablesFrom, -maxAttrSel, -maxCondWhere, -maxAttrGrpBy, -probWhrConst, -repAlias, -nestLevel, -arithCompSel, -distinct, -stringInSel,-stringInWhere, -rowcompar, -DBMS, -user, -pass, -dbName, -isNULL,-isSelectAll Example1: javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java -maxTablesFrom ,→ 5 -maxCondWhere 3 -maxAttrSel 4

Example2: javac -cp ’.;postgresql-42.1.1.jar’ SQLEngine.java -maxtablesfrom ,→ 5 -maxCondWhere 3 -maxAttrSel 4 -arithCompSel 0.5 -distinct ,→ 0.5

Note: Any number of arguments can be passed to the program, and they are not case-sensitive as it can be seen from the Example2.

The following steps should be followed in order to run the complete framework:

• Firstly, the environment should be set up. Hence, it should run on an environment with the following DBMSs installed: PostgreSQL Version 9.6, Microsoft SQL Server Express Edition 2016, IBM DB2 Express-C, Oracle Database 12c and MySQL Community Edition 5.7. (All the appropriate libraries are provided with the source code under lib directory). And the appropriate credential should be specified in the comparisonTool.properties.

• Both DBMS’ schemas and random generated data are provided under Script di- rectory. Also, the open-source Datafiller can be used for generating realistic ran- dom data (https://www.cri.ensmp.fr/people/coelho/datafiller.# downloade)

• Compilation instructions: javac -cp ’All_libraries_names’ SQLEngine.java -compTool ,→ noOfQueries

• Execution instructions: Appendix B. Compilation & Execution Instructions 53

javac -cp ’All_libraries_names’ SQLEngine.java -compTool ,→ noOfQueries

• A log file with all the differences will be generated once the program execution is finished. Bibliography

[1] DB Engines. https://db-engines.com/en/ranking_trend, 2017. Accessed: 2017-08-03.

[2] ISO/IEC9075-2. https://www.iso.org/standard/63556.html, 2017.

[3] Oracle. https://docs.oracle.com/cd/B19306_01/server.102/b14200/ sql_elements005.htm, 2017. Accessed: 2017-07-24.

[4] T. Arvin. Comparison of different implementations. Troels Arvin’s home page, 2006.

[5] T. Y. Chen, F.-C. Kuo, R. G. Merkel, and T. Tse. Adaptive random testing: The art of test case diversity. Journal of Systems and Software, 83(1):60–66, 2010.

[6] E. F. Codd. Derivability, redundancy, and consistency of relations stored in large data banks. Technical report, IBM Research Report, 1969.

[7] E. F. Codd. A relational model of data for large shared data banks. Communica- tions of the ACM, 13(6):377–387, 1970.

[8] F. Coelho. Datafiller–generate random data from database schema. https:// www.cri.ensmp.fr/people/coelho/datafiller.html. Accessed: 2017-06- 10.

[9] F. Coelho. DataFillerTutorial. http://blog.coelho.net/database/2013/ 12/01/datafiller-tutorial.html, 2017. Accessed: 2017-06-01.

[10] T. M. Connolly and C. E. Begg. Database systems: a practical approach to design, implementation, and management. Pearson Education, 2005.

[11] T. P. P. Council. Tpc benchmark c (standard specification, revision 5.11), 2010. URL: http://www. tpc. org/tpcc, 2010.

54 Bibliography 55

[12] C. J. Date and H. Darwen. A Guide to the SQL Standard, volume 3. Addison- Wesley New York, 1987.

[13] S. Deconinck. Database systems: models, languages, design, and application programming. 2011.

[14] A. Eisenberg, J. Melton, K. Kulkarni, J.-E. Michels, and F. Zemke. Sql: 2003 has been published. ACM SIGMoD Record, 33(1):119–126, 2004.

[15] R. Elmasri. Fundamentals of database systems. Pearson Education India, 2008.

[16] P. Guagliardo and L. Libkin. Making sql queries correct on incomplete databases: a feasibility study. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 211–223. ACM, 2016.

[17] P. Guagliardo and L. Libkin. A formal semantics of sql queries, its validation, and applications. Proceedings of the VLDB Endowment, 10(8), 2017.

[18] T. Imielinski´ and W. Lipski Jr. Incomplete information in relational databases. Journal of the ACM (JACM), 31(4):761–791, 1984.

[19] H.-J. Klein. How to modify sql queries in order to guarantee sure answers. ACM SIGMOD Record, 23(3):14–20, 1994.

[20] K. Kline, B. Hunt, and D. Kline. SQL in a nutshell: a desktop quick reference.” O’Reilly Media, Inc.”, 2004.

[21] R. Ramakrishnan. Database management systems . pdf. 2000.

[22] J. D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: The complete book, 2002.

[23] R. F. Van Der Lans. The SQL standard: a complete guide reference. Prentice Hall International (UK) Ltd., 1989.

[24] R. van der Meyden. Logical approaches to incomplete information: A survey. In Logics for databases and information systems, pages 307–356. Springer, 1998.

[25] F. Zemke. What. s new in sql: 2011. ACM SIGMOD Record, 41(1):67–73, 2012.