Quick viewing(Text Mode)

Migrating Web Applications from Sql to Nosql Databases

Migrating Web Applications from Sql to Nosql Databases

MIGRATING WEB APPLICATIONS FROM SQL TO NOSQL

by

RAHMA S.AL MAHRUQI

A thesis submitted to the School of Computing in conformity with the requirements for the degree of Doctor of Philosophy

Queen’s University Kingston, Ontario, Canada January 2020

Copyright © Rahma S. Al Mahruqi, 2020 Abstract

In the era, data is emerging dramatically and the structure of data is becoming increasingly flexible. Non-Relational databases, such as the NoSQL class of Databases, are playing a major role as an enabler technology to manipulate such data. Non-Relational databases, have overcome many of the limitations of the relational databases especially those that relate to enable systems to scale up to serve more customers producing terabytes of data. More businesses are willing to migrate their legacy relational systems to ones that use NoSQL Databases. In this thesis, we present a semi-automated approach to migrate highly dynamic SQL- based web applications to ones that use document-oriented NoSQL databases such as Mon- goDB using source analysis and transformation techniques. We outline a set of source trans- formation steps that can be used to migrate existing web applications database from SQL one to document-oriented NoSQL database. We demonstrate our semi-automated frame- work on the analysis and migration of three existing web applications to extract, classify, translate, migrate and optimize the queries and migrate the PHP code to interact with the migrated database.

i There are two parts to this approach; the migration of schema and data, and the migra- tion of the actual application code with embedded queries. Our approach provides contri- butions to the second part, migrating and optimizing the embedded SQL queries to inter- act with the new database system and changing the application code to use the translated queries.

ii Co-Authorship

The published paper resulting from this thesis have been co-authored with my supervisors Dr. Thomas . Dean and Dr. Manar H. Alalfi. I am the primary author of the paper. Rahma S. Al Mahruqi, Manar H. Alalfi, and Thomas R. Dean, "A semi-automated framework for migrating web applications from SQL to document oriented NoSQL database, CASCON ’19: Proceedings of the 29th Annual International Conference on Computer Sci- ence and Software Engineering, Toronto, November 2019, Pages 44–53.

iii Acknowledgments

First and foremost, I thank God (Allah) the Most Gracious and Merciful for blessing me with guidance, strength, patience and perseverance to complete this work. I would like to express my gratitude to all those who make it possible for me to com- plete this thesis. I want to thank my supervisors Prof. Thomas R. Dean and Prof. Manar H. Alalfi for their advice, support, and encouragements. They always provided valuable direction and feedback, and gave me an important lesson on academic research. It was a great pleasure for me to conduct this thesis under their supervision. I would also like to thank the members of my oral defense committee, Dr. Ettore Merlo, Dr. James Cordy, Dr. Michael Greenspan, and Dr. Hossam Hassanein for their time and insightful questions and comments. My appreciation and gratefulness to all my colleagues and to the friendly members of the School of Computing who have helped in one way or another along the way. In particular, I would like to thank Marwa Afifi, Mariam AlMazour and Ayesha Baber for their sincere friendship. As well, I would like to acknowledge the School of Computing at Queen’s University for providing me with the opportunity and environment for pursuing my research and stud- ies. Special appreciation is due to Mrs. Debby Robertson, our graduate assistant. Thank you for being always there and for your endless encouragement. Special thanks go to Dr. Wendy Powley for advocating and inspiring women in computing.

iv To the soul of my mother and father who did everything they could for me and help me to achieve my dreams and I am sure if they were here, they would have been so proud of their daughter. I further wish to thank my family and friends for their support and encouragement. I also appreciate the support and encouragement my husband, Omar, had given me over the past several years. He was instrumental in me taking this leap in the beginning and for the past 20 years has done nothing but encourage me in whatever goals I set for myself. Finally, I would like to thank my lovely kids; Hafsah, Sara, Abdullah and Maeen; my source of enthusiasm for always encouraging me and telling me that it is never too late for learning. Thank you for your understanding and patience when I was away from you. I thank my brother, Ahmed; my role model, and my youngest brother Mohamed and my sisters; Azza, Nasra, Fatma and Raheema; my father-in-law and mother-in-law, uncles, aunts, and cousins for their continuous prayers and support. I thank my home country Oman; Sultan Qaboos University and the Ministry of Higher Education for their support to pursue my degree and for funding my study.

v Statement of Originality

I, Rahma S. Al Mahruqi, certify that the research work presented in this thesis is my own and was conducted under the supervision of Dr. Thomas R. Dean, and Dr. Manar H. Alalfi. All references to the work of other people are properly cited.

vi Contents

Abstract i

Co-Authorship iii

Acknowledgments iv

Statement of Originality vi

Contents vii

List of Tables xi

List of Figures xii

Chapter 1: Introduction 3 1.1 Problem Statement ...... 3 1.2 Motivation ...... 4 1.3 Research Statement ...... 6 1.4 Objectives ...... 7 1.5 Contributions ...... 7 1.6 Organization of the Thesis ...... 8

Chapter 2: Background 9 2.1 Introduction ...... 9 2.2 Databases ...... 10 2.3 Database Management Systems ...... 11 2.4 Data Storage Evolution in Web Applications ...... 11 2.5 The Relational Model ...... 13 2.6 NoSQL Databases ...... 14 2.6.1 Classification of NoSQL Systems ...... 16 2.6.2 Characteristics of NoSQL Systems ...... 18 2.6.3 NoSQL Use Cases ...... 21

vii 2.6.4 Challenges for NoSQL Adoption ...... 24 2.6.5 MongoDB ...... 26 2.6.6 NoSQL Usage Statistics ...... 28 2.6.7 Web Application Migration to NoSQL ...... 29 2.7 The Need for Automation ...... 31 2.7.1 Source Transformation ...... 32 2.8 Query and Application Migration Challenges ...... 33 2.9 Related Work ...... 34 2.9.1 Schema and Data Migration ...... 35 2.10 Query and Application Migration ...... 37 2.10.1 Research-based Tools ...... 37 2.10.2 Industry-based Products ...... 39 2.11 Conclusion ...... 40

Chapter 3: Proposed Migration Approach 42 3.1 Introduction ...... 42 3.2 Proposed Migration Approach ...... 42 3.3 Conclusion ...... 45

Chapter 4: Schema and Data Migration 46 4.1 Introduction ...... 46 4.2 Data Model Design ...... 48 4.3 Designing The Schema ...... 50 4.3.1 Schema Conversion ...... 51 4.3.2 Data Migration ...... 52 4.4 Migration Strategy ...... 53 4.5 Data Migration ...... 54 4.5.1 Data Migration Tools Overview ...... 55 4.6 Schema and Data Migration Experiments ...... 57 4.6.1 PHPBB3 Schema and Data Migration ...... 57 4.7 WordPress Schema and Data Migration ...... 64 4.7.1 WordPress Schema Conversion ...... 64 4.8 Data Migration Considerations ...... 67 4.9 Data Migration Tools ...... 68 4.9.1 Data Migration Tools Evaluation ...... 69 4.9.2 Data Migration Tools Comparison ...... 71 4.10 Conclusion ...... 73

Chapter 5: Manual Migration 76 5.1 Introduction ...... 76 5.2 PHPBB Code Migration ...... 78 viii 5.3 SCARF Manual Migration ...... 81 5.4 Generic Code Migration Cases ...... 91 5.5 Manual Migration Caveats ...... 99 5.6 Conclusion ...... 100

Chapter 6: Query Migration and Optimization 102 6.1 Introduction ...... 102 6.2 Query Extraction Phase ...... 102 6.3 Query Classification Phase ...... 104 6.4 Query Translation and Migration Phase ...... 107 6.4.1 Naive Queries Translation ...... 108 6.4.2 Migration Cases Examples ...... 110 6.4.3 Dual Queries Translation ...... 112 6.4.4 SCARF Migration Issues ...... 114 6.4.5 PHPBB Migration Issues ...... 117 6.5 Query Migration Evaluation ...... 123 6.5.1 Query Evaluation Before Indexing ...... 123 6.5.2 Query Evaluation After Indexing ...... 126 6.6 Query Optimization Phase ...... 128 6.6.1 Query Optimization Techniques ...... 129 6.6.2 Filter Option Optimization ...... 132 6.6.3 Table Order Optimization ...... 134 6.6.4 Query Optimization Evaluations ...... 142 6.7 Conclusion ...... 144

Chapter 7: Application Migration 146 7.1 Introduction ...... 146 7.2 Application Migration Overview ...... 147 7.2.1 Application Migration Approach ...... 149 7.2.2 Application Migration Steps ...... 153 7.3 Phase 1 - Backward Tracing Search ...... 154 7.3.1 Step 1: Identify and change calls to MySQL ...... 154 7.3.2 Step 2: Pre-process source files to gather global variables ...... 155 7.3.3 Step 3: Find functions used to launch queries ...... 155 7.3.4 Step 4: Search SQL statements using backward tracing from calls to MySQL ...... 156 7.3.5 Step 5: Identify the prototype of SQL statements ...... 161 7.4 Phase 2 - Migration of SQL sentences ...... 161 7.4.1 Step 1: Query classification ...... 162 7.4.2 Step 2: Migrate query by pattern matching ...... 162

ix 7.5 Simple Query Translation Example ...... 164 7.6 Backward Tracing Translation ...... 167 7.6.1 Backward Tracing Overview ...... 167 7.6.2 SCARF Use Cases Examples ...... 170 7.6.3 PHPBB3 Backward Tracing Examples ...... 175 7.6.4 Complex PHPBB3 Query Builds ...... 182 7.7 Conclusion ...... 194

Chapter 8: Application Migration Evaluation 196 8.1 Introduction ...... 196 8.2 SCARF Backward Tracing Process ...... 196 8.3 PHPBB2 Backward Tracing Process ...... 201 8.4 PHPBB3 Backward Tracing Process ...... 202 8.4.1 Backward Tracing Queries with Multiple Parameters ...... 204 8.4.2 Non-Migrated Cases ...... 204 8.5 SQL Backward Tracing Examples ...... 205 8.5.1 Backward Tracing Queries with Operators ...... 206 8.5.2 Issues with non-Migrated Tables ...... 209 8.5.3 Difference between the Manual Migration and the Automated One . 212 8.6 PHPBB3 Application Evaluation ...... 215 8.6.1 SQL Process Examples of the Migration ...... 215 8.7 Application Migration Evaluation ...... 219 8.8 Manual Migration of PHPBB3 Pages ...... 224 8.9 Application Migration Statistics ...... 230 8.10 WordPress Analysis ...... 232 8.11 Conclusion ...... 235

Chapter 9: Summary and Conclusions 237 9.1 Summary ...... 237 9.2 Limitations ...... 238 9.3 Future Work ...... 240

x List of Tables

4.1 Structure Comparison between RDBMS & MongoDB ...... 50

6.1 MongoDB Indexes ...... 126 6.2 Query Migration Statistics before & after Indexing ...... 127 6.3 Double Query Migration Statistics before & after Optimization ...... 143

8.1 SQL to MongoDB Query Mapping ...... 217 8.2 Codified SQL to MongoDB Query Mapping ...... 217 8.3 Experiment Results ...... 232

xi List of Figures

3.1 Proposed Migration Approach ...... 43

4.1 Schema Migration Process ...... 47 4.2 Data Migration Process ...... 48 4.3 PHPBB3 Relational Schema ...... 58 4.4 PHPBB3 Relational Logical Structure ...... 59 4.5 PHPBB3 Relational Schema (users and topics_posted) ...... 60 4.6 PHPBB3 MongoDB Model ...... 61 4.7 Schema Migration Interface ...... 62 4.8 Generated Text Files ...... 63 4.9 Tool Interface ...... 64 4.10 WordPress Entity Relational Diagram ...... 75

5.1 SCARF MySQL ER Diagram ...... 84

6.1 Query Migration Process ...... 103 6.2 Query execution time comparison before Indexing ...... 124 6.3 Table Order Optimization Variants ...... 138 6.4 Query Optimization Analysis ...... 144

7.1 Application Migration Process ...... 148

xii 8.1 PHPBB3 Index homepage ...... 225 8.2 PHPBB3 viewforum Page ...... 226 8.3 Phpbb3 viewtopic Page ...... 227

xiii 1

List of Acronyms

ACID Atomic, Consistent, Isolation, and Durability ANTLR Another Tool for Language Recognition API Application Programming Interface BSON Binary Source Oriented Notation CMS Content Management System CURD Create, Update, Read, and Delete DAO Database Access Object DDI Denormalization, Duplication and Intelligent keys ER Entity Relations ETL Extract-Transform-Load HBase Hadoop database HDFS Hadoop Distributed File System HQL Hibernate Query Language I/O Input Output operations JBoss JavaBeans Open Source Software JDBC Database Connectivity JSON JavaScript Object Notation NoSQL Not only SQL OOP Object Oriented Programming PaaS Platform as a Service PDI Pentaho PDO Php Data Objects 2

PHP Hypertext Pre-processor phpBB Bulletin Board QnA Question and Answer RDMS Relational Database Management System REST Representational State Transfer SCARF Stanford Conference And Research Forum SIGCOMM Association for Computing Machinery’s Special Interest Group SQL Structured Query Language TXL Tree Transformation Language URL Uniform Resource Locator YCSB Yahoo Cloud Serving Benchmark 3

Chapter 1

Introduction

In order to maintain a successful organization or business, a database is usually required. A database is a collection of data, typically describing the activities of one or more related organizations. This helps keep track of all transactions happening within a business in or- der for it to run smoothly. There are several different types of databases which will be covered later in the thesis. Throughout this thesis, we will be focusing on analyzing Re- lational Databases and NoSQL databases, which are currently among the most commonly used. Specifically, this research aims at studying the differences between NoSQL databases and comparing one NoSQL architecture, namely MongoDB, against the performance of an RDBMS (MySQL). A difference in performance uses and architecture advantages and disadvantages will be discussed, as well as the design purposes of various NoSQL imple- mentations and the migration process from MySQL to MongoDB includes migration of schema and data, queries migration and application migration.

1.1 Problem Statement

The appearance of the new data models, NoSQL, have led some application developers to question whether it is more beneficial to transfer existing data and application from 1.2. MOTIVATION 4

relational databases to these new structures. Going further the next questions regarding the possibility, feasibility and reasonableness of undertaking such tasks should be asked. To answer these questions about the difficulty of carrying out an application migration process, the scope of changes, which would have to be done in the existing environment and the application source code, and the efficiency of an application while using new data structures, were considered in this research. With the development of web applications, the demand for high scalability, query per- formance and expansion becomes increasingly necessary. Relational databases seem un- able to handle this, with more applications choosing to migrate to NoSQL database. There are many studies, which discuss the uses of NoSQL databases and compare their efficiency, thus confirming their usefulness and performance [23] and discuss the migration between different Relational Databases. However, there are very few studies dealing with the prob- lem of replacing data storage for systems with different structure other than the existing working one. This lack has become a motivating factor to examine how difficult and laborious it is to move an existing, regularly used application, based on the relational environment to a non-relational data structure.

1.2 Motivation

Databases are among the most important pieces of enterprise applications required for busi- ness decision making. As part of achieving specific targets, business decision making in- volves processing and analyzing large volumes of data, which leads to the growth of enter- prise databases day by day [86]. NoSQL (Not Only SQL) is often used for storing Big Data. This is a new type of database which is becoming more and more popular among large 1.2. MOTIVATION 5

companies today. NoSQL provides simple scalability and improved performance relative to traditional relational databases. NoSQL solutions take advantage of Cloud Computing, meaning that database servers are delivered as a service, typically over the Internet [106]. NoSQL databases support dynamic provisioning, horizontal scaling, significant database performance, distributed architecture and developer agility benefits [56]. This approach excels at storing data structures which are changing rapidly as the business needs of the company change, unlike traditional SQL database management systems which require a carefully designed data structure which does not change over time, and when it does, most, if not all, applications accessing it need to be redesigned. Moreover, NoSQL systems allow data to be easily partitioned to hundreds of servers and the queries can be automatically bro- ken down into multiple parallel processes running on multiple servers. Whenever possible, data is held in-memory to make its access faster [78]. RDBMS technology can no longer keep pace with the velocity, volume, and variety of data being created and consumed, because these systems are based on relations and data normalization. Accommodating new big data, unstructured and semi-structured data into these systems causes the relational normalized schema to become too complex. Normal- izing data requires more tables, which requires more table joins, thus requiring more keys and indexes. As databases size starts to grow into terabytes, performance starts to signifi- cantly degrade. The main reason why Cloud Computing is being utilized by many organi- zations is to avoid high upfront costs and to avoid dealing with issues such as scalability in which Cloud providers have extensive experience [109]. The decision to start using NoSQL databases comprises a set of trade-offs. Migrating to the NoSQL databases, requires some preparation for the enterprise IT that grew up with RDBMSs [23]. The transition from legacy RDBMS to NoSQL requires careful planning. A completely different mindset is 1.3. RESEARCH STATEMENT 6

needed by the developers and architects in order to get the most out of this new technology. Distributing data across multiple nodes in a data center results in a linear performance im- provement [107]. However, understanding how to store data in ways that allow scalability requires a rethinking of the overall approach to systems design. The plenty of NoSQL prod- ucts with many different feature sets, make it almost impossible to recommend only by the data model that they offer. Moreover, how will this transition impact the other application layers is another very important aspect of this process. Therefore, we will investigate the potential and limitations when migrating web application from the relational database to NoSQL which requires changing the embedded queries and the application code to interact with the new database.

1.3 Research Statement

The purpose of this research is to analyze the challenges of migrating web application from the relational database to NoSQL databases. The research will cover schema and data migration, application source code and embedded queries migration. In this thesis, we propose a semi-automated approach to migrate highly dynamic SQL based web applications (e.g. MySQL) to ones that uses document-oriented NoSQL databases. The core idea behind this thesis is summarized in the following statement: We believe that the adoption to the NoSQL database is a necessary requirement to handle big data. So a solution is needed in order to migrate the source code of an existing legacy web application based on relational database to use a NoSQL database. Such solution can reduces the effort of rewriting the application code when the back-end data storage sys- tem is changed and reduce the effort of maintaining data portability between the different databases models. 1.4. OBJECTIVES 7

1.4 Objectives

The main objective of this thesis is to develop a semi-automated framework that sup- ports migrating an existing relational database-based web applications (MySQL) into a document-oriented NoSQL database-based applications (MongoDB) taking into considera- tion the migration of the application source code. The objectives include the following: 1: Investigate existing solutions and best practices of schema and data migration tools. 2: Develop an automated framework for source code migration. 3: Validate that the proposed framework works for different types of web applications by providing and testing sample use cases.

1.5 Contributions

Our work will make the following contributions to the fields of migrating RDBMS to NoSQL in Web Applications:

• Analyze the requirements of the schema, data and source code migration.

• Evaluate the published approaches for migrating the database schema and data.

• Develop a semi-automated migration framework to migrate relational based web ap- plication to document-oriented NoSQL database based on TXL.

• Define and analyze the effects in the migration process on the applications code changes.

• An experiment that evaluates our framework and tool on production web applications of various sizes. 1.6. ORGANIZATION OF THE THESIS 8

Our approach provides contributions to the migrating and optimizing the embedded SQL queries to interact with the new NoSQL database system and changing the application code to use the translated queries. A framework for the automated migration of relational database web applications to a mixture of relational and document oriented NoSQL applications. A semi-automated tool that realizes the proposed framework. An experiment that evaluates our framework and tool on production web applications of various sizes.

1.6 Organization of the Thesis

We proceed by introducing the main concepts of database and web application migration and discussing the related work in Chapter 2. In Chapter 3, we introduce our proposed framework and a brief discussions of the different phases. Chapter 4 describes the schema and data migration steps with running examples. We will discuss the manual migration experiments with some examples from our applications under test in Chapter 5. The im- plementation details and the analysis of the framework are presented in Chapter 6 and Chapter 7. Chapter 8 evaluates our application migration phase and illustrates some exam- ples of the backward tracing process of SCARF and PHPBB3 applications. Chapter 9 concludes our work, addresses the limitations of it, and outlines some future work. 9

Chapter 2

Background

2.1 Introduction

Addressing today’s ever increasing changes in data management needs require solutions that can achieve unlimited scalability, high availability and massive parallelism while en- suring high performance levels [46]. The new kind of applications like business intelli- gence, enterprise analytics, customer relationship management, document processing, con- tent management systems (CMS), social networks sites, web 2.0 and cloud computing require horizontal scaling of thousands of nodes as demanded when handling huge col- lections on structured and unstructured data sets that traditional relational databases fail to manage [67]. The rate with which data is being generated through interactive applications by large numbers of concurrent users in distributed processing involving a very large num- ber of servers and handling big data applications has overtaken the capabilities of relational databases, thus driving focus towards the NoSQL database adaption [106]. NoSQL database systems have addressed scaling and performance challenges inherent in a traditional rela- tional database by exploiting partitions, relaxing heavy strict consistency protocols and by way of distributed systems that can span data centers while handling failure scenarios 2.2. DATABASES 10

without difficulty [106]. In this chapter, we introduce the basic ideas of databases, database management sys- tems, data storage evolution in Web Applications, NoSQL databases classification and char- acteristics, the NoSQL use cases, the challenges for NoSQL adaption, the migration of web application databases from SQL to document-oriented NoSQL. Also, we will review oth- ers recent work related to the conversion to NoSQL databases, highlighting the various techniques and approaches, and identifying the need for automation. We end the chap- ter by briefly introducing the automated technology used in our semi-automated migration approach, the TXL source transformation language [94].

2.2 Databases

A database is a kind of system that is designed for storing data. It is a system made avail- able on a network which handles many incoming requests, even overlapping requests [81]. A database will be given a defined schema, which is a structure of the shape of the data. Databases usually have mechanisms for database structure preserving. This way, the data cannot be inserted, modified, or removed in a way that will violate the schema. Most databases handle requests through transactions. Transactions are an independent unit of work enacted upon the data. This occurs when users are contacting the database and con- ducting transactions. Many users can contact the database at the same time and create simultaneous transactions. The database makes sure those separate transactions do not in- terfere with each other. Transactions conform to whats prescribed by the acronym, ACID. ACID stands for atomicity, consistency, isolation, and durability. The ACID formulation of transactions has been current for thirty years [68]. 2.3. DATABASE MANAGEMENT SYSTEMS 11

2.3 Database Management Systems

The definition of a database management system was best stated by R. Ramakrishnan and J. Gehrke [81], a database management system, or DBMS, is a software designed to assist in maintaining and utilizing large collections of data, and the need for such systems, as well as their use, is growing rapidly. Database management systems will continue to grow as more and more data is becoming accessible through computer networks. For example, one of the most common and simplest forms of databases are digital libraries. With database management systems, it is very easy for libraries to keep track of all items by using a database to show what items are checked out and how many are available. Database management systems have been around since the early 1960s. Charles Bach- man designed the very first general-purpose database management system at General Elec- tric where it was originally known as the Integrated Data Store [81]. In the 1980s, the rela- tional model consolidated its position as the dominant DBMS paradigm, and database sys- tems continued to gain widespread use. The SQL query language for Relational databases, developed as part of the IBM System R project, is now the standard query language [81]. In the late 1980s and the 1990s, advances have been made in numerous areas of database systems. Large vendors began extending their database systems at this time so they could store new data types into their systems such as written text and images, and also gain access to more complex queries.

2.4 Data Storage Evolution in Web Applications

The storage systems used in web applications have evolved dramatically over the past 30 years, and history brings interesting insights. We can roughly divide these storage systems into four generations: 2.4. DATA STORAGE EVOLUTION IN WEB APPLICATIONS 12

• First generation: File Systems In the early 1990s, web pages were static. Web servers received HTTP requests for a file, read the file from their local file system, and returned it to the user. The storage system for web applications was simply the file system holding such files.

• Second generation: File and SQL database Systems In the mid to late 1990s, the web saw the emergence of dynamic content; content that depends on who the user is or what the user has done e.g., the items in a shopping cart. To generate a page, the web server invoked a program written in languages such as Perl, Python, PHP, Java, etc. The program stored data needed to generate the page in a central SQL database system. The use of SQL was convenient, because SQL has many useful features such as joins, secondary indexes, transactions, aggregations, and many data types, etc. The storage system, thus, was a combination of the file system for static content and the database system for dynamic content [68].

• Third generation: Highly scalable NoSQL Systems In the 2000s, large websites emerged, such as Amazon, Hotmail, Google, Yahoo, and others. These sites had rapid growth in the number of users; soon the database system became a performance and scalability bottleneck. Scaling the database system was hard or expensive, so developers decided to replace the SQL database system with their own homegrown storage systems, such as the Google File System, , Dynamo, and others. This was the beginning of what later was known as NoSQL. Some believe that NoSQL databases started with Berkeley DB, which was developed during late eighties and early nineties [31]. Berkeley DB has now been acquired by Oracle but the real movement of NoSQL databases started after 2008, and since then, it has been rigorously advancing [31]. 2.5. THE RELATIONAL MODEL 13

• Fourth generation: Cloud Storage Systems In the 2010s, many Web Applications moved to the cloud, where many vendors share the same computing and storage infrastructure. Storage systems were designed not just to scale, but also to provide isolation, so that applications do not interfere with one another. Examples of such storage systems include Amazon S3, SimpleDB, Azure Blobs, and Azure Tables.

A highlight in this history is the emergence of the NoSQL movement 19 years ago, which sought to replace the SQL database system with cheaper custom-built alterna- tives. These alternatives were much simpler than a SQL database system, and thus were easier to scale, but they offered more restricted functionality; a few NoSQL sys- tems offer transactions, secondary indexes, joins, or aggregations, and certainly, no NoSQL systems offer all the features found in SQL [36].

2.5 The Relational Model

Introduced in the 1970s, the relational model offers a very mathematically-adapt way of structuring, keeping, and using the data [107]. It expands the earlier designs of the flat model, and the network model, by introducing relations. Relations bring the benefits of group-keeping the data as constrained collections whereby data-tables, containing the in- formation in a structured way; e.g. a person’s name and address, relates all the input by assigning values to attributes; e.g. a person’s id number [107]. Thanks to decades of research and development, database systems that implement the relational model work extremely efficiently and reliably [106]. Combined with the long ex- perience of programmers and database administrators working with these tools, use of rela- tional database applications has become the choice of mission-critical applications, which 2.6. NOSQL DATABASES 14

cannot afford the loss of any information in any situation especially due to faults [44].

2.6 NoSQL Databases

NoSQL is a general term for non-relational database. Compared to traditional relational database, NoSQL database has four advantages: they are easy to extend, high performance, flexible data model and high availability [44]. NoSQL database removes the relational fea- tures, so the database expansion becomes easy. NoSQL database has good performance in reading out and writing in big data because of its simple database structure [98]. On the premise of the performance not affected, NoSQL database can easily achieve high avail- ability architecture. Therefore, NoSQL database is more suitable for processing big data. The Database Management Systems “Not only SQL”, or simply NoSQL were proposed in order to meet the needs for efficient storage and accesses of large amounts of structured and unstructured data [31]. Google, Amazon, Facebook, Twitter and LinkedIn have been among the first companies to discover the serious limitations of the technologies behind re- lational databases as far as the demands of newer applications are concerned [10]. Because commercial alternatives did not yet exist, they invented new approaches to manage their data. Their pioneer work generated a major interest, because an ever increasing number of companies was facing similar challenges. Within a short time-period open-source NoSQL database projects emerged, to which the big companies flocked. There is not a single definition to cover all of the aspects of the NoSQL databases, as they cover variety of features. However, one thing that they have in common is that they do not follow the mainstream SQL convention [10]. The main characteristics that distinguish the NoSQL from the traditional RDBMS one are partitioning of data and data replication. Despite these benefits, most of existing large and medium scale software applications are 2.6. NOSQL DATABASES 15

still based on RDBMS since there are many challenges associated with the migration pro- cess. The first one is the volume of data to be migrated [44]. Organizations, in general, decide to migrate their database when the volume of stored data is huge, and the RDBMS are no longer able to satisfy the scalability and high availability expectations. Another challenge is related to the relational database model’s manner of avoiding data redundancy, which is part of the NoSQL models. In this case, it is required to maintain the new data model semantically identical to the original one. Thus, all existing relationships have to be effectively represented without loss or alteration of data [67]. Finally, in addition to all data and model migration, there is a cost associated with adapting software applications to communicate properly with the new NoSQL database model. Application needs have been changing dramatically, due in large part to three trends: growing numbers of users that applications must support. Along with growing user expec- tations for how applications perform, and in the volume and variety of data that developers must work with. NoSQL technology is rising rapidly among Internet companies and the enterprise because it aggregates data management capabilities that meet the needs of mod- ern applications: better application development productivity through a more flexible data model; greater ability to scale dynamically to support more users and data; improved per- formance to satisfy expectations of users wanting highly responsive applications and to allow more complex processing of data [48]. NoSQL is increasingly considered a viable alternative to relational databases, and should be considered particularly for interactive web and mobile applications [44]. NoSQL databases, by using an unstructured or structured-on-the-go kind of approach, aim to eliminate the limitations of strict relations, and offer many different types of ways to keep and work with the data for specific use cases efficiently e.g. full-text document 2.6. NOSQL DATABASES 16

storage [27]. Most NoSQL systems have been developed independently from one another, each with specific application objectives, but with the general goal of offering rather simple operations on flexible data structures [69]. Indeed, they usually leverage on this simplicity to provide high scalability and massive throughput. A feature that is common to almost all systems and coherent with the principle of simple operations is that they handle individ- ual items, identified by unique keys [9]. Systems differ on the structure that is supported for these individual items, on the type constructors (map, set, list) that can be used, and on the possibility of nesting [69]. Also, structures are flexible, in the sense that schemas are often relaxed or completely absent [75]. Being this a relatively young space, consoli- dated standards are yet to be found, and, given both the number of NoSQL data stores and the differences between them, it is useful to group them in categories, according to some criterion [58]. An interesting classification on data modeling features has been recently pro- posed [31]; it groups systems into three major families: extensible record stores, document stores and key-value stores. Systems belonging to the same family largely agree on main data structures and access patterns, whereas they may differ in specific operations support, structure details and in architectural aspects like consistency models, and partitioning [44].

2.6.1 Classification of NoSQL Systems

The NoSQL movement came about largely because of the increasing data storage needs of the Web 2.0 industry [31]. The big players in this industry became frustrated with the difficulties of building distributed storage architectures based on traditional RDBMS systems. Because of this, web application developers took matters in to their own hands and developed their own database technologies [84]. Google and Amazon are two such companies that are now seen as pioneers of the NoSQL movement. Google’s BigTable 2.6. NOSQL DATABASES 17

showed that it was possible to store simple data on a system that could scale to hundreds or thousands of nodes [16]. Amazon’s Dynamo database pioneered the idea of losing strong consistency in favor of high availability; data was not guaranteed to be up-to-date on every node but updates would be applied to each node eventually [84]. Since then, there have been many developments in the NoSQL industry with many more vendors now providing NoSQL systems. Of the systems currently available, most will fall into one of three categories:

• Key-value Stores: All data is stored as a simple key-value index. The key is used to identify a value that is typically stored as a BLOB, but can contain other data types such as strings or pointers. These systems can be equated to a distributed index that was popularized by the Mem-cached open-source cache system, which took advantage of the increasing availability of main memory to store in-memory indexes [16]. These systems are highly efficient; they can scale to a high number of nodes, but provide a very simple data model [17]. Examples of key-value stores include Redis, Riak, Scalaris and V oldemort.

• Document-Oriented Stores: Data is stored in semi-structured documents that are in- dexed by a key. Documents can have a varying number of attributes of varying types and can also be queried to look for matching attributes contained within the document [16]. They offer additional functionality to key-value stores while maintaining the ability to partition data over multiple nodes and provide support for replication and automatic recovery [15]. Examples of document stores are MongoDB, CouchDB, and Dynamo.

• Extensible Record Stores: Also referred to as wide column stores due to the fact that each row in a related data-set can contain a varying amount of attributes [16]. Of the 2.6. NOSQL DATABASES 18

NoSQL systems, this data model is the closest to the relational data model. Data is stored in tables, but each row has a dynamic number of attributes. Both rows and columns can be distributed across nodes providing high scalability and availability [53]. Examples of extensible record stores are Google Big-Table, Cassandra, Hyper- Table and HBase.

• Other non-relational database systems in existence have been put into the NoSQL category that does not fit into the three categories above. Graph databases, such as Neo4j, store data as relationships between keys value pairs [53]. Object-oriented databases store data as collections of objects that can be easily materialized as pro- gramming language objects [16]. Both of these systems provide features such as horizontal scaling and the ability to store massive amounts of data; however as Cat- tell points out [16], these systems differ from those found in the three categories described above in that they generally provide ACID transactions and data querying involves complex object behaviour rather than simple key look-ups. This represents a significant characteristic difference to the key-value stores, document- stores and extensive record stores.

2.6.2 Characteristics of NoSQL Systems

Besides placing each system into one of the three categories described above, all of the systems have common characteristics, which allow them to be collectively described as a NoSQL databases [16].

• Horizontal Scaling: This is a key feature of NoSQL. Data can be replicated and parti- tioned over many servers in shared-nothing architecture, i.e. all nodes are equal and none of the hardware is shared [15]. This enables two important features of NoSQL; 2.6. NOSQL DATABASES 19

storage of large amounts of data and the ability to use cheaper commodity servers instead of more expensive enterprise-class servers [16]. For example, the Couch- baseDB [21] system derives its name from this characteristic; Cluster of Unreliable Commodity Hardware. As the user base grows and we require a database, which has capabilities to handle the added load, most of the NoSQL databases have the capabilities to scale as the data grows. Each table in NoSQL is independent of the other. NoSQL provides the ability to scale the tables horizontally, so we can store frequently required information in one table. All the table joins need to be handled at the application level. Thus, data retrieval is fast [84].

Scalability can be categorized based on [96]:

– Read Scaling: Large number of read operations.

– Write Scaling: Large number of write operations.

• Automatic Sharding: Data is automatically spread across all servers in the cluster, also referred to as "elasticity", due to the fact that servers can be added or removed without any downtime. Any new server added immediately begins to receive data from the other servers in the cluster. Data is also replicated across the cluster [36].

• No Schema or Schema-less: In relational databases, for each table, we have to define a schema, where we specify the number of columns and the type of data it holds. It is difficult to change the data type of the column and adding a new column will result in lots of null values in the table. In NoSQL databases, adding/removing column is easy because we do not have to specify schema on table creation [15]. Unlike a traditional RDBMS, NoSQL databases are confined to a rigid data schema. Any record that is inserted can have an arbitrary number of attributes associated with it 2.6. NOSQL DATABASES 20

and these attributes can be altered at any time [16]. This provides excellent flexibility for applications whose data may not conform to a constant structure and is likely to change regularly. NoSQL is a kind of database that doesn’t have a fixed schema as a traditional RDBMS does. With the NoSQL databases, the developer defines the schema at run time. They don’t write normal SQL statements against the database, but instead use an API to get the data that they need. The NoSQL databases can usually scale across different physical servers easily without needing to know which server the data you are looking for is on [96].

• Simple Query Interface: As the name implies, NoSQL does not support the SQL query language. Instead, querying is provided through various mechanisms, which varies from one distribution to another. For example, Amazon’s SimpleDB uses a subset of SQL commands such as SELECT and DELETE along with operations like GetAttributes and PutAttributes [31]. HBase uses another restricted SQL variant called HQL, and CouchDB uses a procedural approach for querying its document- based records [9].

• High Availability: Data is replicated across multiple servers and even across multiple data centres allowing for a highly available configuration that can handle multiple server failures and support disaster recovery [15].

• Weaker Consistency Model: Providing ACID semantics has been a staple feature of RDBMS databases since their inception in the 1970s. However, in a distributed architecture, the consistency property becomes more difficult to guarantee. Because the web has enabled 24x7 access to applications, availability has become a high priority for a database system. Because of this, developers are willing to lose strong 2.6. NOSQL DATABASES 21

consistency in favour of higher availability. NoSQL systems provide this sacrifice, offering eventual consistency instead of strong consistency. This means that it may not be possible to get a consistent view of data across all nodes at any one time [46]. NoSQL databases may dispense with various portions of ACID in order to achieve certain other benefits; partition tolerance, performance, to distribute load or to scale linearly with the addition of new hardware [51]. As far as when to use them, that depends entirely on the needs of the application.

2.6.3 NoSQL Use Cases

Choosing a data storage platform for an application is no longer the straightforward choice that it once was. The relational database was the only option available and usually the decisions at this level turned around what vendor to choose and what version of DBMS from that vendor. However, with new NoSQL data storage technologies emerging, designers are starting to evaluate data storage needs from a different viewpoint. Like all software systems, databases are subject to evolution as time passes. The impact of this evolution can be vast as a change to the schema of a database can affect the syntactic correctness and the semantic validity of all the surrounding applications [108]. Estimating the volume of data that an application is likely to produce is becoming an increasingly difficult task, particularly if the application is web enabled and accessible by an arbitrary number of users that could potentially grow into millions [46]. Therefore, scalability is now a much higher priority than it may have been 20 years ago. The database must be able to grow in line with potential demand and the cost of scaling up a relational database can be a significant barrier. Increasing volumes of data also bring performance considerations, which must be addressed. Read and write operations can be adversely 2.6. NOSQL DATABASES 22

affected if the database is not designed to cope with high volumes of data [79]. These are all problems that many users are facing with relational databases and for this reason are starting to look at other options. NoSQL databases can potentially provide the solution. They have better scaling capabilities, because they can be horizontally scaled, they can support large volumes of data with minimal impact to performance through Map-Reduce, and they can provide a simpler method of implementing a distributed database [61]. Some of the use cases where this may be particularly beneficial include content management systems, ecommerce, online games, social networking, real-time analytics, event logging and the operational data store for a website [75]. Due to the diversity of NoSQL solutions, making the choice of the most appropriate data store for a given use case scenario is a challenging task. This part discusses some general guidelines that can be used in this task and shows examples of applications that use different data stores. The following discussion is mostly focused on selecting a specific data model over others, but when relevant, we also examine the appropriateness of specific data stores [48]:

• Applications with unexpected usage peaks These are applications exposed to outside world mostly such as online shops, product sites, , etc. which have a steady user base that can do something at some point which can end up with attracting a very large number of users. For instance it can get referenced on a popular web page or they can host a video/audio track that goes viral or release a marketing campaign that can become more successful than antic- ipated such as coupons, promo codes, ad words, etc. [9]. In other words these are applications that cannot predict the inbound traffic volume and the exact time frame but they have to handle it properly. They know the traffic will come in but they don’t 2.6. NOSQL DATABASES 23

know how much, when and how it will be distributed. Just like the above category, hardware acquisition might not worth the costs due to fragmented usage, which can lead to a lot of hardware just being sit there unused [53].

• Applications with periodic and demanding processing needs These can be applications that do batch analyzes on various data and which may become very computing intensive. Batch analyzes might refer to [62]; file conversion from one format to another, reporting, semantic text analyzes, text indexing, data clustering, data classification, neural network training for machine learning and so on. All of these operations are time predictable, in what concerns their occurrence, and with a high demand for computing power, which translates in a greater need for hardware. Although the hardware might be bought it might not worth the investment as the processing takes a limited number of hours per time unit (day, month and so on). More than that, a company can decide at a certain point to further accelerate the batch processing process so it can end more quickly [107], let’s say in less than an hour as opposed of 3 or 4 hours. This would be very easy in a cloud environment just by spinning up some more computing units, while it can become challenging if not impossible otherwise in a standard hosting due to hardware limitation.

Document-oriented databases should be used for applications in which data need not be stored in a table with uniform sized fields, but instead the data has to be stored as a document having special characteristics. Document-oriented databases serve well when the domain model can be split and partitioned across some documents [46]. Document- oriented should be avoided if the database will have a lot of relations and normalization. They can be used for content management system, software etc. The first use cases for document-oriented NoSQL stores are for applications dealing 2.6. NOSQL DATABASES 24

with data that can be easily interpreted as documents, such as blogging platforms and con- tent management systems (CMS). The MongoDB documentation uses these applications as acknowledged examples. A blog post or an item in a CMS, with all related content such as comments and tags, can be easily transformed into a document format even though different items may have different attributes [46]. For example, images may have a resolution at- tribute, while videos have an associated length, but both share name and author attributes. Moreover, these pieces of information are mainly manipulated as aggregates and do not have many relationships with other data. Finally, the capability to query documents based on their content is also important to the implementation of search functionalities [45]. The following is a list of areas where NoSQL is currently being used:

• Social Networks.

• Search engines.

• Geo-spatial analysis.

• Molecular modeling.

• Data Warehousing.

• Caching.

2.6.4 Challenges for NoSQL Adoption

Despite the advantages of NoSQL, there are a number of challenges that the vendors are facing that must be addressed before widespread adoption can occur [34]:

• Vendor Support: Although there are many different NoSQL vendors, many are community- driven and do not provide any formal support structure. Most businesses will look 2.6. NOSQL DATABASES 25

for the assurances of a support contract when choosing a database system to prevent any potential data loss or unavailability of data. However, some NoSQL vendors, do provide a support option for enterprise-level applications. For example, the initiator and sponsors of MongoDB offer two support packages with varying levels of support [96]. Riak provides an Enterprise license for their database system which includes 24x7 customer support, developer support and implementation consultancy at an an- nual cost. Other vendors will need to follow suit in order to penetrate the enterprise market.

• Data Querying: One of the main advantages of relational databases is the ability to query data using SQL. Apart from some minor differences between vendors, the SQL language can be used practically on all relational database systems. Consequently, extracting data from relational databases is a standardized process that has been in existence for nearly 40 years. Users are familiar with how to use SQL and it makes it easier to transfer from one relational DBMS to another. Because NoSQL is relatively new technology, a standardized query language that is capable of extracting data from all NoSQL database types has yet to be developed. There are efforts in existence that are attempting to achieve this, one of which is UnQL, an unstructured query language for JSON, semi-structured and document databases [96].

However, efforts on this project have slowed recently and it does not seem likely that it will be continued [31]. Relational databases are all based on the use of tables to store data and therefore a universal language is easier to create. However, with NoSQL, there are several different approaches to storing data, as discussed above, therefore making it more difficult to create a universal language. SQL succeeded because it was easy to learn, easy to read and easy to understand. With NoSQL, 2.6. NOSQL DATABASES 26

you either need to learn the query language that the vendor has created for each individual product or be proficient in a programming language that is compatible with the NoSQL system. For instance, Google has created its own query language called GQL, which is compatible with its own data store products such as Google Big Table. Riak bases its query language on Lucene, an open source search engine written in Java [54]. These are two vastly different approaches to querying data and in the case of Riak and other vendors that use Java-based languages for querying, it is likely to prevent people from transitioning from the familiarity of relational databases and SQL.

• Immaturity of the Technology: NoSQL databases have only been in existence for a matter of years and many vendors are still in beta stage or are releasing updates continuously. This could be unsettling for potential adopters of the technology as most are looking for ultra-reliability in the system that is managing their critical data. Software often needs to go through many minor and major revisions before bugs are discovered and patched. Also, the immaturity of the technology brings a shortage of expertise in the field. Relational database systems have built up sizeable knowledge bases and technical papers to assist users in the deployment and use of their system.

2.6.5 MongoDB

MongoDB is selected as an example of a document based NoSQL database to do our mi- gration experiment. The reasons for choosing MongoDB are:

• It works at small and large scales, unlike other NoSQL databases such as Cassandra and HBase, which are only suited to very large scale deployments; 2.6. NOSQL DATABASES 27

• It has the capability to replicate any complex queries that may be encountered in a complex relational database;

• It has a good reputation for technical support and a well documentation.

• It is a good combination for MySQL and PHP applications.

MongoDB is an open-source NoSQL document-oriented database, initiated by 10gen company [62]. It was designed to handle growing data storage needs. It is written in C++ and its query language is JavaScript. MongoDB stores data in the form of collections. Each collection contains documents which have a unique identifier (the document ID). Documents are grouped into collections and each document within a collection can have an arbitrary number of fields [106]. It supports indexing of fields in the same way as a relational database; however, there are no joins between collections. MongoDB documents are stored in a binary form of JSON called BSON format. BSON supports Boolean, float, string, integer, date and binary types [10]. Due to document structure, MongoDB is a schema-less or schema-free database, it is easy to add new fields to a document or to change the existing structure of a model. The basic entity in MongoDB is the document which is the collection of key-value pairs. The RDBMS Record is referred as a Document in MongoDB.A Table in RDBMS is referred as the Collection in MongoDB that contains multiple documents. The Primary key concept of RDBMS is reflected by _id field in MongoDB. The joins can be reflected by linking and embedding. The linking can be done either using manual references or DBRef. The linking of the collections is done by using manual references to achieve normalization. The de-normalization can be related to the embedding of documents in one collection [43]. MongoDB has many powerful features that makes it quite attractive. MongoDB is 2.6. NOSQL DATABASES 28

highly scalable and has no singular point of failure. It is comprised of an arbiter, mas- ter node, and many slave nodes [62]. MongoDB offers a technique named Sharding to distribute collections over multiple nodes. Sharding is the distribution of data over mul- tiple machines which is also part of scaling out data. MongoDB can perform automatic sharding, a DBMS driven partitioning across various servers. When nodes contain a differ- ent amount of data, MongoDB automatically redistribute the data, so that load is equally distributed across the nodes [10]. MongoDB, also supports Master-slave replication. The slave nodes are copies of the master nodes and used for reads or backups. Many shards hold data replicas, allowing access to data even if the main node queried for the informa- tion should fail. The ease of the MongoDB commands has also allowed it to gain popularity and become the most popular NoSQL implementation since 2011 [62].

2.6.6 NoSQL Usage Statistics

The Relational Databases were used 51% whereas NoSQL databases were used 49% during the period of March 2014/2015. The market share of using NoSQL databases increased to 59% whereas the use of relational database reduced to 41% during period of March 2018/2019 as a result of a survey that has been done for the top one million websites where found the majority of them using NoSQL comparing with 41% still using relational database [25]. If you are a single database type user who is considering adding another database type to your mix, this is might be of high interest in which databases SQL and NoSQL alike, are most popularly used together. Over 1/3 of multiple database type use is the combination of MySQL and MongoDB [25]. While MongoDB is often considered an alternative to MySQL, the two databases do work well together when properly designed. 2.6. NOSQL DATABASES 29

2.6.7 Web Application Migration to NoSQL

A web application is an application that is accessed via a web browser over a network (e.g. Internet, mobile phone network, etc.). A web application can also be a computer software application that is coded in a browser-supported language such as (HTML, CSS, JavaScript, Java, PHP, etc.) and reliant on a common web browser to render the application executable [56]. Web application code is typically stored on servers. At launch, the browser uses a web address such as a Uniform Resource Locator (URL) to fetch the web application code. The code is then downloaded to the device and the application is executed, either inside the browser or using the browser functionality. Over the course of execution, additional code can be downloaded and executed. The client device may also store the web application code locally, in which case the web application URL points to a local file. It is also possible that the web application is pre-loaded on the client device prior to the delivery of the client device to a user [30]. This is common with, for example, pre-loaded applications on cell phones or laptop computers. Because of the newness of NoSQL solutions, there has been little effort in explain- ing and exploring their potential. Unluckily, the migration of web applications needs to be performed with the latest technology, though SQL is still being utilized to perform the migration activities. Lee and Zheng [47] explore the idea of migrating a legacy web appli- cation data to Joomla CMS. The effort would involve justifying the migration of the legacy applications, along with proposing a tool to handle the migration process. MySQL has been used for RDBMS databases. The authors are restricted to design Joomla schema within the initial phase. It may be a discouraging challenge when the data within the legacy web application is not consistent or has no particular structure. More research in this field is 2.6. NOSQL DATABASES 30

mentioned in Pokornym’s [78] work in remaining library web guides in Content Manage- ment Systems (CMS). The result of this work is satisfying but then again, in that research, MySQL is used to perform the migration process and the proposed work for data migration only. The result of this effort is rather complicated, as he creates several tables and rela- tionships between those tables. This potentially may have a direct negative impact on the migration process. Voda [99] propose some migration steps for migrating existing PHP applications to the cloud environment. Unfortunately he discussed theoretical steps for the migration process and concentrated on moving the data between the two platforms. Yun- hua’s research [31] looks at the problem statement from the perspective of a web crawler. In this work, MongoDB features were incorporated into the design of a web crawling applica- tion. The application enabled them to store web crawled information from web pages from the Internet and stored them in a MongoDB NoSQL solution. No visible effort of appli- cation code migration has been seen from a legacy web application migration to a NoSQL solution. So, our research studies the migration process to NoSQL in web application in details. Along with the use cases studies of NoSQL solutions, there have been great develop- ments in studying the performance of NoSQL and in comparing it with SQL [79]. These results are satisfying especially in cases where the data sets are large and horizontal scal- ability is required or relations between data sets are beyond SQL’s ability [66]. Lombardo et al. [84] discussed issues about complex data structures in the NoSQL database. Li and Manoharan [51] analyzed the performance between SQL and key-value based NoSQL databases. Furthermore, Naheman et al. [66] considered the comparison between HBase and other NoSQL databases, e.g., BigTable, Cassandra, CouchDB, and MongoDB. 2.7. THE NEED FOR AUTOMATION 31

Based on their observation and analysis, NoSQL databases may not always get better per- formance than SQL databases.

2.7 The Need for Automation

Migration of web application from SQL databases to NoSQL databases is an important problem to address nowadays, because businesses are concerning on how to migrate their legacy web application based on relational database (which is already filled with a large amount of data) into this new class of databases. While migration between different re- lational data processing systems has been widely studied, migration from SQL to NoSQL with different structures and different data model has not. We review existing strategies for migration of legacy web applications SQL databases based to NoSQL databases based, noting the unique challenges due to the different structured between these two kinds of databases, the developer’s experience and effort that is needed to migrate the source code to interact with the new database, the immaturity of the technology and the level of support provided, so there is a need for automation to assist in the migration process. In this thesis, we propose a framework and a reusable method based on source trans- formations to automate the migration process, eliminating the need for manual work and avoiding the technical pitfalls. The resulting automated process has the potential to in- crease the speed and accuracy of web application migration from SQL based databases to document-based NoSQL databases by a large factor by eliminating most of the handwork and the potential for errors. 2.7. THE NEED FOR AUTOMATION 32

2.7.1 Source Transformation

Our framework uses formal source transformations to automate the steps of migrating databases from SQL to NoSQL databases for Web Applications. Source transformation is a programming technique based on rules that generically specify the relationship be- tween the original code and the refactored or reprogrammed code and automatically apply the rules to input code to perform the changes. While there are many source transformation systems that could be used to implement our framework. In this thesis, we have used the TXL source transformation language [20] to encode and implement our transformation rules. TXL is a programming language specifically designed to support source transfor- mation and for manipulating and experimenting with programming language notations and features. TXL has been used in many production applications with transformations in- volving many of lines of source code. The TXL transformation process mainly involves three steps:

• a context-free (base) grammar for the source language to be manipulated.

• a set of context-free grammatical changes to the base grammar.

• a set of source transformation rules to implement transformation of the extensions to the base language [20].

The TXL processor parses the source program and converts it into a parse tree, then re- cursively applies the set of transformation rules, beginning with a main rule, until there is no match encountered in the parse tree. TXL finalizes the process by un-parsing the transformed parse tree into the target program [20]. The TXL process interprets a source code program and creates a tree representation of that program following a grammar. It applies some transformation rules, and later it creates 2.8. QUERY AND APPLICATION MIGRATION CHALLENGES 33

a new source code from the resulting tree. Tree elements are marked in the query search if they are involved in the query execution. This SQL is still in the tree. So when later it is changed to MongoDB actions, these new actions will be placed at the same point of the tree.

2.8 Query and Application Migration Challenges

NoSQL database is the broad definition of a non-relational database. MongoDB is one of the most popular NoSQL database nowadays, and it is a schema-free and document- oriented database, which has great query performance with a huge amount of data and great scalability. More businesses have adopted MongoDB, as increasingly large-scale data is being stored in MongoDB. MongoDB design philosophy and query optimization issues have become growing concerns [41]. Migration is a process in which source database with all its structures and data is moved to another target database regardless if the databases are of the same or different type. Many web applications start out small and typically use relational databases such as MySQL or DB2. Once they become popular, they may scale to a size in which relational databases impact the performance of the application. Migration of the underlying database technology for these mature applications can be difficult. There are two parts to this pro- cess, the migration of the database itself which includes formulating the decision of which parts of the database to migrate, the design of the new NoSQL database and the migration of the data. The second part is the changes to the application to use the new NoSQL docu- ments. Considerable research already exists for schema and data migration. The migration of the code has largely been left for manual migration efforts. Migration is a process in which source database with all its structures and data is moved 2.9. RELATED WORK 34

to another target database regardless if the databases are of the same or different type. In software development, migration from Relational database to NoSQL is a challenging task for developers and database administrators. In particular, modifying the application source code in order to comply with new database tend to be the most critical and challenging task, causing many complications to the migration process.

2.9 Related Work

We can observe today that some companies have been joined to this new way of massive data computing, such as Google, Facebook, Twitter and Amazon [80]. It is known that those companies use different databases for different purposes. In fact, we would say that these companies are more interested in "NewSQL" lately; an RDBMS type that seeks to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system. Like Google with its Spanner database [89], for example. Migrating SQL-based applications to ones that uses document-oriented NoSQL database is difficult. Existing tools and techniques target the problem of data and schema migration without consideration to migrating the applications that interacts with the new schemas. For example, the ETL Extract-Transform-Load tool implements the interface to some main- stream NoSQL database systems to do data transformation, such as Cassandra, MongoDB, and HBase [41]. However, ETL tools cannot automatically map the existing database to the target database. ETL tools can only complete the data migration process, requiring the user to design their own mapping strategies between different databases. In this section, we present the prior research done on migration from relational databases to NoSQL database. We categorize the related work along two dimensions: 2.9. RELATED WORK 35

• Schema and Data Migration.

• Query and Application Migration.

The remainder of the section discusses the prior research with respect to these two dimen- sions.

2.9.1 Schema and Data Migration

When designing a data model for MongoDB the key consideration is to decide to use em- bedding or references between different documents [43]. This determines the structure of documents, the performance and the data redundancy. In Kanade et al. [43], they dis- cuss the pros and cons of embedded data and references. Experiments were performed to find the extent of normalization and embedding to reduce the query execution time of MongoDB. Chung [17] and Li [49] proposed two ideas to convert schema from relational databases to HBase separately. Chung [17] came up with a JackHare Map-Reduce framework that converts every table in a relational database into a single HBase table. After the conversion, each table has become a column family in HBase. It provides a concrete conversion model which stores all tables of a relational database in a single HBase table. In the conversion, data of an SQL table are migrated into a column family and column schema of the relational table is remapped into qualifier of that column family. Li [49] uses three guidelines to guide the schema conversion where related tables will be nested as a whole and transformed into a table of HBase. The three guidelines include grouping correlated data in a column family, adding foreign key references if one side needs to access the other side’s data, and merging the attached data tables to reduce foreign keys. However, the latter did not take multilevel nesting into consideration. Schram [86] 2.9. RELATED WORK 36

proposed an abstract layer that allows software applications to access data in the NoSQL model transparently, without the need for changing the existing queries in the applications. Alza [56] presents a new approach for converting the relational data of any database management system to any type of NoSQL databases in a most optimized data structure form without worrying of specifying the schema of tables and relations between them. Also, the author implements a simplified software as a prototype based on this algorithm to show the results of the output for testing the validity of the algorithm. The algorithm takes into consideration the relations between tables of the source relational database and allows the user to choose between three different modes: Embedded, Linking, and Auto to decide how to map these relations to the output database. The chosen mode will create the structure of the output NoSQL database in an efficient way. Roijackers et al. [83] created an abstraction layer between SQL and NoSQL databases. According to the authors, certain data sets show better results when processed by relational databases, and others are better running on NoSQL databases. From there, two models are used in such a way that requests are analyzed in order to decide which model would be ideal for processing such request. We adopt this approach to automate the process of deciding which tables to migrate to NoSQL and which one to keep as relational. Ellison [30] introduce a two-stage approach which accurately estimates the migration cost, migration duration and cloud running costs of relational databases. The first stage of the approach obtains workload and structure models of the database to be migrated from database logs and the database schema. The second stage performs a discrete-event simulation using these models to obtain the cost and duration estimates. 2.10. QUERY AND APPLICATION MIGRATION 37

2.10 Query and Application Migration

We can find in the literature some solutions that offer a semi-automatic migration process or adjustments in relational models so they become able to offer better performance and scalability when dealing with large volume of data. However, the cost of adapting the source code of applications is usually neglected. In the following sections, we discuss the prior related work regarding the query and application migration based on these two categorizes:

• Research-based Tools.

• Industry-based Products.

2.10.1 Research-based Tools

Sellami et al. [87] developed PaaS, ODBAPI, a streamlined and REST-based API in order to execute the CRUD operations, i.e., Create, Read, Update and Delete, on the SQL and NoSQL databases. Rocha et al. [82], propose a product to run MySQL applications with a MongoDB database, using run-time query migration. This is a proxy approach, where queries are migrated at run-time and the original application source code remains untouched. The work divided into two parts: migrating the data; it is a process to migrate all data from SQL database to a NoSQL database. This is an equivalent step which we used Pentaho data migration tool to migrate the data. They create a special collection with SQL schema information (called Metadata) to aid the query migration. The second part is query mapping where the SQL statements are transformed to MongoDB actions. They are using a special MySQL driver that intercepts SQL queries at run-time and reroutes them to a Java service. 2.10. QUERY AND APPLICATION MIGRATION 38

This service executes all queries needed to MongoDB and returns data to the application as if they were executed to a MySQL database. For this mapping part, they used a modified version of the product "MySQL Proxy" [64] to intercept MySQL calls. It communicates with a web service developed in Java which will execute the queries. In Java, they parse the query using the library "JSQLParser" [42]. They create MongoDB operations to execute the equivalent query, using information from collection Metadata (with SQL original schema). Finally, data recovered from MongoDB operations is transformed to MySQL format and sent back to the caller application. Their work is similar to "MongoDB Connector for BI product" [11], but it seems to support more use cases. There are two products; one with very basic SQL support that translates statements directly to MongoDB, and the other that manages a virtual database with a better SQL support. It is a different approach from our work. We are migrating source code; whereas they are migrating queries at run time without changing the original application. Our work is migrating source code with static analysis using source transfor- mation language. The migration is done once, the target application is changed to operate with MongoDB. This means that PHP code that creates dynamically SQL statements has to be changed to PHP code that creates MongoDB operations. Serrano et al. [27] describe a methodology for migrating applications relying on rela- tional databases to HBase back-ends. The authors describe how to create the NoSQL for best performance, and how to query the database. The paper describes a methodology; it is not an automatization tool. To apply it, developers have to redesign the database following the guidelines, and change manually all source codes that access the database. 2.10. QUERY AND APPLICATION MIGRATION 39

2.10.2 Industry-based Products

There are some industrial products and tools implemented that help in migrating the SQL queries to NoSQL ones. These products are aimed to use SQL applications with a MongoDB database and not to migrate the applications. This makes sense commercially, since it can be used also with applications where source code is not available. For example, Unity driver product [61] uses JDBC, which is an API used in Java to access databases, which is equivalent to PHP Data Objects (PDO) [77]. Unity driver shows MongoDB equivalences for SQL statements, which can be an aid to learn MongoDB, but it does it only for very basic SQL statements, other statements are processed by its virtual database. Hue et al. [38] propose using YCSB [105] with a middleware layer that allows translat- ing SQL queries into NoSQL commands. They tested Cassandra and MongoDB with and without the middleware layer, noting that it was possible to build middleware to ease the move from relational data stores to NoSQL databases. Liao et al. [104] proposed a product that allows an application to use both SQL (MySQL) and NoSQL (HBase) databases at the same time. Mainly it consists of these two parts: DB Converter that do the data migration with a synchronization process to maintain databases up-to-date, and they use [90], a data converter to transform bulk data between Re- lational Database and NoSQL HBase database. The application uses SQL to interact with the databases (both databases). SQL commands are migrated to NoSQL at run-time. It is a middle-ware using C# with ANTLR [6] as an SQL parser and SQL grammar based on MacroScope, a .NET library [55]. The application has to be changed to use its interface, but it can use SQL to access all database. It will use MySQL JDBC driver as a Relational Database (RDB) connector, and an SQL query parser and Apache Phoenix [7] as an SQL translator to connect HBase. It is different from our approach, it does not migrate source 2.11. CONCLUSION 40

code, rather it processes SQL queries at run-time. Our work is focusing on migrating and optimizing the embedded SQL queries to interact with the new database system and chang- ing the application code to use the translated queries. As we have seen, there are no studies focused on a full web application migration to NoSQL databases and how this migration will affect the application functionality and performance. Our work attempts to fill in this gap, by reviewing the literature and proposing a semi-automated framework to migrate the application queries and source code based on SQL databases to the ones that use document-oriented NoSQL databases. To fill this gap in the migration process, we propose a semi-automated approach to migrate highly dynamic SQL-based web applications (e.g. MySQL) to ones that uses document-oriented NoSQL such as MongoDB. MySQL is the most popular open source relational database management system. MySQL is fast, robust, easy to use, multi-user and multi-threaded SQL database server and it supports ACID transactions and foreign keys [101]. MongoDB is known for its easy and quick setup and its features such as high- performance, high availability, automatic scaling, and the ability to support fast iterations [56].

2.11 Conclusion

In this chapter, we have introduced a number of higher-level concepts and a basic ideas of databases, database management systems, data storage evolution in Web Applications, NoSQL databases classification and characteristics, the challenges for NoSQL adaption, the migration of web application databases from SQL to document-oriented NoSQL. Then, we reviewed others recent work related to the conversion to NoSQL databases, highlighting the various techniques and approaches, and identifying the need for automation. 2.11. CONCLUSION 41

In the next chapter, we introduce our framework, a semi-automated framework for mi- gration of databases from SQL to document-oriented NoSQL database in Web Applications and the different phases in the migration process. 42

Chapter 3

Proposed Migration Approach

3.1 Introduction

In this chapter, we introduce our migration approach. The approach consists of four phases: the first phase deals with the migration of the schema, the second stage migrates the data from the SQL database into the newly created NoSQL schema. The next step is to migrate and optimize the SQL queries into their equivalent MongoDB actions. The last phase is to adapt the application code to use the translated queries.

3.2 Proposed Migration Approach

There are two parts to the approach: the migration of schema and data, and the migration of the actual application code with embedded queries. Our approach provides the fol- lowing contributions: evaluating and implementing existing schema migration algorithms proposed by Arora et al. [8] and Jia et al. [41] to migrate the schema from MySQL to NoSQL, and the migrating and optimizing the embedded SQL queries to interact with the new database system and changing the application code to use the translated queries. Chap- ters 4, 5, 6, and 7 will describe our approach in more detail. 3.2. PROPOSED MIGRATION APPROACH 43

Figure 3.1: Proposed Migration Approach

Fig.3.1 shows the overall structure of our approach. It is composed of four stages. The first stage deals with the migration of the schema. This stage identifies which part of the database schema should be migrated and creates the MongoDB NoSQL data model. Not all data in a web application should be stored in NoSQL databases. Structured data such as users, administrators, roles and access control should remain in a relational database. Data that is very dynamic in nature such as the time-line in Facebook, user posts in Bulletin Board systems, or tweets and posts in an application such as Twitter are the type of data that should be migrated to a NoSQL database. This stage identifies the part of the database schema that should be migrated and creates the MongoDB NoSQL data model. We im- plemented the algorithms proposed by Arora et al. [8] and Jia et al. [41] to migrate the database schema from MySQL to MongoDB NoSQL database. The second stage migrates the data into the tables that were identified in the first stage to the newly created NoSQL database. We apply and evaluate two data migration tools; Pentaho Date Integration tool [73] and Pelica Migrator tool [71] to migrate the data from MySQL database to the migrated MongoDB NoSQL database. 3.2. PROPOSED MIGRATION APPROACH 44

The third stage of the approach uses a combination of backward tracing based on slic- ing [100] and the technique used by Alalfi et al. [5] to extract the set of SQL statements executed by the application. These queries that involve tables that were migrated in the first stage are translated. One key component of this translation is query optimization. Existing SQL database engines have built-in query planners that optimize the order of operations in complex queries. NoSQL databases require the application developer to specify the order of operations. A naive translation of the SQL query may result in sub-optimal queries. While the queries may be optimized manually, applying some database query optimization techniques during the translation may result in better NoSQL queries. The last stage of our approach is to use the migrated queries as a template to modify the code of the web application to generate the modified queries. This involves a backwards tracing to identify the statements that assemble each of the SQL statements migrated in the third stage. These statements are migrated to generate the appropriate NoSQL query. The main contribution of this thesis is the migration of the application source code, which is composed of the last two stages. To test and validate our approach, we collected a set of applications to migrate to NoSQL. These were a small application SCARF a conference and paper submission and organization system (15.5 Mb) [85], intermediate applications PHPBB v2 and v3 a Bul- letin Board and posts system (321 Mb, 995 Mb respectively) [76], and a large application WordPress a Content Management System (1362.6 Mb) [103]. The smallest application al- lows us to explore migration strategies, while the intermediate and large applications allow us to evaluate in a larger context. Applications were first installed and exercised to generate initial test data to be migrated. SCARF is a conference paper submission and organiza- tion web application. We created 10 users, submitted 25 papers and 67 posts. PHPBB2 3.3. CONCLUSION 45

and PHPBB3 are bulletin board applications that allow users to post comments about a variety of topics. For PHPBB2, we created 47 users, 815 topics and made 1017 posts to populate the data-set. For PHPBB3, we created 10333 users, 23931 topics and 28174 posts. In the next chapters, we will examine each of the steps of our migration framework in more detail.

3.3 Conclusion

In this chapter, we introduced our proposed migration framework which composed of four phases; application schema migration, data migration to the new NoSQL database, query migration and optimization and the last phase is the web application code adaptation to use the translated queries. The proposed framework is a novel one in migration web applica- tion based on SQL databases to ones that use document-oriented NoSQL databases. The proposed framework is flexible enough to allow for different server side technologies and databases. Our approach can be adapted for other document-oriented databases and other classes of NoSQL databases. The framework is evaluated on one of the most popular PHP web applications, PHPBB, to check that the application can be migrated to use the NoSQL databases. In the next chapter, we will go through the first phase of our framework; the schema and data migration phase. We implement and evaluate two published approaches for migrating the database schema to MongoDB and then describe the main difference between the two approaches. We implement an interface to design the database schema from MySQL to MongoDB and then apply Pentaho Data Integration and Pelica Migrator tools to migrate the data into the NoSQL schema. We compare the two data migration tools based on their ease of use, data transformation, and level of automation. 46

Chapter 4

Schema and Data Migration

4.1 Introduction

In this chapter, we discuss the first stage in our migration approach that deals with the migration of the schema and data. The first step in this phase is to identify which part of the database schema should be migrated and creates the MongoDB NoSQL data model. We implemented and evaluated two schema migration algorithms proposed by Arora et al. [8] and Jia et al. [41]. This phase handles the transformation of MySQL schema to MongoDB NoSQL schema. For this purpose, our implemented interface takes the relational database in MySQL as an input to extract metadata of MySQL database such as table’s names, their attribute names, attribute data types, relationship names, indexes from the database. For relationship extraction, the primary and foreign key constraints of each table were used as the relationship information can be helpful for schema conversion. The second step is to migrate the data into the tables that were identified in the first step to the newly created NoSQL database. To do so, we apply and evaluate two data migration tools to migrate the data from MySQL database to the new MongoDB NoSQL database. Fig. 4.1 shows the overall structure of the schema migration steps from MySQL database 4.1. INTRODUCTION 47

MongoDB Source Extract Relational Database Mapping from the source NoSQL Database Schema to the target database (MySQL) Database

Schema Transformation

Figure 4.1: Schema Migration Process to MongoDB database phase. Not all data in a web application should be stored in NoSQL databases. Data such as users, administrators, roles and access control should remain in a relational database. Data that is very dynamic in nature such as the time-line in Facebook, user posts in bulletin board systems PHPBB, or in an application such as Twitter are the type of data that should be migrated to a NoSQL database. For the data transformation, the tables in the source MySQL database are selected from which the data is to be extracted and transformed into JSON format for final storage in the target MongoDB NoSQL database. For this purpose, Pentaho Data Integration Tool (PDI) is used. The data transformation steps are shown in Fig. 4.2. The structure of the relational databases is usually more complex than NoSQL, due to the normalization process in which data is split into multiple related tables [41]. In contrast, NoSQL databases store data in a de-normalized unstructured or semi-structured way. Thus, the transformation from relational databases to the NoSQL ones is not a straightforward process. One of the main differences in NoSQL databases is that joins are not supported in the same way as in relational databases. NoSQL does not have an explicit JOIN operations, as if the table to be migrated is involved in a join relationship with another table, either both tables must be migrated to a single NoSQL collection, or the code for the NoSQL equivalent of a join must be included in the application. Collections can be linked through fields at the 4.2. DATA MODEL DESIGN 48

MongoDB MySQL Extract data from MySQL database Transform Data into JSON NoSQL Database using Pentaho Data Integration Tool format Database

Data Transformation

Figure 4.2: Data Migration Process application level but this will require an extra lookup for every link in the query [24]. In the relational model, the data is stored in tables where attributes represent forum, posts, users and topics information, and rows represent posts and users IDs as an example of PHPBB application. In the document model, the data is transformed into JavaScript Object Notation syntax (JSON) files [108].

4.2 Data Model Design

Data models and schema define the way a database organizes its data. Data modeling is an activity to represent the flow of data by documenting the system design with the help of text and symbols [43]. However, the activity of data modeling leads to define database schema. Database schema refers as a logical organization or structure of data which provides a logical grouping of data entities such as tables, fields, relations, functions and documents [8]. In traditional relational databases, schema should be defined before the insertion of data. It means a schema is inflexible and it cannot be changed with the evolution of application requirements [23]. A database schema can be fixed or dynamic, NoSQL databases have been evolved to address this inflexibility. In the landscape of databases, NoSQL refers to the family of databases that are non-relational in nature and unlike traditional relational databases, vary in technology and style [23]. It is difficult to handle both the size of data and concurrent 4.2. DATA MODEL DESIGN 49

actions on data with the standard row-column RDBMS [8]. When designing a data model for MongoDB, the key consideration is to decide when to use embedding or references between different documents [43]. This determines the structure of documents, the performance and the data redundancy. Kanade et al. [43] discuss the pros and cons of embedded data and references. Experiments were performed to find the extent of normalization and embedding to reduce the query execution time of MongoDB. There are limited existing tools to guide the model transformation and data migration from relational databases to NoSQL databases effectively, because of two major challenges [108]: 1) Model transformation challenge: Due to the lack of automated tools, many model trans- formation strategies based on the database administrator’s experience as they often design the physical model of NoSQL manually based on the existing relational database. It is diffi- cult when dealing with a complex structure of relationships, because it is difficult to decide which tables must be embedded together and which tables should use references [108]. If we embed all related tables, performance may be improved but it may lead to data redun- dancy. On the other hand, NoSQL does not support the join operation and if references are used for each table, NoSQL will issue multiple queries when reading related documents. Reading related documents separately at several times may result in poor performance. In addition, different applications may require different strategies when transforming the model. Making sure that the model of the target NoSQL is exactly what we want, is not an easy task. 2) Data migration challenge: Currently, there is a lack of fully automated data migration 4.3. DESIGNING THE SCHEMA 50

Table 4.1: Structure Comparison between RDBMS & MongoDB

RDBMS MongoDB Table - View Collection Row BSON Document Index Index Join Embedding - Linking Partition Shard Partition Key Shard Key

tools. Some tools are partially automated and using simple migration strategies, like mi- grating each table of the relational database to a specific collection of MongoDB [108]. So far, there is no existing work that can fully automatically migrate data based on the model information.

4.3 Designing The Schema

In order to convert the PHP application to use document based NoSQL such as MongoDB, the first task is to design a schema for the database. To create the schema for the MongoDB database, we analyze the relational schema manually to determine what data points are re- quired and how this data is structured. This information can then be used to organize the data into the format that fits the MongoDB engine. This schema will then be used as the basis for the analysis where a number of changes will be implemented to evaluate the effect this has on the application code and functionality. The MongoDB schema will be designed by converting the objects from the relational schema into their equivalent MongoDB ob- jects. Tables will map to collections, rows as BSON document and Joins to embedding and linking. Table. 4.1 shows the structure comparison between SQL and MongoDB. 4.3. DESIGNING THE SCHEMA 51

4.3.1 Schema Conversion

Schema translation is the process of transforming a schema in one data model into a corre- sponding schema in a different data model [2]. As part of the first two stages of application migration, we evaluated two published approaches for migrating the database schema to MongoDB. Arora et al. [8] transform a relational database into a MongoDB document database. They propose an interface to design the database schema from MySQL to Mon- goDB and then applying Pentaho Data Integration (PDI) tool [14] to migrate the data. The algorithm applies data transformations (embedding documents) and generates the NoSQL structure design. Jia et al. [41] model the NoSQL database, MongoDB, with relational algebra. They propose an approach to model transformation, and they present their tool that aids in the design of the new data structure. Their approach is similar to Arora et al. approach [8], but they gave some guidelines to choose when to embed documents. Their model transformation algorithm only optimizes specific tables instead of the entire database of the relational database to avoid data redundancy. We chose these approaches because they are applicable to our proposed migration pro- cess. The approaches migrate the structure of the MySQL schema to MongoDB schema. In these approaches, developers extract an Entity-Relationship (ER) model from the existing relational database. If developers are not satisfied with the result of model transformation, they can modify the result of model transformation, such as removing some collections, forcing documents to be embedded, or removing some fields. Since the implementations of the two schema migration approaches were not publicly available, we reproduced their algorithms in Java. We then tested our implementation on the migration of the schemas for four PHP based web applications. Our implementation connects to the source database 4.3. DESIGNING THE SCHEMA 52

to obtain the schema: the names of the tables, their attributes and relationships. The infor- mation on the relationships can be retrieved using the primary constraint and foreign keys for each table. Our implementation provides an interface that allows the user to select the starting tables and columns and, using information about the NoSQL target database server, and the algorithms described in the previous research, generate the text files used as input for the data migration tool.

4.3.2 Data Migration

In this section, we test two data migration tools to migrate the data from relational MySQL databases to document-oriented MongoDB NoSQL database for three applications: SCARF , PHPBB, and W ordP ress. We compare the two tools based on their development agility, user interface, support of data integrity, data correctness verification and the level of au- tomation. We collected four applications to migrate the data from MySQL to MongoDB NoSQL. These were a small application SCARF (15.5 Mb) and two intermediate applications PH- PBB v2 and v3 (321 Mb, 995 Mb respectively) and a large application WordPress (1362.6 Mb). The smallest application allows us to explore migration strategies, while the in- termediate and large applications allow us to evaluate in a larger context. We chose the application based on this type of transformation from SQL to document oriented NoSQL that suites this type of applications; bulletin boards and forums systems, and content man- agement systems that require horizontal scaling of thousands of nodes as demanded when handling huge collections on structured and unstructured data sets. Also, all four applica- tions under test are PHP based application and are not simple data applications but have 4.4. MIGRATION STRATEGY 53

the concepts of posts and news. Our choice of the combination of PHP based web appli- cations, MySQL database, and MongoDB is based on the popularity of these technologies. We used the Pentaho data integration tool (PDI) [14], an open-source data integration tool and Pelica Migrator [71] to migrate the data in our sample applications. The tools have the flexibility to either embed the related tables to MongoDB collections automatically or to allow the user to select the columns for joins. We successfully applied the tools to migrate the data residing in the MySQL database of the four applications under test to the generated MongoDB schema using the parameters generated by our tool. We converted 7 tables out of 7 MySQL tables to MongoDB in the SCARF application. In PHPBB2, 26 tables out of 30 tables were converted and in PHPBB3 64 tables out of 70 tables were converted to MongoDB. The result was two independent installations of each test system. One with only MySQL, the other with both MySQL and MongoDB.

4.4 Migration Strategy

NoSQL data models are different from relational models according to their structure and the way they store the information. Compared with the NoSQL databases, the structure of the relational databases is more complex in terms of their concept of normalization. Ac- cording to the rules of normalization, they split their information into different tables with join relationship [12]. On the other hand, NoSQL databases store their information in a de-normalized way which is unstructured or semi-structured. Therefore, the successful mi- gration with accuracy from Relational to NoSQL is not an easy task. Some of the recent methodologies for migration are explained in this section. Han [35] proposed a methodol- ogy for data migration from MySQL relational database to NoSQL database, implemented in MongoDB. The migration process consists of two steps: loading the logical structure of 4.5. DATA MIGRATION 54

the source database and then mapping between the relational model and MongoDB model. The next step is dedicated to defining the mapping between the relational model of MySQL and document-oriented model of MongoDB. The main concern while migrating the re- lational database to NoSQL is the absence of JOIN operations in NoSQL database. The document model makes JOIN operations redundant in many cases [16, 31]. If the table to be migrated is joined to another table, we have to migrate both tables to a single collection in NoSQL. In order to avoid the cross-table query in the NoSQL database, Zhao et al. [109] propose a method that follows NoSQL0s DDI principles: Denormalization, Duplication and Intel- ligent keys. In this method, related tables are aggregated into a big table and then the most suitable key is selected, which is called row key, to identify each row. MySQL stores all table schemes in a special table called information schema from where we have to extract each table0s primary key (pk) and foreign key (fk). The previous works mainly focus on improving reading performance or migrating data to specific type of NoSQL database to receive high availability and scalability, but do not provide an effective solution to eliminate reference dependencies among tables in relational database or transform the relationship between tables to NoSQL database.

4.5 Data Migration

Until a few years ago, companies had relied heavily on relational database technology to deal with big data. However, Abadi [23] states in his research that accessing petabytes of data efficiently using RDBMS is very challenging, and solutions like sharding, create many other problems. At this point, the NoSQL databases emerge as a solution to these challenges. This generates the problem about how to migrate the data from structured 4.5. DATA MIGRATION 55

RDBMS to unstructured NoSQL environment. Data migration tends to get complicated an moving large amounts of data between computer systems or into a new format comes with risk [71]. Data migration is the process of transferring data between two or more different databases. Typically, this task is performed automatically and requires only a limited interaction with the user. In general, migration of data can occur either during the normal operation of the database or it can happen when it is out of service [37]. Data migration is a well- established area in traditional relational databases, but in the NoSQL context, the situation is quite immature. Some Databases vendors provide tools to import and export data from their database; however, the usage of such tools is limited to a specific NoSQL for which they have been built.

4.5.1 Data Migration Tools Overview

Data migration aims to transfer bulk data from structured data stores such as a relational database to a distributed database, and acquires schema-free, easy replication support and high scalability [23]. Some related researches have been done, which mainly focus on the following orientations:

• Some tools have been developed to migrate data from the relational database to NoSQL database. Apache Sqoop [90], DataX [26], and Apache Drill [28] are the most popular ones.

• Map-Reduce is one of these techniques acting as a large-scale data processing plat- form, which is initially used to analyze the text data and build the text index. Also, Map-Reduce used for the data analysis and performance improvement over Hadoop Distributed File System (HDFS) [1]. 4.5. DATA MIGRATION 56

• A Denormalization as a process seeks to improve the response time for data retrieval while maintaining good system performance for row insertions, updates, and dele- tions. It can enhance query performance when it is deployed with a complete under- standing of application requirements [32].

• Schema conversion provides an approach towards transferring the database’s struc- ture in addition to migrating data.

So far, there is no existing framework that can fully automatically migrate data based on the model information. With the development of NoSQL database, some tools have been developed, which provide some methods to migrate the data from relational databases to NoSQL databases. For example, the ETL (Extract-Transform-Load) tool implements the in- terface to some mainstream NoSQL database systems to do data migration, like MongoDB, HBase, and Cassandra [14]. The downside is that those ETL tools cannot automatically map the original database to the target database. ETL tools can only complete the data migration process, requiring the user to design their own mapping strategies between dif- ferent databases. Other tools have been developed to migrate the data from the relational database to NoSQL database. Apache Sqoop [90] and DataX [26] are two popular ones. Apache Sqoop [90] is a tool designed for efficiently transferring bulk data between and structured data storage such as the relational database. However these tools are not a generalized tools, they are designed for specific types of NoSQL data migration. Regarding the emergent requirement of big data, more and more recent researchers focus on how to integrate SQL and NoSQL. For example, Hsu et al. [36] propose and design the correlation-aware technique for Sqoop, which is utilized to transform the data stored in the traditional relational database into the HBase. The proposed technique is able to analyze which tables are frequently used for the cross-table query and then transform 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 57

this data into the same Hadoop data-node. In summary, previous work tends to focus on a specific aspect; some focus on model transformation and NoSQL modeling, while some focus on data migration only. Few researches combined these together.

4.6 Schema and Data Migration Experiments

In this section, we illustrate the steps of the schema and data migration of the two applica- tions under test (PHPBB3 and WordPress). First, we create the MongoDB schema based on our implemented interface and generate the input files for the data migration phase. Then we migrate the data using the generated files from the schema transformation phase.

4.6.1 PHPBB3 Schema and Data Migration

The PHPBB Bulletin Board System was implemented to share information among inter- net users [76]. It is a system that includes functions of QnA, file or info share, and search. PHPBB is an package written in PHP . The PHPBB application is a well-known forum that collects and stores user posts in a MySQL database [76]. The PHPBB application is based on MySQL relational database for storage of all persistent data. This includes user registration details, user posts, forum metadata and configuration data such as the board title and pagination limits. In this experiment, the schema and data are migrated to a MongoDB NoSQL database. The sample PHPBB3 database consists of 70 tables. Fig. 4.3 shows part of the Entity-Relationship diagram (ER) of the PHPBB3 applica- tion. In the relational model, the data is stored in tables where attributes represent forum, posts, users and topics information and rows represent posts and users IDs. In the doc- ument model, the data was transformed into JavaScript Object Notation syntax (JSON) 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 58

Figure 4.3: PHPBB3 Relational Schema

files. The transformation is divided into two steps: the first step generates the text files corresponding to the tables of the source database using our implemented interface. In the second step, these files are used to load the data into the MongoDB NoSQL database using Pentaho Data Integration (PDI) tool. The tool has a semi-automatic transformation algorithm where query translation is done from SQL to NoSQL query language [8]. We 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 59

implemented the algorithm proposed by Arora et al in Java to generate an interface. From this interface, we can select embedding tables and embedded collections and we can choose columns for join and generate input files for Pentaho Data Integration (PDI) [73]. In the second step, these files are used to load the data into the MongoDB NoSQL database using Pentaho tool. The tool will migrate the data into MongoDB database. Our interface consists of the following steps:

• Load the logical structure of the source database We connect to the source database to obtain the schema. Also, we need to specify all the information about the NoSQL target database server. In this step, we obtain a representation of the relational model of the source database. For this, we need to get the names of the tables, their attributes and the relationships. The information on the relationships can be retrieved from the primary constraint and foreign keys for each table. Fig. 4.4 is a subset of the PHPBB3 source schema.

Figure 4.4: PHPBB3 Relational Logical Structure

• Mapping between the relational model and MongoDB The main concern in NoSQL is the absence of JOIN operations. In Fig. 4.5, the rela- tional database uses the user_id field to join the php_users table with the php_topics_posted table to enable the application to report each posted topic to the right user. Modeling 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 60

the same data in MongoDB enables us to create a schema in which we embed an array of sub-documents for each posted topic directly within the user document as described below.

Embedding Array in MongoDB

{ topic_id=1 topic_posted: 1 user{ {user_id: 2 user_type: 3 user_name:"rahma" } } }

Figure 4.5: PHPBB3 Relational Schema (users and topics_posted) 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 61

In this example, the application relies on the MySQL relational database to join four separate tables, while in MongoDB all of the data is aggregated within a single doc- ument linked with a single reference to user document as shown in Fig. 4.6.

Figure 4.6: PHPBB3 MongoDB Model

The steps involved in the schema and data transformation are explained below:

• Connection with the relational database is established. Once the successful con- nection has been made to the MySQL server, then the existing databases from the relational database, which needs to be created in MongoDB, are extracted and then we select the database from the list.

• From the table list extracted in step 1, the tables whose corresponding embedded documents are to be created in MongoDB, are selected as in Fig. 4.7. In MongoDB, related data of two or more tables can be stored in a single document. This is known as embedding and documents are known as embedded documents [41]. The em- bedding approach maintains all the related data in a single document, which makes it easy to retrieve and maintain. The whole document can be retrieved in a single query. The drawback is that if the embedded document keeps on growing too much in size, it can impact the read/write performance 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 62

Figure 4.7: Schema Migration Interface

• The columns of the embedded tables on which join has to be applied are selected. Joining of tables is required to generate a single embedded document corresponding to the two tables.

• Using the above information, text files are generated in the format accepted by the Pentaho (PDI) tool as shown in Fig. 4.8

• Pentaho is used to load the data into MongoDB [73]. The tool takes the text files as input and generates the corresponding collection in MongoDB as in Fig. 4.9. It will accept these files and create the collections and load the data into the collections and then we can design the transformation. Transformation is the process of creating collections and moving the data into MongoDB using the generated input files. To 4.6. SCHEMA AND DATA MIGRATION EXPERIMENTS 63

Figure 4.8: Generated Text Files

use MongoDB with PHP , we need to download the MongoDB PHP driver and add the php_mongo.dll in the PHP extension directory. 4.7. WORDPRESS SCHEMA AND DATA MIGRATION 64

Figure 4.9: Pentaho Tool Interface

4.7 WordPress Schema and Data Migration

This section explains how we can transform W ordP ress schema and data into MongoDB NoSQL database. We are applying our interface based on Arora et al. algorithm to con- vert the W ordP ress database schema into MongoDB model [8], and Pentaho data migra- tion tool to transform the W ordP ress data into MongoDB. Fig. 4.10 shows W ordP ress Entity Relationship (ER) diagram. This diagram shows the MySQL database model of W ordP ress schema and also demonstrates the relations between the tables. We will be using this schema definition to design the schema in MongoDB.

4.7.1 WordPress Schema Conversion

W ordP ress is a Content Management System (CMS) that is widely deployed to estab- lish websites [103]. According to the 2018 CMS survey, W ordP ress controls nearly 60% of the CMS market, which is nearly 34% of the top websites run on W ordP ress 4.7. WORDPRESS SCHEMA AND DATA MIGRATION 65

[19]. W ordP ress, as most other modern content management systems, is a very database- centric application [102]. It keeps all information in the database such as blog settings, posts, comments, links, and users. Therefor, it is important to understand how the database is organized, what types of data is stored there, and how different tables are linked to each other. W ordP ress SQL database is composed of a total of 12 SQL tables. After our exam- ination of the schema, W ordP ress SQL tables can be divided into two groups. One group is composed of 4 SQL tables and the other one is composed of 6 SQL tables. The remaining 2 tables are independent, i.e., no foreign key exists. All of these tables are connected with each other and have one-to-many relationships between them. The most important tables are wp_users and wp_posts. The Wp_users contains the user0s data and wp_posts table contains the data about all the posts. Meta tables store the information about the features of the users, posts and comments. The Wp_links table holds the information about the links entered into the W ordP ress links feature. The options that are set under the administra- tion tab in W ordP ress are stored in the wp_options table. The categories for both posts and links and the tags for posts are found within the wp_terms table. Posts are associ- ated with categories and tags from the wp_terms table and this association is maintained in the wp_term_relationships table. The below figures show part of the converted MongoDB schema.

WordPress NoSQL schema (posts and postmeta)

table"wp _posts" do table"wp _postmeta" do column"ID",:key column"meta _id",:Key column"post _author",:integer column"post _id",:integer,:references=>"wp _posts" column"post _date",:datetime column"meta _key",:string column"post _content",:text column"meta _value",:text column"post _title",:text end column"post _status",:string end 4.7. WORDPRESS SCHEMA AND DATA MIGRATION 66

WordPress NoSQL schema (users and usermeta)

table"wp _users" do table"wp _usermeta" do column"ID",:key column"umeta _id",:key,:as=>:integer column"user _pass",:string column"user _id",:integer,:references =>"wp _users" column"user _email",:string column"meta _key",:string column"user _status",:integer column"meta _value",:text column"display _name",:string end end

The wp_users and wp_usermeta tables demonstrate an example of the embedded col- lection that was entered. In simple SQL, we perform a join to retrieve the data from these two tables. We could avoid the join by embedding the wp_usermeta collection into the wp_users collection. After the modification, the wp_usermeta table schema is shown be- low.

Embed wp_usermeta and wp_users tables

table"wp _usermeta",:embed_in=>’wp _users’ do column"umeta _id",:key,:as =>:integer column"user _id",:integer,:references =>"wp _users" column"meta _key",:string column"meta _value",:text end

The application can access the data without using join operations. We use the gener- ated translation file to migrate the data from the MySQL database to MongoDB. The same approach is used for the wp_posts and wp_postmeta tables by embedding the wp_postmeta collection into the wp_posts collection as shown below. However, this is not always a perfect scenario. In this case, there will be a single table which could cause performance issues. 4.8. DATA MIGRATION CONSIDERATIONS 67

Embed wp_postmeta and wp_post tables table"wp _postmeta",:embed_in=>’wp _posts’,:on =>’post _id’ do column"meta _id",:key,:as =>:integer column"post _id",:integer,:references => wp _posts" column"meta _key",:string column"meta _value",:text end

4.8 Data Migration Considerations

MongoDB collections are a flat copy of MySQL table data. Collection attributes must have the correct data type, because MongoDB is a strong typed database. It is important that the data is consistent between the two databases, because we are using information from Mon- goDB and MySQL at the same time. We have to take the following points into consideration when migrating the data from MySQL to MongoDB:

• Document ID: In PHPBB integer " id " fields were migrated as is. So, identifiers are the same in MongoDB and MySQL. The application uses the same id fields that are used in MySQL. Indexes were created on id fields. In SCARF primary key has been migrated as MongoID values and renamed to _id field. This has the advantage of being the default MongoDB field key as we do not need an extra index here. The field name has been changed, so we will need to take this into consideration in the PHP code. Also, we need to change the code to treat those fields as MongoID as they are not integers anymore. Since using MongoDB key requires to change the field name and the field type, for the automatic migration, it is advisable to migrate MySQL primary keys to MongoDB regular as fields with an index.

• DateTime fields: Pentaho data migration tool seems to have a bug with the migration 4.9. DATA MIGRATION TOOLS 68

of dates. It reads MySQL dates in Toronto time zone, but it stores those dates in Mon- goDB as if they were in UTC. This creates dates with some hours of displacement. This only affects the data migration. The PHP application already uses the correct time zone.

• Boolean fields: MySQL does not have native Boolean fields. The application uses tinyint(1) type fields to hold Boolean values, where 0 means false and 1 means true. These values are migrated as true/false Booleans with the data migration tool. But we will have to take this into account in the PHP code to not store integer values in those fields. MySQL automatically converts the type, but MongoDB does not.

• BLOBs with binary data: Pentaho is treating to store binary data to MongoDB string attributes, but MongoDB stores strings as UTF-8, and it will reject to store data as strings binary, which is not a correct UTF-8 string. MongoDB has a special data type for binary data, and data migration tool should be using it. We modified the translation file to force its use.

4.9 Data Migration Tools

In our experiments, we apply two data migration tools to migrate the generated text files from our implemented interface into MongoDB database. The first tool is Pentaho data integration tool (PDI) [73] which is an open source data integration tool. Pentaho data integration tool is a part of Pentaho Studio that delivers powerful Extraction, Transforma- tion, and Loading (ETL) capabilities using meta-data driven approach. It provides intuitive, graphical, drag and drop design environment that has scalable standard based architecture. The second tool is Pelica Migrator which is a data migration tool from TechGene’s [71] 4.9. DATA MIGRATION TOOLS 69

that is used to migrate data from RDBMS to MongoDB. Pelica works on Google Chrome browser only, but Pentaho is a browser independent, it designed to run on any kind of browser and it works in different browsers i.e IE, Google Chrome, Safari. Both tools can be deployed on either the Apache Tomcat [93] or JBoss [40] web application servers. Pel- ica Migrator has feature such as column selection so the users can select the columns they need to migrate. Also, it provides version control; the tool can maintain migrated tables history.

4.9.1 Data Migration Tools Evaluation

In this section, we evaluate the two data migration tools based on the community experience and the features of each tool described in the tool documentations. The Pentaho tool is robust and user-friendly tool as the majority of the data migration process is automatically done. Database administrator and developers require semantic knowledge of the database to interact with the tool. This makes the data migration process user-friendly with few inputs from the user, which speeds up the migration process. The following points summarize the evaluation of the PDI tool [70] [57]:

• Good development agility: Once we design the transformation for each table, migra- tion from MySQL to a collection in MongoDB is reasonably fast.

• Good Scaling: We can easily add fields into the collections at any time.

• The Simplicity of query: Once we migrate data successfully from MySQL to Mon- goDB, querying data becomes straightforward.

• Support for data integrity checking: Pentaho generates and CSV files. Users can check the files and check if the data in MongoDB is the same as in the MySQL 4.9. DATA MIGRATION TOOLS 70

database.

• User-Friendly: Pentaho is a Standard Widget Toolkit based design tool (SWT). It is a user-friendly ETL tool and platform-independent i.e. works on Windows, , Unix. It is a JavaScript based tool and supports parallelism.

• Partially automated: First, we should have input files generated from our imple- mented interface. Second, we have to design the transformation for each table in MongoDB. Using Pentaho, we can only get one embedded document in each oper- ation. So, we should perform several operations to get all of the desired embedded documents.

The following points summarize the evaluation of the first data migration tool Pentaho [72]:

• An open-source data integration tool with community support.

• Very good development agility: Once you design your transformation for each table; it only takes a little time to migrate table from MySQL to a collection in MongoDB.

• The simplicity of query: Once we migrate data from MySQL to MongoDB, querying data will become very simple.

• Good scaling: We can easily add fields into the original collections at any time.

• Support for data integrity checking: Pentaho generates html and CSV files. Users can check these files and validate whether the data is the same in MongoDB as the MySQL database. 4.9. DATA MIGRATION TOOLS 71

• User Friendly: Pentaho is a Standard Widget Toolkit (SWT) based design tool. It is user-friendly ETL tool and platform-independent, i.e. works on windows, Linux, Unix. It is JavaScript based tool and supports parallelism [14].

• Partially automated: First, we must have input files. Loading input files will take a lot of time if the database is huge, because we have to re-scan the whole database to get input files of tables. Second, we have to design all transformation for each table in MySQL. It will cause lots of manual work if the database has so many tables.

By using Pentaho, we can get all embedded documents from MySQL. But by using Pelica Migrator with the first algorithm, we can get only one embedded document from MySQL at one time. But it is according to the result MongoDB database; so we can apply a more complex query to the result MongoDB from Pentaho. We can only apply simple query to the result MongoDB from Pelica Migrator with the first algorithm. But, if by applying the second algorithm, which generates all input files of embedded documents, then the result will be the same. For the simplicity of query using the two tools, we get the MongoDB of a similar structure, so querying the data will become easy and straightforward using the tools.

4.9.2 Data Migration Tools Comparison

Pentaho Data Integration (PDI), is an open source data migration tool that handle data trans- formation between Relational database and document-oriented NoSQL database [13]. It is a comprehensive tool with advanced features such as clustering of ETL processing. These features are available in the open source version of Pentaho and are found only in the com- mercial version of Pelica Migrator tool data. Pentaho provides a graphical interface called Spoon based on Standard Widget Toolkit SWT [92], from which we can create two types of 4.9. DATA MIGRATION TOOLS 72

treatment: transformations and tasks (jobs). Jobs and transformations are stored in a meta language, which can either be stored in XML format or in a database [13]. By using the two tools Pentaho and Pelica, we get similar structure for the resulting MongoDB schema. However, using Pentaho, we can generate all embedded documents from MySQL at one time. In Pelica we need to select the tables for the embedded documents one by one. Also, using Pelica with the first algorithm [8], we get only a single embedded document from the MySQL schema. We must apply a complex query to get more embedded documents from MongoDB using Pelica. The two tools provide a mechanism to query directly in SQL, which allows making all modes of joint and nested queries. It is possible with Pentaho to join data from the source database and create a view of the joined data [13]. Both Pentaho and Pelica are robust solutions to perform data migration. P elica0s strength comes from its control flow, and data flow [71]. It allows great flexibility to the developer to design the structure and the flow of the ETL process. On the other side, Pen- taho includes many more options to access outside data such as a Google Analytics [33] and several options to access Web services. It can be used on either Windows or Linux op- erating systems. Pentaho gives the user a graphical user interface to a parallel processing ETL engine to solve data integration challenges. The user interface reduces data migration complexity by elimination the need to code data extraction, data transformation and data loads. The main differences between the two tools are:

• The limitation of the number of tables to be migrated in one time. For Pentaho data integration, there is no limitation on the number of tables to be migrated from source database to the target database at one time. While in Pelica Migrator, the user has to 4.10. CONCLUSION 73

select maximum of 10 tables to be migrated at one time.

• Pentaho is an open source data integration tool while Pelica is a commercial data migration tool. We used the trial version of the tool to test the migration process. Despite the difference of the automation level and the migration steps in both tools, they showed an efficient way of data migration instead of applying manual migration. The experiments showed that both data migration tools are a suitable solution to han- dle migrating data from relational databases to document-oriented NoSQL databases.

Data migration tools are designed and used to save time and cost when a new data mart or data warehouse is developed, Pelica are mostly satisfied the needs of large organiza- tions, as it can manage the large database. Pentaho is mostly used for small to medium enterprises, as it limits the speed and have limited debugging facility [14]. The choice be- tween Pentaho and Pelica Migrator data migration tools thus depends essentially on the typology of the project it leads.

4.10 Conclusion

In this chapter, we discussed the schema and data migration phase of our migration ap- proach. We evaluated two published approaches for migrating the database schemes to MongoDB document database [8, 41] and then applying two data migration tools; Pentaho and Pelica Migrator to migrate the data from MySQL to MongoDB. We compared the two data migration tools based on their ease of use, data transforma- tion, and level of automation. During the data migration, the model structure of MongoDB database was created from the relational database schema using our implemented interface. In the next chapter, we discuss the manual migration process experiments from MySQL 4.10. CONCLUSION 74

relational database to MongoDB NoSQL database of the applications under test. The pur- pose of the manual migration is to see the feasibility and the work required to migrate the applications from SQL database to NoSQL one. SCARF and PHPBB3 applications were migrated manually to use MongoDB and MySQL. With that, we got information which will be a guide later to automate similar migrations. 4.10. CONCLUSION 75

Figure 4.10: WordPress Entity Relational Diagram 76

Chapter 5

Manual Migration

5.1 Introduction

In this chapter, we discuss the manual migration process of SCARF , and PHPBB3 applications from MySQL relational database to MongoDB NoSQL database. The data is already migrated to the database as explained in the previous chapter. The process will involve the migration part of the database to MongoDB. This means some of the tables like users table and any credentials tables will stay in MySQL database and other related tables such as posts will be migrated to MongoDB. The purpose of the manual migration is to analyze and test the feasibility of migrating application database from SQL to MongoDB database and if we can implement a frame- work to test it and apply it to other similar applications. Also, the manual migration allows us to explore th different migration methods. The work is done directly on Ubuntu 14.04.4 server. Comments have been added to the changes in the code so we will be able to know what the code is about and what has been changed. We are using MongoDB version 3.2.10 and as for the PHP module, we are using the installed MongoDB module, version 1.6.14. 5.1. INTRODUCTION 77

Our analysis reveals that PHP applications may be migrated automatically to use Mon- goDB instead of MySQL, but it may not be feasible to do it fully automated. One of the main problems is that SQL queries are built dynamically. Doing a migration at run-time would avoid this issue. This is the approach used in MongoDB Connector for BI product [11] and as done by Rocha et al. [82]. However, our goal is to build a process to statically migrate the source code of the PHP applications, because we note that translating the SQL statement each time incurs a penalty and could eliminate any benefit of the move to NoSQL. From a total of 19 PHP files of SCARF application, 16 files migrated manually into MSCARF version and 9 files out of 19 files migrated manually into SCARF 3 version. For PHPBB3 application, 611 PHP files out of 2906 PHP files were migrated manu- ally. To validate that the migrated application retains the same functionality from the outside as the original application, we test manually, all interesting functionalities from the origi- nal application and compare the interesting parts with the new version. So in the manual test, we exercise all pages, links, forms in the application under test, and we track as well whether all pages were visited and all SQL statements were exercised. We have tests to check that concrete SQL statements are migrated correctly independently of the migrated application, Also, we validate that one concrete application works like its original version, so We logged all queries executed while trying as many functions from the application as possible. Since we migrated SQL based web applications to MongoDB document oriented NoSQL, we have tested that the produced MongoDB actions are functionally equivalent to the original SQL statements. 5.2. PHPBB CODE MIGRATION 78

5.2 PHPBB Code Migration

PHPBB3 are bulletin board applications that allow users to post comments about a variety of topics. For the manual migration process, we created 10333 users, 23931 topics and 28174 posts. In this phase, we have migrated some of the PHP pages and all queries were successfully translated, and the application was manually tested to ensure the migrated version had the same functionality. We checked that the migrated application has the same output with an equivalent database on all pages. In this manual migration, we get rid of the intermediary layers and place the PHP code directly where we see that it would have the same result. PHPBB3 application uses SQL statements all over the source code. It has classes to manage differences between databases, but they receive SQL statements as strings. These SQL statements use joins with multiple tables and SQL function. Our approach creates a MongoDB connection where the SQL connection is done and to use this connection instead of the SQL when needed. In PHPBB integer id fields were migrated as is. So, identifiers are the same in MongoDB and MySQL. The application uses the same id fields that is used in MySQL. Indexes were created on id fields. The approach generates MongoDB actions that have outputs as the same as the SQL counterpart. By doing so, we can maintain the same business logic. PHPBB3 has a plugin system, which is used in some places when preparing SQL statements. These are not used in the current application, but they are fired to give PHPBB3 plugins a chance to change SQL query strings. It is important that the data is consistent between databases, because we are using information from MongoDB and MySQL at the same time. MongoDB is a strong typed database, It is important to migrate the collection attributes with the correct data type. 5.2. PHPBB CODE MIGRATION 79

• Database connection in MySQL database is made in functions.php file.

MySQL Database Connection

{mysql_connect($hostname,$username,$password or die( _error()); mysql_select_db$dbname or die(mysql _error()); }

We have replaced it with the MongoDB connection. It will give us the global variable $db to access it.

SQL Statement

$manager= new MongoClient($mongourl); $db=$manager->selectDB($dbname);

Most databases accesses are done in PHP global scope. When done inside a func- tion, we need to declare the usage of this global variable. We have created a class mongo_controller at mongo/mongo_controller.php file to manage MongoDB con- nections. It will be accessible with the global variable $mongo. This is the same approach used for SQL databases, that use the global variable $db.

We are keeping the SQL statements with the migrated code. It is differentiated using a condition with $mongo->enabled. This way we can easily see the changes. For example: 5.2. PHPBB CODE MIGRATION 80

MongoDB optimized Statement

if(!$mongo->enabled) { $=’SELECT forum _id FROM’. TOPICS _TABLE." WHERE topic_id=$topic _id"; $result=$db->sql _query($sql); $forum_id=(int)$db->sql _fetchfield(’forum_id’); $db->sql_freeresult($result); } else { // Get attribute’forum _id’ from document with$topic _id in Topics collection $row=$mongo->db->Topics->findOne( [’topic_id’ =>$topic _id], [’forum_id’ => 1] ); $forum_id=$row[’forum _id’]; }

Here we can see the first block with the original code, and the second block with its equivalent counterpart using MongoDB.

• Creating Indexes: We used some migrated attributes to identify concrete documents, so it is advisable to create an index on these attributes:

phpbb Indexes

db.Users.createIndex({"user_id":1}, { unique: true }); db.Posts.createIndex({"post_id":1}, { unique: true }); db.Topics.createIndex({"topic_id":1}, { unique: true });

• Migrating the text search function: The application has a text search function to search words on posts. With MySQL, we can use MATCH, and with MongoDB the 5.3. SCARF MANUAL MIGRATION 81

$text function. It works well for the general search, but in PHPBB advanced op- tions we can specify to search only on the title or on post text. This is not allowed with the MongoDB $text function. So, we have to discard the $text function and develop a more manual search. PHPBB has four different implementations of text search depending on the SQL database. We use fulltext_native.php which is for databases that do not implement a full-text search functionality.

We migrated viewtopic.php, which represents the view where you can see the topic content of a forum and all of its posts. viewtopic.php uses only MongoDB for all accesses to Posts, Topics and Users. It is the page where posts are listed. We test the code and test some cases (for example, posts with polls). In (viewforum.php) file which is the list of topics, SQL statements are embedded in business logic. In these pages, SQL statements are built part by part with conditions and reference to multiple tables. viewforum.php uses a calculated field to sort topics. It may need a custom field created in MongoDB. This is one of the core queries. The normal user functionality has been migrated and the extra functionalities as the ad- ministration control panel (ACP), the moderator control panel (MCP), and the RSS channel feeds have been migrates as well.

5.3 SCARF Manual Migration

SCARF [85], Stanford Conference And Research Forum, is a PHP based web appli- cation designed to help researchers and conference administrators to create and maintain discussion forums for their research papers. In SCARF , research papers are uploaded and stored in a database where users can view, comment and edit them, as well as organize them into sessions. SCARF application is intended to support interactive conferences such as 5.3. SCARF MANUAL MIGRATION 82

SIGCOMM [88], for which it was originally developed. SCARF uses only functional PHP all over the application. In comparison, PHPBB uses some functional PHP but database access uses Object Oriented Programming (OOP). We have checked how the application uses the database. It builds SQL statements all over the source code. The same PHP file that generates HTML output contains the SQL statements. SCARF is installed on Ubuntu 14.04.4 server and we migrated the application from MySQL to MongoDB manually. SCARF MySQL database has these tables; authors, com- ments, files, options, papers, sessions, and users. Fig.5.1 references the Entity-Relationship diagram of the MySQL database of SCARF application. It has SQL statements with many characteristics (joins, calculated attributes, filters with multiple conditions, dynamic queries, etc). In SCARF , we created a version with all tables migrated, and another ver- sion with only some tables migrated to MongoDB. We added comments to the changes in the code, so we will be able to know what the code is about and what has been changed. We have migrated SCARF application manually to the following versions:

• MSCARF is a complete migration of SCARF from MySQL to MongoDB. Mon- goID is used to identify records, replacing original int field keys.

• SCARF 3 is a partial migration of SCARF from MySQL to MongoDB. All tables except users and authors have been migrated. In this case there are some database queries with JOINs of tables migrated with other that are not migrated. 5.3. SCARF MANUAL MIGRATION 83

SCARF MySQL tables and MongoDB Collections

MySQL tables: users authors MongoDB collections(migrated from SCARF): sessions papers comments files options

In the next section we will illustrate SCARF 3 version with migrating all the database tables to MongoDB except users and authors tables that remain in MySQL relational database. The install process (install.php) file has been adapted to create the needed tables in MySQL and collections and indexes in MongoDB. The migration is done with the following considerations:

• Embedded Collections: SCARF MongoDB database has a collection for each table except files which was embedded in papers. We put files as a separate collection. The application accesses this information separately. Changing this data structure would require functional refactoring.

• Key Fields: MongoDB collections use the same key fields as MySQL. These fields are integers defined with AUTO_INCREMENT in MySQL. Since they are expected to be sequential, we will use a counter to get new values when inserting documents in MongoDB. 5.3. SCARF MANUAL MIGRATION 84

Figure 5.1: SCARF MySQL ER Diagram Insert document operation of MongoDB

// Get next id from counter $id= next _counter(’paper_id’, ’papers’); // Create document $document = [’paper_id’ =>$id,’title’ =>$title,----]; // Insert document $db->papers->insert($document); 5.3. SCARF MANUAL MIGRATION 85

Function next_counter is defined in functions.php file.

It is also a good idea to create indexes for correspondent MySQL primary keys.

MongoDB Indexes

db.papers.createIndex({’paper_id’:1 }, {’unique’: true}); db.sessions.createIndex({’session_id’: 1}, {’unique’: true}); db.comments.createIndex({’comment_id’: 1}, {’unique’: true}); db.options.createIndex({’name’: 1}, {’unique’: true}); db.files.createIndex({’paper_id’: 1, ’name’: 1}, {’unique’: true});

• Document id: In SCARF , the primary key has been migrated as MongoID values and renamed to _id field. This has the advantage to be the default MongoDB field key and we do not need an extra index here. The field name has changed, so we need to take this into account in the PHP code. We also need to change the code to treat those fields as MongoID as they are not integers anymore. MySQL does not have native Boolean fields. SCARF uses tinyint(1) type fields to hold Boolean values, where 0 means false and 1 means true. These values are migrated as true/false Boolean. So we have to take this into consideration in the PHP code to not store integer values in those fields. MySQL automatically converts the type, but MongoDB does not.

• Dynamic Queries: The application always sends SQL queries to the database as plain strings. Parameters are passed as part of the string.

SQL Statement of static query

"SELECT comment_id, comment FROM comments where user_id =’". getUserID() ."’ AND paper_id =’$id’" 5.3. SCARF MANUAL MIGRATION 86

Here user_id is set by an explicit string concatenation, and paper_id by a variable substitution. Since string manipulation only affects the parameters, we will consider it a static query.

SQL Statement of a dynamic query

"SELECT * FROM comments$where ORDERBY paper _id"

In this case, the query is a dynamic query. The variable $where is expected to contain the where part of the statement, so it is not an SQL parameter.

• Queries with find Operation: These are queries that request documents from only one collection and can be translated to a find operation. If only one record has to be read (as when reading a document by id), it can be translated to findOne operation.

MongoDB find action syntax

db->collection->find([filter[, projection]]);

• All SQL statements are executed through this function defined in functions.php:

SQL Statement

function query($sql) { $result= mysql _query($sql); if(!$result) die(mysql _error()); return$result; }

As it only receives a string, SQL statements never will be prepared statements with parameters. SQL statements use variable substitution and string concatenation to set parameter values. 5.3. SCARF MANUAL MIGRATION 87

SQL Statement

$result= query("SELECT * FROM papers WHERE paper_id=’$id’");

• Sometimes SQL statements are built in parts. For example in comments.php at line 36 the variable $where is defined:

SQL Statement

33 if(! isset($ _GET[’paper_id’]) ) 34 { 35 Moderate ALL comments mode 36$where="WHERE approved=’0’"; 37 } 38 else 39 { 40$id=(int)$ _GET[’paper_id’]; 41$where="WHERE paper _id=’$id’ AND approved=’1’"; 42 }

To be used later at line 116:

SQL Statement

112$result= query( 113"SELECT ----- 114 FROM comments 115 LEFT JOIN users on comments.user_id= users.user _id 116$where ORDERBY paper _id" 117 );

• MySQL cursors: MySQL cursors are created using mysql_query(). They are nested cursors, but this does not affect the migration. These cursors are accessed using mysql_fetch_row() and equivalent functions. 5.3. SCARF MANUAL MIGRATION 88

• Queries with filter: Queries that have filter conditions, when the operation is trans- lated to a find operation, it will be the first parameter. SQL filter condition is a text in infix notation that has to be parsed to its equivalent syntax tree.

filter condition

"WHEREa=1 or(b=2 andc=3)" Would be translated to this php array: [’a’=> 1, ’$or’ =>[ ’b’ => 2, ’c’ =>3]]

MongoDB has a less flexible filter system than SQL. SQL statements can be filtered with calculated values or even with sub-queries.

• projection: These are queries that specify what attributes to get. When the operation is translated to a find operation, projection will be the second parameter. With find, you can select what attributes to get, but not to create new calculated values.

• calculated fields: This column indicates that the SQL statement has some calculated value. In most cases, a good solution is to move the calculated operation to the PHP code. For example: 5.3. SCARF MANUAL MIGRATION 89

SQL and MongoDB of calculated value

$result= query ( "SELECT CONCAT(firstname,’’, lastname) from users where‘email‘=’". getEmail() ."’" ); $row= mysql _fetch_row($result); return$row[0]; Can be translated to: $row=$db->users->findOne ( [’email’ => getEmail()],[’firstname’ => 1,’lastname’ => 1] ); return$row[’firstname’].’’.$row[’lastname’];

• sort: Queries that specify what attributes to order by. When the operation is translated to a find operation, we can add a sort operation to MongoDB action as the following example:

SQL and MongoDB order by clause

"SELECT * FROM users ORDERBY lastname" Can be translated to $result=$db->users->find()->sort([’lastname’ => 1]);

Here we only can order by actual attributes, not by calculated fields.

• count: SQL queries that count all records matching the condition. MongoCursor has also a count function to count its documents. For example: 5.3. SCARF MANUAL MIGRATION 90

SQL and MongoDB of count function

$count_admin=$db->users->find([’privilege’ =>’admin’])->count();

It can be also used to migrate calls to MySQL mysql_num_rows() used to count records.

$result= query("SELECT * FROM users WHERE email=’$email’ AND password =’". mysql_real_escape_string(md5($_POST[’password’])). "’"); $num_rows= mysql _num_rows($result)

• aggregation: SQL queries that will need an aggregation pipeline, mostly because more than one collection is involved. An SQL statement like this:

Query with aggregation pipeline

$result3= query("SELECT firstname, lastname FROM authors LEFT JOIN users USING(user _id) WHERE paper_id=’$row2[paper_id]’ ORDERBY‘order‘"

Can be translated to the following: $pipeline=[ [’$match’ =>[’paper _id’ =>$row2[’ _id’]]], [’$lookup’ =>[ ’from’ =>’users’, ’localField’ =>’user _id’, ’foreignField’ =>’ _id’, ’as’ =>’user’]],

[’$unwind’ =>[’path’ =>’$user’,’preserveNullAndEmptyArrays’ => true]], [’$sort’ =>[’order’ => 1]], [’$project’ =>[’user.firstname’ => 1,’user.lastname’ => 1,’ _id’ => 0]]]; $result3=$db->authors->aggregateCursor($pipeline);

To make this translation, we need to check first which table has the attributes where they were not specified. In this case from the SQL, we cannot tell what table has 5.4. GENERIC CODE MIGRATION CASES 91

firstname, lastname, paper_id or order columns.

• update set: SQL statement to update documents by setting constant values to some attributes. We use $set operator to assign a constant value and use $inc to increase or decrease a numeric value.

Update Statement with $set operator

"UPDATE users SET password=’$pass’ WHERE email=’$email’" It is an update operation witha$set operator $db->users->update([’email’ =>$email],[’$set’ =>[’password’ =>$pass]]);

• update inc: SQL statement to update documents incrementing or decrementing some attributes.

Update Statement with $inc operator

"UPDATE users SET posts= posts+1 WHERE email=’$email’" It is an update operation witha$inc operator $db->users->update([’email’ =>$email],[’$inc’ =>[’posts’ => 1]]);

5.4 Generic Code Migration Cases

In this part, we discuss the general cases of the migration and the difference between the original SQL statements and the equivalents MongoDB actions.

• MongoDB queries return iterators with documents MongoDB queries return iterators with documents, while in SQL, the application is getting row by row. In these SQL, rows fields are accessed sometimes by name and sometimes by position. In MongoDB, documents always get the attributes by name. So, in a code like this: 5.4. GENERIC CODE MIGRATION CASES 92

SQL Statement

$result= query("SELECT name, type FROM files WHERE paper_id=’$id’"); while($row= mysql _fetch_row($result)) { $name=$row[0]; ----- }

In the translation, the loop must be adapted to the iterator, fields will be referenced by name.

MongoDB loop action

$result2=$db->files->find( [’paper_id’ =>$id],[’name’ => 1,’type’ => 1] ); foreach($result2 as$row) { $name=$row[’name’]; ---- }

To get an iterator with all information in the root level (no embedded arrays), we use the flat_iterator wrapper.

iterator wrapper

$result=$mongo->flat _iterator($result);

This iterator will merge children attributes into parent root for each element. If we use a lookup, we get this data: 5.4. GENERIC CODE MIGRATION CASES 93

The unwind command

[ {post_id: 1, topic: [{topic_id: 1, title:’A’}, {topic_id: 2, title:’B’}]}, {post_id: 2, topic: [{topic_id: 1, title:’A’}]}, {post_id: 3} ] Unwind topic, it will be: [ {post_id: 1, topic:{topic_id: 1, title:’A’}}, {post_id: 1, topic:{topic_id: 2, title:’B’}}, {post_id: 2, topic:{topic_id: 1, title:’A’}} ]

If we want to keep rows with no topic elements, we can add the unwind parame- ter preserveNullAndEmptyArrays. It will keep documents where the array field is missing, null or an empty array.

• MongoCursor has also a count function to count its documents.

MongoDB count function

$count_admin=$db->users->find([’privilege’ =>’admin’])->count();

• Join multiple tables: An SQL statement like this:

SQL Statement

$result3= query("SELECT firstname, lastname FROM authors LEFT JOIN users USING(user _id) WHERE paper_id=’$row2[paper_id]’ ORDERBY‘order‘")

Can be translated to the following MongoDB action: 5.4. GENERIC CODE MIGRATION CASES 94

MongoDB Join action

$pipeline=[ [’$match’ =>[’paper _id’ =>$row2[’ _id’]]], [’$lookup’ =>[ ’from’ =>’users’, ’localField’ =>’user _id’, ’foreignField’ =>’ _id’, ’as’ =>’user’]], [’$unwind’ =>[’path’ =>’$user’,’preserveNullAndEmptyArrays’ => true]], [’$sort’ =>[’order’ => 1]], [’$project’ =>[’user.firstname’ => 1,’user.lastname’ => 1,’ _id’ => 0]]]; $result3=$db->authors->aggregateCursor($pipeline);

But be aware that to make this translation, we need to first check which table has the attributes where they were not specified. In this case from the SQL, we cannot tell what table has firstname, lastname, _id or order columns.

• Queries with multiple tables In MongoDB version 3.2, the $lookup operator was added to the aggregation pipeline [39]. It can be used to simulate an inner join or a left join. It has the limitation that the match and the equality have to be between a single key from each collection.

MongoDB lookup action

$lookup: { from: , localField:, foreignField:, as:}} }

For example an SQL like this: 5.4. GENERIC CODE MIGRATION CASES 95

SQL Statement

SELECT * FROM authors LEFT JOIN users USING(user _id) WHERE authors.paper_id=$paper _id

Can be translated to this MongoDB action:

MongoDB Action

$result=$db->authors->aggregateCursor([[ ’$match’ =>[’paper _id’ =>$paper _id]],// Filter main collection [’$lookup’ =>[ ’from’ =>’users’,// Collection to join ’localField’ =>’user _id’,// Field from the input documents ’foreignField’ =>’ _id’,// Field from the docs of the"from" as’ =>’user’]],// Output array field [’$unwind’ =>’$user’]]}// Unwind topic array field )];

When possible, it is important to filter the main collection before the lookup com- mand. So, MongoDB does not process unnecessary rows. The purpose of the $un- wind command is to deconstruct the array field and create a document for each doc- ument [11].

• Queries with joins in both databases: These are SQL statements that use some tables migrated to MongoDB with other tables that have not been migrated (Dual). We have to look at each one to see how we merge the data from various databases as in the following examples. 5.4. GENERIC CODE MIGRATION CASES 96

Case#1: When the query executes only one row with some information from another table, as here:

SQL Statement

$row= mysql _fetch_row(query( "SELECT users.email, comments.comment FROM comments LEFT JOIN users USING(user _id) WHERE comment_id=’$comment_id’" ) ); $author=$row[0]; $comment=$row[1];

We have to split the query into two parts, one for each database, gathering from the first query the required information to perform the second one.

// Get comment document by id:

Mongodb action

$row=$db->comments->findOne ( [’comment_id’ =>$comment _id], [’comment’ => 1,’user _id’ => 1] ); $comment=$row[’comment’]; $user_id=$row[’user _id’]; 5.4. GENERIC CODE MIGRATION CASES 97

// Get user email:

SQL Statement

$row= mysql _fetch_row(query ( "SELECT email FROM users WHERE user_id=’$user _id’" ) ); $author=$row[0];

Case#2: This query executes multiple rows completed with information from other tables:

SQL Statement

$result= query (

"SELECT * FROM sessions LEFT JOIN users ON sessions.user_id= users.user _id ORDERBY starttime" );

In this case, we can get records from both databases and cross them using the function join_array. This function will create a new array merging records from specified fields. The join_array function is defined in the functions.php file. 5.4. GENERIC CODE MIGRATION CASES 98

SQL Statement

// Get session ordered by starttime $result=$db->sessions->find()->sort([’starttime’ => 1]); // Create array from iterator $result= iterator _to_array($result); // Extract user_id $userids= array _column($result,’user _id’); // Join only if there is data if(!empty($userids)){ // Get user information from MySQL $sqlresult= query ("

SELECT * FROM users WHERE user_idIN (". implode(’,’,$userids).") "); // Convert to array $sqlrows= array( while($sqlrow= mysql _fetch_assoc($sqlresult) ) { $sqlrows[]=$sqlrow; } // Merge both arrays by user_id $result= join _arrays($result,’user _id’,$sqlrows,’user _id’); }

• Migrating BLOBs with binary data: MySQL database has binary data for the PDF files. Files are stored in BLOB fields that the data migration tool is not able to pro- cess. The data migration tool is storing binary data to MongoDB string attributes, but MongoDB, stores strings as UTF-8, and it will reject to store as strings binary data which is not a correct UTF-8 string. MongoDB has a special data type for binary data, so we modify the translation file to force its use as following: 5.5. MANUAL MIGRATION CAVEATS 99

MongoDB Binary Data

table"files" do column"paper _id",:integer,:references=>"papers" column"name",:string column"ext",:string column"type",:string column"data",:binary before_save do|row| row.data= BSON::Binary.new(row.data) end end

table"papers" do column"paper _id",:integer column"title",:string column"abstract",:text column"pdf",:binary column"session _id",:integer,:references =>"sessions" column"pdfname",:string column"order",:integer before_save do|row| row.pdf= BSON::Binary.new(row.pdf) end end

5.5 Manual Migration Caveats

From the manual code migration, we extracted these caveats:

• SQL statements and actual SQL calls: To map SQL queries in PHP , first of all, it is needed to search where the SQL statements are defined and how they are executed. Each application can use different layers of abstractions.

• No exists one to one translation for all possible SQL statements: SQL language is 5.6. CONCLUSION 100

more powerful than MongoDB language. Not all SQL statements can be translated to a single MongoDB operation.

• Transactions: MongoDB does not have transactions. It only supports ACID transac- tions at the document level.

• Some SQL statements need schema knowledge to be interpreted: In some SQL state- ments, we need to know the schema to interpret it. For example, in a SQL like this:

SQL Statement

SELECT * FROMa JOINbONa.b _id=b.id WHERE col=0

To be able to migrate this statement, we need to know what table has the col attribute in the WHERE condition.

• Queries that access data from both databases (DUAL queries): These are SQL state- ments that use some tables migrated to MongoDB with other tables that have not been migrated. We need to get information from both databases and merge data.

We handle all these caveats in our proposed automated process.

5.6 Conclusion

In this chapter, we discussed the code manual migration process of SCARF and PHPBB3 applications, the different migrations cases, and the migration caveats. Normally, migrating queries involving multiple tables are more complex than others. Migrating these queries to process data from two different databases requires much more work. Also, it is more prone to errors as we have to change the logic of the application. Motives to not migrate the whole database tables would be; data from some tables is still accessed from other applications 5.6. CONCLUSION 101

that are not being migrated. This could be another application using the same user table, for example. This is a manual migration and if it is performed on a large application such as W ordP ress, the overhead to join data from multiple sources is less than the work needed to migrate all the tables. In the next chapter, we will describe the second phase of our migration framework, the query migration and optimization phase. We will start with the naive query translation from MySQL to MongoDB. Then we will discuss the different query migration cases and how we can optimize the queries by applying some database query optimization techniques. 102

Chapter 6

Query Migration and Optimization

6.1 Introduction

The second phase in the migration process is the query migration and optimization phase. Fig.6.1 illustrates our query migration and optimization process. There are three steps to the query migration process in our approach. The first step is to extract the queries from the application and classify them based on the use of tables from the schema. The second step is to migrate each of the queries and translate them to the equivalent MongoDB actions. The third step in the process is query optimization, where we change the SQL statement before migrating it to produce a better migration. Our approach is implemented in the source transformation language, TXL [20] using a modified MySQL grammar adapted from SQL2XMI [3], our own MongoDB API grammar, and the official PHP grammar from the TXL website [94].

6.2 Query Extraction Phase

The first phase in the query migration process is the query extraction phase, where we start by extracting the queries from the web application. We instrument calls to the mysql_query 6.2. QUERY EXTRACTION PHASE 103

Unchanged

Single Translate Optimize Single MongoDB Single MongoDB

Extracted Classify SQL Statements Double Translate Optimize Double MongoDB Double MongoDB

Translate Optimize Dual MongoDB Dual MongoDB Dual & SQL & SQL

Figure 6.1: Query Migration Process function. The final computed string that is passed to this call is logged along with the location (file and line) of the call. This provides an inventory of all the queries executed by the application. PHPBB do not have the queries as they are executed in its source code. So, to have a sample of queries of how they are executed, we run the application and instruct mysql to write to a file all queries when they are executed. The idea is to execute the program and see what queries it really executes. 6.3. QUERY CLASSIFICATION PHASE 104

6.3 Query Classification Phase

The second step in the query migration process is the query classification, where we clas- sify the SQL statements using TXL based on their database access described in chapter 4. We identified which tables in the database are migrated to NoSQL and which remain in the relational database. From the list of extracted SQL statements, we classify the SQL state- ments to a certain categorization depending on their database access using TXL program. The first category (Unchanged) are SQL queries that only use tables that are not part of the migration process and remain in the relational database. An example might be the table that identifies users and their account preferences. These queries are obviously left alone. The following TXL program illustrates the definitions of the classification rule for each Unchanged category.

TXL Classification Rule Definition of Unchanged Table Category

1 % Treat SQL statements that uses only tables in mysql database 2 rule ifUnchanged

3 replace * [TransformableStatement] 4 sql_statement[MySQLStatement] _ [opt ’;] 5 construct tables[repeat tableName] 6 _ [^ sql_statement] 7 where all 8 tables[notMigratedTable each tables] 9 export TXLexitcode[number] 10 3 11 by 12 sql_statement’:’unchanged 13 end rule

The second category of queries (Single) are queries that involve only a single table that has been migrated to MongoDB. This represents a simple translation from the SQL to the equivalent API for MongoDB. The following TXL program illustrates the definitions of 6.3. QUERY CLASSIFICATION PHASE 105

the classification rule for Single table category.

TXL Classification Rule Definition of Single Table Category

14 % Treat SQL statements that uses only one table, 15 % and this table is migrated to 16 rule ifSingle

17 replace * [TransformableStatement] 18 sql_statement[MySQLStatement] _ [opt ’;] 19 construct tables[repeat tableName] 20 _ [^ sql_statement] 21 construct num_tables[number] 22 _ [length tables] 23 deconstruct num_tables 24 1 25 where all 26 tables[isMigratedTable each tables] 27 export TXLexitcode[number] 28 1 29 by 30 sql_statement’:’single 31 end rule

The third set of queries (Double) are queries that involve more than one table, all of which have been migrated in the previous stage. These SQL queries are also migrated from SQL to the equivalent for MongoDB, but are more complex than a single table. The following TXL program illustrates the definitions of the classification rule for Double queries category. 6.3. QUERY CLASSIFICATION PHASE 106

TXL Classification Rule Definition of Double Category

32 % Treat SQL statements that uses multiple tables, all 33 % migrated to mongodb 34 rule ifDouble

35 replace * [TransformableStatement] 36 sql_statement[MySQLStatement] _ [opt ’;] 37 construct tables[repeat tableName] 38 _ [^ sql_statement] 39 construct num_tables[number] 40 _ [length tables] 41 deconstruct not num_tables 42 1 43 where all 44 tables[isMigratedTable each tables] 45 export TXLexitcode[number] 46 2 47 by 48 sql_statement’:’double 50 end rule

The last class of queries (Dual) are queries that include both relational and MongoDB tables. This is the most complex migration as it requires a combination of SQL statements and the MongoDB API. The following TXL program illustrates the definitions of the clas- sification rule for Dual category. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 107

TXL Classification Rule Definition of Dual Category

51 % Treat SQL statements that use both tables migrated 52 % to mongodb and tables remain in mysql database 53 rule ifDual

54 replace * [TransformableStatement] 55 sql_statement[MySQLStatement] _ [opt ’;] 56 construct tables[repeat tableName] 57 _ [^ sql_statement] 58 where 59 tables[isMigratedTable each tables] 60 where 61 tables[notMigratedTable each tables] 62 export TXLexitcode[number] 63 4 64 by 65 sql_statement’:’dual 66 end rule

6.4 Query Translation and Migration Phase

The third step in the migration of SQL statement is the translation phase which is mapping each SQL statement to the corresponding MongoDB action. A single TXL program is used to read a list of tables that has been migrated and processes each of the extracted SQL queries in turn. Separate translation rules were written for each of the three cases. Migrate common parts action indicates that some TXL patterns matching parts are shared in mul- tiple SQL operations and some TXL rules are shared in multiple patterns. For example, SQL filters WHERE or SQL order ORDER BY are presented in multiple SQL patterns. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 108

6.4.1 Naive Queries Translation

The translation program applies a different set of translation rules for each category of queries. The migrated statements are then stored in a mapped directory. Output for state- ments without a mapping is stored in the unchanged. If the statement is migrated, the output will be the original SQL statement followed by the result MongoDB operation. The following query from the SCARF application is an example of one that is not changed. The tables authors and users in the SCARF application are not sufficiently dynamic to benefit from the use of MongoDB.

SQL Statement

SELECT firstname, lastname FROM authors LEFT JOIN users USING(user _id) WHERE paper_id=’$row2[paper _id]’;

We start with direct naive translations, drawn from the SQL to MongoDB mapping table on the PHP website [74]. For example, the following TXL pattern, matches columns tables, where, order and limit elements of a SELECT statement.

SQL Statement Pattern

pattern: SELECT[list select _Expr1] FROM[table _references] [opt whereClause][opt orderbyClause][opt limitClause]

The following pattern about SELECT with a single table and selecting a list of columns are translated to MongoDB find operation. If there is an ORDER BY clause, a sort() call is added to the find action and the pattern supports limit clause. An example of this type of translation is shown below. We show the original SQL statement and the equivalent 6.4. QUERY TRANSLATION AND MIGRATION PHASE 109

MongoDB action.

SQL Statement

// SELECT * FROM users WHERE email=’$email’ db.users.find({email:$email});

This is the simplest case of a select all with a simple where clause. It is turned into a find operation with the search fields as a parameter. The TXL program classifies each file with one SQL statement from the source and moves the result to a different directory for each type. This example illustrates a single table with select, where, order and limit clauses.

SQL Statement

SELECTp.post _id FROM phpbb_postsp WHEREp.poster _id=$param1 ANDp.post _visibility=$param2 ORDERBYp.post _time DESC LIMIT$param3;

MongoDB translated action

db.phpbb_posts.find({poster_id:$param1,post_visibility:$param2}) .sort({post_time:-1}).limit($param3)$;

Our current implementation covers the SQL language including SELECT, INSERT, UP- DATE, DELETE, and CREATE TABLE statements. There are some complex cases involv- ing built-in SQL functions which are not fully covered; however the cases on the PHP website [74] are covered, as well are the cases found in our test applications. A total of 1185 queries were extracted from PHPBB3, they are categorized as follows: 80 unchanged queries, 968 single, 63 dual, and 74 double queries. For SCARF : 64 single and 22 double queries. For PHPBB2: 254 single queries, 29 double queries, 36 dual queries and 72 unchanged queries remain in MySQL database. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 110

6.4.2 Migration Cases Examples

In this section, we explore some cases of the migrated queries which do not have a direct one to one mapping from SQL to MongoDB. For example, some queries specify an order to the results using computed values. The MongoDB find operation can only be sorted by fields, not by computed values, but we need to produce an output from the MongoDB collection with the same order. Instead of migrating SELECT statements with computed sort statements directly to MongoDB find operations, we migrate the SQL statement to an aggregation MongoDB action. This will create a new computed field which can be used for sorting as shown in the following example:

SQL Statement

SELECTt.topic _id FROM(phpbb _topicst) WHEREt.topic _last_post_time>=$param4 ORDERBYt.topic _type DESC, LOWER(t.topic _title) ASC LIMIT$param5

MongoDB translated Statement

db.phpbb_topics.aggregate([ {"$match":{topic _last_post_time:{$gte:$param4}}},$ projection:{topic _id : 1,computed1:{$toLower"$topic _title"}}}, {"$sort":{topic _type : -1, computed1 : 1}}, {"$limit":$param5}]);

In this example, the aggregate function Lower is implemented in the result by the new field computed1 which is computed using the $toLower built-in function. The field is then used in the sort clause of the MongoDB action. Some queries have multiple aggregation functions (e.g. max, min, count) as in the fol- lowing example from PHPBB3 application: 6.4. QUERY TRANSLATION AND MIGRATION PHASE 111

SQL Statement

SELECT MAX(post_id)AS last _post, MIN(post_id)AS first _post, COUNT(post_id)AS total _posts FROM phpbb_posts WHERE topic_id=$id$

This statement will be translated using migrate select aggregation rule to generate a MongoDB aggregation pipeline with the aggregate functions.

MongoDB translated action db.phpbb_posts.aggregate([ {$projection:{post _id : 1, post_id : 1, post_id : 1}}, {"$match":{topic _id:$id}}, {"$group":{" _id": null, total _posts:{"$sum":{"$cond": [{"$ifNull":["$post _id", false]}, 1, 0]}}, first_post: {"$min":"$post _id"}, last_post:{"$max":"$post _id"}}

The min and max are simple functions at the end of the pipeline, the count aggregation is implemented as a combination of $sum, $cond and $ifNull. Another case that some double queries have a join that is not a simple field equality such as the following SQL statement:

SQL Statement

SELECTt.topic _id FROM phpbb_topicst, phpbb _topics t2 WHERE t2.topic_id=$topic _id ANDt.forum _id= t2.forum _id ANDt.topic _moved_id=0 ANDt.topic _last_post_id> t2.topic _last_post_id ORDERBYt.topic _last_post_id ASC LIMIT1

In this example, the join filter includes an inequality to sort topics by the id of the last post in the topic. The lookup operator in MongoDB can only join two collections with a 6.4. QUERY TRANSLATION AND MIGRATION PHASE 112

single field equality. So the result has to be implemented as two MongoDB actions, one for each of the field equality. This statement has been migrated to two MongoDB actions, one for each collection.

MongoDB translated action

db.phpbb_topics.find({topic_moved_id : 0}); db.phpbb_topics.find({"t.topic_id":$topic _id, forum_id:{$in:$list _forum_id}});

PHPBB3, also uses array elements in Select statements as in the following example:

SQL Statement

SELECT * FROM phpbb_forums WHERE id=$row[id]

SQL statements are also used in other constructions such as object properties:

SQL Statement

SELECT * FROM phpbb_forums WHERE parent_id=$this->parent _id

SQL query parameters interpolation array items: SCARF passes query parameters interpolation variables like:

query parameters interpolation variables

SELECT * FROM users WHERE user_id=$id

6.4.3 Dual Queries Translation

DUAL queries request data from both databases. So, there is one operation done into MySQL, and another to MongoDB to get the data for the join. The following examples illustrate two different queries of the dual migration queries, the first one where the main table is not migrated and the second table is migrated. The second example where the main table is migrated and the second table is not. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 113

Migrate a DUAL statement where the first main table is not migrated, and the secondary table is migrated, creates a SELECT SQL for the main table, and a MongoDB find() action for the other. The second generated statement has a filter with variable $list, this variable represents an array with the field values of this join.

Dual Query Translation

Original SQL statement: SELECT users.email, comments.comment FROM users LEFT JOIN comments USING(user _id) WHERE email=’$email’;

SQL part:

SELECT * FROM users WHERE email=’$email’

MongoDB part: db.comments.find({user_id:{$in:$list}});

The second example is a DUAL statement where first the main table is migrated, and the secondary is not, creates a MongoDB find() action for the main table, and a SELECT SQL statement for the other. The second generated statement has a filter with variable $list, this variable represents an array with the field values of this join.

Dual Query Translation

Original SQL Statement: SELECT email FROM comments LEFT JOIN users USING(user _id) WHERE comment_id=’$comment _id’;

MongoDB part: db.comments.find({comment_id:$comment _id}, {email :1, user_id :1});

SQL part:

SELECT * FROM users WHERE user_idIN($list); 6.4. QUERY TRANSLATION AND MIGRATION PHASE 114

We have to take this into consideration in the PHP code in the application migration, when we are extracting data from two different databases. The generated PHP code should do a query with SQL to get data from users table, then a MongoDB calls to get comments information for these records, and then it merges information from the two databases.

6.4.4 SCARF Migration Issues

The queries are migrated with TXL rules and correctly executed on PHP . Most SCARF queries run without problems, but there are some issues:

• SQL projection functions: Non aggregate functions to be executed with the results.

• Inconsistent field data types usage: The PHP function used in instrumented calls that access fields by position have been changed to use this information to get the correct data.

• Preserve SELECT order projection: SQL functions on projection that are not mi- grated in TXL, because they were deferred to be managed at application level and have no values yet. These are functions like string concatenation or date transforma- tions.

To solve the above issues, we start with the third point, since this way we will have a defined structure to pass information between TXL and PHP . This requires to add rules to return a more complex structure. It will return the queries to be executed with MongoDB, and a new array with the order expected in the resulting rows. To solve this, we have two options: 1) Manage projection on PHP and translate these SQL functions. 2) Use aggregations for all SELECT statements, and try to migrate these functions. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 115

The second issue is the different field data types: Sometimes SCARF uses inconsis- tent types when referencing fields. If an SQL statement references a numeric column with a string, the database does the translation. But in MongoDB, each document can have differ- ent field data types for the same field, and there, the database does not do any translation. If a field has the numeric value 1, and it searched as the string $1, it will not be found. This also affects INSERT and UPDATE operations. Using the wrong data type can result in documents with different field data types. In SQL, every column has a type, and MySQL API do the translation, but MongoDB does not do type conversion; we did the type conver- sion explicitly. This has been solved doing data casting in the migration with the schema information. The third issue is related to the Select projection order functions: In SQL field order projection is preserved in the result. But MongoDB does not always respect it. On most queries, it is not a problem, but when PHP recovers rows with the function mysql_fetch_row, values are retrieved by position. Since it seems that it cannot be solved at the query level, one solution can be to reorder the array at PHP level based on projection information. Our transformation adds a mapping between fields and positions. PHP function that ac- cess fields by position use this mapping to get the correct data. This behaviour can be seen at User Options of SCARF . This page queries user infor- mation with this code: 6.4. QUERY TRANSLATION AND MIGRATION PHASE 116

Select projection order function

$result= query( "SELECT firstname, lastname, showemail, affiliation, email FROM users WHERE user_id=’".$id."’" ); $row= s2m _mysql_fetch_row($result); $first=$row[0]; $last=$row[1]; $showemail=$row[2]; $affiliation=$row[3]; $email=$row[4];

Fields need to be in specified order because it reads fields by position. Rules for SELECT statements will generate code to produce a $rows variable with the MongoDB operation. And when needed, a $cols array with columns position. In that example, it will produce:

MongoDB projection order function

$rows=$db->users->find( [’user_id’ =>$id], [’firstname’ => 1,’lastname’ => 1, ’showemail’ => 1,’affiliation’ => 1,’email’ => 1] ); $cols=["firstname","lastname","showemail","affiliation","email"];

PHP interface has been updated to use this information. The most important change is the s2m_mysql_fetch_row function. This is the function that retrieves each row as an un-associated array. Now it created the array placing values in the required order. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 117

6.4.5 PHPBB Migration Issues

The migrated queries are correctly executed on PHP . Most PHPBB2 and PHPBB3 queries run without problems, but there are some issues found with the queries migration:

• The applications are using special queries to configure and get information from the database. Like SELECT VERSION() AS version, SET NAMES, and utf8.

• The application uses P HP namespaces. When a PHP file has a namespace, all new code has to be placed after the namespace declaration..

• Queries are using these SQL functions in SQL projection: CONCAT, UNIX_TIMESTAMP, DATE_ADD, and LOWER functions. We treat these functions from the projection phase in the query migration stage.

The following part illustrates how we solve the above migration issues in PHPBB applications.

• Use aggregations for calculated fields: MongoDB can have calculated fields in ag- gregations, but not in the find command. We are using the find command when possible for Single queries. But with find command, you can specify which fields do you want from the collection, but you cannot create calculated fields. For this rea- son, migrateSelect1 that creates find MongoDB commands from Single queries has been updated. It will only treat queries that only have non-calculated fields. This way queries with calculated fields will be converted to an aggregation on posterior rules. Rules that generate aggregations use migrateSimpleProjection to create the projec- tion parameters. Now, this function will create a MongoDB equivalent expression for specified SQL functions. Computed SQL expressions are translated with function buildExp. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 118

• CONCAT: This function concatenates all strings passed as parameters. We will use the equivalent concatmongo function as follows:

CONCAT function

SQL: SELECT CONCAT(firstname, ’ ’, lastname)AS fullname FROM users MongoDB output: $rows=$db->users->aggregate( [[’$project’ =>[’fullname’ => [’$concat’ =>[’$firstname’,’’,’$lastname’]]]]] ); $cols=["fullname"];

• LOWER: This function converts a string to lower case. We will use the equivalent toLower MongoDB function as defined in the following TXL function.

LOWER function definition

1 % Builda mongo equivalent projection for LOWER function 2 function buildExpLower expr[Expr] 3 deconstruct expr 4 ’LOWER’(E[Expr] ’) 5 construct N1[js _primary_expn] 6 ERR 7 construct subexp[js_primary_expn] 8 N1[buildExpE] 9 replace[js _primary_expn] 10 ERR 11 by 12 ’{"$toLower": subexp’} 13 end function

This is an example of an SQL statement with the Lower function and the translated equivalent MongoDB action. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 119

LOWER function

SQL: SELECT LOWER(email)AS email FROM users MongoDB output: $rows=$db->users->aggregate( [[’$project’ =>[’email’ =>[’$toLower’ =>’$email’]]]] ); $cols=["email"];

• UNIX_TIMESTAMP: This function converts an SQL date to a unixtimestamp. Unix time is a system for describing a point in time, defined as an approximation of the number of seconds that have elapsed since 1 January 1970 at 00:00 UTC [97]. Mon- goDB does not have an equivalent function, but a subtraction of two dates gives us the difference in milliseconds. The following function was added to convert SQL date to a "unixtimestamp": 6.4. QUERY TRANSLATION AND MIGRATION PHASE 120

UNIX_TIMESTAMP function

1 function s2m_project($row,$projection){ 2$matches= array(); 3 if(preg_match(’/UNIX_TIMESTAMP\((\w+)\)/’,$projection,$matches)) 4 { 5$param=$matches[1]; 6 if(isset($row[$param])) 7 { 8 return strtotime($row[$param]); 9 } 10 }

11 if(preg_match(’/CONCAT\((.*?)\)/’,$projection,$matches)) 12 { 13$params= explode(’,’,$matches[1]); 14$str=""; 15 foreach($params as$param) 16 { 17 if(substr($param,0) == ’\’’ && substr($param,-1) == ’\’’) 18 { 19$str .= trim($param,’\’’); 20 } 21 if(isset($row[$param])) 22 { 23$str .=$row[$param]; 24 } 25 } 26 return$str; 27 } 28 }

So, we will subtract date 1970-01-01 from the given date and multiply that by 1000 to get the equivalent unix time as in the following example: 6.4. QUERY TRANSLATION AND MIGRATION PHASE 121

SQL statement and equivalent MongoDB action of Unix Timestamp function

SQ SELECT function: SELECT UNIX_TIMESTAMP(starttime)AS starttime FROM sessions MongoDB output: $rows=$db->sessions->aggregate( [[’$project’ =>[’starttime’ =>[’$divide’ => [[’$subtract’ =>[’$starttime’, new Date ("1970-01-01") ]],1000]]]]] ); $cols=[starttime];

• UNIX_TIMESTAMP with added minutes: SQL UNIX_TIMESTAMP parameter can have a number of minutes added with the INTERVAL keyword. Since unix time is in seconds, the interval in minutes will be multiplied by 60.

TIMESTAMP with added minutes function

SQL: SELECT UNIX_TIMESTAMP(starttime+ INTERVAL duration MINUTE)AS endtime FROM sessions MongoDB output: $rows=$db->sessions->aggregate( [[’$project’ =>[’endtime’ =>[’$add’ =>[[’$divide’ =>[ [’$subtract’ =>[’$starttime’, newDate ("1970-01-01")]], 1000]],[’$multiply’ =>[’$duration’, 60]]]]]]] ); $cols=["endtime"];

• DATE_ADD: MongoDB can add milliseconds to a date. To add minutes, we have to multiply these by 60000 first. 6.4. QUERY TRANSLATION AND MIGRATION PHASE 122

Date Add function

SQL: SELECT DATE_ADD(starttime, INTERVAL duration MINUTE)AS endtime FROM sessions MongoDB output: $rows=$db->sessions->aggregate( [[’$project’ =>[’endtime’ =>[’$add’ => [’$starttime’,[’$multiply’ =>[’$duration’, 60000]]]]]]] ); $cols=["endtime"];

Another change that has been made is that the character ’$’ will not be accepted as part of an id in the grammar. To be able to differentiate id columns from parameters, parameters are now defined with its own entity parameter as shown below.

New Grammar Entity Parameter

define Parameter ’$[SPOFF][id][SPON] end define

We have extended the SQL grammar to also allow PHP variables as parameters. Now it allows PHP variables (with "$" sign) in places where expressions are expected. SCARF and PHPBB uses variable interpolation to pass parameters to queries. This is not an SQL, this is a PHP syntax. This means that PHP strings are built with information from other variables. We accept these variables in the parser to be able to migrate these con- structions. SCARF and PHPBB embeds some PHP variables inside SQL statements with the "$" sign. They are resolved at runtime. For example, here: 6.5. QUERY MIGRATION EVALUATION 123

Embedded php variable in an SQL statement

SELECT forum_id FROM phpbb_topics WHERE topic_id=$topic _id

topic_id is a PHP variable, it is not part of the SQL, but we allow it, and it will be migrated to something like this:

Migrated MongoDB action

$db->phpbb_topics->find([’topic_id’ => intval($topic _id)]

Since we know that "$topic_id" is a variable that will be replaced at runtime by its value.

6.5 Query Migration Evaluation

Our naive translation was able to generate a set of MongoDB actions for the SQL queries from our test applications which we tested against migrated tables in our experimental installations of the applications. The main goal of this phase is to test the correctness of the translation and how to make a better translation. The MongoDB actions produced the same results as the original SQL queries. However, the performance of the naive translation is an issue. These performance results will be shown in the query optimization phase section.

6.5.1 Query Evaluation Before Indexing

We measured SELECT queries migrated to MongoDB and classified as SINGLE or DOU- BLE. Queries have been measured on an Ubuntu 14.04.4 server, with 13 GB disk size, 1 GB memory size, AMD Opteron 23xx 2.0 GHz CPU with MySQL 5.5.47 and MongoDB 3.2.10, with the following collections from PHPBB3 application database: 28174 posts, 23931 topics and 10333 users. 6.5. QUERY MIGRATION EVALUATION 124

File / Query Query Execution Time (Sec.) SQL MongoDB / before-index double_mongo/phpbb3_0106.js 0.00049 2.12076 double_mongo/phpbb3_0107.js 0.00066 2.04483 double_mongo/phpbb3_0108.js 0.00054 1.98237 double_mongo/phpbb3_0129.js 0.00081 2.30272 double_mongo/phpbb3_0130.js 0.00051 2.17271 double_mongo/phpbb3_0132.js 0.00056 1.31365 double_mongo/phpbb3_0256.js 0.00131 2.65393 double_mongo/phpbb3_0257.js 0.00112 2.70304 double_mongo/phpbb3_0254.js 0.00151 4.46921 double_mongo/phpbb3_0258.js 0.00129 2.65308 double_mongo/phpbb3_0339.js 0.00682 4.54144 double_mongo/phpbb3_0391.js 0.00120 4.87255 double_mongo/phpbb3_0392.js 0.00154 4.74997 double_mongo/phpbb3_0422.js 0.00324 4.38161 double_mongo/phpbb3_0423.js 0.00322 4.01667 single_mongo/phpbb3_0069.js 0.00036 0.17699 single_mongo/phpbb3_0070.js 0.00041 0.10925 single_mongo/phpbb3_0071.js 0.00051 0.10385 single_mongo/phpbb3_0077.js 0.00620 0.14231 single_mongo/phpbb3_0087.js 0.00596 0.05789 single_mongo/phpbb3_0088.js 0.00534 0.05582 single_mongo/phpbb3_0089.js 0.00489 0.05901 single_mongo/phpbb3_0090.js 0.05187 0.07708 single_mongo/phpbb3_0104.js 0.00074 0.19099 single_mongo/phpbb3_0109.js 0.00024 0.18657 single_mongo/phpbb3_0110.js 0.00027 0.17407 single_mongo/phpbb3_0111.js 0.00077 0.12874 single_mongo/phpbb3_0118.js 0.00062 0.21504 single_mongo/phpbb3_0119.js 0.00058 0.15995 single_mongo/phpbb3_0120.js 0.00047 0.10087

Figure 6.2: Query execution time comparison before Indexing

We measured the time needed to execute each query. Tests were run on both the MySQL and MongoDB systems using the same data sets. The test was done on the original migrated MongoDB queries without indexing and the time is recorded in seconds as shown in Fig. 6.2

1 6.5. QUERY MIGRATION EVALUATION 125

Six queries that take much time and they take about 4 seconds to finish. All of these queries have a filter on a field of the main collection and not involving other collections. Those filters can be moved before the join as in the following example:.

Non-optimized MongoDB Action db.phpbb_posts.aggregate([ {"$lookup": {"from":"phpbb _topics", "localField":"topic _id", "foreignField":"topic _id", "as":"t"}}, {"$unwind":{"path":"$t","preserveNullAndEmptyArrays":false}}, {"$lookup": {"from":"phpbb _users", "localField":"poster _id", "foreignField":"user _id","as":"u"}}, {"$unwind":{"path":"$u","preserveNullAndEmptyArrays":false}}, {"$match":{post _id:{$in:$list}}}, {"$sort":{"post _time" : -1,"post _id" : -1}}])

Filter on post_id only involves the first collections of the query, so it can be moved:

Optimized MongoDB Action db.phpbb_posts.aggregate([ {"$match":{post _id:{$in:$list}}}, {"$lookup": {"from":"phpbb _topics", "localField":"topic _id", "foreignField":"topic _id", "as":"t"}}, {"$unwind":{"path":"$t","preserveNullAndEmptyArrays":false}}, {"$lookup":{"from":"phpbb _users","localField":"poster _id","foreignField":" user_id","as":"u"}}, {"$unwind":{"path":"$u","preserveNullAndEmptyArrays":false}}, {"$sort":{" post_time" : -1,"post _id" : -1}}]) 6.5. QUERY MIGRATION EVALUATION 126

Table 6.1: MongoDB Indexes

MongoDB Index Collection Posts db.Posts.createIndex(post_id:1); Posts db.Posts.createIndex(forum_id:1); Posts db.Posts.createIndex(topic_id:1); Posts db.Posts.createIndex(poster_id:1); Posts db.Posts.createIndex(topic_id:1,post_time:1); Posts db.Posts.createIndex(post_username:1); Posts db.Posts.createIndex(post_visibility:1); Posts db.Posts.createIndex(post_delete_user:1); Topics db.Topics.createIndex(topic_id:1); Topics db.Topics.createIndex(forum_id:1);); Topics db.Topics.createIndex(forum_id:1,topic_id:1) ; Topics db.Topics.createIndex(topic_last_post_time:1) ; Topics db.Topics.createIndex(forum_id:1,topic_last_post_time:1,topic_moved_id:1) ; Topics db.Topics.createIndex(topic_visibility:1); Topics db.Topics.createIndex(forum_id:1,topic_visibility:1, topic_last_post_id:1) ; Users db.Users.createIndex(user_id:1) ; Users db.Users.createIndex(username_clean:1) ; Users db.Users.createIndex(user_birthday:1) ; Users db.Users.createIndex(user_email_hash:1) ; Users db.Users.createIndex(user_type:1) ;

This optimization gives a functionally equivalent MongoDB query. With this change, the performance of the 6 slow queries improved from current 4 seconds to approximately 0.2 seconds.

6.5.2 Query Evaluation After Indexing

In this part, we evaluate the queries in the previous section by applying some indexes on the fields. We create all equivalent indexes from MySQL as shown in Table. 6.1. There are three queries that take more than 10 minutes to finish. These queries are joining collection topics with posts by topic_id. We create an index to help the search of 6.5. QUERY MIGRATION EVALUATION 127

Table 6.2: Query Migration Statistics before & after Indexing

Total of 1042 Single & Double Migrated Queries % Before Indexing Slower than SQL 1031 98.94% Similar to SQL 11 1.06% Faster than SQL 0 0% After Indexing Slower than SQL 140 13.44% Similar to SQL 811 77.83% Faster than SQL 91 8.73% posts by topic.

SQL Posts Table MongoDB Index db.Posts.createIndex({topic_id:1});

By adding this index, theses queries take times similar to the other queries. With in- dexes, there is no query variant that takes more than 4 seconds. Fastest variant for slowest queries has also improved. Table. 6.2 shows the query execution time comparison statistics of MongoDB against SQL of the migrated queries before indexing and after adding indexes of PHPBB3 appli- cation. As we can see from the table of a total of 1042 Single and Double executed queries, that the majority of the queries without adding any indexes took more time to execute in MongoDB than in SQL. 98.94% of the queries were slower in MongoDB, while only 1.06% of the total queries took similar execution time to SQL and there are no queries that took less time to execute in MongoDB than in SQL database. After adding the indexes to the collections, the number of queries that took a long time to run decreased by 85.5% and the number of queries took a similar time to execute in MongoDB like in SQL increases to 77.83%, while there were 91 queries faster to run in MongoDB than in SQL with an increase from 0% to 8.73%. 6.6. QUERY OPTIMIZATION PHASE 128

6.6 Query Optimization Phase

Database performance is one of the most challenging aspects of an organization’s database operations. A well-designed application may still experience performance problems if the query it uses is poorly constructed [18]. It is much harder to write efficient queries than to write functionally correct queries. As such, query optimization can help significantly improve a system’s performance and energy efficiency. The key to tuning the queries is to minimize the search path that the database moves to find the data [18]. As the amount of data increases, the performance decreases and the execution time and energy consumption increases. Therefore, optimization of the queries becomes essential. It is important to optimize queries in order to make the documents quickly accessible and scale as the collections grow. There are many factors that can affect database perfor- mance and query execution time including the use of indexes, query structure, data models and schema design as well as operational factors such as architecture and system configu- ration. Applying some of the database query optimization techniques like creating indexes and changing tables order during the translation may result in better NoSQL queries. Optimization is not a simple field. If queries have to be optimized in this stage, it should be by applying filters as soon as possible. This is what a human developer would do first. Also, if we want to switch table names or join conditions, it is best done by knowing the data, what collections may have more documents, and what are the indexed fields. This would have different results with different data. So, we may try to change a query to use an index, or it may be best to create a new index. A naive translation of the SQL query may result in sub-optimal queries. While the queries may be optimized manually, applying some database query optimization techniques during the translation may result in better NoSQL queries. 6.6. QUERY OPTIMIZATION PHASE 129

6.6.1 Query Optimization Techniques

There are some general guidelines that every database developer follows to improve the performance of their systems. The goal of database tuning focuses on improving the exe- cution optimization of the database system, which plays an important role and runs through the entire life cycle of the database applications [18]. There are various optimization tech- niques, which can be implemented to make the queries run faster and consume less energy. The goal of optimizing the queries includes delivering quick response times using less CPU resources, and reducing I/O operations [18]. The following section provides best practices for optimizing the performance of the queries. Optimizations Best Practices:

• Indexing.

• Aggregation Pipeline Optimization.

• Projection Optimization.

• Table Order Optimization.

1. Indexing: Unnecessary full-table scans cause a huge amount of unnecessary I/O and can drag-down an entire database [39]. The tuning expert first evaluates the database based on the number of rows returned by the query. The most common tuning remedy for unnecessary full-table scans is adding indexes [39]. The Primary Key for a table acts as a default index. Additional indexes can be added to a table depending upon the data size it holds. 2. Pipeline Optimization where the developer can place the $match as early in the aggregation pipeline as possible. Because $match limits the total number of documents 6.6. QUERY OPTIMIZATION PHASE 130

in the aggregation pipeline, earlier $match operations minimize the amount of processing down the pipe. If we place a $match at the very beginning of a pipeline, the query can take advantage of indexes like any other. $Match filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage. The $match stage has the following prototype form:

Match Syntax

{$match: { }}

In MongoDB version 3.2, the $lookup operator was added to the aggregation pipeline [59]. It can be used to simulate an inner join or a left join. It has the limitation that the match and equality has to be between a single key from each collection.

MongoDB lookup operator

{ $lookup: { from: , localField:, foreignField:, as: } }

For example an SQL like this:

SQL Statement

SELECT * FROM authors LEFT JOIN users USING(user _id) WHERE authors.paper_id=$paper _id

Can be translated to this MongoDB action: 6.6. QUERY OPTIMIZATION PHASE 131

MongoDB lookup operator

$result=$db->authors->aggregateCursor([ [’$match’ =>[’paper _id’ =>$paper _id]],// Filter main collection [’$lookup’ =>[’from’ =>’users’,// Collection to join ’localField’ =>’user _id’,// Field from the input documents ’foreignField’ =>’ _id’,// Field from the docs of the"from" ’as’ =>’user’]],// Output array field [’$unwind’ =>’$user’]]// Unwind user array field )];

When possible, it is important to filter the main collection before the lookup command. Doing so, MongoDB will not be processing unnecessary rows. The $unwind command purpose is to deconstruct the array field and create a document for each document. If we want to keep rows with no users elements, we can add the unwind parameter preserveNul- lAndEmptyArrays. It will keep documents where the array field is missing, null or an empty array. 3. Aggregation Pipeline Optimization: Aggregation pipeline operations have an opti- mization phase which attempts to reshape the pipeline for improved performance. For an aggregation pipeline that contains a projection stage, (project or addFields followed by a match stage, the developer moves any filters in the match stage that do not require values computed in the projection stage to a new match stage before the projection. Pipeline Sequence Optimization Syntax:

Pipeline Sequence Optimization

$project or$addFields+$match Sequence Optimization

If an aggregation pipeline contains multiple projections and/or match stages, the de- veloper performs this optimization for each match stage, moving each match filter before all projection stages that the filter does not depend on. 6.6. QUERY OPTIMIZATION PHASE 132

4. Projection Optimization: The aggregation pipeline can determine if it requires only a subset of the fields in the documents to obtain the results. If so, the pipeline will only use those required fields, reducing the amount of data passing through the pipeline.

6.6.2 Filter Option Optimization

MongoDB supports pipeline optimizations where the developer can place the match filter as early in the aggregation pipeline as possible [59]. Because the filter limits the total number of documents in the aggregation pipeline, it minimizes the amount of processing later in the pipe. As such, in this step, we move all match conditions to as early in the pipeline as possible. For example:

SQL Statement

SELECT DISTINCTp.post _id FROM phpbb_postsp, phpbb _topicst WHEREp.poster _id=$param1 (p.post_visibility=$param2ORp.forum _idIN$list) ANDt.topic _id=p.topic _id GROUPBYt.topic _id,t.topic _last_post_time ORDERBYt.topic _last_post_time DESC

In the naive approach, the match is placed later in the aggregation pipeline. The opti- mized version places the match at the beginning of the pipeline as shown: 6.6. QUERY OPTIMIZATION PHASE 133

MongoDB optimized Statement db.phpbb_posts.aggregate([ {"$match" :{$and:[{poster _id:$param1}, {$or:[{post _visibility:$param2}, {forum_id:{$in:$list}}]}]}}, {"$lookup": {"from":"phpbb _topics", "localField":"topic _id", "foreignField":"topic _id","as":"t"}}, {"$unwind":{"path":"$t","preserveNullAndEmptyArrays": false}}, {"$project":{"post _id" : 1,"t.topic _last_post_time" : 1}}, {"$sort":{"t.topic _last_post_time" : -1}}, {"$group":{" _id":{topic _id:"$t.topic _id", topic_last_post_time :"$t.topic_last_post_time"}}}, {"$group":{" _id":{post _id:"$post _id"}}}])$

Applying the suggested optimizer rule, this optimization rule will work on generated aggregate MongoDB actions, and will move all main conditions that it can before doing the lookup with other collections. The conditions that will be moved which only references directly to main collection. To benefit from the query optimization, we know that at least the 6 worst cases will improve a lot, since we tested the optimization manually. MongoDB action optimization in filtered joins SQL WHERE conditions are divided in separable parts in the extractWhereLogicalExpr rule. This rule returns a repetition of filters that can be joined with "AND". The addLookupTableReference rule add all filters that only involve previous referenced collections. It does this before adding any "lookup" for a new join. With this optimization, the slowest queries take much less time to execute. In the tested queries, there still remain 4 queries that take more than one second to execute. These queries have joins that do not have any filter on the main collection. 6.6. QUERY OPTIMIZATION PHASE 134

6.6.3 Table Order Optimization

In the above example, there are two collections, phpbb_posts and phpbb_topics. The query is implemented as an aggregate pipeline on phpbb_posts. The collection phpbb_posts is the main collection of the query. A double table query where the main collection is filtered is more likely to be faster. So, if the main collection of the query is unfiltered, changing the order of the collections so that the main collection with a filtered collection may be faster. However, it will not be always the case. For example, a main collection can be unfiltered but has fewer documents than the result of filtering the other collection. In the following example, we have a table with posts and another with topics, and each post has a field with its topic id.

SQL Statement

SELECT phpbb_topics.topic_title FROM phpbb_topics, phpbb_posts WHERE phpbb_posts.post_id=1 AND phpbb_posts.topic_id= phpbb _topics.topic_id

If we run a query to retrieve the topic title for the postid = 1, SQL database query planner chooses what table is best to read first, but MongoDB does not. Without the table order optimization program, MongoDB query looks first at the topic table. It has to read ALL topics and ALL their posts. Then, it will discard all posts and retrieve the post with id = 1 only. With the optimization, the post table would be used first because it is filtered. MongoDB will have to read only one post, searching by its id and for that post, it will read only its topic. The transformation rules follow MongoDB migration guidelines [59]. As a result, the migrated application follows these guidelines. We did a performance test to check the performance of the migrated queries. We notice that some query transformations run too 6.6. QUERY OPTIMIZATION PHASE 135

slow. All were Double queries. In MongoDB, the order of the aggregation of collections matters, MongoDB uses aggregations to produce similar results to JOINs of tables in SQL. In SQL the order of tables does not matter because SQL database does the optimization by itself, but MongoDB does not. In our tests, we have seen that rearranging table references to name first those that are more used in filters produces better results when translated to MongoDB aggregations. We added an additional optimization step for Double queries, to reorder references to tables, but before doing so, the process checks that the optimization will produce a functionally equivalent SQL statement. To test how much changing the table order can affect query performance, a rule has been added that creates equivalent queries with all table order variations. JOINS with attached conditions cannot be moved to a position before other tables they reference. We tested the process with PHPBB3 Double queries and we permute table_references that are list of tables. Variations will be calculated using Steinhaus algo- rithm [91]. For example, the following SQL statement: A main rule has been added replicateStatementPermutations in double2mongo.txl pro- gram. It will replace a Double statement with its each possible table order variation state- ment. This rule is executed before any MongoDB translation. With the activation of this rule, the migrated output for a (Double) query will contain all possible MongoDB migrat- ing actions by changing table order, and its SQL equivalence annotated. For example, the following SQL statement:

SQL Statement

SELECT post_id FROM phpbb_postsp, phpbb _topicst WHEREp.post _id=$param1 ANDp.topic _id=t.topic _id : double 6.6. QUERY OPTIMIZATION PHASE 136

Will produce the following two translated MongoDB actions:

MongoDB action with table order output 1

//"SELECT post _id FROM phpbb_postsp, phpbb _topicst WHEREp.post _id=$param1 ANDp.topic _id=t.topic _id" db.phpbb_posts.aggregate([ {"$match":{post _id:$param1}}, {"$lookup": {"from":"phpbb _topics", "localField":"topic _id", "foreignField":"topic _id","as":"t"}}, {"$unwind":{"path":"$t","preserveNullAndEmptyArrays":false}}, {"$project":{"post _id" :1}}])

MongoDB action with table order output 2

//"SELECT post _id FROM phpbb_topicst, phpbb _postsp WHEREp.post _id=$param1 ANDp.topic _id=t.topic _id" db.phpbb_topics.aggregate([ {"$lookup":{"from":"phpbb _posts", "localField":"topic _id", "foreignField":"topic _id","as":"p"}}, {"$unwind":{"path":"$p","preserveNullAndEmptyArrays":false}}, {"$project":{"p.post _id" :1}}, {"$match":{"p.post_id":$param1}}])

Note that each MongoDB action will be optimized if possible. Here, the first action can filter by post_id before doing the join, because it is a field from the main collection. Second action has to filter after the join because the filtered field is not available until lookup is complete. Changing the table order has produced two SQL queries that are non migrate-able to MongoDB lookup in MongoDB 3.2.10. These are queries with three tables, and this hap- pens when two related tables have another table in between. For example: 6.6. QUERY OPTIMIZATION PHASE 137

SQL Statement

SELECT * FROM phpbb_topicst, phpbb _usersu, phpbb _postsp WHEREp.post _idIN$list ANDt.topic _id=p.topic _id ANDp.poster _id=u.user _id ORDERBYp.post _time DESC,p.post _id DESC

This is an insufficient table order, because T opics table is related with P osts and P osts with Users, but there is no direct relation between T opics and Users. We run all PHPBB3 (Double) query variations to see the differences in performance by changing the table order. Queries have been measured on Ubuntu 14.04.4 server, with 13 GB disk size, 1 GB memory size, AMD Opteron 23xx 2.0 GHz CPU, MySQL 5.5.47, and MongoDB 3.2.10, with the following collections from PHPBB3 application database:

Tables Collections of PHPBB3

Collection Documents Indexed fields Users 10333 user_id Posts 28174 post_id, text(post _subject, post_text) Topics 23931 topic_id,topic_first_post_id

Fig. 6.3 illustrates the different variants of the query execution time of changing the query order table for Double queries. The First column has the filename of the query. In that file, we can see all variations, in SQL and MongoDB. Other columns contain execution time for each variation. Queries with 3 tables have 6 possible variations. In PHPBB3 there are no queries with more than 3 tables. Cells with (−) represent a variation that is not migrate-able to a MongoDB lookup, as explained above. As we can see from the figure that SQL queries normally have a small difference between permutations. Actually, they have the same performance, only that different measures at different times give slightly different numbers. These measures help to find queries that need optimizations, but with 6.6. QUERY OPTIMIZATION PHASE 138

Figure 6.3: Table Order Optimization Variants small values, the actual number does not matter, because there may be other factors that are expending more time than the query itself. For example, using different drivers would give different numbers. We can see that in this relatively small database, all launched SQL queries are using an index and are very fast with these parameters. As we can see from the figure that there are 4 queries (127, 129, 130, 352) with a great improvement from switching table order. The execution time improved from (3,255, 2,293, 2,809, 3,889) to (0,003, 0,099, 0,051, 0,008) respectively. By changing the table order, the optimization rule is able to do the filter before the lookup. There are still some slow queries. For example, query 255 is slow in all of its variations. All other queries seems to have any 6.6. QUERY OPTIMIZATION PHASE 139

of the variations with reasonable performance. We have seen that changing table order in Double queries can affect performance on MongoDB. MySQL chooses what table is best to read first by itself, but MongoDB does not. For example, if we execute a query to retrieve the topic title for the post id = 1, SQL database query planner chooses what table is best to read first, but MongoDB does not. Without the table order optimization program, MongoDB query looks first the topic table. It has to read ALL topics and ALL their posts. Then, it will discard all posts and retrieve the post with id = 1 only. With the optimization process, the post table would be used first because it is filtered. MongoDB will have to read only one post, searching by its id and for that post, it will read only its topic. Based on the suggestion of the NoSQL community [59], we optimize the queries by changing the table order based on the following criteria: First, it uses tables filtered by their key and then uses tables filtered by other non-key fields and finally use tables with no filters. The program will produce only one MongoDB output for each Double SQL query. This query will use the table order with the highest score. The migration process executes function optimizeTableOrder before migrating Double queries. This function will change the SQL statement to an equivalent SQL that is expected to produce a better migration. This function only changes queries that have been classified as Double. We explain below each step in the table order optimization process. 6.6. QUERY OPTIMIZATION PHASE 140

Table Order Optimization Process

1 % The TXL rule operates on TransformableStatement: 2 replace[program] 3TS[TransformableStatement] 4 % Deconstruct the SQL statement. Only SQL statement classified as Double will 5 % be processed: 6 deconstructTS 7 ’SELECT optSelectionType[opt selectionType] _ [opt’SQL _CALC_FOUND_ROWS] 8 listExpr1[list select _Expr1] 9 ’FROMTR[list table _reference +] whereC[opt whereClause] 10 optGroupBy[opt groupbyClause] optOrderBy[opt orderbyClause] optLimit 11 [opt limitClause] _ [opt ;] ’: double

12 % Ensure all table references are tables: 13 where allTR[table _reference_isTable eachTR] 14 % Createa list with all tables: 15 % Constructa list of table _references

16 construct tables[table_reference *] _[. eachTR]

17 % Createa list with all the conditions: 18 % Extract separable conditions.

19 construct conditions[logicalExpr *] _[extractWhereLogicalExpr whereC]

20 % Createa list of table scores. Each element contains an SQL and its score. 21 % The score is higher when most filtered tables are named first: 22 % Calculate table scores. Each table gains point when it is 23 % filtered bya condition. Repetition with scores is ordered in reversed order.

24 construct table_scores[table_reference_points*] _[table_score_calc tables conditions]

25 % Get first table reference repetition from table_scores. 26 % Since it is reverse ordered, first element has the highest score:

27 construct ordered_tables[table_reference*] _[^table_scores]

The following T Xl code replaces the original SQL with this new optimized SQL state- ment where the new table reference has the reordered list of tables. 6.6. QUERY OPTIMIZATION PHASE 141

The new optimized SQL statement

28 % Transform the list of tables to an usable list of tables: 29 deconstruct ordered_tables

30 table_1[table_reference] tables_rest[table_reference*] 31 construct table_head[list table _reference +] table_1 32 construct newTR[list table_reference +] table_head[, each tables_rest] 33 by 34 ’SELECT optSelectionType listExpr1’FROM newTR whereC 35 optGroupBy optOrderBy optLimit 36 ’: double

We profile the queries to check their performance. Only some DOUBLE queries were too slow. They were solved by optimizing the order of tables in the sentence. This is why an optimization rule for DOUBLE queries exists. In most cases, the optimization program produces a query with the lowest query execution time. Here is an example of an SQL query and its equivalent optimized MongoDB action:

SQL Statement

SELECT COUNT(p.post _id)AS total FROM phpbb_postsp, phpbb _topicst WHEREp.forum _idIN$list ANDp.post _visibility=$param1 ANDt.topic _id=p.topic _id ANDt.topic _visibility<>p.post _visibility

MongoDB optimized Statement db.phpbb_posts.aggregate([ {"$match":{$and:[{forum _id:{$in:$list}}, {post _visibility:$param1}]}}, {"$lookup":{"from":"phpbb _topics", "localField":"topic _id", "foreignField":"topic _id","as":"t"}}, {"$unwind":{"path":"$t","preserveNullAndEmptyArrays": false}}, {"$project": {"post_id" : 1, computed1:{$ne:["$t.topic _visibility","$post _visibility"]}}}, {"$match":{computed1: true}}, {"$group":{" _id": null, total:{"$sum": {"$cond":[{"$ifNull":["$post _id", false]}, 1, 0]}}}}]) 6.6. QUERY OPTIMIZATION PHASE 142

This optimization gives a functionally equivalent MongoDB query. With this change, the performance of the 6 slow queries improved from current 4 seconds to about 0.2 sec- onds.

6.6.4 Query Optimization Evaluations

In order to evaluate the impact of our query optimization step on the performance of the migrated queries, we conducted an experiment on a small data set from one of the applica- tion under-test. The experiment was conducted on Ubuntu 14.04.4 server, with 13 GB disk size, 1 GB memory size, AMD Opteron 23xx 2.0 GHz CPU, MySQL 5.5.47 and MongoDB 3.2.10, with the following collections from PHPBB3 application database: 28174 posts, 23931 topics and 10333 users. Queries that were measured are the SELECT queries that were classified as SINGLE or DOUBLE and migrated to MongoDB.

PHPBB3 Database Collections

Collection Documents Indexed fields Users 10333 user_id Posts 28174 post_id, text(post _subject, post_text) Topics 23931 topic_id, topic_first_post_id

Tests were run on both the MySQL and MongoDB systems using the same data sets. One test was done on the original non-optimized queries and the other with the optimized queries, and the time is recorded in seconds. Table. 6.3 shows the migration statistics of the PHPBB3 MongoDB un-optimized and optimized Double queries in comparison to SQL. As we can see from the table of a total of 74 Double executed queries, that the majority of the un-optimized queries took more time to execute in MongoDB than in SQL. 81.08% of the migrated queried were slower in MongoDB, while only 18.92% of the total queries took similar execution time to SQL 6.6. QUERY OPTIMIZATION PHASE 143

Table 6.3: Double Query Migration Statistics before & after Optimization

Total of 74 Double Queries % Un- Optimized Slower than SQL 60 81.08% Similar to SQL 14 18.92% Faster than SQL 0 0% Optimized Slower than SQL 6 8.11% Similar to SQL 21 28.38% Faster than SQL 47 63.51%

and there is no un-optimized query that took less time to execute in MongoDB than in SQL database. After applying the filter option and changing the tables order optimizations rules, the number of queries that took a long time to run decreased to 6 and the number of queries took a similar time to execute in MongoDB like in SQL increases to 21 queries. While there were 47 Double queries faster to run in MongoDB than in SQL with an increase from 0% to 63.51% . Fig. 6.4 shows a comparison between the execution time of the non-optimized migrated MongoDB queries (green), the execution time of the optimized migrated MongoDB (red), and the original SQL queries (blue). For example, the non-optimized execution time for query 1 was over 5 seconds while the optimized and original SQL execution times were a small fraction of 1 second. We can observe that the optimization stage of our approach did drastically improved the performance of the migrated MongoDB queries to a level that is comparable to the performance of the SQL queries. The lower graph in Fig. 6.4 shows a closer look at the two bottom lines in the upper graph. The largest difference is in query 6 which the optimized MongoDB is 0.1 sec vs a very small SQL execution time. 6.7. CONCLUSION 144

Figure 6.4: Query Optimization Analysis

6.7 Conclusion

In this chapter, we presented the query migration phase of the migration framework. We demonstrated the steps on the migration of subset of queries of SCARF and the PHPBB3 applications. We conducted an experiment to evaluate our query optimization phase and the evaluation suggests that our optimization was instrumental in improving the performance of the migrated system with a performance almost equivalent to the original non-migrated 6.7. CONCLUSION 145

application. Since the test size of the test collections are still relatively small, the advantage of MongoDB for these types of collections is not immediately obvious. However, it is clear that our approach does not introduce a significant performance penalty while enabling the use of MongoDB. In the next chapter, we will explain how we alter the application to use the translated queries by applying a backward tracing technique to locate and markup the SQL statements in the PHP code. 146

Chapter 7

Application Migration

7.1 Introduction

The last stage of our approach is changing the application to use the translated queries. This process is the converse to the SQL queries extraction stage. We apply a backward tracing combined with the approach used by Alalfi et al. [4] to adapt the translation to the application. We backward tracing to markup impacted statements. We start from the location where each of the migrated SQL statements is executed (a call to the function mysql_select). This is the configured function used to launch SQL queries in SCARF and both versions of the PHPBB applications. The SQL statement may be a string literal, in which case the transformation is done there. Otherwise, we move backwards from the execution of the SQL statement in the program tracing the construction of the string literal that contains the SQL statement. 7.2. APPLICATION MIGRATION OVERVIEW 147

7.2 Application Migration Overview

Fig.7.1 shows the steps to migrate a PHP application from a MySQL database to use MongoDB.

• Database Migration: Data from MySQL database migrated to MongoDB is done be- fore migrating the source code. Data migration is done with Pentaho Data Inte- gration tool (PDI) as described in Chapter 4. The tool migrates flat copy of MySQL tables to MongoDB collections. Each migrated table from MySQL will have an equiv- alent collection with the same fields in MongoDB.

PDI tool uses a configuration file generated from our implemented interface based on the algorithm proposed by Arora [8] to specify what tables need to be migrated and their columns. The translation file will be included in the source code migra- tion process as scheme’s data. Basically it is used to identify what tables are being migrated, what fields they have, and what are the primary keys and foreign keys.

• The Source Code Migration: The source code migration is automated using TXL programs to analyze and transform PHP source code. The process is composed of two phases:

• Code migration - Phase 1: A TXL program does a static analysis of all PHP source files from the application to be migrated. It uses a PHP TXL grammar to parse these files, and a set of TXL transformation rules to identify PHP code that generates SQL sentences, and instru- ments all calls to the MySQL API. The PHP code instrumentation basically changes calls to MySQL API, to special functions that interact with MongoDB instead. All identified SQL statements are passed to Phase 2 to be migrated to their equivalent 7.2. APPLICATION MIGRATION OVERVIEW 148

Figure 7.1: Application Migration Process MongoDB actions. These migrated code will replace the PHP code that constructs the SQL statement. This TXL program will produce a set of PHP files with the mi- grated application, a list of all processed SQL statements, and a summary with the migration process results. 7.2. APPLICATION MIGRATION OVERVIEW 149

• Code migration - Phase 2: In this step the SQL statements are parsed by TXL using the SQL grammar. First the statements are classified to different categories (Single, Double, Dual, Unchanged) based on the migrated tables to the new MongoDB database. Mainly it parses the SQL with its SQL grammar as explained in Chapter 6. From there it can count the number of tables, and see if tables are migrated. We use information from the data migration configuration file to determine for each table if it is migrated or remains in MySQL database.

Then the SQL statements are migrated to MongoDB using TXL transformations rules. Each query type has its own TXL rules to transform the SQL sentences to their equiv- alent MongoDB actions. The migrated database sentences are pushed back to the static analysis process from Phase 1 to be integrated in the PHP source code.

7.2.1 Application Migration Approach

There are two general approaches to application migration. The first is to dynamically translate each SQL query as it is executed by the PHP application. An example of the the first approach is the work propose by Rocha et al. [82]. They leaves the PHP code to build its SQL sentences, and intercepts calls to the DB API. Here the target application source code remains unchanged, migration is done at runtime each time, with completely constructed SQL sentences. We note that translating the SQL statement each time incurs a penalty. The optimization issues discussed in the previous chapter make the translation process non-trivial, even if the results are cached. The second approach is to migrate SQL statement statically to use the PHP API for the NoSQL database. This approach allows the developer to adapt to the best of MongoDB over 7.2. APPLICATION MIGRATION OVERVIEW 150

time. The initial dynamic translations imposes an overhead that could eliminate any benefit of the move to NoSQL. For this reason, we chose the static migration of the application. The usual approach for using SQL in PHP is to build a query in a string which is then passed to the SQL library for processing. PHP also provides prepared statements in which the SQL statements are pre-compiled and data is later inserted, but these are not as common as the string method [65]. In essence, we transform each PHP statement that adds to the SQL stringlit to add to the MongoDB stringlit. An alternate approach is to use lambda functions [74]. The string is then executed using the PHP eval function and the results of the evaluations are used to access the query results. We translate SQL statements of code to MongoDB actions as PHP code. SQL state- ments are defined in one part of source code (as strings), but they are executed in another part. In between, there may be other logic executed. So, PHP code that represents SQL statements cannot be placed when SQL statements are defined as it is executed elsewhere. PHP SQL provides prepared statements, but neither SCARF nor PHPBB use them. MongoDB operations are implemented directly in PHP code, each MongoDB action is triggered with a PHP function. The example illustrates the difference between an SQL string and PHP expressions of MongoDB actions.

SQL string and MongoDB PHP expression example

SQL Statement:

mysql_query("SELECT * FROM users WHERE user_id=1") MongoDB Action: $db->users->find(array(’user_id’ => 1))

While it is possible to build up a PHP data structures and lambda functions [74] to 7.2. APPLICATION MIGRATION OVERVIEW 151

execute the equivalent MongoDB action, we have chosen an intermediate transform to start. We construct the MongoDB query consisting of PHP code in a string literal the same way the SQL query is constructed. We identify each PHP statement, that adds a component to the SQL stringlit. The resulting string contains MongoDB actions in PHP code. In the PHPBB3 application, all calls to mysql_query are done through the sql_query function provided by the driver. We leave this penalty and what the PHP string of Mon- goDB action.

Simple MongoDB find action in PHP

$sql="\$rows=\$db->phpbb _topics->find([’topic_id’=> intval($topic _id)], [’forum_id’=> 1]; \$cols=[\"forum_id\"];" $result= db->sql _query($sql) $forum_row=$db->sql _fetchrow($result); $db->sql_freeresult($result); if(!$forum _row){ trigger_error(’NO_TOPIC’); }

Our approach also instruments calls to the rest of the MySQL API. We have a modified versions of the MySQL API to work for both the SQL and the MongoDB queries and results. This simplifies the translation by allowing us to focus on the translation of the query. Our migrated version of the sql_query function checks the query. If the query is an un-migrated SQL query, the standard MySQL function is called, and the result is returned as normal. If it s a MongoDB action, then we evaluate the string and check the results. 7.2. APPLICATION MIGRATION OVERVIEW 152

Elided Migrated Version of sql_query try{ ... $res= eval($out); ... } catch(MongoException$e){ ... error handling ... return null; } catch(Throwable$t){ ... error handling ... return null; } ... if(isset($rows)) { $res=$rows; }

Next, we shows an example of an UPDATE Statement.

SQL UPDATE statement in PHP

$sql=’UPDATE’. POSTS _TABLE.’ SET post_reported=1 WHERE post_id=’.$post _id; $db->sql_query($sql);

This query sets the post_reported field to 1 on the post identified by $post_id. It is used to mark a post as reported. POST_TABLE is a PHP constant defined in constants.php.

MongoDB update action in PHP

$sql="\$db->phpbb _posts->update([’post_id’ => intval($post _id)], [’\$set’ =>[’post _reported’ => boolval (1)]],[multi => 1])"; $db->sql_query($sql);

This is an intermediate step. While simpler constructions that use only constants and user input like the example above can be optimized, SQL statements that are constructed in 7.2. APPLICATION MIGRATION OVERVIEW 153

pieces are not easily optimized. The main advantage of this approach is that it provides a transparent executable migration. The MongoDB actions are directly visible in the appli- cation code. Developers can manually tune the string construction based on the optimized statements produced in the previous chapter and their knowledge of the structure and se- mantics of their data. A second stage of transformation that translates the strings to direct PHP code is also possible in the future.

7.2.2 Application Migration Steps

Our migration divided into two phases:

• Phase 1: Backward tracing search:

– Step 1: Identify and change calls to MySQL.

– Step 2: Pre-process all source files to gather global variables.

– Step 3: Find functions used to launch queries.

– Step 4: Search SQL statements using backward tracing from calls to MySQL.

– Step 5: Identify the prototype of SQL statements.

• Phase 2: Individual query migration

– Step 5: Query classification (Single, Double, Dual, Unchanged).

– Step 6: Migrate query by pattern matching.

We give a short explanation of the steps before going into each in detail. 7.3. PHASE 1 - BACKWARD TRACING SEARCH 154

7.3 Phase 1 - Backward Tracing Search

The TXL Backward program does a backward search to mark the statements involved in an SQL query execution in a PHP program. It receives a list with all PHP files to process. PHP parse is done with official PHP grammar from TXL website [94]. This process starts by search for calls to the mysql_select function which is back-traced to its definition using pattern matching with TXL rules. This is the function used to launch SQL queries in SCARF and both versions of PHPBB. The parameter of this function is a string contains the SQL statement. We trace backward to identify its definition. If this function call has a string literal as a parameter, we know that it is an SQL statement. If it has a variable as a parameter, we search its declaration. If it is initialized with a string literal, we use it as an SQL statement. If this variable is a function parameter, the process repeats this call to this function in all PHP files. The steps are discuss in the following sections.

7.3.1 Step 1: Identify and change calls to MySQL

We start by identifying the functions in the MySQL API, and we add a prefix to the queries. The list of functions is given on the following table:

Table of MySQL Functions

| mysql_real_escape_string| mysql _pconnect | mysql_fetch_row| mysql _fetch_assoc | mysql_fetch_array| mysql _connect | mysql_select_db| mysql _error | mysql_num_rows| mysql _query | mysql_affected_rows| mysql _data_seek | mysql_errno| mysql _free_result | mysql_info| mysql _insert_id 7.3. PHASE 1 - BACKWARD TRACING SEARCH 155

7.3.2 Step 2: Pre-process source files to gather global variables

The second step in the application migration process is to pre-process all source files to gather global variables. The goal is to know the value of the global PHP constants that can be used in SQL definitions. As example in PHPBB, all tables are defined through constants defined in includes/constants.php, this is in turn uses the value table_prefix de- fined in config.php file.

Global Variables Extraction

1 % Extract global variables

2 construct global_vars[varvalue*] 3 _[message"...Extract constants"] 4 [extractGlobal each InputFiles] 5 [extractDefine each InputFiles] 6 %%%[print]

7 % Export constants

8 export GLOBAL_VARS[constant*] 9 _[addConstant each global_vars]

7.3.3 Step 3: Find functions used to launch queries

The next step is to find the application functions used to launch queries. In this step, we search for calls to the function mysql_select. This is the function used to launch SQL queries in SCARF and PHPBB applications. SCARF and PHPBB applications use mysql_select lo launch queries, but they define an application function to wrap these calls. In this step, the process gets a list of functions that use mysql_select. 7.3. PHASE 1 - BACKWARD TRACING SEARCH 156

php Function used to Launch Queries Extraction

1 % Starting point for backward tracing 2 construct FunctionName[id] 3 ’mysql_query

4 % Extract php function used to launch queries

5 export SQL_FUNCTIONS[id *] 6 _[message"...SQL Functions"] 7 [. FunctionName] 8 [searchFunctionsCallingFunction FunctionName InputFiles]

9 % Process each file 10 construct _[stringlit] 11 _[message"...Process files"] 12 [processFile each InputFiles]

13 import MARK_COUNTERS

14 construct _[mark_counter*] 15 MARK_COUNTERS[print]

16 replace[program] 17 _[any] 18 by 19 DONE 20 end function

7.3.4 Step 4: Search SQL statements using backward tracing from calls to MySQL

The process does a backward search to mark the SQL sentences involved in an SQL query execution in a PHP program. The SQL sentence is the string parameter passed to mysql_select. The goal of this step is to locate the structure of the SQL sentences in the PHP source code. The backward tracing process places marks in the TXL parsed tree. These marks are used to follow the backward tracing process in development. These marks are defined as PHP 7.3. PHASE 1 - BACKWARD TRACING SEARCH 157

function calls, and they can have parameters with extra information. The backward tracing program does a backward search to mark statements involved in a SQL query execution in a PHP program. Marks placed with backward tracing modified PHP files will have these marks:

Backward Tracing PHP Files Marks

Track_function Description track_call Call to function, Parameter is not string literal. track_call_local Call to function, Parameter isa variable defined in the same block. track_call_strlit Call to function, Parameter isa string literal. track_strvar Variable SQL assignment, but it is nota literal. track_strlit Variable SQL assignment, it isa string literal.

Where call marks; "track_call", "track_call_local", "track_call_strlit" to function mysql_select marks places where a call to MySQL is launched. Marks with strlit; "track_strlit", "track_call_strlit" are in places where a static query is defined (or queries converted to a single string with inserted variables). The following are the main TXL rules and functions:

• main: Reads all PHP files to memory. Instrument calls to mysql_query with back- ward tracing for static SQL. Writes modified PHP files back to disk. 7.3. PHASE 1 - BACKWARD TRACING SEARCH 158

Backward tracing PHP files marks

1 % Process one file 2 function processFile InputFile[stringlit] 3 construct _[stringlit] 4 _[+"..."][+ InputFile][print] 5 % Read file 6 construct InputProg[program] 7 _[read InputFile] 8 % Import SQL functions

9 import SQL_FUNCTIONS[id *] 10 % Export fine name to be used in log function. 11 % Log marked queries 12 export PHP_FILE_NAME[stringlit] 13 InputFile

• instrumentFunctionCall: First search if there is any function that calls to this func- tion using its own parameter. Those functions will be instrumented as well using recursively this rule.

• instrumentFunctionCallRemainingLiteral: Instrument calls to searched function, when the parameter is a string literal. This string literal is expected to be a complete SQL statement.

• instrumentFunctionCallRemainingExpr: Instrument calls to searched function, when a parameter is any PHP expression.

• instrumentFunctionCallLocalLiteral: Instrument calls to searched function, when a parameter is defined in the same block as a string literal. This string literal is expected to be a complete SQL statement.

• instrumentFunctionCallLocalExpr: Instrument calls to searched function, when a parameter is defined in the same block as a PHP expression. 7.3. PHASE 1 - BACKWARD TRACING SEARCH 159

The following are the main TXL rules and functions:

Backward tracing PHP files marks

14 % Process source code 15 construct MigratedProgram[program] 16 InputProg[trackFunctionCallWrapped each SQL_FUNCTIONS] 17 [trackFunctionCallLocal each SQL_FUNCTIONS] 18 [trackFunctionCallRemainingLiteral1a each SQL_FUNCTIONS]% Allow tracing object functions 19 [trackFunctionCallRemainingLiteral!b each SQL_FUNCTIONS] 20 [trackFunctionCallRemainingExpr each SQL_FUNCTIONS] 21 [migrate] 22 [instrument_mysql_calls]

23 % Exit function if nothing changed 24 deconstruct not InputProg 25 MigratedProgram

26 % Write file 27 construct OutputProg[program] 28 MigratedProgram[instrumentPage1][instrumentPage2] 29 [write InputFile] 30 match[any] _[any] 31 end function

32 % Add variable value to constant array 33 function addConstant varvalue[varvalue] 34 deconstruct varvalue 35 id[id]: str[stringlit] 36 construct idstr[stringlit] _[quote id] 37 construct constant[constant] 38 idstr: str

39 replace[constant *] 40R[constant *] 41 by 42R[. constant] 43 end function 7.3. PHASE 1 - BACKWARD TRACING SEARCH 160

• An example of track_call function: track_call function marks a call to a function that executes a query where the parameter is not string literal. parameter1: function used to execute the query and parameter2: parameter passed as query. An example, in viewtopic.php file, line 313 has this code:

track_call function

$result=$db->sql _query($sql);

The migration process changes the code to:

track_call function changes

$result=$db-> sql _query ((($sql))); track_call(’sql _query’,$sql);

Meaning that this line is part of a backward tracing search, where sql_query is a function used to launch a query, and $sql is the query (but it is not a literal).

• A second example of track_strlit which marks a variable SQL assignment used in a query call that is a literal. parameter1: Variable and parameter2: Value assigned. As in the following example in file viewtopic.php line 72 has this code:

track_strlit function

$sql=’SELECT forum _id FROM’. TOPICS _TABLE."WHERE topic _id=$topic _id"; $result=$db->sql _query($sql);

The migration process changes the code to: 7.4. PHASE 2 - MIGRATION OF SQL SENTENCES 161

track_strlit function

{ $sql = (((" //\"SELECT forum _id FROM phpbb_topics WHERE topic_id=$topic _id\" \$rows=\$db->phpbb_topics->find([’topic_id’ => intval($topic _id)],[’forum_id’ => 1]); \$cols=[\"forum_id\"]; "))); track_strlit($sql,"SELECT forum _id FROM phpbb_topics WHERE topic_id=$topic _id"); } $result=$db-> sql _query ((($sql))); track_call_local(’sql _query’,$sql);

The track_strlit line indicates that previous assignment is used as a parameter in SQL call, and it is an SQL literal.

7.3.5 Step 5: Identify the prototype of SQL statements

To identify the SQL statements prototype, we started with the patterns from the MongoDB guide [74], and later we were adding and refining rules to cover more queries from the analyzed applications.

7.4 Phase 2 - Migration of SQL sentences

The program transforms each SQL statement from the previous step to MongoDB code. The translated statements does not need to be reintegrated in the PHP source code, because they are modified in place. This program parses the statement as an SQL program. It uses two grammars to produce PHP code from SQL code [95] by Thomas R. Dean. This phase consists of the following steps as explained in the query migration chapter. 7.4. PHASE 2 - MIGRATION OF SQL SENTENCES 162

7.4.1 Step 1: Query classification

The program classifies the SQL query into one of these categories (Unchanged, Single, Double, Dual) as explained in the previous chapter. We use information from the data migration configuration file to determine if each table is migrated or remains in MySQL.

7.4.2 Step 2: Migrate query by pattern matching

Each query type has its own TXL rules to transform the SQL statements to MongoDB ac- tions. So, there are pattern matching rules, where we search structures of code that matches some patterns and replace them with given rules. We follow MongoDB guidelines in the transformation to choose the appropriate transformations [39]. Therefore, rules are written with the optimizations stated in the MongoDB migration guidelines. The transformations are done with TXL pattern matching. TXL applies rules to source code programs and manipulates directly the source code. The transformations are done with TXL transfor- mation rules. The transformation phase is done in the sql2mongo.txl program, where the statement is loaded with a MySQL grammar. The SQL tree representation is processed with classification rules, and later changed with translation rules to MongoDB. The optimiza- tion rules are part of the transformation rules. An example of the rule migrateSelectCount matches SELECT with a COUNT of a single query to be translated to count() function in MongoDB. An example query that matched this rule is:

SQL query with count

SELECT COUNT( *)AS num FROM comments WHERE paper_id=’$paper _id’; 7.4. PHASE 2 - MIGRATION OF SQL SENTENCES 163

It is migrated to the following equivalent PHP code:

Equivalent translated MongoDB action for count db.comments.find({paper_id:$paper _id).count();

The TXL rule starts with the replace command. It only processes statements classified as single.

TXL rule of single statement replace[TransformableStatement] sql_statement[MySQLStatement] ’: single

The sql_statement is required to match this TXL pattern:

Matching pattern for SQL statement with count

1 deconstruct sql_statement

2 ’SELECT count’(’ * ’) _[opt AsClause] 3 ’FROMTR[table _references] 4 whereC[opt whereClause] _ [opt;]

This transformation is equivalent to this entry in the SQL to MongoDB mapping table [74].

SQL mapping entry

SQL statement:

SELECT COUNT(*) FROM people WHERE age> 30

And the corresponding MongoDB actions:

corresponding MongoDB actions db.people.count({ age:{$gt: 30 } } ) //or db.people.find({ age:{$gt: 30 } } ).count() 7.5. SIMPLE QUERY TRANSLATION EXAMPLE 164

7.5 Simple Query Translation Example

The PHP functions will manage interaction with MySQL and MongoDB. The process generates PHP MongoDB actions that are the same as the SQL sentences. Doing so, we can maintain the same business logic. The global variable mongo manages MongoDB actions. This is the same approach used in PHP source code for SQL database, that uses the global variable db. In the migrated PHP applications, SQL sentences are being changed with MongoDB actions. These MongoDB actions are represented in PHP code as strings. When the modified mysql_query receives a MongoDB action, it is processed with the MongoDB database. Later, the migrated PHP application uses the instrumented mysql functions to retrieve the results. For example, instrumented function for mysql_num_rows is implemented in sql2mongo.php with function s2m_mysql_num_rows (result_id). If re- sult_id represents a result from MySQL, it calls the original mysql_num_rows. Otherwise, result_id represents a result from MongoDB, therefore this function will return the number of elements in the MongoDB result. Below is an illustration of a simple query translations of a SELECT statement and an UPDATE statement.

• Simple SELECT statement translation example: An example of a simple SELECT query translation at file includes/functions_admin.php, line 601 is illustrated below.

Simple SELECT Statement Translation Example

$sql=’SELECT forum _id FROM’. TOPICS _TABLE.’ WHERE topic_id=’.$topic _id; $result=$db->sql _query($sql); 7.5. SIMPLE QUERY TRANSLATION EXAMPLE 165

This query selects records from table TOPICS_TABLE filtered by topic_id filed. It only retrieves forum_id field. TOPIC_TABLE is a PHP constant defined in con- stants.php. It is used to obtain forum id from a specific topic.

The process operates on all input files at once. The TXL program then adds the files into a single tree to process. PHP allows parsable function file to be changed and after having all in at once are allows the cross file dependence to be evaluated.

We use mysql_query as a starting point. We know that PHPBB uses that function to launch queries. We search all functions that call mysql_query in all PHP files.

Then, we read the source code using the PHP grammar. InputProg contains the PHP source code tree.

This PHP program is transformed by multiple rules. Functions trackFunctionCall search for multiple shapes of PHP function calls to any SQL_FUNCTION. In in- cludes/functions_admin.php PHP tree, there is a match in that sentence, where there is a call to sql_query, that receives a string variable which is defined locally. It is replaced with an element that has the same content and mark it in the TXL tree.

Since the value assigned to $sql has concatenations, it can be simplified as follows:

Simplified Concatenation String Value

SELECT forum_id FROM phpbb_topics WHERE topic_id=$topic _id

The value is changed with an element with the same content of type SqlString. Then, migrate function will call to sql2mongo.txl program for each SqlString element found in the PHP tree, and replacing its content in place. It will be replaced with: 7.5. SIMPLE QUERY TRANSLATION EXAMPLE 166

MongoDB Equivalent Action

"\$rows=\$db->phpbb _topics->find( [’topic_id’ => intval($topic _id)],[’forum _id’ => 1]); \$cols=[\"forum_id\"];"

Rule instrument_mysql_calls looks for any use of mysql functions and replaces them with a prefix to use our instrumented version as explained in step 1 in the previous section.

If none of the previous rules changed the PHP source tree, then the function will exit. We do not want to lose comments and format from original source code if it is not necessary. After that, the changes are written to the file.

• Simple query translation example of an UPDATE statement: Here is an simple query translation example of an UPDATE statement at file re- port.php, line 242.

UPDATE statement example

$sql=’UPDATE’. POSTS _TABLE.’ SET post_reported=1 where post_id=’.$post _id; $db->sql_query($sql);

This query sets post_reported field to 1 on records from table POST_TABLE filtered by post_id filed. POST_TABLE is a PHP constant defined in constants.php. It is used to mark a post as reported. Note that since in MySQL post_reported is a boolean field, this 1 will be converted to TRUE. The execution steps are the same as in the execution plan of the SELECT statement described above. Since the value assigned to $sql have concatenations, it is simplified to: 7.6. BACKWARD TRACING TRANSLATION 167

Simplified Concatenation String Value

"UPDATE phpbb_posts SET post_reported=1 WHERE post _id=$post _id"

The value is changed with an element with the same content of type SqlString. It will be replaced with the following equivalent MongoDB action:

MongoDB Equivalent Action

"\$db->phpbb_posts->update([ ’post_id’ => intval($post _id)], [’\$set’ => [’post_reported’ => boolval (1)]],[multi =>1 ])"

The same process of the SELECT and UPDATE statements examples is applied to the DELETE statement.

7.6 Backward Tracing Translation

In this section, we present a method of applying the Backward Tracing technique to PHP application programs to help identify the SQL statements, which can help identify where the SQL statements are defined and executed. First, we introduce the technique that we use to recover the tracing criterion from source code. Furthermore, we explain the backward tracing techniques using the TXL source transformation language to recover the statements or predicates that may affect or be affected by the tracing criterion. We also show the process of the backward tracing with sample PHP source code.

7.6.1 Backward Tracing Overview

Program tracing is an automated source code extraction technique that takes a criterion and program source code as input and yields parts of the program source code as output [50]. 7.6. BACKWARD TRACING TRANSLATION 168

How does one express the tracing criterion in the program source code, how can we make the tracing criterion represent the statement that the developers are interested in, in order to express the tracing criterion as an understandable concept, and thereby produce executable program parts. We use a similar approach to mark selected sentences by tracing backward. The backward tracing process finds the PHP statements that build the SQL sentence, and translates them to MongoDB. The code in the trace is identified and all marked codes are part of the backward tracing, but not extracted as independent code, since it would not be helpful without its position and context for the migration. The program does a backward trace to mark statements involved in a SQL query execution in a PHP program. This process searches for calls to mysql_select function. This is the function used to launch SQL queries in SCARF and PHPBB. If this function call has a literal string as a parameter, we know that it is an SQL statement. If it has a variable as a parameter, we must find the value. The goal is to localize where SQL is launched and the source code that defines this SQL code. In the file includes/mcp/mcp_main.php, we have a simple example of a call to PHP functions as part of the SQL string.

An SQL built calling another PHP function example

$db->sql_query(’INSERT INTO’. POSTS _TABLE.’’. $db->sql_build_array(’INSERT’,$sql _ary));

We are able to identify all the PHP code involved in generating an SQL statement, we translate that to PHP that generates equivalent MongoDB actions. Not to mention, that SQL statement are strings that are being moved all around the PHP code, while MongoDB actions are PHP code. This involves that some application functions should change its definition. 7.6. BACKWARD TRACING TRANSLATION 169

We identify all PHP involved in creating and executing an SQL statement, and trans- form it to a PHP code that creates and executes an equivalent MongoDB action as in the following example from PHPBB3:

Backward Tracing Example

mysql(’SELECT * FROM’. FORUMS _TABLE); // [1] $where=’ WHERE topic _id=’.$topic _id; // [2] if($forum _id){ $where .= ’ AND forum_id=’.$forum _id; }

mysql(’SELECT * FROM’. TOPICS _TABLE.$where); // [3] function mysql($sql) { // [4] log($sql); // [5] return mysql_query($sql); // [6] }

In this simple example, it has a call to mysql_query marked as [6], this is where the SQL is executed. We can start the backward search from there. It uses the variable $sql defined as a variable in [4]. Be aware that in the function this parameter is used in [5], probably as a string. This could cause a problem if $sql is changed to another type in the migration. This happens all the time, the same variable can be used as part of a query and for other things in other parts of the code. In this case, $sql is a parameter of the function mysql. So we have to search all calls to this function. In this piece of code, it is called from [1] and [3]. [1] constructs an SQL statement concatenating a constant. In this case, the concatenated value represents the table name. [3] in this case it also uses the $where variable. We need also to track its changes. To know the content of the $where variable in [3], we need to look at the [2] block. This block contains pieces of PHP constructing pieces of SQL. At this point in time the backward 7.6. BACKWARD TRACING TRANSLATION 170

tracing has identified all sentence fragments.

7.6.2 SCARF Use Cases Examples

In the following section we illustrate some examples of the backward tracing process of different cases of converting the string concatenation to single string in the SCARF appli- cation. 1. Direct call to mysql_query with literal string:

SQL Statement functions.php file: $result= mysql _query("SELECT value FROM options WHERE name=’Conference Name’"); track_call_strlit(’mysql _query’,"SELECT value FROM options WHERE name=’Conference Name’" );

We have a mark where the query is defined, and it is the same place where it is executed. We have it as a single constant string. 2. Direct call to mysql_query with concatenated string:

SQL Statement install.php file:

$mysql_query("GRANT ALLON".mysql _real_escape_string($ _POST[’dbname’]).".* TO ’".mysql_real_escape_string($user)."’@’localhost’ IDENTIFIEDBY’". mysql_real_escape_string($pass)."’");

track_call_strlit(’mysql _query’,"GRANT ALLON$ _POST[dbname].* TO ’$user’@’localhost’ IDENTIFIEDBY’$pass’");

We have a mark where the query is defined, and it is the same place where it is executed. We have it as a single string with interpolated variables from original concatenated string. 7.6. BACKWARD TRACING TRANSLATION 171

3. Call to mysql_query with a local variables as in the following example:

SQL Statement install.php file: $usertable="CREATETABLE‘users‘ (user_id‘ INTNOTNULL auto_increment,‘ email‘ VARCHAR( 200) NOTNULL, ‘password‘ VARCHAR( 50) NOTNULL,‘ firstname‘ VARCHAR( 200) NOTNULL, ‘lastname‘ VARCHAR( 200) NOTNULL, affiliation‘ VARCHAR( 200) NOTNULL,‘ showemail‘ BOOL DEFAULT false NOTNULL, privilege‘ ENUM(’admin’,’user’) NOTNULL DEFAULT’user’, PRIMARYKEY(‘user _id‘)); "; track_strlit($usertable,"CREATETABLE‘users‘ (‘ user_id‘ INTNOTNULL auto_increment, ‘email‘ VARCHAR( 200) NOTNULL, ‘password‘ VARCHAR( 50) NOTNULL,‘ firstname‘ VARCHAR( 200) NOTNULL,‘ lastname‘ VARCHAR( 200) NOTNULL,‘ affiliation‘ VARCHAR( 200) NOTNULL,‘ showemail‘ BOOL DEFAULT false NOTNULL,‘ privilege‘ ENUM(’admin’,’user’) NOTNULL DEFAULT’user’, PRIMARYKEY(‘user _id‘));"); ----- mysql_query($usertable)); track _call_local(’mysql _query’,$usertable);

The call to mysql_query is using a local variable. The query is defined in mark track_strlit and later executed in mark track_call_local. Definition and usage are in different places. 7.6. BACKWARD TRACING TRANSLATION 172

4. Call through a function as in the following example:

SQL Statement comments.php file: mysql_fetch_row(query("SELECT title FROM papers WHERE paper_id=’$paper_id’")); track_call_strlit(’query’,"SELECT title FROM papers WHERE paper _id=’$paper_id’"); functions.php file: function query($sql) { $result= mysql _query($sql); track_call(’mysql _query’,$sql); ... }

The query is defined in comments.php file with mark track_call_strlit and it is passed as a parameter to function query. Function query is defined in functions.php file. There, at mark track_call is where the query is actually executed. 5. Call through a function with string concatenation of a variable such as:

SQL Statement register.php file: query("SELECT * FROM users WHERE email=’".mysql_real_escape_string($email)."’"); track_call_strlit(’query’,"SELECT * FROM users WHERE email=’$email’"); functions.php file: function query($sql) { $result= mysql _query($sql); track_call(’mysql _query’,$sql); ... }

The query is defined in register.php with mark track_call_strlit, where we have it translated to a single string. This query is passed as a parameter to function query. Function 7.6. BACKWARD TRACING TRANSLATION 173

query is defined in functions.php file. There, at mark track_call is where the query is actually executed. 6. Call through a function with string concatenation of a variable with index string:

SQL Statement

generaloptions.php: query("DELETE FROM users WHERE email=’".mysql _real_escape_string($ _GET[’delete_email’]). "’"); track_call_strlit(’query’,"DELETE FROM users WHERE email=’$_GET[delete_email]’"); functions.php: function query($sql) { $result= mysql _query($sql); track_call(’mysql _query’,$sql); ... }

The query is defined in generaloptions.php file with mark track_call_strlit, where we have it translated to a single string. This query is passed as a parameter to function query. Function query is defined in functions.php file. There, at mark track_call is where the query is actually executed. 7. Call through a function with string concatenation of other PHP expression as in the following example: 7.6. BACKWARD TRACING TRANSLATION 174

SQL Statement useroptions.php: query("UPDATE users SET password=’".md5($pass)."’, email=’$email’ WHERE user_id=’$id’");$ _mongop19= md5($pass); track_call_strlit(’query’,"UPDATE users SET password=’$ _mongop19’, email=’$email’ WHERE user_id=’$id’"); functions.php: function query($sql) { $result= mysql _query($sql); track_call(’mysql _query’,$sql); ... }

The query is defined in useroption.php with mark track_call_strlit, where we have it translated to a single string. In this case, we need to create an intermediate variable. This query is passed as a parameter to function query. Function query is defined in file func- tions.php. There, at mark track_call is where the query is actually executed. 8. Using interpolated variables: Using interpolated variables is fine when values inter- polated represent query parameters. This is the case of all SCARF queries except two, where they are used to create a dynamic query. These two constructions are concatenating a dynamic value that is part of the query as in the following example:

SQL Statement generaloptions.php query("UPDATE users SET".implode _with_keys(",",$values,",")." WHERE user _id=’$id’"); comments.php query("SELECT approved, paper _id, comment_id, showemail, users.email, comment, UNIX_TIMESTAMP(date)AS date, CONCAT(firstname,’’, lastname)AS fullname FROM comments LEFT JOIN users on comments.user_id= users.user _id$where ORDERBY paper _id"); 7.6. BACKWARD TRACING TRANSLATION 175

7.6.3 PHPBB3 Backward Tracing Examples

Doing the same exercise with PHPBB3, we search for mysql_query. It has 13 matches. One of them is the one used to execute application queries like with SCARF .

• PHPBB3 Backward Tracing Case #1 We will look at this one at (phpbb/db/driver/mysql.php) file, Line 189:

PHPBB3 Backward Tracing Example #1

if(($this->query _result= @mysql _query( $query,$this->db _connect_id)) === false )

It receives the $query string parameter. We can see that it is defined as a function parameter:

Function Parameter definition

phpbb/db/driver/mysql.php, Line 168 function sql_query( $query = ’’,$cache _ttl=0 )

This function is a class function. For simplicity, we will search all calls to the func- tion sql_query of any class. In that particular case, it will be enough. This gives us 1890 matches and all looks as query launches.

• PHPBB3 Backward Tracing Case # 2 We backward trace one of the simpler matches, at file viewforum.php line 118:

PHPBB3 Backward Tracing Example

118$db->sql _query($sql); 7.6. BACKWARD TRACING TRANSLATION 176

It uses $sql variable. We found it defined at line 115:

Update statement Backward Tracing Example #2

115$sql=’UPDATE’.FORUMS _TABLE.’ 116 SET forum_posts_approved= forum _posts_approved+1 117 WHERE forum_id=’.$forum _id;

To build this string, we need to know the values of the constant FORUM_TABLE and the variable $forum_id. FORUMS_TABLE is a constant defined in includes/con- stants.php

Constants and variable values definition

define(’FORUMS_TABLE’,$table _prefix.’forums’);

In turn, $table_prefix is defined in config.php

$table_prefix definition

$table_prefix=’phpbb\ _’;

$forum_id variable is defined in the same viewforum.php file, line 28:

$forum_id variable definition

28$forum _id=$request->variable(’f’, 0);

We will assume that $forum_id will be used as a query parameter. Putting all pieces together: 7.6. BACKWARD TRACING TRANSLATION 177

Backward Tracing Results

config.php, Line 10 10$table _prefix=’phpbb _’;

includes/constants.php, Line 262 262 define(’FORUMS_TABLE’,$table _prefix.’forums’);

viewforum.php, Line 28 28$forum _id=$request->variable(’f’, 0);

viewforum.php, Line 115 115$sql=’UPDATE’. FORUMS _TABLE.’ 116 SET forum_posts_approved=%forum _posts_approved+1 117 WHERE forum_id=’.$forum _id;

viewforum.php, Line 118 118$db->sql _query($sql);

phpbb/db/driver/mysql.php, Line 168 168 function sql_query($query= ’’,$cache _ttl= 0)

phpbb/db/driver/mysql.php, Line 189 189 @mysql_query($query,$this->db _connect_id)

• PHPBB3 Backward Tracing Case # 3 For this example, we will look at the match at viewforum.php, line 70:

PHPBB3 Backward Tracing Example #3

70$result=$db->sql _query($sql);

It uses$sql variable. We found it defined at line 67:

67$sql="SELECTf. * $lastread_select 68 FROM$sql _from 69 WHEREf.forum _id=$forum _id"; 7.6. BACKWARD TRACING TRANSLATION 178

That string has 3 inferred variables. We have to backward tracing $lastread_select, $sql_from and $forum_id: Variable $lastread_select is found 3 times, the initializa- tion and the changes within two if blocks:

$lastread_select variable initialization and changes inside If block

$lastread_select = ’’; if(php _condition){ ... $lastread_select .= ’, ft.mark_time’; } if(php _condition){ ... $lastread_select .= ’, fw.notify_status’; }

Variable $sql_from is define at line 50:

$sql_from variable definition

50$sql _from= FORUMS _TABLE.’f’;

FORUMS_T ABLE is a constant defined in includes/constants.php

FORUMS_TABLE constant definition

define(’FORUMS_TABLE’,$table _prefix.’forums’);

In turn, $table_prefix is defined in config.php

$table_prefix variable definition

$table_prefix=’phpbb _’;

Continuing with $sql_from 7.6. BACKWARD TRACING TRANSLATION 179

Constants and variable values definition

if(php _condition) { $sql_from .= ’LEFT JOIN’. FORUMS _TRACK_TABLE.’ ft ON(ft.user _id=’.$user->data[’user _id’].’ AND ft.forum _id=f.forum _id)’; }

We will leave the concatenation of $user->data[0user_id0] as is, because we can see that it will be interpreted as a query parameter. Backward tracing constant FO- RUMS_TRACK_TABLE, is defined in includes/constants.php

Constants FORUMS_TRACK_TABLE definition

define(’FORUMS_TRACK_TABLE’,$table _prefix.’forums _track’);

In turn, $table_prefix is defined in config.php

$table_prefix variable definition

$table_prefix=’phpbb _’;

Variable $forum_id is found in another SQL condition:

$forum_id variable

if(php _condition) { $sql_from .= ’LEFT JOIN’. FORUMS _WATCH_TABLE.’ fwON (fw.forum_id=f.forum _id AND fw.user_id=’.$user->data[’user _id’].’)’; }

Backward tracing constant FORUMS_WATCH_TABLE, is defined in includes/constants.php

FORUMS_WATCH_TABLE constants definition

define(’FORUMS_WATCH_TABLE’,$table _prefix.’forums _watch’); 7.6. BACKWARD TRACING TRANSLATION 180

At this turn $table_prefix is defined in config.php

$table_prefix variable definition

$table_prefix=’phpbb _’;

Again, we will leave the concatenation of $user->data[’user_id’] as is, because we can see that it will be interpreted as a query parameter. Finally, $forum_id variable is defined at line 28, where we will assume it is a query parameter:

$forum_id variable definition

28$forum _id=$request->variable(’f’, 0);

Putting all pieces together:

Backward Tracing Result

viewforum.php, Line 51 51$lastread _select = ’’;

viewforum.php, Line 58 58 if(php _condition){$lastread _select .= ’, ft.mark_time’; }

viewforum.php, Line 64 64 if(php _condition){$lastread _select .= ’, fw.notify_status’; }

config.php, Line 10 10$table _prefix=’phpbb _’;

includes/constants.php, Line 262 262 define(’FORUMS_TABLE’,$table _prefix.’forums’);

viewforum.php, Line 50: 50$sql _from= FORUMS _TABLE.’f’; 7.6. BACKWARD TRACING TRANSLATION 181

Backward Tracing Result

config.php, Line 10 10$table _prefix=’phpbb _’;

includes/constants.php, Line 264 264 define(’FORUMS_TRACK_TABLE’,$table _prefix.’forums _track’);

viewforum.php, Line 56: 56 if(php _condition){$sql _from .= ’ LEFT JOIN’.%FORUMS _TRACK_TABLE.’ 57 ftON(ft.user _id=’.$user->data[’user _id’]%.’ 58 AND ft. forum _id=f.forum _id)’;}

includes/constants.php, Line 264 264 define(’FORUMS_WATCH_TABLE’,$table _prefix.’forums _watch’);

viewforum.php, Line 63: 63 if(php _condition){$sql _from .=’ LEFT JOIN’. FORUMS _WATCH_TABLE.’ 64 fwON(fw.forum _id=f.forum _id AND fw.user_id=’. 65$user->data[’user _id’].’)’;}

viewforum.php, Line 28: 28$forum _id=$request->variable(’f’, 0);

viewforum.php, Line 67:

67$sql="SELECTf. * $lastread_select FROM$sql _from WHEREf.forum _id=$forum _id";

viewforum.php, Line 70: 70$result=$db->sql _query($sql);

phpbb/db/driver/mysql.php, Line 168 168 function sql_query($query= ’’,$cache _ttl= 0)

phpbb/db/driver/mysql.php, Line 189 189 @mysql\_query($query,$this->db _connect_id) 7.6. BACKWARD TRACING TRANSLATION 182

7.6.4 Complex PHPBB3 Query Builds

Most queries are like previous examples, but there are also more complex query constructions. For example, also at file viewforums.php, at line 521:

Query built with object function

521$sql=$db->sql _build_query(’SELECT’,$sql _ary); 522$result=$db->sql _query($sql);

This query is built with object function sql_build_query. Luckily it has only one definition, it is defined in phpbb/db/driver/driver.php:

Query built with object function definition

function sql_build_query($query,$array) { $sql = ’’; switch($query) { case’SELECT’: case’SELECT _DISTINCT’; $sql= str _replace(’_’, ’ ’,$query).’’.$array[’SELECT’] . ’ FROM’; // Build table array. We also build an alias array for later checks. $table_array=$aliases= array(); $used_multi_alias= false; foreach($array[’FROM’] as$table _name=>$alias) { if(is _array($alias)) {$used _multi_alias= true; foreach($alias as$multi _alias) {$table _array[]=$table _name.’’.$multi _alias; $aliases[]=$multi _alias;} } else {$table _array[]=$table _name.’’.$alias;$aliases[]=$alias;} } 7.6. BACKWARD TRACING TRANSLATION 183

Query built with object function definition (Continued)

// We run the following code to determine if we need to re-order the table array. if(!empty($array[’LEFT _JOIN’]) && size of($array[’FROM’])>1 &&$used _multi_alias !== false) {// Take first LEFT JOIN $join= current($array[’LEFT _JOIN’]); // If there isa first join match, we need to make sure the table order is correct if(!empty($matches[1])) {$first _join_match= trim($matches[1]); $table_array=$last= array(); foreach($array[’FROM’] as$table _name =>$alias) {if(is _array($alias)) { foreach($alias as$multi _alias) {($multi _alias ===$first _join_match)?$last[]=$ table_name.’’.$multi _alias:$table _array[]=$ table_name.’’.$multi _alias;

} else {($alias ===$first _join_match)?$last[]=$table _name.’’. $alias:$table _array[]=$table _name.’’.$alias; } } $table_array= array _merge($table_array,$last); } } $sql .=$this-> _sql_custom_build(’FROM’, implode(’ CROSS JOIN’,$ table_array)); if(!empty($array[’LEFT _JOIN’])) { foreach($array[’LEFT _JOIN’] as$join) {$sql .=’ LEFT JOIN’. key($join[’FROM’]).’’. current($join[’FROM’]).’ON(’.$join[’ON’].’)’; } } 7.6. BACKWARD TRACING TRANSLATION 184

Query built with object function definition (Continued)

if(!empty($array[’WHERE’])) {$sql .= ’ WHERE’; if(is _array($array[’WHERE’])) {$sql _where=$this-> _process_boolean_tree_first($array[’WHERE ’]);} else {$sql _where=$array[’WHERE’];} $sql .=$this-> _sql_custom_build(’WHERE’,$sql _where);} if(!empty($array[’GROUP _BY’])) {$sql .=’ GROUPBY’.$array[’GROUP _BY’];} if(!empty($array[’ORDER _BY’])) {$sql .=’ ORDERBY’.$array[’ORDER _BY’]; } break; } return$sql; }

It is called with SELECT as the first parameter. The second parameter is:

Query Built with Object Function

$sql_ary= array( ’SELECT’=>$sql _anounce_array[’SELECT’], ’FROM’=>$sql _array[’FROM’], ’LEFT_JOIN’=>$sql _anounce_array[’LEFT_JOIN’], ’WHERE’=>’(t.forum _id=’.$forum _id.’ ANDt.topic _type=’. POST _ANNOUNCE.’) OR (’.$db->sql_in_set(’t.forum_id’,$g _forum_ary).’ ANDt.topic _type=’. POST _GLOBAL.’)’, ’ORDER_BY’=>’t.topic _time DESC’, );

This array is using the function sql_in_set that has a similar complexity. To understand exactly what query will be built, that PHP logic has to be interpreted. Here we are not 7.6. BACKWARD TRACING TRANSLATION 185

dealing with simple PHP string concatenations. These are PHP algorithms that generate SQL statement parts. The process is able to translate some complex query translations, but not all of them. In- deed, in PHPBB3, all queries are dynamic. There is no single complete query; they are all built at runtime. We have some specific rules to catch some specifics of the more complex cases. Like in PHPBB3, where some SQLs are built calling the function "sql_build_array", we have a specific rule to catch that. There, we know that this function receives an array of values and returns the SQL sentence part with these values like in the following example:

Query built with function calling

File"includes/functions _posting.php" has this SQL statement: $sql=’UPDATE’. TOPICS _TABLE.’ SET’.$db->sql _build_array( ’UPDATE’,$sql _data[TOPICS_TABLE][’sql’]).’ WHERE topic_id=’.$data[’topic _id’];

We know that function "sql_build_array" builds an SQL piece of code to assign values in the UPDATE statement. The second parameter is a dictionary of values. So, we have defined a specific pattern that if it finds a call to ’sql_build_array’, it will use its second parameter to set field values, like:

Query built with function calling

$_mongop768= var _export($sql _data[TOPICS _TABLE] [’sql’], TRUE); $sql="\$db->phpbb _topics->update([’topic_id’ => intval($data[topic _id])], [’\$set’ =>$ _mongop768],[multi => 1])"; var_export is a standard PHP function. Here it is used to encode the parameter array as a string, that it is later embedded in the query string. The following additions have been added to catch more queries in PHPBB3:

• In PHPBB3, the function that wraps SQL calls may have multiple parameters. 7.6. BACKWARD TRACING TRANSLATION 186

Rules in Backward program have been changed to match function calls with extra parameters. With this change, we were able to find more SQL definitions for query calls.

• Change rules in the translation program to accept SQL embedded parameters that are array elements PHPBB3, also uses interpolations with array elements like the following example:

Interpolations with array elements

SELECT * FROM phpbb_forums WHERE id=$row[id]

• PHPBB3, also uses other constructions like object properties:

Object properties construction

SELECT * FROM phpbb_forums WHERE parent_id=$this->parent _id

• Catch SQL definitions when multiple complete definitions are used in a single call. Specifically, there are complete SQL definitions created inside IF blocks that later share a call point. Sometimes an SQL call uses multiple SQL definitions. If those SQL definitions are complete, they can be migrated independently as in the following example:

SQL Statement

if(condition){ $sql= SQL QUERY1"; } else{ $sql="SQL QUERY2"; } sql_call($sql) 7.6. BACKWARD TRACING TRANSLATION 187

• Add support to more statements, like TRUNCATE, and ALTER TABLE.

• Add support for some dynamic statements, when tables are not migrated. This means to allow some dynamic parameter in queries with non-migrated tables; we are more flexible in the SQL grammar with queries that only contains not migrated tables like in the following example:

SQL Statement

SELECT group_id, group_name, group_colour, group_type, group_legend FROM phpbb_groups WHERE group_legend>0 ORDERBY$order _legendAS

In this example, if phpbb_groups was a migrated table, we would need to know the value of $order_legend. We only need to understand this statement enough to see its table name. SQL grammar has been updated with more entries with [Parameter] options where dynamic parts are allowed for not migrated tables.

This example contains the parameter $order_legend in the ORDER BY clause. This is a PHP interpolated variable, it is not part of the standard SQL grammar. We added the item [Parameter] in the grammar to allow PHP interpolated variables in some SQL syntax parts like in the definition of Expr, so any SQL statement that expects an expression can take a PHP interpolated variable: 7.6. BACKWARD TRACING TRANSLATION 188

Expression Definition

Expression define Expr [number] | ’- [SPOFF][Expr][SPON] | [stringlit] | [charlit] | [Operation2] | [Comparing] | [ObjectCVar] | [Parameter] | [SPOFF][FunctionCall][SPON] | ’( [Expr] ’) end define

• Add support for some specific application functions like sql_insert and sql_build_array, which means to change some codes before the main line that defines a query. But those changes should only be applied if the query uses migrated tables.

• Support specific application functions like in PHPBB3 where there are many SQL statements built using application functions. The Backward program has been up- dated to include support for some structures using:

– db− > sql_in_set() to construct WHERE IN conditions.

– db− > sql_build_array() to help construct UPDATE statements.

– db− > sql_build_array() to help construct INSERT statements.

This approach helps in the migration of more queries, but it is application specific.

• Rule update with parameters: Add support for the UPDATE statements with a dy- namic column set. All queries with dynamic column SET are in queries for not migrated tables.

PHPBB3 also uses select statements with array elements such as in the following example: 7.6. BACKWARD TRACING TRANSLATION 189

SQL Statement

SELECT * FROM phpbb_forums WHERE id=$row[id]

And in other constructions such as object properties:

SQL Statement

SELECT * FROM phpbb_forums WHERE parent_id=$this->parent _id

Testing with PHPBB3 we have seen that some queries intended for tables not mi- grated results with some migrated tables with parameters. Each file is processed independently in processFile. Therefore all SQL parts must be in the same file. Wrapping functions (functions that call another function to launch complete SQL) are searched through all files, and Global constants, that is in PHPBB are used in all queries, are also searched before in all files in extractGlobal and extractDefine rules. Sometimes an SQL call uses multiple SQL definitions. If those SQL definitions are com- plete, they can be migrated independently. An example from the PHP files of PHPBB3 for catch SQL definitions when multiple complete definitions are used in a single call.

SQL call uses multiple SQL definitions

if(condition) { $sql="SQL QUERY1"; } else { $sql="SQL QUERY2"; } sql_call($sql)

For example in PHPBB3 in session.php file line 644, it has a query execution point with two query definitions. These are two complete queries. The query to execute is chosen 7.6. BACKWARD TRACING TRANSLATION 190

at runtime. This call is found with rule ’trackFunctionCallLocal’, because it is a call to launch a query where its parameter is defined in the same PHP block. In fact, it is defined twice. This call to sql_query has two different and complete SQL definitions and they are reported independently.

SQL call uses multiple SQL definitions example

644 if(!$bot) 645 {

646$sql=’SELECT * 647 FROM’. USERS _TABLE.’ 648 WHERE user_id=’.(int)$this->cookie _data[’u’]; 649} 650 else 651{ 652// We give bots always the same session if it is not yet expired.

653$sql=’SELECTu. *,s. * 654 FROM’. USERS _TABLE.’u 655 LEFT JOIN’. SESSIONS _TABLE.’sON(s.session _user_id=u.user _id) 656 WHEREu.user _id=’.(int)$bot; 657} 658$result=$db->sql _query($sql); 659$this->data=$db->sql _fetchrow($result); 660$db->sql _freeresult($result);

In this code, sql_query receives variable ’$sql’, which contains the SQL statement. The call will be marked with:

SQL call Mark

track_call_local(’sql _query’,$sql);

Meaning that this sql_query call uses the variable ’$sql’ which is defined locally. In previous lines we find the variables assignment, with two versions. We treat each one independently. The first assignment is: 7.6. BACKWARD TRACING TRANSLATION 191

Variable assignment version # 1

$sql=’SELECT * FROM’. USERS _TABLE.’ WHERE user_id = ’.(int)$this->cookie _data[’u’];

USERS_TABLE is defined as phpbb_users as a constant in another file (constants.php). It can be changed to:

Table constant definition

$sql=’SELECT * FROM phpbb_users WHERE user_id = ’. (int)$this->cookie _data[’u’];

To be able to interpolate the expression "(int) $this->cookie_data[’u’]", we define it as variable:

Expression interpolation

$_mongop201=(int)$this-> cookie _data[’u’];

$sql="SELECT * FROM phpbb_users WHERE user_id=$ _mongop201";

Since this value is now a string literal, it will be marked with:

New mark of string literal

track_strlit($sql,"SELECT * From phpbb_users where user_id=$ _mongop201");

Meaning that ’$sql’ is a variable used in a query launch, and contains a string literal. This literal will be migrated from SQL to MongoDB. The other assignment found in:

Variable assignment version # 2

$sql=’SELECTu. *,s. * FROM’. USERS _TABLE.’u LEFT JOIN’. SESSIONS _TABLE.’sON(s.session _user_id=u.user _id) WHEREu.user _id = ’. (int)$bot;

Like in previous query, table constant can be changed by its name: 7.6. BACKWARD TRACING TRANSLATION 192

Table constant definition

$sql=’SELECTu. *,s. * FROM phpbb_usersu LEFT JOIN phpbb_sessionssON(s.session _user_id=u.user _id) WHEREu.user _id = ’. (int)$bot;

And the expression "(int) $bot" can be interpolated with a variable:

Expression interpolation

$_mongop202=(int)$bot;

$sql="SELECTu. *,s. * FROM phpbb_usersu LEFT JOIN phpbb_sessionssON(s.session _user_id=u.user _id) WHEREu.user _id=$ _mongop202";

Since this value is now a literal, it will be marked with:

New mark of string literal track_strlit($sql,"SELECTu. *,s. * FROM phpbb_usersu LEFT JOIN phpbb_sessionssON(s.session _user_id=u.user _id) WHEREu.user _id=$ _mongop202");

Meaning that ’$sql’ is a variable used in a query launch, and contains a string literal. This literal will be migrated from SQL to MongoDB.The complete migration for that code is: 7.6. BACKWARD TRACING TRANSLATION 193

Complete code migration

133 if(!$bot) 134 { 135 { 136$ _mongop201=(int)$this-> cookie _data[’u’]; 137$sql= ((("

138//\"SELECT * FROM phpbb_users WHERE user_id=$ _mongop201\" 139\$rows=\$db->phpbb _users->find([’user_id’ => intval($ _mongop201)]); 140\$cols=[]; 141")));

142 track_strlit($sql,"SELECT * FROM phpbb_users WHERE user_id=$ _mongop201"); 143} 144} 145 else 146{ 147{ 148$ _mongop202=(int)$bot;

149$sql= ((("//\"SELECTu. *,s. * FROM phpbb_usersu LEFT JOIN phpbb _sessionssON 150(s.session _user_id=u.user _id) WHEREu.user _id=$ _mongop202\" 151\$rows=\$db->phpbb _users->find([’user_id’ => intval($ _mongop202)]); 152\$rows0=\$rows; 153$list _user_id=array_column($rows0,’user_id’); 154 foreach($list_user_id as&$a){$a=(int)$a;};

155\$res=\$ _sql(’SELECT * FROM phpbb_sessions WHERE session_user_idIN\$list _user_id’); 156\$rows=array();while(\$row=mysql _fetch_assoc(\$res))\$rows[]=\$row; 157$rows=$s2m->join _external($rows0,’user_id’,$rows,’session_user_id’,true); 158\$cols=[]; 159\$proj=[\"name1\"=>\"\",\"name2\"=>\"\"];")));

160 track_strlit($sql,"SELECTu. *,s. * FROM phpbb_usersu 161 LEFT JOIN phpbb_sessionssON(s.session _user_id=u.user _id) 162 WHEREu.user _id=$ _mongop202"); 163} 164} 165$result=$db -> sql _query ((($sql))); 166 track_call_local(’sql _query’,$sql); 7.7. CONCLUSION 194

With these changes done in the grammar, we were able to migrate more SQL definitions to MongoDB actions. After adapting the process to be able to process PHPBB3, it found 1828 query calls, and the definition of 1055 SQL. From these it was able to migrate 1185 queries.

7.7 Conclusion

In this chapter, we have presented migrating the applications to use the migrated database through the backward tracing technique. We have gone through the backward tracing pro- cess by illustrating some examples of the tracing process of migrating some queries from SCARF and PHPBB3 applications. We have migrated SCARF and both versions of PHPBB applications. The missing parts are a result of missing elements of the backwards markup and of the dynamic nature of the queries that could not get a complete sentence by the static analysis. Also, the limitation of the backward tracing is that the query may not exist in the source code. Backwards tracing is a well-understood problem, and completion is a matter of dealing with many language details of PHP . We finished the migration of three PHPBB3 web pages manually. We only manually migrated the untranslated queries and did not modify the queries that were migrated auto- matically. This allowed us to test the functionality of the automatically translated queries in these pages (which were correctly executed). We log all SQL queries executed in a PHPBB3 session, testing as many pages as possible by instrumenting PHPBB3 to write all executed queries to a file. Then going through all pages in the web browser with an anonymous user, and with an identified user. With that we got a list of complete PHPBB3 queries to test our procedures and to improve our migration rules. 7.7. CONCLUSION 195

In the next chapter, we will illustrate some examples and use cases of the application migration process through the backward tracing approach. Then we will evaluate the ap- plication migration phase of our proposed framework. 196

Chapter 8

Application Migration Evaluation

8.1 Introduction

In the previous chapters, we have outlined the phases of our framework for semi-automatic migration of web applications from SQL to document-oriented NoSQL database using a sequence of source transformations that migrate the SQL statements into MongoDB actions in the PHP source code. Our running examples has demonstrated the application of the process to small and medium but representative web applications. We have evaluated the migration process of each phase in the previous chapters. In this chapter, we demonstrate and evaluate the backward tracing process described in chapter 7 on two web applications under test, SCARF , the Stanford Conference and Research Forum, a discussion forum application, and PHPBB versions 2 and 3, a Bulletin Board forum applications, a forum web application used by thousands of users worldwide.

8.2 SCARF Backward Tracing Process

SCARF [85], the Stanford Conference and Research Forum, is a PHP based web appli- cation designed to help researchers and conference administrators to create and maintain 8.2. SCARF BACKWARD TRACING PROCESS 197

discussion forums for their research papers. We start the backward tracing process from the location where each of the migrated SQL statements is executed (a call to the function mysql_select). This is the configured function used to launch SQL queries in SCARF ap- plication. The SQL statement may be a literal string, in which case the transformation is done there. Otherwise, we move backwards from the execution of the SQL statement in the program tracking the construction of the string literal that contains the SQL statement. SCARF has 14 calls to mysql_query:

• 2 calls with a string literal.

• 2 calls with a concatenated string.

• 9 calls receive SQL as a variable defined locally.

• 1 call with a function parameter, function named ’query’.

In its turn, the ’query’ function has 57 calls:

• 39 calls with a string literal.

• 18 calls with a concatenated string.

By converting the string concatenation to a single string, we can catch all query defini- tions in SCARF . This gives us the string of the static query when the concatenated parts represent parameter queries. When these dynamic parts represent a part of the query itself, they may need more treatment. In the following part, we will illustrate use cases of the backward tracing process of the SCARF application. SCARF uses mysql_query function to execute queries on MySQL. mysql_query has 14 matches. If we skip the installation script, this gives us 2 matches in functions.php. Starting with the easiest a direct call to mysql_query with a string literal. 8.2. SCARF BACKWARD TRACING PROCESS 198

• SCARF Backward Tracing Case #1

SCARF Simple Backward Tracing Example

functions.php $result= mysql _query("SELECT value FROM options WHERE name =’Conference Name’"); track_call_strlit(’mysql _query’,"SELECT value FROM options WHERE name=’Conference Name’");

In this case, we have a mark where the query is defined, and it is the same place where it is executed. We have it as a single constant string. It calls to mysql_query directly with a string with a static SQL. Only queries from installation script directly use mysql_query. All the other application queries use an intermediate function as shown in the next example.

• SCARF Backward Tracing Case #2 The other match for mysql_query is at functions.php, line 10:

SCARF Backward Tracing Example #2

functions.php 8 function query($sql) 9 { 10$result= mysql _query($sql); 11 if(!$result) die(mysql _error()); 12 return$result; 13 }

Here this function receives an $sql variable. We search for its definition. It is a function parameter, defined in the previous line. We need to search all calls to this function. Searching for query function usages gives us 76 results. This tells us that SCARF uses this function to execute SQL queries. Following all 76 matches, we would have a complete backward trace for SCARF . Here we will follow the match 8.2. SCARF BACKWARD TRACING PROCESS 199

at file comments.php, line 116.

SCARF Backward Tracing Example #2

comments.php 116$result= query( 117"SELECT approved, paper _id, comment_id, 118 showemail, users.email, comment, UNIX_TIMESTAMP(date)AS date, 119 CONCAT(firstname,’’, lastname)AS fullname 120 FROM comments 121 LEFT JOIN users on comments.user_id= users.user _id 122$where ORDERBY paper _id");

This call is taking an SQL string. It has $where as an inferred value. We search its definition, it is found twice in the same file:

Backward tracing with $where example

$where="WHERE approved=’0’"; and $where="WHERE paper _id=’$id’ AND approved=’1’";

The second match has inferred variable $id. We will assume that $id will be in- terpreted as a parameter, because it is a PHP variable embedded inside the SQL statement. If we were not able to guess what part of the final built query it is, we should also backward trace its definition. Here $id is a filter parameter, so we can omit its definition. These $where definitions are inside IF blocks. 8.2. SCARF BACKWARD TRACING PROCESS 200

where definition inside an IF Block

if(! isset($ _GET[’paper_id’]) ) { $where="WHERE approved=’0’"; } else { ... $where="WHERE paper _id=’$id’ AND approved=’1’"; }

With that, we have all the parts involved. Putting it together:

Backward Tracing Result

comments.php, Line 38 38 if(php _condition){ 39$where="WHERE approved=’0’"; } comments.php, Line 41 41 else{ 42 php_code 43$where="WHERE paper _id=’$id’ AND approved=’1’"; 44 } comments.php, Line 116 116$result= query("SELECT approved, paper _id, comment_id, showemail, 117 users.email,comment, UNIX_TIMESTAMP(date)AS date, 118 CONCAT(firstname,’’, lastname)AS fullname 119 FROM comments 120 LEFT JOIN users on comments.user_id= users.user _id 121$where ORDERBY paper _id"); functions.php, Line 10 10 function query($sql){ 11$result= mysql _query($sql);

All SCARF queries executed and return the results for the version of the query. 8.3. PHPBB2 BACKWARD TRACING PROCESS 201

8.3 PHPBB2 Backward Tracing Process

PHPBB2 also uses mysql_query to launch SQL queries. The query calls all wrapped through an object function named sql_query. PHPBB2 has fewer queries than PHPBB3, and less of them are dynamic. It has fewer queries not migrated in principal PHP pages. In the first run, the process marked 159 queries as no_sql. So for 35% of all query calls, it does not finds its query definition. That is because in PHPBB2 many SQL calls are used inside IF conditions. We have updated the rules of the Backward program to take this into account and it gives these results from a total of 460 queries.

• not_migrated queries 69 (15%).

• migrated queries 391 (85%).

The main difference of the query calls between PHPBB2 and PHPBB3 is that in PHPBB2 many query calls have a different shape. There are calls like this:

PHPBB2 Query Call example

if( !($result=$db->sql _query($sql))) { message_die(CRITICAL_ERROR,"Could not query config information", "", __LINE__, __FILE__, $sql); }

So the query call is in a variable assignment placed in a condition. In PHPBB3 variables assignments are outside conditions, which is much cleaner.

PHPBB3 Query Call example

$result=$db->sql _query($sql); 8.4. PHPBB3 BACKWARD TRACING PROCESS 202

Function trackFunctionCallLocalVar has been changed to catch calls to function queries inside other PHP structs. These functions also use new rules removeBlocks and remove- Case, because in these functions, we are only interested in calls that are not inside other PHP blocks (inside the inner 0if 0 code blocks, for example). These two exclusions are needed to not lose migrated queries in PHPBB2.

8.4 PHPBB3 Backward Tracing Process

Testing PHPBB3, we notice that its cache module did not work well with the migrated queries. This problem presents because PHPBB3 implementation uses results from the database directly as key for its cache algorithms in PHP . However, with MySQL it works, because it returns its results with a PHP resource. We need to return resources in our instrumented functions. PHPBB3 uses its own implementation of the database cache. We had to take this into consideration because PHPBB3 is mangling queries before running them. For example encode translated MongoDB PHP code as a string, because it will not be only passed as a parameter to be executed in another point like in SCARF . PHPBB3 also uses the query as key in a dictionary. So query has to be usable as a dictionary key, like a string. We launched the backward tracing process that we used to migrate SCARF with PHPBB3. To do the backward search, we loaded all files at once. Doing so, the mi- gration rules can be applied to all sources and catch cross-references. This was fine with SCARF , since it had only 19 files, but PHPBB3 had 2906 PHP files. This is about 21 Mb on disk. The process is not able to load all those files at once. With the first implemen- tation, all files were read one time and kept in memory. In this implementations, they are 8.4. PHPBB3 BACKWARD TRACING PROCESS 203

read multiple times with the following steps:

• Gather constants that may be used in any file.

• Find functions that wrap the API to launch SQL queries.

• Process each file, instrumenting database calls and migrating SQL queries, since each file is processed independently, it only has information from other files gathered in the previous steps. It is more limited to search for cross-reference.

These additions have been done to migrate PHPBB3 applications:

• Add rules to catch calls to object functions: Since PHPBB3 is programmed with OOP PHP, rules that search for function calls have been updated to catch this other format of the function call.

• Add new rules to catch dynamic SQL queries: In PHPBB3, all queries are dy- namic since table names are defined as constant values in one file and concatenated in SQL queries when they reference any table. The process creates a global variable GLOBAL_V ARS with all constants defined in any file. They are resolved when ref- erenced in any SQL string, so the migration rule can transform these queries. Without adding these new rules in the process, it would not catch any complete SQL sentence.

All PHPBB3 queries are dynamic, wherein these queries table names are PHP con- stants. It does not use prepared statements. Parameter values are passed inside the string query. PHPBB uses different function call types, because it is programmed with objects. While in SCARF all calls were to plain functions as in the following examples. The plain function call in SCARF , where ’query’ is a global function that launches a query. 8.4. PHPBB3 BACKWARD TRACING PROCESS 204

SCARF plain function call

$result= query($sql)

An object function call in PHPBB, where $db is an object of a class that has a defini- tion for the sql_query function.

PHPBB object function call

$result=$db->sql _query($sql);

Backward tracing rules have been changed to follow calls to object functions.

8.4.1 Backward Tracing Queries with Multiple Parameters

Applying backward tracing to PHPBB3, we have 1828 call queries. From these we have 1055 query definitions. Looking at the query calls where no definition is found, we see that most of them use a call with many parameters. Taking this into consideration, we catch 500 more query definitions. Changing the rule that parses external parameters and it gives 608 migrated queries, but there are other queries that use array elements as parameters and this has been taken into consideration, so adapting the Backward process to accept parameters of this type allow it to process correctly even more queries. The process is placing marks in the source code.

8.4.2 Non-Migrated Cases

The most recurrent issue is with SQL that references not migrated tables. Sometimes the process has mangled some parameters to migrate to MongoDB, but it interferes with queries that have to be launched in MySQL. We did some modification to the process to keep these parameters as untouched as possible. Also, some SQL sentences lose spaces between SQL parts in the transformations. Queries with a table of a parameter are of type not_migrated. 8.5. SQL BACKWARD TRACING EXAMPLES 205

In those cases, the SQL sentence is found, but it is not migrated because the table is not known. Like this query from out_phpbb/phpbb/captcha/plugins/qa.php:

non migrated statement with table of a parameter

DELETE FROM$table WHERE question _id=$question _id

This sentence should only be migrated if $table has the name of a migrated table. The following option has been done to migrate more queries:

• Migrate queries with dynamic table names by inserting in PHP both the migrated and the unchanged query, and a condition to choose the good one based on the table name, but there may be an issue here when the migration depends on a specific table name. This requires the query to be present in SQL and MongoDB forms since the execution path will be decided based on the table name.

So in this step, we add some of the other small rules to migrate more queries (change from not_migrated to migrated). Each of these rules affect a small amount of queries, so we migrated more queries.

8.5 SQL Backward Tracing Examples

The following examples show how Backward tracing can be used to locate SQL statement starting from the source code point where queries are processed. SCARF and PHPBB3 applications utilize multiple functions to operate with MySQL database. There are functions to manage connections, cursors, etc. The function mysql_query is used to pass the SQL statement to MySQL, so we will use it to start the backward tracing. We will use backward tracing to know what is the SQL to be executed, but the query must be executed where mysql_query is placed, not where it is executed. For example: 8.5. SQL BACKWARD TRACING EXAMPLES 206

SQL Backward Tracing Example

$sql= SELECT * FROM USERS; block(); mysql_query($sql);

Here the query is defined before the call to block(), but it is executed after its call. The following section illustrates the backward tracing process of an operator of the SQL statement.

8.5.1 Backward Tracing Queries with Operators

Backward tracing is not affected by SQL operators. First, we locate SQL sentences, and later we migrate them. We do not have to do backward tracing for the operator or clauses, we are doing it on the SQL statement as a whole. We search SQL statements as a whole, ignoring its content. The migrated applications use PHP function mysql_query to launch queries. We do a backward trace from calls to this function to find where its parameters are built. We do not search strings that look like SQL, we search strings that are used as SQL, no mater what its shape is. In the same way, if source code has a string that looks like an SQL but it is not used as SQL, it will not be migrated. It may be part of the application content to be shown to the user. The process searches for the SQL definitions from the SQL calls. It treats to include expressions and variables used in the SQL constructions. It will use variables come from a web form, a computation, or from a file. An example of how the backward tracing is done in the SQL construction that contains an operator such as greater than ">" or less than "<" as in includes/acp/acp_main.php file of PHPBB3 application. This is a Double query. It gets all poster_id and topic_id from all posts from specified forum where topic_moved_id is 0, topic_last_post_time is greater than 8.5. SQL BACKWARD TRACING EXAMPLES 207

">" $get_from_time variable, and poster_id is not ANONYMOUS. Many filtered fields are from POSTS_TABLE, so it does a JOIN by topic_id to get this information.

Backward tracking of SQL statement with ">" operator

$sql=’SELECTp.poster _id,p.topic _id FROM’. POSTS _TABLE.’p,’. TOPICS _TABLE.’t WHEREt.forum _id=’.$forum _id.’ ANDt.topic _moved_id=0 ANDt.topic _last_post_time>’.$get _from_time.’ ANDt.topic _id=p.topic _id ANDp.poster _id<>’. ANONYMOUS.’ GROUPBYp.poster _id,p.topic _id’;

comparisons are migrated following the table of conversions [74]. So, this condition:

">" operator translation in Mongodb t.topic_last_post_time>$get _from_time Is translated to: ’t.topic_last_post_time’ => [’\$gt’ => intval($get _from_time)]

$gt is the comparison operator for"greater than" in MongoDB.

Since the SQL statement has joins between two tables, it has been migrated as an ag- gregation to MongoDB Action: 8.5. SQL BACKWARD TRACING EXAMPLES 208

Equivalent MongoDB action with ">" Operator

\$rows=\$db->phpbb_posts->aggregate( [[’\$match’ =>[’poster _id’ => [’\$ne’ => intval($ _mongop720)]]], [’\$lookup’ =>[’from’ =>’phpbb _topics’,’localField’ =>’topic _id’, ’foreignField’ =>’topic _id’,’as’ =>’t’]], [’\$unwind’ =>[’path’ =>’\$t’,’preserveNullAndEmptyArrays’ => false]], [’\$project’ =>[’poster _id’ => 1,’topic _id’ => 1,’t.forum _id’ => 1, ’t.topic_moved_id’ => 1,’t.topic _last_post_time’ => 1]], [’\$match’ => [’\$and’ => [[’t.forum_id’ => intval($forum _id)],[’t. topic_moved_id’ => 0],[’t.topic _last_post_time’ => [’\$gt’ => intval ($get_from_time)]]]]], [’\$group’ =>[’ _id’ =>[’poster _id’ =>’\$poster _id’,’topic_id’ =>’\$ topic_id’]]]]); \$cols=[\"poster_id\",\"topic_id\"];"))); $result=$db -> sql _query ((($sql))); track_call_local(’sql _query’,$sql); $posted= array(); while($row=$db -> sql _fetchrow($result)) {$posted[$row [’poster_id’]][]=$row[’topic _id’]; } $db -> sql_freeresult ($result); $sql_ary= array(); foreach($posted as$user _id =>$topic _row) { foreach($topic _row as$topic _id) {$sql _ary[]= array( ’user_id’ =>(int)$user _id,’topic _id’ =>(int)$ topic_id,’topic _posted’ => 1, ); ’topic_posted’ => 1,); } }

$ne selects the documents where the value of the field is not equal to the specified value. This includes documents that do not contain the field.If we want to keep rows with no topic 8.5. SQL BACKWARD TRACING EXAMPLES 209

elements, we can add the unwind parameter preserveNullAndEmptyArrays. It will keep documents where the array field is missing, null or an empty array. Another example of an SQL construction with ">" can be found in an update statement in includes/acp/acp_forums.php file:

An SQL construction with ">" operator

$sql=’UPDATE’. FORUMS _TABLE.’ SET left_id= left _id + 2, right_id= right _id+2 WHERE left_id>’.$row[’right _id’];

It updates all registers from FORUM_TABLE where left_id is greater than $row[’right_id’] variable. It increases left_id and right_id fields by 2. The example below shows the result- ing MongoDB action:

Migrated MongoDB action

\$db->phpbb_forums->update([’left_id’ => [’\$gt’ => intval($row[right _id])]], [’\$inc’ =>[’left _id’ => 2,’right _id’ => 2]],[multi => 1]);

The (">", "<", ">=", "<=", "<>") SQL operators will be translated into ("$gt","$lt", "$gte, "$lte", "$ne") in the MongoDB actions.

8.5.2 Issues with non-Migrated Tables

There is an issue in SQL statements that references not migrated tables. Sometimes, the process has mangled some parameters to migrate to MongoDB, but it interferes with queries that have to be launched in MySQL. We changed the rules to keep these parameters as unchanged as possible. Also, some SQL statements lose spaces between SQL parts in the transformations. We patched these queries in PHPBB3 source to continue with the tests. PHPBB3 processed with our automated process is run-able. Migrated code is able to execute queries of type (SINGLE, DUAL, DOUBLE and UNCHANGED), but, Unchanged 8.5. SQL BACKWARD TRACING EXAMPLES 210

queries present some problems. These are detected queries that use tables from MySQL. To migrate more queries, we reduced dynamic parts from UNCHANGED queries. Most of the remaining dynamic queries specify the table name with a parameter. Other queries are built with application functions like sql_build_query. These are functions specific to PHPBB3. We treat these concrete functions, and these are the main queries in pages like view_forum.php, and view_topic.php and this improves automatic migration in general. The export_translation.txl program will generate a PHP file with a function to deter- mine if a table has been migrated. It will use information from data migration files. In the sql2mongo.txl migration program the function prepareTableAsParameter is called before the classification rules. If the query has only one table name specified it will be duplicated. One will be treated as if the parameter specifies a migrated table, and the other as if it does not. As an example, admin_db_utilities.php file has a query as generic as:

table name as a parameter query

"SELECT * FROM$table" Now it will be translated to: if(\mongify::isCollection($table)) {

$sql=[\"SELECT * FROM$table\"]; } else { $rows=\$db->selectCollection($table)->find(); $cols=[]; }

• Create rules to reduce dynamic parts from undetected queries like tables specified with parameters. Here, we have tables specified as a parameter that does not allow us to migrate some queries. These queries may have to be migrated to MongoDB if that 8.5. SQL BACKWARD TRACING EXAMPLES 211

table is migrated, or be marked as Unchanged if that table is kept in MySQL. Some- times, these queries are reused with migrated and un-migrated tables. Therefore, we create a rule that for these cases to generate a migrated query and also keep its SQL statement, and the instrumented PHP to choose which one to execute depending on the table used. By applying this modification, we increased the number of queries that can be migrated automatically.

• In MySQL, Insert statements can have an option to specify the priority. In PHPBB this option is dynamic sometimes. Since it is ignored in MongoDB, we can ignore this parameter. For example, in viewtopic.php:

SQL Insert Statement with priority

INSERT$sql _priority INTO phpbb_topics_watch(user _id, topic_id, notify_status) VALUES($userdata[user _id],$topic _id, 0)

• Modify priority definition to include parameter in the MySQL grammar. UpdateS- tatement and Deletestatment definitions have been changed to also use this defini- tion.

• Change other rules that only accept one table like migrateUpdate and migrateDelete, to accept tables names as parameter. For example, in nestedseet.php:

UPDATE statement with table name as a parameter

UPDATE$this->table _name SET$this->column _item_parents=’$ _mongop’ WHERE$this->column _parent_id=$ _mongop

• Allow parameter to more SQL conditions such as NOT IN. For example in groupcp.php: 8.5. SQL BACKWARD TRACING EXAMPLES 212

UPDATE statement with ’NOT IN’ condition

UPDATE phpbb_users SET user_level=$ _mongop WHERE user_idIN($remove _mod_sql) AND user_level NOTIN($ _mongop)

• Allow UPDATE statements with column name as a parameter. For example, in ad- min_forums.php:

UPDATE statement with column name parameter

UPDATE$table SET$orderfield=$i WHERE$idfield=$ _mongop

Note that in this case the value of $i will not be cast to the column type since table and column are unknown. This is an example of a query with a table defined dynamically with a PHP variable. The translation process does not know what value ’$table’ will have. Since the process does not know the table name, it can not know its structure. When the column type is not known, the migrated query does not include the type cast.

8.5.3 Difference between the Manual Migration and the Automated One

Manually, we call directly MongoDB, so we choose to not call sql_query, because it is cleaner this way and the code is easier to read, but it could have done by passing the MongoDB action as parameter to the sql_query function. In the manual migration, we get rid of the intermediary layers, and place the PHP code directly where we know that it would have the same result. So, were are not following the string approach as we did in the automated migration by defining the MongoDB operation 8.5. SQL BACKWARD TRACING EXAMPLES 213

where the SQL statement is defined, and executing it where the SQL would be executed. For example, this code in viewtopic.php:

SQL statement

$sql=’SELECT forum _id FROM’. TOPICS _TABLE." WHERE topic_id=$topic _id"; $result=$db->sql _query($sql); $forum_id=(int)$db->sql _fetchfield(’forum_id’);

Creates an SQL query and passes it to the sql_query function. Manually, we determined that we could bypass this call, and call directly MongoDB as follows:

Migrated MongoDB Action

$row=$mongo->db->Topics->findOne([’topic _id’ =>$topic _id],[’forum _id’=>1]); $forum_id=$row[’forum _id’];

However, the automated migration process will have to call sql_query function. The query will be executed where this function calls mysql_query function. This is the sql_query function definition: 8.5. SQL BACKWARD TRACING EXAMPLES 214

sql_query function definition function sql_query($query = ’’,$cache _ttl = 0) { if($query != ’’) { global$cache; // EXPLAIN only in extra debug mode if(defined(’DEBUG’)) {$this->sql _report(’start’,$query);} else if(defined(’PHPBB _DISPLAY_LOAD_TIME’)) {$this->curtime= microtime(true);} $this->query_result=($cache&&$cache _ttl)?$cache->sql _load($query):false ; $this->sql_add_num_queries($this->query_result);

if($this->query _result === false) { if(($this->query _result= @mysql _query($query,$this->db _connect_id)) === false) {$this->sql_error($query);} if(defined(’DEBUG’)) {$this->sql_report(’stop’,$query);} else if(defined(’PHPBB _DISPLAY_LOAD_TIME’)) {$this->sql_time += microtime(true)-$this->curtime;} if($cache&&$cache _ttl) {$this->open_queries[(int)$this->query _result]=$this->query _result; $this->query_result=$cache->sql _save($this,$query,$this-> query_result,$cache _ttl);} else if(strpos($query,’SELECT’) ===0&&$this->query _result) {$this->open_queries[(int)$this->query _result]=$this->query _result;} } else if(defined(’DEBUG’)) $this->sql_report(’fromcache’,$query);} } else { return false;} return$this->query _result; } 8.6. PHPBB3 APPLICATION EVALUATION 215

8.6 PHPBB3 Application Evaluation

Since PHPBB3 touches thousands of files, it is more difficult to analyse than SCARF . With the latest improvements in the rules, the process gives these totals:

• no_sql (149)

• not_migrated (649)

• migrated (988)

The migration process searches queries call, and from there it determines what query it will use. The source code has 149 query calls without SQL found. This means that for all SQL calls, there are 149 with SQL definition not determined. In those 149, those calls do not use a direct SQL statement nor an identified PHP code to construct the SQL statement. That means that the statement is determined at runtime and the migration process does not know what it will be. For the other query calls, a query definition has been found and from these 1637 SQLs, 988 have been migrated.

8.6.1 SQL Process Examples of the Migration

The migration process changes SQL statements to their MongoDB action counterparts, with these steps: 1. Search PHP code that generates SQL statements. 2. Check if we can get the SQL statements. 3. Change the statement definition to its MongoDB counterpart. These changes affect the PHP code where the SQL statement is defined. Apart from that, PHPBB has one point where all application SQL statements are received and executed. 8.6. PHPBB3 APPLICATION EVALUATION 216

This code is changed to execute also MongoDB actions. The following SELECT statement example illustrates the process.

• Find PHP with SQL construction

Find PHP with SQL construction

$sql=’SELECT forum _id FROM’. TOPICS _TABLE .’ WHERE topic_id=’.$topic _id; $result=$db->sql _query($sql);

This query is used to obtain forum_id from a specific topic. It is a basic SELECT query build at file includes/functions_admin.php, line 601.

The process first searches all PHPBB3 files for calls to mysql_query. This is the PHP function used to launch queries with MySQL. It is used in phpbb/db/- driver/mysql.php, function sql_query. The process finds that sql_query is used in this application to launch queries.

In this code, sql_query receives variable $sql, which contains the SQL statement. In previous lines, we find the variables assignment. It is the concatenations of multiple values (dot ’.’ concatenates strings in PHP ).

Find PHP with SQL construction

’SELECT forum_id FROM’. TOPICS _TABLE .’ WHERE topic_id=’.$topic _id

TOPIC_TABLE is defined as phpbb_topics as a constant in another file (constants.php). It can be changed to:

Find PHP with SQL construction

"SELECT forum_id FROM phpbb_topics WHERE topic_id=".$topic _id 8.6. PHPBB3 APPLICATION EVALUATION 217

Table 8.1: SQL to MongoDB Query Mapping

SQL Statement Mongo Query Language Statement SELECT a,b FROM users WHERE age =33 db->users->find([’age’ => 33], [’a’ => 1, ’b’ => 1]);

Table 8.2: Codified SQL to MongoDB Query Mapping

SQL Statement Mongo Query Language Statement SELECT [projection] FROM [table] WHERE [filter] db->[table]->find([projection],[filter]);

And PHP allows to interpolate variables inside strings, so the process embeds the $topic_id variable to get a PHP string with a complete SQL statement.

Find PHP with SQL construction

"SELECT forum_id FROM phpbb_topics WHERE topic_id=$topic _id"

• Transform SQL to MongoDB: We parse the string as a SQL statement. It uses ph- pbb_topics table, since it is a migrated table, the process will migrate this sentence. The process uses a pattern matching based on SQL to MongoDB mapping table on the PHP website [74]. Table.8.1 shows the SQL statement and the mapped Mon- goDB action. Table.8.2 shows the SQL statement and the mapped MongoDB action, which is codified inside the process.

Changing parameters to MongoDB style, the migrated statement would have this shape:

Migrated MongoDB action

$db->phpbb_topics->find([’topic_id’ =>$topic _id],[’forum_id’ => 1]);

It also adds type cast to deal with MongoDB, because MongoDB is more strict than MySQL, and variable $cols to have this information later when the application runs the query. 8.6. PHPBB3 APPLICATION EVALUATION 218

Migrated MongoDB action

$rows=$db->phpbb _topics->find([’topic_id’ => intval($topic _id)], [’forum_id’=> 1]); $cols=["forum_id"];

• PHP with migrated statement: PHP source code is modified with the migrated statement where the SQL was built, and now has this code:

Migrated MongoDB action

$sql="\$rows=\$db->phpbb _topics->find([’topic_id’ => intval($topic _id)], [’forum_id’ => 1]); \$cols=[\"forum_id\"]; $result=$db->sql _query($sql);

• Query execution: The migration process also instruments calls to MySQL API. In- strumented PHP functions are defined in file sql2mongo.php. These functions will manage interaction with MySQL and MongoDB. In particular, all calls to sql_query used to launch queries are redirected to our function PHP function. When this func- tion receives queries intended for MongoDB, it will be executed with the API for that database.

The generated MongoDB actions mimic the original SQL sentences. These Mon- goDB operations have outputs as similar as possible with its SQL counterpart. This way we can maintain the same business logic.

In our case, when the call to instrumented sql_query receives our MongoDB find, it will returns the same results that the original SQL SELECT. So, this line will get the same result in MongoDB and MySQL: 8.7. APPLICATION MIGRATION EVALUATION 219

The same Result for MySQL and MongoDB

$result=$db->sql _query($sql);

8.7 Application Migration Evaluation

We made some manual modifications to the automatic PHPBB2 and PHPBB3 migra- tion to executing three pages. So we know that produced PHP is executable and compati- ble with the internal SQL cache of PHPBB2 and PHPBB3 applications. Most recurrent issue is with SQL that references not migrated tables. Sometimes the process has man- gled some parameters to migrate to MongoDB, but it interferes with queries that has to be launched in MySQL. We try to keep these parameters as unchanged as possible. The non processed queries are the dynamic queries launched at first three pages. This list contains the complete SQL string before its execution. The Select statements are marked in green if they involve MongoDB collections. index.php: This is the first page that shows the list of forums. It launches dynamic queries from multiple files. Here is the list of all queries for the non_migrated tables. 8.7. APPLICATION MIGRATION EVALUATION 220

index.php Page’s Queries

% conatiner_builder.php

SELECT * FROM phpbb_ext WHERE ext_active = 1; db.h SELECT config_name, config_value, is_dynamic FROM phpbb_config; % manager.php

SELECT * FROM phpbb_ext; session.php SELECT ban_ip, ban_userid, ban_email, ban_exclude, ban_give_reason, ban_end FROM phpbb_banlist WHERE ban_email = ’’ AND(ban _userid=1OR ban _ip<> ’’); % db.h SELECT config_name, config_value FROM phpbb_config WHERE is_dynamic= 1; UPDATE phpbb_config SET config_value= ’1545993652’ WHERE config_name=’rand _seed_last_update’; UPDATE phpbb_config SET config_value= ’769b5faf13bdf3bc2f88427b16bc86bb’ WHERE config_name=’rand _seed’; % functions_display.php

@"SELECT"@f. * FROM(phpbb _forumsf) ORDERBYf.left _id;

SELECTm. *,u.user _colour,g.group _colour,g.group _type FROM(phpbb_moderator_cachem) LEFT JOIN phpbb_usersuON(m.user _id=u.user _id) LEFT JOIN phpbb_groupsgON(m.group _id=g.group _id) WHEREm.display _on_index= 1; % index.php SELECTg.group _id,g.group _name,g.group _colour,g.group _type,g.group _legend FROM phpbb_groupsg LEFT JOIN phpbb_user_group ugON(g.group _id= ug.group _id AND ug.user_id=1 AND ug.user _pending= 0) WHEREg.group _legend>0 AND(g.group _type<>2OR ug.user _id= 1) ORDERBYg.group _legend ASC; 8.7. APPLICATION MIGRATION EVALUATION 221

view_forum.php: This is the second page. It shows forum topics. This page have some dynamic queries that use migrated tables.

SQL statements involve MongoDB collections

"SELECT" t.*,f.forum _name FROM(phpbb _topicst) LEFT JOIN phpbb_forumsfON(f.forum _id=t.forum _id) WHERE(t.forum _id=2ANDt.topic _type = 2)OR(t.forum _idIN (1, 2) ANDt.topic _type = 3) ORDERBYt.topic _time DESC;

"SELECT" t.topic_id FROM(phpbb _topicst) WHEREt.forum _id=2 ANDt.topic _typeIN (0, 1) ANDt.topic _visibility=1 ORDERBYt.topic _type DESC,t.topic _last_post_time DESC,t.topic _last_post_id DESC LIMIT 25;

"SELECT" t.* FROM(phpbb _topicst) WHEREt.topic _idIN (230, 15, 23313, 228, 246, 245, 244, 243, 242, 241, 240, 239, 238, 237, 236, 235, 234, 233, 232, 231, 229, 227, 226, 225, 224);

view_topic.php: This is the third page. It shows posts. This page has some dynamic queries that use migrated tables. In that page, some dynamic queries have not been mi- grated. 8.7. APPLICATION MIGRATION EVALUATION 222

view_topic.php Page Queries db.h SELECT config_name, config_value FROM phpbb_config WHERE is_dynamic = 1; UPDATE phpbb_config SET config_value = ’1545994756’ WHERE config_name=’rand _seed_last_update’; UPDATE phpbb_config SET config_value=’ecd37bfc704c5ba1b7bf014d2f544646’ WHERE config_name=’rand _seed’;

SQL statements involve MongoDB collections: viewtopic.php

"SELECT" Tt. *,f. * FROM(phpbb _forumsf CROSS JOIN phpbb _topicst) WHEREt.topic _id = 15 ANDf.forum _id=t.forum _id;

SQL statements involve MongoDB collections

"SELECT" p.post_id FROM phpbb_postsp WHEREp.topic _id = 15 ANDp.post _visibility=1 ORDERBYp.post _time ASC,p.post _id ASC LIMIT 10;

"SELECT" u.*,z.friend,z.foe,p. * FROM(phpbb _usersu CROSS JOIN phpbb _postsp) LEFT JOIN phpbb_zebrazON(z.user _id=1 ANDz.zebra _id=p.poster _id) WHEREp.post _idIN (803, 804, 805, 806, 807, 7291) ANDu.user _id=p.poster _id;

"SELECT" l.*,f. * FROM phpbb_profile_langl, phpbb _profile_fieldsf WHEREl.lang _id=0 ANDf.field _active=1 8.7. APPLICATION MIGRATION EVALUATION 223

ANDf.field _hide=0 ANDf.field _no_view=0 ANDl.field _id=f.field _id ORDERBYf.field _order;

view_topic.php Page Queries manager.php

SELECT *FROM phpbb_profile_fields_data WHERE user_idIN (2, 2588); auth.php SELECTa.user _id,a.forum _id,a.auth _setting,a.auth _option_id, ao.auth_option FROM phpbb_acl_usersa, phpbb _acl_options ao WHEREa.auth _role_id=0 ANDa.auth _option_id= ao.auth _option_id ANDa.user _idIN (2, 2588) AND ao.auth _option=’u _readpm’;

SELECTa.user _id,a.forum _id,r.auth _option_id,r.auth _setting,r.auth _option_id, ao. auth_option FROM phpbb_acl_usersa, phpbb _acl_roles_datar, phpbb _acl_options ao WHEREa.auth _role_id=r.role _id ANDr.auth _option_id= ao.auth _option_id ANDa.user _idIN (2, 2588) AND ao.auth_option=’u _readpm’;

SELECT ug.user_id,a.forum _id,a.auth _setting,a.auth _option_id, ao.auth_option FROM phpbb_acl_groupsa, phpbb _user_group ug, phpbb_groupsg,phpbb _acl_options ao WHEREa.auth _role_id=0 ANDa.auth _option_id= ao.auth _option_id ANDa.group _id= ug.group _id ANDg.group _id= ug.group _id AND ug.user_pending=0 AND NOT(ug.group _leader=1 ANDg.group _skip_auth = 1) AND ug.user_idIN (2, 2588) AND ao.auth_option=’u _readpm’; 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 224

view_topic.php Page Queries

functions_user.php SELECT ban_userid FROM phpbb_banlist WHERE ban_useridIN (2, 2588) AND ban_exclude<>1 AND ban _end= 0;

functions.php SELECT COUNT(DISTINCTs.session _ip) as num_guests FROM phpbb_sessionss WHEREs.session _user_id=1 ANDs.session _time >= 1545994800 ANDs.session _forum_id= 2;

SELECTs.session _user_id,s.session _ip,s.session _viewonline FROM phpbb_sessionss WHEREs.session _time >= 1545994830 ANDs.session _forum_id=2 ANDs.session _user_id<> 1;

SELECT forum_id FROM phpbb_forums WHERE forum_options&2<>0 LIMIT 1;

8.8 Manual Migration of PHPBB3 Pages

The following section describes the manual changes done to test the migrated pages of PHPBB3 application. We install and run processed PHPBB3 with existing rules, with manual patches to be able to run first pages. To run PHPBB3, we have made some changes in the sources used earlier to instrument SCARF and PHPBB2, mainly because PHPBB3 has multiple levels of sub-directories, and it also uses namespaces. Testing 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 225

viewforums.php, the process processed PHP files, but it has not much SQL queries. We patched also the main functionality from viewtopics.php, that has more queries. The first page shows the list of forums, and the second page shows all topics from each forum. Fig. 8.1 displays the migrated Index home page of the PHPBB3 application and Fig. 8.2 displays the migrated PHPBB3 forum page.

Figure 8.1: PHPBB3 Index homepage

The first page in PHPBB3 is index.php, and it uses viewforum.php by default. This PHP page has six queries, four of which are migrated automatically, but it uses other queries from other files. We manually migrate the remaining two queries. The following manual tuning has been done in the PHPBB3 pages for the un-migrated parts. 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 226

Figure 8.2: PHPBB3 viewforum Page

• Changes in Backward program namespaces: PHPBB3 uses namespaces. This key- word needs to be the first statement in the PHP files.

• Sources in sub-directories: PHPBB3 has sources in sub-directories. Functions InstrumentPage1 and InstrumentPage2 have been changed to include new files with a relative path to the root project. 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 227

The files have been installed on an Ubuntu 14.04.4 server at pices.ece.queensu.ca. This is the server with the PHPBB3 MongoDB database. Fig.8.3 displays the mi- grated view topic page of the PHPBB3 application.

Figure 8.3: Phpbb3 viewtopic Page

Though we can see that table names passed as parameters affects many queries. Below are two concrete examples where the rules have not been able to identify the query. 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 228

Table name as a parameter example viewforum.php $sql_array= array( ’SELECT’ =>’COUNT(t.topic _id)AS num _topics’, ’FROM’ => array(TOPICS_TABLE =>’t’,), ’WHERE’ =>’t.forum _id=’.$forum _id.’ AND(t.topic _last_post_time >=’.$min _post_time.’ ORt.topic _type=’. POST _ANNOUNCE.’ ORt.topic _type=’. POST _GLOBAL.’) AND’.$phpbb _content_visibility->get_visibility_sql (’topic’,$forum _id,’t.’), ); $result=$db->sql _query($db->sql_build_query(’SELECT’,$sql _array));

Note that sql_build_query is a custom function that builds SQL statements, and get_visibility_sql is a custom function that builds a part of an SQL condition. 8.8. MANUAL MIGRATION OF PHPBB3 PAGES 229

Table name as a parameter example viewtopic.php

161$sql _array= array(’SELECT’ =>’t. *,f. *’,’FROM’ => array(FORUMS _TABLE =>’f’),); 162// The FROM-Order is quite important here, elset. * columns can not be correctly bound 163 if($post _id) 164{$sql _array[’SELECT’] .=’,p.post _visibility,p.post _time,p.post _id’; 165$sql _array[’FROM’][POSTS_TABLE]=’p’;} 166// Topics table need to be the last in the chain 167$sql _array[’FROM’][TOPICS_TABLE]=’t’;

168 if($user->data[’is _registered’]) 169{$sql _array[’SELECT’] .=’, tw.notify _status’;$sql _array[’LEFT_JOIN’]= array();

170$sql _array[’LEFT_JOIN’][]= array(’FROM’ => array(TOPICS _WATCH_TABLE =>’tw’), 171’ON’ =>’tw.user _id=’.$user->data[’user _id’]. 172 if($config[’allow _bookmarks’]) 173{$sql _array[’SELECT’] .=’, bm.topic _id as bookmarked’; 174$sql _array[’LEFT_JOIN’][]= array(’FROM’ => array(BOOKMARKS _TABLE =>’bm’), 175’ON’ =>’bm.user _id=’.$user->data[’user _id’]. 176’ ANDt.topic _id= bm.topic _id’);} 177 if($config[’load _db_lastread’]) 178{$sql _array[’SELECT’] .=’, tt.mark _time, ft.mark_time as forum_mark_time’; 179$sql _array[’LEFT_JOIN’][]= array(’FROM’ => array(TOPICS _TRACK_TABLE =>’tt’), 180’ON’ =>’tt.user _id=’.$user->data[’user _id’]. 181’ANDt.topic _id= tt.topic _id’); 182$sql _array[’LEFT_JOIN’][]= array(’FROM’ => array(FORUMS _TRACK_TABLE =>’ft’), 183’ON’ =>’ft.user _id=’.$user->data[’user _id’]. 184’ANDt.forum _id= ft.forum _id’);} 185} 186 if(!$post _id) 187{$sql _array[’WHERE’]="t.topic _id=$topic _id";} 188 else 189{$sql _array[’WHERE’]="p.post _id=$post _id ANDt.topic _id=p.topic _id";} 190$sql _array[’WHERE’] .=’ ANDf.forum _id=t.forum _id’; 191$sql=$db->sql _build_query(’SELECT’,$sql _array); 192$result=$db->sql _query($sql); 8.9. APPLICATION MIGRATION STATISTICS 230

8.9 Application Migration Statistics

We have applied our process to SCARF and to both versions of PHPBB. In the case of SCARF , all queries were successfully translated automatically, and the application was manually tested to ensure the same functionality. The current implementation of our approach does not fully migrate the two versions of PHPBB. In PHPBB2 15% of queries are not fully migrated automatically, and 25% of the queries in PHPBB3 are not fully migrated automatically. The un-migrated parts are the PHP language features used to dynamically assemble queries that are not fully implemented by our backward tracing approach. We have finished the migration of PHPBB3 files manually. In the manual intervention, we only manually migrated the untranslated queries and did not modify the queries that were migrated au- tomatically. This allowed us to test the functionality of these pages which were correctly executed. Changing our rules to adapt to array elements and object properties, we were able to migrate more SQL definitions to MongoDB actions in the application as in the below ex- amples. Here are some examples of the migrated queries from PHPBB3 pages are: A total of 86 queries in SCARF are processed automatically, with 100% coverage. In PHPBB2, of a total of 460 queries, 391 processed automatically, with (85%) coverage and in PHPBB3 with a total of 1572 queries, 1185 processed automatically with (75%) coverage. Table. 8.3 shows the migration statistics of the three applications. 8.9. APPLICATION MIGRATION STATISTICS 231

Migrated SQL Query to MongoDB of viewforum.php page

40 if(!$forum _id) 41 {$sql = (((" 42//\"SELECT forum _id FROM phpbb_topics WHERE topic_id=$topic _id\" 43\$rows=\$db->phpbb _topics->find([ 44’topic _id’ => intval($topic _id)],[’forum _id’ => 1] 45); 46\$cols=[\"forum _id\"]; 47"))); 48 track_strlit($sql,"SELECT forum _id 49 FROM phpbb_topics 50 WHERE topic_id=$topic _id"); 51$result=$db-> sql _query ((($sql))); 52 track_call_local2(’sql _query’,$sql); 53$forum _id=(int)$db -> sql _fetchfield(’forum _id’); 54$db -> sql _freeresult($result);

Migrated SQL Query to MongoDB aggregate function of viewtopic.php page

77 // \"SELECT topic_last_post_idAS post _id, topic_id, forum_id 78 FROM phpbb_topics WHERE topic_id=$topic _id\" 79\$rows=\$db->phpbb _topics->aggregate( 80 [[’\$project’ =>[’post _id’ =>’\$topic _last_post_id’,’topic _id’ => 1,’forum _id’ => 1]], 81 [’\$match’ =>[’topic _id’ => intval($topic _id)]]] 82); 83\$cols=[\"post _id\",\"topic_id\",\"forum_id\"]"))); 84 track_strlit($sql,"SELECT topic _last_post_id as post_id, topic_id, forum_id 85 FROM phpbb_topics 86 WHERE topic_id=$topic _id"); 87 } 88$result=$db-> sql _query ((($sql))); 89 track_call_local(’sql _query’,$sql); 90$row=$db -> sql _fetchrow($result); 91$db -> sql _freeresult($result); 8.10. WORDPRESS ANALYSIS 232

Table 8.3: Experiment Results

Application Name SCARF PHPBB2 PHPBB3 PHP files 19 71 2906 Modified files 16 49 1561 Total Queries 86 460 1572 Migrated Queries 86 391 1185 Single 64 254 968 Double 22 29 74 Dual 0 36 63 Not changed 0 72 80

8.10 WordPress Analysis

We checked the migration process with W ordP ress source code. W ordP ress also uses ’mysql_query’ to launch queries, but it seems that W ordP ress has multiple layers of PHP code between the SQL definition and the SQL execution. This will interfere with the back- ward tracing process used to find the query definition. W ordP ress0s architecture and source code are different from SCARF and PHPBB. W ordP ress application has some PHP files that are not parsed with our parsing pro- gram. Our program reads and interprets all PHP files, stores them in its internal tree representation to process them. If there are changes, a new file will be generated from that internal representation. So, the program needs to understand perfectly well all the content of processed PHP files, not only the interesting parts. Also, W ordP ress has various layers of indirection. First run of our process on W ordP ress, only detect queries in file wp-includes/wpdb.php. These two SQL initialization queries:

• SELECT @@SESSION.sql_mode

• SET SESSION sql_mode=’$modes_str’

And this generic query: 8.10. WORDPRESS ANALYSIS 233

• $query

$query is a PHP variable. It can contain any SQL statement. Sometimes, it is used in W ordP ress to pass an array with parameters to instruct other PHP code on how to build SQL statements. The last query is the entry point of all W ordP ress queries. All application queries are done through_do_query, which is at the same time is called from query. Note that query itself does some changes to the database statements at runtime. Also, it changes its returned values. For example, if a database query starts with the keyword ’INSERT’, it checks how many records have been inserted. This is an SQL keyword, but in MongoDB an insertion will not start with ’INSERT’, so it will not work properly with migrated statements. W ordP ress source-code alters all SQL statements at runtime, after they are built. At its time, most (if not all) queries all are pre-processed with function prepare. The prepare function does some transformations to the statement at runtime, before executing it. For example, it embeds variable parameters to the statement, but note that this is not an SQL prepared statement. It uses its own format; it is not an SQL. It may be interesting to see how the statements are passed to prepare. So a modification has been done to the Backward program to search for calls to this function. It has trouble understanding queries, because all tables are prefixed with wpdb− > or blog_prefix. For the purpose of this test, we removed these strings. The modified sources stored in wp folder, and wp_ori has the originals. 8.10. WORDPRESS ANALYSIS 234

The process finds 343 calls, some seem complete queries to be used with prepared statements, like:

Prepared statement query

SELECTID FROM posts WHERE post _author=%d

But note that this is a proprietary format. %d is not a correct SQL placeholder, and there are also other entries with only parts of statements, like:

SQL statement query

AND postmeta.meta_value=%s

So, it seems that prepare is called multiple times when building queries at runtime, it is not only called at the last stage. The current migration process cannot be used as it is in migrating WordPress appli- cation. W ordP ress source code to access to a database is different from SCARF and PHPBB. W ordP ress does not use standard SQL statements, and it has various levels of indirection from where queries are defined and where queries are executed. In these stages, queries are modified at runtime. Also, to make things more complicated, W ordP ress is parsing queries that it have previously built dynamically. If they are not SQL state- ments, the application will not work as expected. If W ordP ress has to be migrated to use MongoDB with static analysis at the source level, it cannot be done with generic rules. Other approaches are needed to migrate queries at runtime, which may not have the above problems. Of course, with the counterpart of having to deal with migration in each query execution. 8.11. CONCLUSION 235

8.11 Conclusion

In this chapter, we have presented some use cases examples of our backward tracing pro- cess of the applications under test. We showed the original SQL statements, the backward tracing process markups and the translated MongoDB actions. We have evaluated the ap- plication migration phase for the three applications We have migrated SCARF and both versions of PHPBB applications. In the case of SCARF , all queries were successfully translated, and the application was manually tested to ensure the migrated version had the same functionality. We checked that the migrated application has the same output with an equivalent database on all pages. The current implementation of our approach does not fully migrate the two versions of PHPBB. In PHPBB2 15% of queries are not fully migrated, and 25% of the queries in PHPBB3 are not fully migrated. The missing parts are a result of missing elements of the backwards markup and of the dynamic nature of the queries that could not get a complete sentence by the static analysis. Also, the limitation of the backward tracing is that the query may not exist in the source code. Backward tracing is a well-understood problem, and completion is a matter of dealing with many language details of PHP . We finished the migration of three web pages of PHPBB3 manually. We only man- ually migrated the untranslated queries and did not modify the queries that were migrated automatically. This allowed us to test the functionality of the automatically translated queries in these pages (which were correctly executed). We log all SQL queries executed in a PHPBB3 session, testing as many pages as possible by instrumenting PHPBB3 to write all executed queries to a file. Then going through all pages in the web browser with an anonymous user, and with an identified user. With that we got a list of complete 8.11. CONCLUSION 236

PHPBB3 queries to test our procedures and to improve our migration rules. In the next chapter, we conclude the thesis, highlight limitations of our migration frame- work, and outline opportunities for continued future work. 237

Chapter 9

Summary and Conclusions

9.1 Summary

For over forty years, relational databases have been the leading model for data storage, retrieval and management. However, due to the increasing needs for scalability and perfor- mance, alternative systems are being developed, namely NoSQL technology [52]. With the increased interest in NoSQL technology, as well as more use case scenarios, over the last few years, these databases have been more frequently evaluated and compared. It is nec- essary to find if all the possibilities and characteristics of non-relational technology have been revealed. There is much research that discuss the migration between different Re- lational Databases, but there is much less research dealing with the problem of replacing data storage for systems currently working with a new data storage with different structure than the existing working one. Also, there is no visible effort of application source code migration has been seen from a legacy web application based on relational databases to a NoSQL solution. So, our research fills in this gap by discussing the migration process to NoSQL in web application in details. 9.2. LIMITATIONS 238

In this thesis, we have presented an approach for semi-automating the migration of highly dynamic relational-based (e.g. MySQL) web applications to ones that use document- oriented NoSQL databases such as MongoDB. The approach was tested on three applications, SCARF , PHPBB2, and PHPBB3. We demonstrated the approach on the migration of a subset of queries of the SCARF , a conference management system and PHPBB3, a bulletin board web application. Our framework consists of several automated phases: query extraction phase, query classifications phase, query translation and migration phase, query optimization phase and application migration phase. In Chapter 2, we proceed by introducing our proposed framework and discussing the literature review and the related work. Chapter 4 describes the schema and data migration steps with running examples. In Chapter 5, we discuss the manual migration experiments with some examples from our applications under test. In Chapter 6 and Chapter 7, we presented the implementation details and the analysis of the query migration and the ap- plication migration phases of our framework. Chapter 8 illustrates some examples and use cases of the backward tracing process of SCARF and PHPBB3 applications and evaluates the application migration phase. Chapter 9 concludes our work, addresses the limitations of it, and outlines some future work.

9.2 Limitations

The approach successfully migrated the applications into fully working systems which use MongoDB NoSQL databases and partially interacting with some non-migrated SQL tables. We conducted an experiment to evaluate our optimization phase and the evaluation sug- gested that our optimization was instrumental in improving the performance of the migrated 9.2. LIMITATIONS 239

system with a performance almost equivalent to the original non-migrated application. The current implementation of our approach does not fully migrate the two versions of PHPBB. The missing parts are a result of missing elements of the backwards markup and because of the dynamic nature of the queries that could not get a complete sentence by the static migration. Also the limitation of the backward tracing is the query may not exist in the source code. Backwards tracing is a well understood problem, and completion is a matter of dealing with many language details of PHP . We automated the extraction of the SQL statements by applying a backward tracing and the approach suggested by Alalfi et al [5]. We also provided a semi-automated way to migrate the application source code to use the translated queries. The suggested framework will eliminate or minimize the effort of rewriting the ap- plication code when the back-end data storage system is changed. Further, the proposed transformation framework will reduce the effort of maintaining data portability between the different databases models. The framework transforms the design into compatible code to support the ability to the move between the different web applications and database systems. Since the test size of the test collections are still relatively small, the advantage of MongoDB for these types of collections is not immediately obvious. However, it is clear that our approach does not introduce a significant performance penalty while enabling the use of MongoDB. Good candidate systems to apply our approach are web applications that are open- source, and built using the combination of PHP , and MySQL. The proposed framework will be applicable to other technologies as well, simply by adding their grammars to the static analysis and instrumentation stages. 9.3. FUTURE WORK 240

Because our approach is based on static and dynamic analysis, we require source code. Our choice of the combination of PHP based web applications, MySQL database, and MongoDB is based on the popularity of these technologies. PHP has been the most popu- lar server-side for years and is likely to remain so for some time. MySQL as well is the fastest-growing database in the industry, with more than 10 million active installations and 80,000 daily downloads [63]. The approach could be applied to other technologies as well. MongoDB NoSQL is an open-source document-oriented NoSQL database with commu- nity support with over 15 million downloads and counting, MongoDB is the most popular NoSQL database today. The rise in popularity of the JavaScript-based stack meant many programmers now prefer MongoDB as their choice of database [60]. MongoDB has grown from being just a JSON data store to become the most popular NoSQL database solution. We are applying the proposed approach to the PHPBB web application [76]. PHPBB is the world’s leading open source forum software. It has a powerful permission system and a number of other key features such as private messaging, search functions, a customizable template and language system, and support for multiple database technologies.

9.3 Future Work

Future work is to integrate the optimization from the query migration into the application migration level. The proposed framework works on PHP based web applications and the migration from MySQL relational database to MongoDB document-oriented NoSQL database. As a future work is to test the generality of the framework on other document- oriented database such as CouchDB [22], or Dynamo [29]. Also, the proposed framework works on PHP based web application and as a future work is to test the generality of the 9.3. FUTURE WORK 241

approach to other web applications based on other development tools such as JavaScript, Python, Perl, etc rather than PHP based ones by adding their grammars to the static anal- ysis and instrumentation stages of the framework. While we have illustrated our approach on the PHP language, our framework and its steps are not specific to any particular language. Extending our automated migration frame- work to other web application languages such as Python, Perl by adding their grammars to the static analysis and instrumentation stages of the framework is another area for future research. BIBLIOGRAPHY 242

Bibliography

[1] Tauro S. A. Comparative study of the new generation, agile, scale-able, high performance databases. International Journal of Computer Applications, 48(20):888–975, 2013.

[2] Rateb Abu-Hamdeh, James Cordy, and Patrick Martin. Schema translation using structural transformation. In Proceedings of CASCON’94, Toronto, Nov. 1994, pages 202–215, 1994.

[3] Manar H. Alalfi, James R. Cordy, and Thomas R. Dean. SQL2XMI: reverse engi- neering of UML-ER diagrams from relational database schemas. In Proceedings of the 15th Working Conference on Reverse Engineering ,WCRE 2008, 15-18 October

2008, Antwerp, Belgium, pages 187–191, 2008.

[4] Manar H. Alalfi, James R. Cordy, and Thomas R. Dean. WAFA: fine-grained dy- namic analysis of web applications. In Proceedings of the 11th IEEE International Symposium on Web Systems Evolution, WSE 2009, 25-26 September 2009, Edmon-

ton, Alberta, Canada, pages 141–150, 2009.

[5] Manar H. Alalfi, James R. Cordy, and Thomas R. Dean. Automating coverage met- rics for dynamic web applications. In Proceedings of the 14th IEEE European Con- ference on Software Maintenance and Re-engineering, 15-18 March 2010, Madrid, BIBLIOGRAPHY 243

Spain, pages 51–60, 2010.

[6] Antlr parser. https://www.antlr.org/.

[7] Apache phoenix. https://phoenix.apache.org/.

[8] P. Rupali Arora and Rinkle Aggarwa. An algorithm for transformation of data from mysql to nosql (mongodb). In International Journal of Advanced Studies in Com- puter Science and Engineering, IJASCSE, Volume 2, Special Issue 1, 2013.

[9] P. Atzeni, F. Bugiotti, and L. Rossi. Uniform access to nosql systems. Journal of Information System, Science Direct., 1999(673), 2013.

[10] Cristina Bazar and Cosmin Sebastian Iosif. The transition from rdbms to nosql, a comparative analysis of three popular non-relational solutions: Cassandra, mongodb and couchbase. Database Systems Journal, 4(2):49–59, 2014.

[11] Mongodb documentation website. https://www.docs.mongodb.com/bi- connector/master.

[12] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Language User Guide. Addison-Wesley, 1999.

[13] MR. Carina. Pentaho Data Integration Beginner’s Guide. Buenos Aires: Packet Publishing Limiteds, 2013.

[14] M. Casters, R. Bouman, and J. Van Dongen. Pentaho Kettle Solutions: Building Open Source ETL solutions with Pentaho Data Integration. John Wiley and Sons, 2010. BIBLIOGRAPHY 244

[15] R. Cattell. Scalable sql and nosql data stores. In SIGMOD Rec, Vol. 39, Issue No. 4, 012–027, 2011.

[16] R. Cattell. Relational databases, object databases, key-value stores, document stores, and extensible record stores: A comparison. In International Journal of Advance Engineering Sciences and Technologies,Vol. 11, Issue No. 1, 015-030, 2012.

[17] W. C. Chung, H. P. Lin, S. C. Chen, M. F. Jiang, and Y. C. Chung. Jackhare: a framework for sql to nosql translation using map-reduce. In Automated Software Engineering, vol. 21, no. 4, pages 489–508, 2014.

[18] Dzone website. https://dzone.com/articles/comparing-mongodb-amp-mysql.

[19] Popular content management systems. https://websitesetup.org/popular-cms/.

[20] James R. Cordy. The TXL source transformation language. Sci. Comput. Program., 61(3):190–210, 2006.

[21] Couchbase, nosql database technology: Post-relational data management for inter- active software systems. https://info.couchbase.com/.

[22] Couchdb. https://couchdb.apache.org/.

[23] Abadi D. Data management in the cloud: Limitations and opportunities. In IEEE Data Engineering Bulletin, vol 32(1), pages 3–12, 2010.

[24] Dipina Damodaran, Shirin Salim, and Surekha Marium. Performance evaluation of mysql and mongodb databases. International Journal on Cybernetics and Informat- ics (IJCI), 5(2), 2016. BIBLIOGRAPHY 245

[25] Database trends. https://scalegrid.io/blog/database-trends-sql-vs-nosql-top- databases-single-vs-multiple-database-use.

[26] Datax data migration. https://datax.exchange/.

[27] Eleni Stroulia. Diego Serrano. From relations to multi-dimensional maps: a sql-to- hbase transformation methodology. In Proceedings of the 26th Annual International Conference on Computer Science and Software Engineering (CASCON ’16), Blake

Jones (Ed.). IBM Corp., Riverton, NJ, USA, pages 156–165, 2016.

[28] Apache drill website. https://drill.apache.org/.

[29] Amazon dynamodb. https://aws.amazon.com/dynamodb.

[30] Calinescu Radu Paige Ellison, Martyn and Richard F. Evaluating cloud database migration options using workload models. Journal of Cloud Computing: Advances, Systems and Applications archive, Article No. 108, 7(1), 2018.

[31] S. Shu G. Yunhua and Z. Guansheng. Application of nosql databases in web crawl- ing. International Journal of Digital Content Technology and its Applications., 5(6), 2012.

[32] Dieter Gawlick and Zhen Hua Liu. Management of flexible schema data in rdbmss opportunities and limitations for nosql. In The Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, California, USA, pages 4–7, 2015.

[33] Google analytics. https://analytics.google.com/analytics/web/provision//provision.

[34] Dayne Hammes, Hiram Medero, and Harrison Mitchell. Comparison of nosql and BIBLIOGRAPHY 246

sql databases in the cloud. In Proceedings of the Southern Association for Informa- tion Systems Conference, Macon, GA, USA, March 21- 22, 2014.

[35] J. Han, E. Haihong, G. Le, and J. Du. Correlation aware technique for sql to nosql transformation. In 7th International Conference on Ubi-Media Computing and Workshops, pages 43–46, 2011.

[36] Jen Chun Hsu, Ching Hsien Hsu, Shih Chang Chen, and Yeh Ching Chung. Sur- vey on nosql databases. In International Conference in Pervasive Computing and Applications (ICPCA). IEEE, pages 363–366, 2014.

[37] Chi-Ming Huang. Data migration from relational database to cloud database, Jan- uary 2012.

[38] Wu Hui-jun, L. Kai, and Li. Gen. Design and implementation of distributed stage db: A high performance distributed key-value database. In Proceedings of the 6th Inter- national Asia Conference on Industrial Engineering and Management Innovation, pages 189–198, 2015.

[39] Mongodb documentation manual website. https://docs.MongoDB.com/manual/indexes.

[40] Jboss enterprise application platform. https://www.redhat.com/en/technologies/jboss- middleware/application-platform.

[41] Tianyu Jia, Xiaomeng Zhao, Zheng Wang, Dahan Gong, and Guiguang Ding. Model transformation and data migration from relational database to mongodb. In Interna- tional Congress on Big Data, IEEE Computer Society, 2016.

[42] Jsqlparse website. http://www.jsqlparser.sourceforge.net/. BIBLIOGRAPHY 247

[43] A. Kanade, A. Gopal, and S. Kanade. A study of normalization and embedding in mongodb. In IACC 2014, in Advance Computing Conference (IACC), 2014 IEEE International. IEEE, pages 416–421, 2014.

[44] K. Kaur and R. Rani. Modeling and querying data in nosql databases. In IEEE International Conference on Big Data, 2013.

[45] R. Lamllari. Extending a methodology for migration of the database layer to the cloud: Considering relational database schema migration to nosql. Journal of Object Technology., 9(2), 2013.

[46] N. Leavitt. Will nosql databases live up to their promise? IEEE Computer Society, 10:12–14, 2010.

[47] C. Lee and Y.Zheng. Automatic sql-to-nosql schema transformation over the mysql and hbase databases. In International Conference on Consumer Electronics-Taiwan (ICCE-TW, 2015.

[48] C Lee and Y Zheng. Sql-to-nosql schema denormalization and migration: A study on content management systems. In IEEE International Conference on Systems, Man, and Cybernetics, 2015.

[49] C. Li. Transforming relational database into hbase: A case study. In Software Engineering and Service Sciences (IC-SESS), 2010 IEEE International Conference

on. IEEE, pages 683–687, 2010.

[50] Xinzheng Li. Defining and visualizing web application slices using design recovery, April 2004. BIBLIOGRAPHY 248

[51] Yishan Li and S. Manoharan. A performance comparison of sql and nosql databases. In Proceedings of IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pages 15–19, 2013.

[52] J. Ricardo Lourenco, A. Veronika, M. Vieira, B. Cabral, V. Lourenco, and J. Bernardino. A Software Engineering Perspective Chapter (10), New Contribu- tions in Information Systems and Technologies, Volume 353 of the series Advances

in Intelligent Systems and Computing pp 741-750. Springer, 2013.

[53] Joao Ricardo Lourenco, Bruno Cabral, Paulo Carreiro, Marco Vieira, and Jorge Bernardinol. Choosing the right nosql database for the job: a quality attribute eval- uation. Journal of BigData., pages 2–18, 2016.

[54] Lucene. https://lucene.apache.org/.

[55] Macroscope website. http://macroscope.sourceforge.net/.

[56] Alza A. Mahmood. Automated algorithm for data migration from relational to nosql databases. Al-Nahrain Journal for Engineering Sciences (NJES., 21(1):60–65, 2018.

[57] Tim A. Majchrzak, Tobias Jansen, and Herbert Kuchen. Efficiency evaluation of open source etl tools. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, March 2011, pages 287–294, 2011.

[58] Innocent Mapanga and Prudence Kadebu. Database management systems: A nosql analysis. In International Journal of Modern Communication Technologies & Re- search (IJMCTR), IJMCTR, Volume 1, Issue 7, 2013.

[59] Mongodb manual website. https://docs.mongodb.com/manual/core/aggregation- pipeline-optimization. BIBLIOGRAPHY 249

[60] Mongodb-popularity. https://techtrendspro.com/security/why-mongodb-is-so- popular.

[61] Unityjdbc website. http://www.unityjdbc.com/mongojdbc/mongosqltranslate.php.

[62] A. B. M. Moniruzzaman and S. A. Hossain. New era of databases for big data analytic classification, characteristics and comparison. International Journal of Database Theory and Application, 6(4):1–13, 2013.

[63] Mysql ab, mysql market share. http://www.mysql.com/why-mysql/marketshare.

[64] Proxysql website. https://www.proxysql.com/.

[65] Php manual website. https://www.php.net/manual/en/mysqli.quickstart.prepared- statements.php.

[66] W. Naheman and Jianxin Wei. Review of nosql databases and performance testing on hbase. In Proceedings of International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), Dec., pages 2304–2309, 2013.

[67] R. P. Padhy, M. R. Patra, and S. C. Satapathy. Rdbms to nosql:reviewing some next-generation non-relational databases. In International Journal of Advance En- gineering Sciences and Technologies, Vol. 11, Issue No. 1, pages 015–0309, 2011.

[68] P. K. Pandya and J. Radhakrishnan. An equational theory for transactions. In FSTTCS 2003: Foundations of Software Technology and Theoretical Computer Sci-

ence, pages 38–49, 2003. BIBLIOGRAPHY 250

[69] Z Parker, Poe, and S. Vrbsky. Comparing nosql mongodb to an sql db. In Proceed- ings of the 51st ACM Southeast Conference. ACMSE ’13, New York, USA, pages 5–10, 2013.

[70] Pentaho features. https://www.hitachi.com/en-us/products/data-management- analytics/pentaho-platform/pentaho-data-integration.html.

[71] Pelica migrator tool. http://www.techgene.com/it-solutions/data-migration.

[72] Pentaho documentation. https://help.pentaho.com/Documentation/Pentaho_Data_Integration.

[73] Pentaho website. https://sourceforge.net/projects/pentaho/files/Data.

[74] The php website. https://secure.php.net/manual/en/mongo.sqltomongo.php.

[75] An introduction to using mongodb with php: Walking through the basics of schema design, php cloud summit 2011. https://spf13.com/presentation/mongodb-php-and- the-cloud-php-cloud-summit-2011//.

[76] Phpbb website. https://www.phpbb.com/.

[77] Php manual website. http://php.net/manual/en/book.pdo.php.

[78] Jaroslav Pokornym. Nosql databases: a step to database scalability in web environ- ment. International Journal of Web Information Systems, 1(13):1744–0084, 2012.

[79] Hecht R and Jablonski S. Nosql evaluation. In International Conference on Cloud and Service Computing. IEEE, Hong Kong, China, pages 36–41, 2011.

[80] M. R. Patra R. P. Padhy and S. C. Satapathy. Rdbms to nosql: Reviewing some next- generation non-relational databases. International Journal of Advanced Engineering Science and Technologies., 11(1), 2011. BIBLIOGRAPHY 251

[81] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw Hill, 2000.

[82] Leonardo Rocha. A framework for migrating relational data-sets to nosql. In Pro- cedia Computer Science 51, pages 2593–2602, 2015.

[83] J. Roijackers and G.H.L Fletcher. On Bridging Relational and Document-Centric Data Stores. Big Data. BNCOD 2013. Lecture Notes in Computer Science, Vol. 7968. Springer, Berlin, Heidelberg, 2013.

[84] Lombardo S., Nitto E. Di, and Ardagna D. Issues in handling complex data struc- tures with nosql databases. In Proceedings of the 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pages 443– 448, 2012.

[85] Scarf - stanford conference and research forum website. http://scarf.sourceforge.net/.

[86] Aaron Schram and Kenneth M. Anderson. Mysql to nosql: Data modeling chal- lenges in supporting scale-ability. In ACM SPLASH, pages 19–20, 2012.

[87] R. Sellami, S. Bhiri, and B. Defude. ODBAPI: a unified rest api for relational and nosql data stores. In Proceedings of IEEE International Congress on Big Data (Big- Data Congress, pages 653–660, 2014.

[88] Sigcomm - association for computing machinery’s special interest group. https://dl.acm.org/sig/sigcomm.

[89] Google spanner website. https://cloud.google.com/spanner.

[90] sqoop.apache website. https://www.sqoop.apache.org/. BIBLIOGRAPHY 252

[91] Steinhaus johnson–trotter algorithm. https://www.cut-the- knot.org/Curriculum/Combinatorics/JohnsonTrotter.shtml.

[92] Standard widget toolkit wiki. https://en.wikipedia.org/wiki/Standard_Widget_Toolkit.

[93] Apache tomcat. https://tomcat.apache.org/.

[94] Txl website. https://www.txl.ca/txl-resources.html.

[95] Transformation paradigmse. http://txl.ca/docs/TXLtranspara.pdf.

[96] Bhat Uma and Shraddha Jadhav. Moving towards non-relational databases. Interna- tional Journal of Computer Applications, 1(13):40–46, 2016.

[97] Unix time. https://en.wikipedia.org/wiki/Unix_time.

[98] Abramova V. and Bernardino J. Nosql databases: Mongodb vs. cassandra. In Pro- ceedings of the International Conference on Computer Science and Software Engi-

neering, ACM, pages 14–22, 2013.

[99] Ionut Voda. Migrating existing php web applications to the cloud. Informatica Economica, 18(4):62–72, 2015.

[100] M.D. Weiser. Program slices: Formal, Psychological, and Practical Investigations of an Automatic Program Abstraction Method. University of Michigan, Ann Arbo, 1979.

[101] M. Widenius and D.A. Mysql introduction. Linux Journal., 1999(673), 1999.

[102] Wordpress database. https:// idfrontend.wordpress.com/tag/database-wordpress /wordpress-database. BIBLIOGRAPHY 253

[103] Wordpress website. https://wordpress.com/.

[104] C.-H. Lu S.-C. Chen C.-H. Hsu W. Chen M.-F. Jiang Y.-C. Chung Y.-T. Liao, J. Zhou. Data adapter for querying and transformation between sql and nosql database. Journal of Future Generation Computer Systems., 2016.

[105] Yahoo cloud serving benchmark. https://scalegrid.io/blog/how-to-benchmark- mongodb-with-ycsb/.

[106] H. Chen Z. Wei-Ping, LI Ming-Xin. Using mongodb to implement textbook man- agement system instead of mysql. In IEEE 3rd International Conference on Com- munication Software and Networks(ICCSN), ISSN 978-1-61284-486, 2011.

[107] Asadulla Khan Zaki. Nosql databases: new millennium database for big data, big users, cloud computing and it’s security. IJRET: International Journal of Research in Engineering and Technology, ISSN: 2319-1163.May, 3(3):403–409, 2014.

[108] G. Zhao, W. Huang, S. Liang, and Y. Tang. Modeling mongodb with relational model. In EIDWT 2013, Fourth International Conference in Emerging Intelligent Data and Web Technologies (EIDWT), IEEE, pages 115–211, 2013.

[109] G. Zhao, Q. Lin, L. Li, and Z. Li. Schema conversion model of sql database to nosql. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2014 Ninth International Conference on. IEEE, pages 355–362, 2014.