Masaryk University Faculty of Informatics

Web application for Data Import from XLSX into a Relational Database

Bachelor’s Thesis

Samuel Toman

Brno, Spring 2021

Masaryk University Faculty of Informatics

Web application for Data Import from XLSX into a Relational Database

Bachelor’s Thesis

Samuel Toman

Brno, Spring 2021

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Samuel Toman

Advisor: Mgr. Luděk Bártek Ph.D.

i

Acknowledgements

I would like to express gratitude towards my advisor Mgr. Luděk Bártek, Ph.D., for always being available to patiently answer all my questions. Likewise, I would like to thank my consultants JUDr. Ing. František Kasl, Ph.D. and JUDr. Pavel Loutocký, Ph.D., BA.

iii Abstract

Spreadsheets are often used in office environments due to their user- friendliness coupled with their practicality. The majority of spread- sheet users are non-professional programmers, and as such, keeping them user-friendly remains a high priority. Their intuitiveness comes at a price, however. Due to their design, they are not well suited for stor- ing and querying large, structured data. They are nevertheless often relegated to precisely that role. The conversion process from a spread- sheet to a relational database can often be problematic and requires some level of technical knowledge. The main objective of this thesis is to provide a semi-automatic means of importing spreadsheets into a relational database, easing the process of conversion while still pro- viding enough modularity to design a suitable database schema. The thesis examines existing solutions and addresses their shortcomings in a resulting web application. As part of the thesis, the application was incorporated into an existing system called “CyQualf.”

iv Keywords database systems, MySQL, PHP, web application, Docker

v

Contents

Introduction 1

1 Data representation in XLSX documents and SQL databases 3 1.1 Office Open XML Workbook (XLSX) ...... 3 1.1.1 Data representation ...... 4 1.2 Relational database ...... 4 1.2.1 Relational model ...... 5

2 Project requirements 7 2.1 Functional requirements ...... 7 2.1.1 Mapping schema ...... 8 2.1.2 HTTP API ...... 9 2.2 Non-functional requirements ...... 9

3 Exploration of existing XLSX to SQL conversion tools 11 3.1 Web converters ...... 11 3.1.1 SQLizer ...... 11 3.1.2 Other web converters ...... 12 3.2 Desktop application converters ...... 13 3.3 Conclusion ...... 14 3.3.1 Missing functionality ...... 14

4 Technology stack and frameworks 17 4.1 PHP language ...... 17 4.2 PHP spreadsheet parser ...... 17 4.2.1 PhpSpreadsheet ...... 18 4.3 JavaScript ...... 19 4.3.1 React ...... 19 4.4 Docker ...... 20

5 Implementation and project structure 21 5.1 Project structure ...... 21 5.2 Server-side ...... 22 5.2.1 Server-side file structure ...... 22 5.2.2 Parsing the mapping schema ...... 23 5.2.3 Mapping relationships ...... 24

vii 5.3 Client-side ...... 26 5.3.1 Client-side file structure ...... 26 5.3.2 Front-end design components ...... 27

6 Deployment 29 6.1 Docker project structure ...... 29 6.1.1 Mariadb service ...... 29 6.1.2 Adminer service ...... 30 6.1.3 Php service ...... 30 6.1.4 Server service ...... 31 6.1.5 React-frontend service ...... 31 6.2 Summary of configuration files ...... 32

7 Conclusion 33

Bibliography 35

A Usage example 37 A.1 Running the application ...... 37 A.1.1 Service configurations ...... 37 A.2 Using the GUI ...... 38 A.3 Using the API ...... 39

B Graphical user interface design 43

viii List of Tables

1.1 One-to-many relationship represented in XLSX. 4 5.1 An example worksheet Employee 25

ix

List of Figures

4.1 A comparison of download counts (from NPM package manager) of the three most popular JavaScript front-end frameworks/libraries. Downloads measured from April 2019 to April 2021. 20 5.1 A class diagram of the mapping schema data structure. 24 5.2 Component decomposition of the webpage GUI. 28 A.1 A single table of the mapping schema. 38 A.2 A mapping schema containing two tables. 39 B.1 The webpage GUI on a desktop-sized screen width. 43 B.2 The webpage GUI on a smartphone-sized screen width. 44

xi

Introduction

Spreadsheet programs are often considered to be a significant factor in the introduction and establishment of personal computers (PCs), due to the spreadsheets being one of the main use-cases for the early PCs [1, p. C-177]. The first spreadsheet application for PCs called VisiCalc, originally released for Apple II in 1979, was considered a huge commercial success. It was often referred to as Apple II’s first “killer app” [2], meaning a program so essential, one would buy a computer just to be able to use it. As seen from their continued success, it is clear that spreadsheets provide essential services, often considered irreplaceable by their users. However, as is the case with any software, they are not the tool for everything. Spreadsheets work well enough when manipulating or analyzing manageably small data; they begin to struggle once the data gets sufficiently big, however. Among the many problems exacerbated by a growing dataset are poor performance, data redundancy, error proliferation, and many more. Their structure does not allow them to link and cross-reference data between tables easily, enforce data integrity rules, or retrieve data using complex querying functions. All of the above-mentioned are desirable traits for a system maintaining a sizeable or a critical dataset. In conclusion, a spreadsheet is not a database; it is not designed for the purpose of long-term storage of large or essential data. Thus, a problem of conversion to a proper database emerges. The existing web applications for importing spreadsheets into relational databases do not offer a solution functionally sufficient enough, to design a relational schema and subsequently map the data into it. The only available option is a simple import of the entire worksheet into the database as-is, without the option of establishing relations. To fully leverage the advantages of the relational model, the imported tables would have to be further processed into a new schema, which might be an uneasy procedure. This thesis aims to develop a web application capable of importing spreadsheet data into a database according to a user-defined schema, using a pleasant graphical interface.

1 The first chapter of the thesis contains a quick overview andcom- parison of data representation in spreadsheets and relational databases. The second chapter describes the project requirements, detailing what the application should be capable of and how it should behave. The third chapter explores existing solutions, compares both web and desktop variants, and draws a conclusion based on this analysis. The selected technologies and the reasoning for their selection are outlined in the following, fourth chapter. The fifth chapter details the project structure and selected parts of the implementation. The following sixth chapter explains the deployment of the application using Docker, de- tailing individual Docker services comprising the project. The seventh chapter contains a conclusion while summarizing the thesis. Addi- tionally, two appendices concerning the usage of the application and its graphical design are appended at the end of the thesis.

2 1 Data representation in XLSX documents and SQL databases

This chapter offers an overview of the XLSX format in comparison to relational databases. It describes the differences in data represen- tation, structure, and functionality between the two paradigms. A short insight into the file structure of XLSX is also given to deepen the understanding of the format.

1.1 Office Open XML Workbook (XLSX)

XLSX is a spreadsheet format designed by Microsoft, introduced to- gether with 2007 and standardized by Ecma Interna- tional, ISO1 and IEC2. The format was designed to comply with the Office Open XML specification [3] and served as a successor tothe previous proprietary Excel Binary File Format (XLS) used by earlier versions of Microsoft Excel. Since its inception in December 2006, XLSX has become widely supported by most modern spreadsheet programs due to it being a standardized format. In contrast to the previous XLS, which is a binary format, XLSX is a ZIP-compressed3 archive containing several XML4 files. Compared to its predecessor, XLSX offers a significant file size reduction [4, p. 324]. As a ZIP archive, the file can be unpacked, revealing the underlying structure of the format:

• [Content_Types]. - Contains references to all XML files included in the package.

• _rels/ - A folder consisting of a single XML file storing package- level relationships.

1. International Organization for Standardization 2. International Electrotechnical Commission 3. ZIP is an archive file format, supporting lossless compression 4. Extensible Markup Language format designed to be both human-readable and machine-readable

3 1. Data representation in XLSX documents and SQL databases

Table 1.1: One-to-many relationship represented in XLSX.

Department Department Employee Employee Tag Name Wage ENG Engineering Julian Johnson 35000 ENG Engineering Jane Jones 39000 ACC Accounting Martin Moore 28000 ACC Accounting Larry Lewis 46000

• docProps/ - A folder containing XML files with overall doc- ument properties, such as author, last modification date, and metadata about the file’s content.

• xl/ - This is the main folder, branching into further subfolders and XML files. As a whole, it contains the details about the workbook contents and the data itself.

1.1.1 Data representation

The data is represented in tabular form, in rows and columns of cells. The entire spreadsheet can contain multiple worksheets, where each worksheet contains its own cells of data. The individual cells can contain various data types and styles, and can be formatted in different ways. Representing relationships in spreadsheets can be problematic. Table 1.1 shows how a one-to-many relationship can be represented between employees and their departments, where a single depart- ment can have multiple employees. Maintaining data this way can be problematic, primarily due to the fact that spreadsheets do not en- force data integrity. Meaning, it would allow deleting the name of the Engineering department from a single row, resulting in inconsistent data.

1.2 Relational database

A relational database is a specific type of database that stores and provides access to data related to one another, thus conforming to

4 1. Data representation in XLSX documents and SQL databases

the so-called relational model. The data in this type of database can be linked and cross-referenced, creating relationships. This facilitates effective storage and searchability. The usability of the database is administered by a software called Relational Database Management System (RDBMS). There are many such systems available, some proprietary and some open-source. A characteristic feature provided by a vast majority of RDBMSs is SQL5 or some variation of it. SQL serves as a means of interacting with the database using an English-like syntax. It provides broad functionality for querying and maintaining the database.

1.2.1 Relational model

SQL databases do not implement the relational model perfectly; in- stead, they try to approximate the theoretical model with minor devi- ations. The general principles of the model still apply, however. For clarity, the more common SQL terms will be used instead of the rela- tional database terms, both of which are interchangeable. The model organizes data into tables, each consisting of rows and columns, where a unique key identifies each row. The columns are constrained bya domain (or data type). The model permits a table row to hold a foreign key, referencing the primary key of another table row, thus establishing proper relation- ships between tables. This allows us to represent the earlier example in Table 1.1 as two separate tables with an Employee table holding references to a Department table, forming a one-to-many relationship. The used RDBMS ensures the integrity of the data, and each Employee row can only reference an existing Department.

5. Structured Query Language

5

2 Project requirements

The base requirement was to create a web application developed in PHP, allowing a semi-automatic data conversion from a spreadsheet into a MySQL/MariaDB database. This includes a back-end accessible by an API, handling the data conversion and import into the database, and a front-end allowing the user to intuitively design a database schema.

2.1 Functional requirements

Functional requirements define the functions that the system must im- plement. They generally describe system behavior under specific con- ditions. Using plain language, functional requirements define “what” the system should be able to do. The back-end portion of the application is responsible for the main bulk of functionality regarding the conversion of data and its import into a database. The connection to the target database should be con- figurable in the back-end by editing database connection details and credentials. Provided a valid mapping schema and XLSX file, the back- end should be able to convert and import the spreadsheet data into the configured database, according to the mapping schema. The front-end portion of the application is responsible for easing the conversion process by creating a visual representation, creating a sort of a bridge between the user and the back-end functionality. Its primary purpose is to give the user the capacity to create the mapping schema using a graphical interface. It should prompt the user to select an XLSX file, after which it should allow the user to design a mapping schema for the selected spreadsheet. The user should be able to use all properties of the mapping schema using the web page, which includes the following:

• Creating and deleting tables from the schema.

• Changing the names of schema tables.

• Adding and deleting columns from particular tables.

7 2. Project requirements

• Changing the names and data types of columns.

• Adding and removing references to other tables.

Once the user has selected an XLSX file and designed a valid schema, the user can import the data into a database using the web page interface.

2.1.1 Mapping schema

Mapping schema is a JSON1 file describing how the spreadsheet data should be mapped into a database. The schema describes tables to be created in the database if they do not yet exist. The database tables will then be populated with data from the spreadsheet as defined by the mapping schema. The schema must allow for the following:

• Defining tables, which includes specifying columns and refer- ences to other tables.

• The user can change table and column names.

• The user can change column data types.

• Creating one-to-many and many-to-many relationships between tables.

• Defining multiple tables from a single worksheet by mapping different worksheet columns to different tables.

• Specifying the schema for the entire spreadsheet instead of just one worksheet.

1. JavaScript Object Notation is a human-readable format used to store and transmit data.

8 2. Project requirements 2.1.2 HTTP API

The application must provide an HTTP API to access the back-end converter. Sending an HTTP POST2 request to a specific URI3 imports the spreadsheet into a specified database. The request must be sent along with schema and XLSX spreadsheet. If either the schema or the spreadsheet file is missing from the POST request, no changes should be made to the database, and a response code 400 shall be returned. Similarly, if either the schema is incorrectly formed or the spreadsheet file cannot be parsed or imported into the defined database, response code 400 should be returned. If the import is successful, response code 200 shall be returned.

2.2 Non-functional requirements

Non-functional requirements specify “how” the system should behave and describe the limits of its functionality. Even when not met, they do not impact the system’s basic functionality, though they usually impact the user experience. The back-end must be implemented using the PHP programming language. Due to the fact that data from an XLSX file are being con- verted and imported into an SQL database, the possibility of SQL injections4 arises. The PHP implementation should be able to prevent SQL injection attempts from the spreadsheet data. The code should be kept clean and readable by adhering to good programming principles, such as consistent and descriptive naming conventions, avoiding repetitiveness and unnecessary complexity. This ensures easier extensibility and maintainability of the code in the future. The code should also provide some level of documentation to aid readability.

2. POST is a request method defined by HTTP, usually used to submit an entity to the specified resource, often causing a change in state or side effects on theserver [5]. 3. Uniform Resource Identifier; a unique sequence of characters specifying are- source. 4. SQL injection is a security vulnerability in which malicious SQL statements are inserted for execution.

9 2. Project requirements

The primary purpose of a front-end when an API is available is to provide a more accessible and intuitive option of using the application while also retaining the same capabilities as the API, if possible. As such, the graphical interface should be designed to fulfill this goal, providing a capable yet straightforward method of use. To further improve usability, the web page layout should be responsive, meaning it should adjust itself depending on the available screen size and provide a comfortable experience on all device sizes.

10 3 Exploration of existing XLSX to SQL conver- sion tools

Several solutions capable of importing data from XLSX format into an SQL database, or at least solutions that partially achieve this goal, were encountered during exploration. This chapter evaluates and compares a selection of them by inspecting their functionality and capabilities. There are tens of applications, the majority of them in the form of web applications with similar functionalities; therefore, this analysis will focus mainly on the more prominent ones, mentioning potential unique features.

3.1 Web converters

The format of a web application grants a level of comfort to the user. There is no need to download and install a desktop application on the user’s computer, taking up memory and computing resources. The only requirement for a web application is a working internet con- nection. With sufficient connection, it will work independently ofthe user’s machine and its operating system. Especially for an applica- tion that is likely to be single-use for a significant amount of users, a web-based solution seems to be optimal.

3.1.1 SQLizer

At the time of writing, SQLizer1 belongs amid the more prominent online converters, based on the results of an anonymized search engine query. SQLizer allows the user to upload an XLSX file, select a single worksheet to import, and convert it into an SQL script2, which can be used to import the data into the user’s database. The tool is free to use on files with under 5000 rows of data and offers a paid versionwith no limit [6]. SQLizer is capable of converting data from XLSX, CSV, and JSON, though we are only interested in XLSX. Several additional settings

1. https://sqlizer.io/#/ 2. A sequence of SQL commands.

11 3. Exploration of existing XLSX to SQL conversion tools related to the conversion from XLSX are available. Some of the more relevant options are: • The option to designate the first row as a header (which will be ignored during import). • Selecting which worksheet to convert. • Converting either the entire worksheet or selecting an area within the worksheet to convert. • The possibility of naming the resulting SQL table. The app allows selecting from three database types, including MySQL, PostgreSQL, and Microsoft SQL Server. The UI3 can be characterized as straightforward and simple to use. The available options are either in the form of drop-down lists or checkboxes. The UI is designed to be responsive and scales well based on the current display size. It is usable comfortably on both mobile and desktop. SQLizer also provides a public REST API4 with similar functionality to the graphical UI. SQLizer has some missing functionality that we require. For ex- ample, it has no capability of importing multiple worksheets at once, though this can be achieved by importing the worksheets one by one. A more significant drawback is the tool’s inability to create references between imported tables or references to tables already existing in the database.

3.1.2 Other web converters

More online converters exist, though they offer mostly comparable functionality with some differences. Not all online converters can be analyzed within the scope of this thesis. The following ones provide extra functionality that others do not. BeautifyTools5 contains a tool for converting XLSX to SQL basi- cally identical to SQLizer in base functionality, albeit missing some

3. User Interface, in the case of a web application, it is usually the graphical inter- face, also referred to as GUI. 4. Application programming interface that conforms to REST architectural style constraints. 5. https://beautifytools.com/excel-to-sql-converter.php

12 3. Exploration of existing XLSX to SQL conversion tools

additional capabilities, such as ignoring the first row of the worksheet as a header. However, it has one unique feature not found in SQLizer or other similar tools as of the time of writing. That is the capability to delete or update the data in the worksheet from the database instead of just importing the data as other converters do. RebaseData6 contains a broad suite of converters, mostly between different SQL versions. Amid its array of available tools, RebaseData offers a conversion tool from XLSX 7to SQL . Functionality-wise, this tool is identical to SQLizer and BeautifyTools without the extra settings mentioned before. Besides the web application and a public REST API, it offers its services as a PHP library, a Python library, a Java tool,and a Linux command-line tool. Similar to SQLizer, RebaseData offers a free and a paid version.

3.2 Desktop application converters

A desktop application is an application that runs locally on a single computer, used to perform a specific task. This type of application has to be installed on the users’ machine, consuming disk space and other resources. One of the main advantages of desktop applications is their lower reliance on a network connection. They possess the ability to better leverage their host systems resources, whereas a web application needs to make calls to a server and wait for a response. SQL Server Import and Export Wizard8 is a tool developed by Mi- crosoft and is a part of their SQL Server Integration Services (SSIS). Once the wizard is installed, it does not require an internet connec- tion to function. Compared to its online counterparts, the Wizard has expanded functionality. First off, instead of returning an SQL script, it is capable of connecting to a provided database connection and automatically creating and populating tables in the database. Another handy feature is the ability to filter the spreadsheet data using an SQL query before inserting it into the database. Additionally, the names of columns and tables can be changed, along with column data types, in

6. https://www.rebasedata.com/ 7. https://www.rebasedata.com/convert-xlsx-to-mysql-online 8. https://docs.microsoft.com/en-us/sql/relational-databases/import -export/import-data-from-excel-to-sql?view=sql-server-ver15#wiz

13 3. Exploration of existing XLSX to SQL conversion tools a simple and intuitive GUI9. The entire conversion process is described on a Microsoft tutorial page10. Full Convert11 is a powerful database converter. Among the many formats supported by the program is also the conversion from XLSX into SQL. Much like the aforementioned SQL Server Wizard, it allows for modifying tables’ names and the types and names of their columns. The program offers a myriad of extra functionalities, though for our purposes, the most important of them is referencing other tables, thus allowing the program to create a custom database schema during conversion and populate it with selected spreadsheet data. Full Convert is by far the most complete solution in terms of func- tionality. Its main drawback, however, is its lack of accessibility. It offers a 30-day free trial period, after which the least expensive variant costs 699 USD per year for a single user.

3.3 Conclusion

The available web converters generally favor high usability and ease of use at the cost of high performance and functionality. They require an internet connection and do not connect directly to the database to insert the data. On the other hand, they do not require an installa- tion, are generally free to use at least to some capacity, and are very straightforward in terms of usability. Desktop converters generally offer a wider range of functionality, though the solutions with sufficient abilities are expensive, compli- cated, and do not offer their services online.

3.3.1 Missing functionality

The Full Convert desktop application offers a wide range of function- ality not supported by the various online converters. Among the most significant missing features are the following:

9. Graphical User Interface. 10. https://docs.microsoft.com/en-us/sql/integration-services/import -export-data/get-started-with-this-simple-example-of-the-import-and- export-wizard?view=sql-server-ver15#heres-the-new-table-of-data-cop ied-to-sql-server 11. https://www.fullconvert.com/

14 3. Exploration of existing XLSX to SQL conversion tools

• Filtering what tables and columns to import from the spread- sheet.

• Naming of the resulting tables that are to be created in the database.

• Naming of columns and selecting column types for the resulting tables.

• Establishing references between tables, thus allowing the user to create a desired relational schema from the spreadsheet data, including one-to-many and many-to-many relationships be- tween both the tables that are being imported, and the tables already existing in the database.

As of the time of writing, none of the web converters are capable of the stated features. As such, using the available online tools, the user is only able to import the spreadsheet table as is. In case the spreadsheet contains multiple tables with one-to-many or many-to-many relationships, the existing converters will not be able to break this table into multiple interconnected tables and instead keep the data in a single joined table. The resulting table will hold redundant and duplicate data, and the database will be unable to leverage many of the advantages of the relational model. Furthermore, if several tables are defined in the spreadsheet, and one wishes to create relationships between them, the online tools would not enable it. The references would have to be later added manually in the database, which is a task requiring further technical expertise. A tool handling this process automatically for a defined schema would greatly reduce the difficulty of converting spreadsheets to a relational database.

15

4 Technology stack and frameworks

Selecting the appropriate technologies for specific purposes is an often underestimated aspect of developing software. Choosing an incorrect tool for the job, however, can cause significant issues. Different tech- nologies are designed with different goals in mind; picking the one suitable for our goals depends on many factors. The reasoning on why specific tools were selected is given in this chapter.

4.1 PHP language

PHP is a general-purpose scripting language especially suited for web development[7]. Despite its steady decline over the past years, as of April 2021, PHP is still the 9th most popular programming language, according to TIOBE[8]. According to W3Techs, it is used by 79.2% of all websites whose server-side programming language is known[9]. For those reasons, PHP was selected to be used in the back-end of the application to perform the conversion and import of data into an SQL database.

4.2 PHP spreadsheet parser

First introduced with in 2007, XLSX is a format de- fined in the Office Open XML standard, which has since itsintroduc- tion became standardized and used by many spreadsheets on different platforms. Since the format is actually a ZIP-compressed archive that contains a directory structure of XML text documents, it would be possible to unpack the ZIP archive and parse the XML files using a PHP XML parser extension called SimpleXML1 to extract the relevant data. Us- ing an XML parser would be unwieldy and require unnecessary pro- gramming, however. Upon further exploration, several PHP libraries capable of parsing specifically XLSX files were found. A selection of the most prominent ones were analyzed, giving a short description and reasoning on why the library was selected or not.

1. https://www.php.net/manual/en/book.simplexml.php

17 4. Technology stack and frameworks

Spreadsheet-reader2 is a small pure-PHP spreadsheet reader spe- cializing in efficient data extraction, capable of handling large files without running out of memory. Apart from XLSX files, it also sup- ports reading XLS, CSV, and ODS files. Unfortunately, as of the time of writing, development has stopped, and the library is no longer maintained, therefore not a viable choice for our purposes. SimpleXLSX3 is a lightweight pure-PHP spreadsheet reader fo- cused on reading specifically XLSX files. It has the smallest sizeand amount of dependencies of all compared spreadsheet parser libraries. It is currently being maintained with regular commits to the master branch, though it has a relatively small amount of contributors and users. The most significant drawbacks of SimpleXLSX in comparison to the selected library are its relatively limited functionality and its lack of comprehensive documentation. The library contains examples of basic usage in its README4 file along with a few example PHP scripts sufficient to comprehend the usage, though it is not as extensively documented as the eventually selected library.

4.2.1 PhpSpreadsheet

PhpSpreadsheet5 is the library we elected to use in the back-end as an XLSX parser. It is a pure-PHP library for reading and writing spread- sheet files, a popular successor to an older and no longer maintained library called PHPExcel6. PhpSpreadsheet is the most heavyweight library of the compared ones both in terms of size and dependencies, though in exchange for its size, it provides extra functionality. PhpSpreadsheet is by far the most popular out of the aforemen- tioned parsers, with the greatest amount of contributors and regular updates. Even though our application does not currently utilize plenty of the extra functionality, it allows for future improvements without the need to switch the library. Due to its popularity and a large number of contributors, it increases the chances of being supported long-term.

2. https://github.com/nuovo/spreadsheet-reader 3. https://github.com/shuchkin/simplexlsx 4. README is a text file usually contained within a git repository, specifying basic information about the project. 5. https://github.com/PHPOffice/PhpSpreadsheet 6. https://github.com/PHPOffice/PHPExcel

18 4. Technology stack and frameworks

PhpSpreadsheet provides the most extensive documentation, and despite its considerably larger codebase compared to its competitors, it is the easiest to use. The documentation is hosted on a separate domain7 and provides a wide range of tutorials and instructions for various tasks. A separate site for API documentation is also provided8.

4.3 JavaScript

JavaScript is a general-purpose programming language conforming to the ECMAScript9 specification. It is often described as the pro- gramming language of the web due to its strong presence in web development, both in front-end and back-end. StackOverflow’s 2020 Developer Survey, with a sample size of nearly 65000 people, con- cluded that JavaScript is used by 67,7% of developers and marked it as the most popular language for the eighth year in a row[10]. Addition- ally, according to W3Techs, it is used as a client-side programming language by 97,2% of all websites[11]. JavaScript’s popularity can partially be attributed to its wide variety of tools and frameworks suited for an array of purposes; this includes several established frameworks and libraries designed for front-end development, such as React.js, Angular.js, Vue.js, and more.

4.3.1 React

React.js is an open-source, front-end JavaScript library, which was chosen to implement the web application front-end. Along with An- gular.js and Vue.js, it belongs to one of the three most widely used JavaScript front-end frameworks/libraries, according to StackOver- flow’s 2020 Developer Survey[10]. It is steadily the most downloaded package of the three (according to NPM10 package manager)[12], as shown in Figure 4.1. React.js is maintained by Facebook and a community of open- source contributors, and since its initial release in 2013, it has enjoyed

7. https://phpspreadsheet.readthedocs.io/en/latest/ 8. https://phpoffice.github.io/PhpSpreadsheet/ 9. A programming language, standardized by Ecma International, meant to ensure web page interoperability across different web browsers. 10. A JavaScript package manager.

19 4. Technology stack and frameworks

Figure 4.1: A comparison of download counts (from NPM package manager) of the three most popular JavaScript front-end framework- s/libraries. Downloads measured from April 2019 to April 2021. stable growth and is very likely to be continually supported in the following years.

4.4 Docker

Docker is a tool allowing the developer to create, deploy and run appli- cations using so-called containers [13]. Docker containers refer to an operating system paradigm called OS-level virtualization, in which the kernel of the operating system allows the existence of multiple isolated user spaces, called containers. Each container comes bundled with its own software and configurations. It can only see its own contents and is completely isolated from its host system and other containers, providing a layer of security. Communication channels between the containers themselves or the host system can be established as needed, however. The containers can run with their own environments, defined by the developer and independent of the host system. Therefore, the project can be developed without considering what system the appli- cation will ultimately be running on. This significantly eases the de- velopment process and the deployment of the system. The developed web application runs in several discrete containers, each providing a service, together creating the desired functionality. The specific struc- ture and deployment of the application using docker will be detailed in Chapter 6.

20 5 Implementation and project structure

This chapter details selected parts of the implementation and the development process. The structure of the project and the functionality of some of its components are also described.

5.1 Project structure

The root of the project is comprised of the following structure:

• api/ - A directory containing the server-side portion of the web application, implemented using PHP.

• frontend/ - A directory comprising of the client-side portion of the web application implemented using a JavaScript framework, React.js.

• example_files/ - A folder consisting of example files, includ- ing XLSX spreadsheets and corresponding JSON schemas. The file schema_definition.txt contains a description of a schema example.

• docker-compose.yml - A YAML1 configuration file used by Docker to configure the application’s services.

• php.Dockerfile - It is a text document used by Docker, con- taining a series of commands used to assemble an image. This particular Dockerfile contains instructions to build an image running a PHP FastCGI Process Manager2, used by the PHP back-end.

• react.Dockerfile - Similar to PHP.Dockerfile, except responsi- ble for building an image running a react front-end server.

1. YAML stands for recursive acronym YAML Ain’t Markup Language, which is a human-readable language commonly used for configuration files. 2. FastCGI is a variation of an earlier Common Gateway Interface (CGI), a protocol for interfacing external applications to web servers. FastCGI Process Manager (PHP- FPM) provides extra features, such as faster uploads, logging, and more.[14]

21 5. Implementation and project structure

• nginx.Dockerfile - A Dockerfile used for building an image running an Nginx web server.

• README.md - A markup language file describing the project, with examples and explanations of usage.

The structure of the docker-compose.yml file together with the two Dockerfiles will be further expanded upon in the following chapter focused on the deployment of the application.

5.2 Server-side

As the name suggests, server-side code runs on a server, which awaits requests from clients. The server-side implementation then processes the requests, and a response is sent back to the client.

5.2.1 Server-side file structure

The PHP back-end implementation is entirely contained within the above-mentioned api/ folder. The folder itself contains the following:

• config/ - A directory containing configuration files used mainly by docker services running the server and the aforementioned PHP-FPM and a db-credentials.env file containing database cre- dentials. The database credentials are used by the application to get a database connection, which will be used as a target of the spreadsheet imports. The user can change the credentials to gain a connection to the desired database. The default values are set to connect to a MariaDB database, running in a docker container by default.

• public/ - A directory holding a single file import_xlsx.php, which is a PHP script implementing the HTTP API. The script is making calls to classes defined in the source folder.

• source/ - This directory holds the main portion of the logic responsible for converting data from XLSX, mapping it onto a user-defined schema, and importing it into a database.

22 5. Implementation and project structure

• vendor/ - The application automatically generates this folder when run. It contains back-end dependencies. If additional PHP libraries are added to the project in future development, this folder needs to be deleted and a new one autogenerated for the changes to take place.

• composer.json - A JSON configuration file for a PHP package manager called Composer. The file defines PHP package de- pendencies and specifies their versions, ensuring future com- patibility.

• composer.lock - Composer package manager automatically generates this file. It serves to lock the project dependencies to a known state when run.

5.2.2 Parsing the mapping schema

The mapping schema is appended to the HTTP POST request either as a JSON file or a string. The application must parse the provided schema and create an in-memory object representation, which can be used to map XLSX data into a database. The class responsible for parsing the schema is called Schema- Parser. It is an abstract3 class providing two public static4 methods, one for parsing a JSON file and the other for parsing a JSON string. The methods transform the schema into a data structure described by a class diagram in Figure 5.1. The class diagram is simplified for the sake of readability and only shows properties relevant to the representation of the mapping schema. The following classes are utilized:

• SheetSchema - Represents a single worksheet of the spread- sheet document.

• TableSchema - Describes a table, which will be created in the target database.

3. An abstract class cannot be instantiated. 4. A static method belongs to the class itself and can be invoked without the need of instantiating the said class.

23 5. Implementation and project structure

Figure 5.1: A class diagram of the mapping schema data structure.

• ColumnSchema - Defines a table column, including its name and data type. The property col specifies which column in the parent worksheet should be mapped onto this table column.

• ReferenceSchema - This class represents a table reference. The property referencedTable holds the name of a table other than the parent table. The parent table will contain a reference to the referenced table.

• ColumnLink - Describes the columns on which to create a link. The property fromSheetCol specifies a column in the parent sheet and the property toTableCol specifies the the column name of the referenced table.

5.2.3 Mapping relationships

The application allows for establishing one-to-one, one-to-many and many-to-many relationships between tables. A single worksheet can be broken up into multiple in-database tables, and relationships can

24 5. Implementation and project structure

Table 5.1: An example worksheet Employee

Employee Name Employee Wage Department Julian Johnson 35000 ENG Jane Jones 39000 ENG Martin Moore 28000 ACC Larry Lewis 46000 ACC

be established between themselves and the other tables. Consider the following snippet of a mapping schema:

"references":[ { "table":"Department", "mapOnColumns":[ { "fromThisSheetCol": 3, "toOtherTableCol":"DepartmentTag" } ] } ]

Each table must be constructed from a worksheet. Let us assume the snippet above is set within the context of a worksheet called Employee, specified in Table 5.1. The snippet references atable Department and specifies that the tables should be joined on the third column ofwork- sheet Employee (which contains department tags) and the Department column DepartmentTag. Each row with a valid department tag will be replaced with a foreign key to the corresponding department in the resulting table. If a corresponding department tag does not ex- ist, a null value will take the place of the foreign key, thus creating a one-to-many relationship between the tables.

25 5. Implementation and project structure 5.3 Client-side

Client-side code refers to operations running on the user’s machine. In the case of this application, it primarily refers to a front-end developed in React.js. It is responsible for what the user sees and interacts with in the browser, allowing the user to create a mapping schema and send a request to the back-end.

5.3.1 Client-side file structure

The React front-end is contained within the frontend/ directory. The directory has the following structure:

• public/ - This folder is automatically generated when creating a React project using the create-react-app command. Among other files, it contains index., which is the site index template.

• source/ - The directory holding the front-end implementation.

– components/ - A folder holding React components re- sponsible for the interactive graphical interface. The inter- face serves to create a mapping schema. – entities/ - This directory contains JavaScript classes used for representing the mapping schema created in the graph- ical interface. When sending a request to back-end, this data structure is converted into a JSON and attached to the request.

• node_modules/ - A folder that is automatically generated when the application is run, containing packages installed by the NPM package manager.

• package.json - A JSON configuration file used by the NPM package manager, listing project dependencies and their ver- sions. The node_modules/ folder is generated based on this file.

26 5. Implementation and project structure

5.3.2 Front-end design components

React.js makes use of so-called React components, which allow the UI to be split into small, independent, and reusable pieces. A React component can accept arbitrary inputs and, based on them, returns elements describing what should appear on the page. This allows us to create a hierarchy of components, interacting with each other. A component can pass data to its children using a constructor argument props, and a child component can pass data back to its parent using callback functions. A visualization in Figure 5.2 shows the resulting component hier- archy. An important note is also the ability of SchemaComponent to lay out the individual SchemaTableComponent instances depending on their amount and the available screen space, guaranteeing a re- sponsive design capable of adapting to a range of screen sizes. Figure B.1 shows an example of the UI adapting to a desktop-sized screen, and Figure B.2 shows the UI adjusting to a smartphone-sized screen.

27 5. Implementation and project structure

Figure 5.2: Component decomposition of the webpage GUI.

28 6 Deployment

Because the web application was developed using Docker, the process of deployment is very straightforward. All the dependencies needed by the application are included in the Docker project and automatically installed during the deployment process. The application will run regardless of the host’s operating system because the application’s individual services will run each in their separate Linux container. The only requirement is having installed Docker or a Docker alternative on the host machine.

6.1 Docker project structure

The Docker project is built according to a specification of a docker- compose.yml file. The Compose Specification defines the formatof the file [15]; this project using Compose file format version 3.The docker-compose.yml file specifies five services to be created andrun, each inside its separate container. According to Docker, volumes are the preferred mechanism for persisting data generated and used by Docker containers [16]. In contrast to bind mounts, volumes are not dependent on the directory structure and the OS of the host machine, allowing us to retain easy portability [17]. Among other functionalities, they provide an easier way of migrating and backing up than bind mounts, which is why they are used to persist a database described in Subsection 6.1.1.

6.1.1 Mariadb service

The mariadb service provides a MariaDB database running inside a container, which the application can use. The database is created with credentials specified in mariadb-credentials.env. The credentials are recommended to be changed if the database is to be used. The service is built from mariadb:latest image, which is an open- source relational database, forked from MySQL. A volume named mysqldata is assigned to the service, providing data persistence to the database when the application is restarted. The database is listening

29 6. Deployment on port 3306, but the port is not exposed to the host. Thus it can only be accessed by other containers in the network. This service is not necessary for the functioning of the web appli- cation and can be omitted if the user elects to use a different database. However, if the service is omitted, connection credentials to another database must be provided in file api/config/db-credentials.env.

6.1.2 Adminer service

Adminer is a tool for managing database content, authored by a Czech programmer Jakub Vrána [18], allowing the user to view and edit the database using a user-friendly interface. The service is built from image adminer:latest and is by default exposed on port 8080, though the port can be configured in the .env file by changing the variable ADMINER_PORT. The adminer service is not necessary for the functioning of the application and can be omitted if not needed. It serves as a means to easily view the database after the XLSX import to verify that the data has been imported according to expectations.

6.1.3 Php service

The php service is running a PHP FastCGI Process Manager (PHP- FPM), which is an improved PHP FastCGI implementation. FastCGI is a binary protocol responsible for interfacing external applications with a web server. The earlier Common Gateway Interface (CGI) protocol created a new process for each request, which was torn down when finished. The high overhead of this approach is addressed by FastCGI, which persists processes to handle a series of requests [19]. The build of this service is described in a separate php.Dockerfile. The used base image called php:fpm provides PHP-FPM preinstalled, together with all necessary dependencies. Additional packages are installed in the Dockerfile, along with a PHP package manager called Composer, which is used to install dependencies specified in the com- poser.json file mentioned in previous Chapter 5 (Item 5.2.1). The entirety of the api/ directory containing the PHP back-end implemen- tation is copied into the container during the build, and then finally, the php-fpm command is executed, and the service is available for

30 6. Deployment

use. Internally, it is accessible on port 9000, though the port is not exposed to the host machine. Since this service is working as a PHP interpreter, it is necessary for the application and cannot be omitted.

6.1.4 Server service

The server service is running a popular open-source web server named Nginx, often used in conjunction with PHP-FPM. According to an April 2021 Web Server Survey by Netcraft, it recently became the most used web server when accounting for the surveyed websites [20]. Defined in a separate nginx.Dockerfile, the service is built from an nginx:alpine-perl image. Alpine-based image versions are smaller compared to their usual counterparts. The reason for using a Perl1- enabled image is that Perl is used to read environment variables in the Nginx configuration files. The server will internally run on port 5000, and by default, the port will be exposed to the host machine so that API requests can be made to the server, which will be automatically forwarded to the PHP-FPM running on the php service. However, the port to the host machine can be remapped by changing the property API_PORT in file .env. The server service is dependent on the php service running and is necessary for the base functioning of the application.

6.1.5 React-frontend service

The service is running a server responsible for providing the applica- tion front-end. It is defined in a separate react.Dockerfile, built from a node:alpine base image providing much of the necessary function- ality for running the React front-end. Using the Alpine variant results in smaller image size. During the build, the entirety of the frontend/ directory is copied into the container. The JavaScript package manager NPM, which comes already preinstalled on the node image, is used to install the necessary dependencies defined in the package.json file mentioned in previous Chapter 5 (Item 5.3.1). The server listens on port 3000 by default but can be changed in the file .env by editing the REACT_PORT variable. The service depends

1. A general-purpose programming language 31 6. Deployment on php and server service to run. It is not necessary for the functioning of the application if the user only plans to use the API without a GUI.

6.2 Summary of configuration files

The following is a summary of the aforementioned configuration files along with short descriptions:

• .env - An environment file containing the mapping of ports for individual services. The user can change them to suit their needs.

• mariadb-credentials.env - Specifies credentials for a MariaDB database specified in the mariadb service. Relevant only if the mariadb service is used.

• api/config/db-credentials.env - Defines a connection to the target database. Can be different from the connection specified in mariadb-credentials.env.

32 7 Conclusion

The aim of the thesis was to create a web application implemented using PHP, capable of a semi-automatic import of data from an XLSX format into a MySQL/MariaDB database. It provides an intuitive way of mapping a spreadsheet into a configured database using a graphical interface or an API. The goal was achieved by creating a Docker application, resulting in easy deployment regardless of the host system’s OS. The application’s functionality is broken up into separate services, which can be configured to suit the user’s needs or even be entirely omitted if not needed. A mapping schema was developed for the purpose of mapping the spreadsheet into a relational schema. The mapping schema takes the form of a JSON file that can be automatically generated using the application’s front-end or created manually and used with the provided API. As part of the thesis, existing solutions were explored to gain a perspective on the available tools and find possible shortcomings and flaws, so that they can be avoided. The technology stack was detailed, and the chosen technologies were briefly described in the following chapter to explain why the specific tools were selected. A description of the project structure and implementation follows next. Finally, the deployment of the project is explained, including the specifications and objectives of the individual services of the Docker application. None of the existing web applications can create a mapping schema for the spreadsheet and directly import it into a database. The only available solution is a complex and expensive desktop application, inaccessible to a regular non-corporate user due to its restrictive price. Therefore, compared to its online alternatives, the developed web ap- plication provides a major expansion of functionality while retaining its accessibility.

33

Bibliography

1. HILL, Charles WL; JONES, Gareth R; SCHILLING, Melissa A. Strategic management: Theory & cases: An integrated approach. Cen- gage Learning, 2014. isbn 1285184491. 2. How computing’s first ’killer app’ changed everything [online]. Lon- don: BBC, 2019 [visited on 2021-04-07]. Available from: https: //www.bbc.com/news/business-47802280. 3. Excel (.xlsx) Extensions to the Office Open XML SpreadsheetML File Format [online]. Microsoft, 2021 [visited on 2021-04-30]. Avail- able from: https://docs.microsoft.com/en-us/openspecs/ office _ standards / ms - xlsx / 2c5dee00 - eff2 - 4b22 - 92b6 - 0738acd4475e. 4. FAIRHURST, Danielle Stein. Using Excel for business analysis: A guide to financial modelling fundamentals. John Wiley & Sons, 2015. isbn 1119062462. 5. HTTP request methods [online]. Mozilla, 2021 [visited on 2021- 04-13]. Available from: https://developer.mozilla.org/en- US/docs/Web/HTTP/Methods. 6. The internet’s leading data migration tool [online]. London: SQLizer, 2021 [visited on 2021-04-09]. Available from: https://sqlizer. io/about/. 7. PHP: Hupertext Preprocessor [online]. The PHP Group, 2021 [vis- ited on 2021-04-18]. Available from: https://www.php.net/. 8. TIOBE Index for April 2021 [online]. TIOBE Software BV, 2021 [visited on 2021-04-18]. Available from: https://www.tiobe. com/tiobe-index/. 9. Usage statistics of PHP for websites [online]. Q-Success, 2021 [vis- ited on 2021-04-18]. Available from: https : / / w3techs . com / technologies/details/pl-php. 10. 2020 Developer Survey [online]. Stack Overflow, 2021 [visited on 2021-04-26]. Available from: https://insights.stackoverflow. com/survey/2020.

35 BIBLIOGRAPHY

11. Usage statistics of JavaScript as client-side programming language on websites [online]. Q-Success, 2021 [visited on 2021-04-26]. Avail- able from: https://w3techs.com/technologies/details/cp- javascript. 12. POTTER, John. react vs vue vs @angular/core [online]. 2021 [visited on 2021-04-26]. Available from: https://www.npmtrends.com/ react-vs-vue-vs-@angular/core. 13. MERKEL, Dirk. Docker: lightweight linux containers for consis- tent development and deployment. Linux journal. 2014, vol. 2014, no. 239, p. 2. 14. FastCGI Process Manager (FPM) [online]. The PHP Group, 2021 [visited on 2021-04-18]. Available from: https://www.php.net/ manual/en/install.fpm.php. 15. The Compose Specification [online]. 2021 [visited on 2021-04-27]. Available from: https://github.com/compose-spec/compose- spec/blob/master/spec.md. 16. Use volumes [online]. Docker Inc., 2021 [visited on 2021-04-28]. Available from: https://docs.docker.com/storage/volumes/. 17. BHAT, Sathyajith. Understanding Docker Volumes. In: Practical Docker with Python. Apress, Berkeley, CA, 2018, pp. 91–118. isbn 9781484237847. 18. Adminer [online]. 2021 [visited on 2021-04-27]. Available from: https://github.com/vrana/adminer. 19. FastCGI Specification [online]. Open Market Inc., 1996 [visited on 2021-04-28]. Available from: http://www.mit.edu/~yandros/ doc/specs/fcgi-spec.html. 20. April 2021 Web Server Survey [online]. Netcraft Ltd, 2021 [visited on 2021-04-30]. Available from: https://news.netcraft.com/ archives/2021/04/30/april-2021-web-server-survey.html.

36 A Usage example

This appendix shows an example of basic usage of the application, including installation.

A.1 Running the application

Due to the application being designed using Docker, its installation process should be simple. This appendix assumes that Docker is in- stalled on the host machine. If an alternative to Docker is being used, use corresponding commands. In the root folder containing the file docker-compose.yml, run the following command: docker-compose build The command builds and tags services using their names as defined in docker-compose.yml. The command may take up to a few minutes to finish. After it has finished, run the following: docker-compose up[service...] The command starts the listed services. If no service is explicitly listed, all the application services defined in the compose file are started up. This process might take up to a few minutes if being run for the first time. The services can be stopped by sending a SIGINT (CTRL + C) signal.

A.1.1 Service configurations

As mentioned in Chapter 6, not all services are necessary for the func- tioning of the application. If not needed, the user might elect to exclude specific services. The following command starts the most basic config- uration, which only offers the API for importing the spreadsheets. docker-compose up php server However, with this configuration, the target database connection has to be provided as described in Section 6.2. If the user wishes to use the GUI, the service react-frontend has to be added to the command. Furthermore, the mariadb service can be added to provide a tar-

37 A. Usage example

Figure A.1: A single table of the mapping schema. get database, and adminer can be added to provide an in-browser database inspection tool.

A.2 Using the GUI

Let us assume we want to import table 1.1 from a spreadsheet file. First, we need to select the XLSX file and the worksheet, which contains the desired table. Since the table contains a header in the first row, we should check the “First Row is Header” checkbox, which will ensure that the header is not imported into the database as data. After that done, clicking the “Add table” button will allow us to specify a new table, which will be created from the selected worksheet. If we wish the table to be imported into the database as-is, it can be achieved using the table defined in Figure A.1. If instead, we wantto split the worksheet table into two separate tables, “Development” and “Employee,” with a one-to-many relationship, the schema defined in Figure A.2 will yield said result.

38 A. Usage example

Figure A.2: A mapping schema containing two tables.

A.3 Using the API

The front-end generates a schema JSON file from the visual representa- tion created by the user. The schema JSON is then sent to the back-end along with the XLSX file using the API. The same API function canbe used by the user directly, without the need for the front-end. First, the mapping schema JSON needs to be created manually. The following code corresponds to the schema generated from Figure A.2. Additional examples of schema files can be found in example_files/ folder. { "sheets":[ { "title":"WorksheetName", "firstRowIsHeader": true, "calculateFormulas": false, "formatData": false, "tables":[ { "title":"Department",

39 A. Usage example "columns":[ { "name":"DepartmentTag", "col": 1, "type":"NVARCHAR(3)" },{ "name":"DepartmentName", "col": 2, "type":"NVARCHAR(60)" } ], "references":[] },

{ "title":"Employee", "columns":[ { "name":"EmployeeName", "col": 3, "type":"NVARCHAR(60)" },{ "name":"EmployeeWage", "col": 4, "type":"INTEGER" } ], "references":[ { "table":"Department" } ] } ] } ] }

40 A. Usage example

Once the schema is created, the API request can be made. By de- fault, the server listens on port 5000; therefore, the API is available at the address http://localhost:5000/import_xlsx.php. The request must be an HTTP POST, and its header must contain the following value. Content-Type: multipart/form-data The body of the request consists of the XSLX file and the schema: spreadsheet: file.xlsx schemaJson: schema.json The schemaJson value can either be a JSON file containing the schema or the JSON string itself. Upon sending the request, a response code 200 will be returned if the import was successful. In case the import fails, code 400 is returned along with an error message of what went wrong. The import can take up to several seconds, depending on the size of the spreadsheet and the complexity of the schema.

41

B Graphical user interface design

Figure B.1: The webpage GUI on a desktop-sized screen width.

43 B. Graphical user interface design

Figure B.2: The webpage GUI on a smartphone-sized screen width.

44