Web Application Engineering Data Modeling

Matthew Dailey

Information and Communication Technologies Asian Institute of Technology

Matthew Dailey (ICT-AIT) Web Eng 1 / 54 Readings

Readings for these lecture notes: - Greenspun, SQL For Web Nerds. - Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley, 2003. - Ruby, Copeland, and Thomas, Agile Web Development with Rails, 6th edition, 2020.

These notes contain material © Greenspun, 2006; Fowler, 2003; Ruby, Copeland, and Thomas, 2020.

Matthew Dailey (ICT-AIT) Web Eng 2 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 3 / 54 Introduction

To this day, the RDBMS is the king of data storage. NoSQL have important use cases (very large datasets, semi-structured or unstructured data, document oriented processing), but these aren’t relevant for small and medium sized applications. We will thus learn (or review for some of you) how to use the RDBMS as an effective means of persistence for our Web applications. Later, we will take a look at NoSQL databases such as MongoDB. For all practical purposes, a “relational database is a big spreadsheet that several people can update simultaneously.” (Greenspun).

Matthew Dailey (ICT-AIT) Web Eng 4 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 Database normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 5 / 54 SQL basics Tables

Each in the database is a spreadsheet with fixed columns, each having a name and a data type. The rows are unordered. Example:

create table mailing_list ( email varchar(100) not null , name varchar(100) );

The primary key constraint means this column must be unique, and in PostgreSQL causes an index to be created on the column. Indices allow efficient search of one or more columns in a table.

Matthew Dailey (ICT-AIT) Web Eng 6 / 54 SQL basics Populating and modifying tables

We use SQL’s insert command to add data to a table:

insert into mailing_list ( name, email ) values (’Philip Greenspun’,’[email protected]’);

We can add and delete new columns:

alter table mailing_list add phone_number varchar(20) not null; alter table mailing_list drop phone_number;

For queries we use select:

select * from mailing_list;

Matthew Dailey (ICT-AIT) Web Eng 7 / 54 SQL basics Many-to-one relationships

Most folks have more than one phone number. Should we put a list in the phone number column? It might work but our data would not be in “normal” form (more on normalization later). For many-to-one relationships we normally use a separate table:

create table phone_numbers ( email varchar(100) references mailing_list, phone_type char(1) check ( phone_type in ( ’W’, ’H’, ’M’, ’F’ )), phone_number varchar(20) );

The keyword references creates a consistency constraint between the two tables. Try adding phone numbers for email addresses that are not in the mailing list table. OK, insert some data into the table.

Matthew Dailey (ICT-AIT) Web Eng 8 / 54 SQL basics Joins

A join combines information from more than one table:

select * from mailing_list, phone_numbers; But we don’t get what we want — we get the cross product of the rows in the two tables. We have to be more selective:

select * from mailing_list, phone_numbers where mailing_list.email = phone_numbers.email;

Other useful commands: delete from mailing list and update mailing list.

Matthew Dailey (ICT-AIT) Web Eng 9 / 54 SQL basics Data types

We saw a few of SQL’s data types already. Here is a more complete but still partial list, for PostgreSQL: Fixed-length strings (char(len)) Variable-length strings (varchar(len)) Variable-length strings, no limit on length (text) Variable-length binary data (bytea) Dates and times (date, time, timestamp) Numbers (integer, numeric, real precision, double precision, serial, others) Other more complex, less-used types

Matthew Dailey (ICT-AIT) Web Eng 10 / 54 SQL basics Constraints

Values can also be constrained: not null unique primary key check references

That’s all you need for some simple data modeling!

Matthew Dailey (ICT-AIT) Web Eng 11 / 54 SQL basics Keys: natural or surrogate?

A key is an attribute or group of attributes what uniquely identifies a row of a table. Composite keys are made up of more than one attribute. Natural keys are attributes in the real world: citizen ID number, etc. Surrogate keys are artifical keys introduced into the data model that have no relationship to the real-world entities being modeled. Many analysts prefer natural keys because surrogate keys are artificial and unrelated to the business logic. But natural keys may be coupled to the business logic and might therefore change when requirements change. Most Web application frameworks are easiest to work with when you allow them to define their own surrogate key for every table.

Matthew Dailey (ICT-AIT) Web Eng 12 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 Database normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 13 / 54 Useful PostgreSQL features User-defined functions

PostgreSQL provides the PL/pgSQL language for specification of user-defined functions. As a simple example consider f (x) = 2x:

create or replace function doubleint( x integer ) returns integer as $$ declare y integer; begin y := 2 * x; return y; end; $$ language plpgsql;

Before creating a first PL/pgSQL function in your database, you must use the shell command createlang plpgsql apache (use your database’s name instead of apache). Now queries like select doubleint( 10 ); should work.

Matthew Dailey (ICT-AIT) Web Eng 14 / 54 Useful PostgreSQL features Triggers

PL/pgSQL functions returning trigger can be set to execute automatically when a table is changed. Example: automatically create a change log entry every time a student changes projects: create table project_changes ( studentid integer references students, oldproj integer references projects, newproj integer references projects, update_timestamp timestamp ); create or replace function proj_log() returns trigger as $PROC$ begin if ( NEW.studentid = OLD.studentid and NEW.projectid <> OLD.projectid ) then insert into project_changes ( studentid, oldproj, newproj, update_timestamp ) values ( NEW.studentid, OLD.projectid, NEW.projectid, current_timestamp ); end if; return NEW; end; $PROC$ language plpgsql; drop trigger proj_log_post on students; create trigger proj_log_post after insert or update on students for each row execute procedure proj_log(); Matthew Dailey (ICT-AIT) Web Eng 15 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 Database normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 16 / 54 Database normalization Introduction

A normalized database only stores atomic data in a non-redundant form. The concept of normal form for relational databases was proposed by E.F. Codd in 1970. Normalizing a database means ensuring that all data in every table is atomic and depends only on the primary key for that table. Normalization means all dependencies are explicit in the data model. This makes it easier to maintain the database in a consistent state. There are many levels of normalization. The most important are first, second, and third normal form.

Matthew Dailey (ICT-AIT) Web Eng 17 / 54 Database normalization First normal form

Criteria for first normal form: All columns in every table are atomic (nondecomposable). Every row of every table has a unique primary key.

Example: conference program committee website: Papers are submitted by potential authors Papers are reviewed by committee members (who can also be authors) The program chair makes acceptance and rejection decisions based on the reviews.

Papers have an author list, a title, a list of keywords, a link to the PDF submission, a set of reviews, and a decision. Reviews have a single author, a paper being reviewed, comments, and ratings from 1–5 for technical quality, originality, and presentation.

Matthew Dailey (ICT-AIT) Web Eng 18 / 54 Database normalization First normal form

1NF procedure: Consider each relation and break non-atomic attributes into separate tables. Add the relationships between the tables. Determine the primary keys.

Matthew Dailey (ICT-AIT) Web Eng 19 / 54 Database normalization First normal form

For atomicity, we need separate tables for (at least): papers people keywords reviews Relationships: Papers to authors: many to many. Requires a new table, papers authors relating the two. Papers to keywords: many to many. Requires a new table, papers keywords relating the two. Papers to reviews: one to many. Requires a reference in reviews. People to reviews: one to many. Requires a foreign key reference in reviews. Matthew Dailey (ICT-AIT) Web Eng 20 / 54 Database normalization First normal form

Keys: papers: no natural key. Introduce surrogate paper id. people: no natural key. Introduce surrogate person id. keywords: the keyword itself must be unique, so it is a natural key. reviews: the paper, reviewer pair is unique. It is a natural (composite) key.

With a unique key for all tables, and only atomic data, our database is in first normal form.

Matthew Dailey (ICT-AIT) Web Eng 21 / 54 Database normalization Second normal form

Criteria for second normal form: The database is in 1NF There should be no columns dependent on only part of a composite key.

Example: suppose we had a column reviewer home page in the reviews table. This would be atomic but redundant, and should be moved to the people table.

Matthew Dailey (ICT-AIT) Web Eng 22 / 54 Database normalization Third normal form

Criteria for third normal form: The database is in 2NF There should be no columns dependent on non-key columns.

Example: suppose for each review, we have a field originality (an integer between 1 and 5) and originality desc (“Groundbreaking”, “Novel”, “Somewhat new”, “Minor variation of existing work”, and “Complete ripoff”) describing what the rating means. We can see that originality desc depends directly on originality which is not a key for reviews. To achieve 3NF we should move originality desc into a new table and make originality be a foreign key reference.

Matthew Dailey (ICT-AIT) Web Eng 23 / 54 Database normalization Denormalization

Normalization simplifies data updates and changes to the data model. Normalization leads to more complex queries with many joins. This has implications for performance. Databases that are primarily transactional should emphasize normalization. Databases that are primarily read only might use denormalization to improve performance and simplify the queries sent to the RDBMS. The preferred denormalization technique is to use indexed views. If denormalization is done at the data model level, constraints should be used to ensure consistency of the redundant data.

Matthew Dailey (ICT-AIT) Web Eng 24 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 Database normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 25 / 54 Object-relational mapping Introduction

Most SQL APIs return an array of hash arrays or a similar structure in response to queries. Ruby example: the Sequel database API provides a row abstraction for database rows.

Next page: Sequel example. In Ubuntu, you need gems pg and sequel. You’ll also need a username with password for the database as the connection is through a network socket. In psql run alter user username with password ’password’; (note that you don’t put quotes around the username).

Matthew Dailey (ICT-AIT) Web Eng 26 / 54 Object-relational mapping Introduction

Ruby Sequel example (put in a text file such as db access.rb and run from the command line using ruby db access.rb:

require "sequel"

dbh = Sequel.connect( "postgres://mdailey:password@localhost/wae_students_development")

dbh[:students].each do |row| row.keys.each do |key| printf "%s: %s ", key, row[key] end print "\n" end

Matthew Dailey (ICT-AIT) Web Eng 27 / 54 Object-relational mapping Introduction

In object oriented analysis and design (OOAD) we normally construct a domain model containing the entities in the business domain. If we are using OOAD and an object-oriented programming language like Ruby or Java, we want to work with objects, not database rows. But what if we are stuck with a RDMBS? The simplest thing to do is to map database rows directly to domain model objects. The Active Record pattern for enterprise applications is one of the simplest approaches to so-called object-relational mapping.

Matthew Dailey (ICT-AIT) Web Eng 28 / 54 Object-relational mapping Active Record

Active Record An object that wraps a row in a database table or , encapsulates the database access, and adds domain logic on that data.

Fowler (2003), Fig. 3.3

Matthew Dailey (ICT-AIT) Web Eng 29 / 54 Object-relational mapping Active Record

Some popular Active Record implementations: Ruby ActiveRecord (decoupled from Rails way back in version 3.0) CakePHP .NET Castle

There are many others.

Matthew Dailey (ICT-AIT) Web Eng 30 / 54 Object-relational mapping Active Record

What Active Record implementations can do for us: Automatically construct an instance of Active Record from a SQL result row. Automatically construct a SQL insert from a given instance of Active Record. Provide static finder methods via reflection that return Active Record instances. Map getters and setters to SQL selects and updates, transforming SQL data types to reasonable native types.

Matthew Dailey (ICT-AIT) Web Eng 31 / 54 Object-relational mapping Active Record

Some advantages of Active Record: It works very well when the domain model and business logic are simple.

Some disadvantages: It cannot handle complex mappings from objects to relations. It couples the domain logic to the database schema.

Matthew Dailey (ICT-AIT) Web Eng 32 / 54 Object-relational mapping Active Record in Rails

Some key features of Rails ActiveRecord: Object schema is constructed on the fly from the database schema. Transparent lazy fetching. Transparent optimistic locking via row versioning. Simple support for associations between classes. Transaction support. Validations. Value objects. Single table inheritance.

Matthew Dailey (ICT-AIT) Web Eng 33 / 54 Object-relational mapping Active Record in Rails

Conventions Each database table has a surrogate primary key, id. The model class name is singular and UpperCamelCase (e.g. Student); the table name is the plural form of the object name (e.g. students). Foreign key reference names are written classname id. Join tables for many-to-many associations are named for the two tables they join, e.g., projects students.

Default behavior can be changed as necessary (e.g. invoke class method set table name to use a non-standard table name).

Matthew Dailey (ICT-AIT) Web Eng 34 / 54 Object-relational mapping Active Record in Rails: one-to-many associations

Example: students and their projects. Domain model:

Student Project +students +project +studentid: integer * 1 +name: string +name: string +url: string

Corresponding database schema:

Matthew Dailey (ICT-AIT) Web Eng 35 / 54 Object-relational mapping Active Record in Rails: one-to-many-associations

After creating the database tables (through a direct admin tool or via Rails migrations), we create the model classes: app/models/project.rb: class Project < ActiveRecord::Base has_many :students end app/models/student.rb: class Student < ActiveRecord::Base belongs_to :project end

The method calls belongs to and has many set up the one-to-many relationship between projects and students. Other methods for associations include has one, and has and belongs to many.

Matthew Dailey (ICT-AIT) Web Eng 36 / 54 Object-relational mapping Active Record in Rails: one-to-many associations To thoroughly test your ActiveRecord classes, it’s easiest to work from the console. Try the following for an example:

% script/console >> s=Student.create >> s.project = Project.create :name => "Soi Cats and Dogs", :url => "web13.cs.ait.ac.th" >> s.name = "Matthew Dailey" >> s.studentid = 123456 >> s.save >> s = Student.find(1) >> Project.find(:all) >> s = Student.find_by_name("Matthew Dailey") >> s = Student.find_by_name_and_studentid( "Matthew Dailey", 123456 ) >> Student.find_by_sql( "select * from students where students.name like ’Matt%’" )

You might want to tail -f log/development.log. Note that new creates an instance in memory only, but create creates an instance and commits it to the database. Matthew Dailey (ICT-AIT) Web Eng 37 / 54 Object-relational mapping Active Record in Rails: many-to-many associations

Now for a many-to-many relationship. Suppose I need to record information about peer evaluations of your projects. We need to set up a many-to-many relationship between students and projects. Since the association has an attribute (the score) we have to create an ActiveRecord model for the join table:1

% script/generate model ProjectEvaluation score:integer project:references \ student:references

1There is an ActiveRecord method has and belongs to many that may be more convenient if you don’t need any attributes on the association. Matthew Dailey (ICT-AIT) Web Eng 38 / 54 Object-relational mapping Active Record in Rails: many-to-many associations In app/models/project evaluation.rb, add: class ProjectEvaluation < ActiveRecord::Base belongs_to :project belongs_to :student end To the Student and Project models, add the method call has_many :project_evaluations

That’s it! From the console, try s = Student.find(1) p = Project.find(2) pe = ProjectEvaluation.create :student => s, :project => p, :score => 3 s.project_evaluations Lastly, try adding has_many :evaluators, :through => :project_evaluations, :source => :student to the Project model (what is the purpose of this?). Matthew Dailey (ICT-AIT) Web Eng 39 / 54 Object-relational mapping Active Record in Rails: transactions

Oftentimes it will be important to group multiple database operations into a single atomic transaction. For example:

% script/generate model Student name:string account_balance:float % rake db:migration % script/console >> bill = Student.create :name => ’Bill G’, :account_balance => 10000000.0 >> matt = Student.create :name => ’Matt D’, :account_balance => 100.0 >> bill.account_balance -= 10000 >> matt.account_balance += 10000 >> bill.save >> matt.save

If an exception occurs while saving the second updated student, Bill G. loses 10,000 baht.

Matthew Dailey (ICT-AIT) Web Eng 40 / 54 Object-relational mapping Active Record in Rails: transactions

It would be safer to encapsulate both operations in a transaction:

>> Student.transaction do ?> bill.save >> matt.save >> end

If any exception occurs during the transaction, it is rolled back.

Matthew Dailey (ICT-AIT) Web Eng 41 / 54 Object-relational mapping Active Record in Rails: optimistic locking

Transactions, except with strict serializable isolation (the highest level of isolation provided in the SQL standard, which locks data read by any transaction), don’t help with the problem of lost updates. Consider the following code executing concurrently in two threads:

Thread 1 Thread 2

s = Student.find_by_name "Bill G" s = Student.find_by_name "Bill G" s.account_balance += 1000000 s.account_balance += 1000000 s.save s.save What should happen, and what actually happens, with snapshot isolation and serializable isolation? Note that PostgreSQL does not support full serializable isolation.

Matthew Dailey (ICT-AIT) Web Eng 42 / 54 Object-relational mapping Active Record in Rails: optimistic locking

Optimistic locking means we allow concurrent users to perform any action they like but track updates to the database. When one user attempts to update an old version of a record, an exception and transaction rollback should occur. In Rails, optimistic locking can be enabled on any ActiveRecord class by adding a version column to the database table: alter table students add column lock_version int default 0; The versions are transparently updated and checked by the ActiveRecord base class. Try the concurrent access scenario again with this change.

Matthew Dailey (ICT-AIT) Web Eng 43 / 54 Object-relational mapping Active Record in Rails

We’ve covered many of the features of Rails’ implementation of Active Record. There are a few others of note: Value objects Single-table inheritance Polymorphic associations

Matthew Dailey (ICT-AIT) Web Eng 44 / 54 Object-relational mapping Data Mapper

Active Record maps directly between database tables and domain objects. Data Mapper is an alternative pattern that decouples the domain model from the database schema.

Data Mapper A layer of mappers that moves data between objects and a database while keeping them independent of each other and the mapper itself.

Fowler (2003), Fig. 3.4

Matthew Dailey (ICT-AIT) Web Eng 45 / 54 Object-relational mapping Data Mapper

Data Mapper is widely implemented: Hibernate for Java MassiveJS and many others for JavaScript SQLAlchemy for Python DataMapper for Ruby

Even if there is no existing implementation for your preferred environment, it is easy to roll your own, starting small and gradually improving the implementation over time.

Matthew Dailey (ICT-AIT) Web Eng 46 / 54 Outline

1 Introduction

2 SQL basics

3 Useful PostgreSQL features

4 Database normalization

5 Object-relational mapping

6 NoSQL (Mongo)

Matthew Dailey (ICT-AIT) Web Eng 47 / 54 NoSQL (Mongo) Introduction

Applications dealing with “big” data: High volume: we need to store millions or billions of records. High velocity: the data are arriving and need to be processed at a very high rate such as thousands of records per minute. High variety: we have potentially many sources providing data that are structured, unstructured, and semi-structured.

Under these conditions, designing schemas, migrating every time we add a new data source or data format, ensuring consistency, and guaranteeing isolated transactions may all be bottlenecks. A possible solution: throw away your schemas, your consistency rules, and/or your isolated transactions! [Think about where SQL and NoSQL would be best used: a banking application and a Facebook post analysis engine.]

Matthew Dailey (ICT-AIT) Web Eng 48 / 54 NoSQL (Mongo) Types of NoSQL databases

There are several types of NoSQL databases: Key-value: dictionaries wherein values are indexed by a single key Document: key-value databases in which the value is a document represented in JSON, XML, etc. Wide column: row-oriented tables with dynamic columns Graph: data are nodes with edges

MongoDB is probably the most popular NoSQL database. It is document oriented.

Matthew Dailey (ICT-AIT) Web Eng 49 / 54 NoSQL (MongoDB) MongoDB features

Besides simple key-value storage and retrieval, MongoDB adds Sharding: distributing the data across multiple machines for high throughput Replication, duplication, load balancing for high availability at scale Document validations: imposing consistency rules where necessary Fine-grained locking: reader and writer locks at the global, database, or collection level to deal with concurrency issues.

Matthew Dailey (ICT-AIT) Web Eng 50 / 54 NoSQL (Mongo) Quick MongoDB tutorial

To get a feel for MongoDB, first install it:

$ sudo apt install mongodb

Start a shell:

$ mongo MongoDB shell version v3.6.8 connecting to: mongodb://127.0.0.1:27017 Implicit session: session { "id" : UUID("fca1c52f-ca00-4819-9821-7f9576077b33") } MongoDB server version: 3.6.8 Server has startup warnings: ... >

Figure out what db we’re connected to:

> db test

Matthew Dailey (ICT-AIT) Web Eng 51 / 54 NoSQL (Mongo) Quick MongoDB tutorial Switch to the studentdb database: > use studentdb Insert some data into a new collection: > db.projects.insertMany([ ... { name: "Soi Cats and Dogs", url: "http://scad.org" }, ... { name: "ICT Infosystem", url: "http://ict-info.ait.ac.th" } ... ]) { "acknowledged" : true, "insertedIds" : [ ObjectId("612eba9d84616b17e76630a4"), ObjectId("612eba9d84616b17e76630a5") ] } > Search the collection for a document: > db.projects.find({name: "Soi Cats and Dogs"}) { "_id" : ObjectId("612eba9d84616b17e76630a4"), "name" : "Soi Cats and Dogs", "url" : "http://scad.org" } >

Matthew Dailey (ICT-AIT) Web Eng 52 / 54 NoSQL (Mongo) Quick MongoDB tutorial Generally we should avoid references where possible, but when we need one document to refer to another, we can use the id field:

> var project = db.projects.find({name: "Soi Cats and Dogs"}).next(); > project { "_id" : ObjectId("612eba9d84616b17e76630a4"), "name" : "Soi Cats and Dogs", "url" : "http://scad.org" } > db.students.insertMany([ ... { name: "Matt Dailey", studentid: "123456", project_id: project._id }, ... { name: "Bishal Khanal", studentid: "123457", project_id: project._id } ... ]); { "acknowledged" : true, "insertedIds" : [ ObjectId("612ecf3a84616b17e76630a6"), ObjectId("612ecf3a84616b17e76630a7") ] }

Matthew Dailey (ICT-AIT) Web Eng 53 / 54 NoSQL (Mongo) Quick MongoDB tutorial

> db.students.find() { "_id" : ObjectId("612ecf3a84616b17e76630a6"), "name" : "Matt Dailey", "studentid" : "123456", "project_id" : ObjectId("612eba9d84616b17e76630a4") } { "_id" : ObjectId("612ecf3a84616b17e76630a7"), "name" : "Bishal Khanal", "studentid" : "123457", "project_id" : ObjectId("612eba9d84616b17e76630a4") }

Things to note here: The shell interprets our input as JavaScript The find() method returns a cursor, i.e., an object that has to be iterated to extract its data. The cursor’s next() method returns the next record in the cursor’s underlying collection. Use code such as while (cursor.hasNext()) { var record = cursor.next(); printjson(record); } to iterate over the query’s results.

Matthew Dailey (ICT-AIT) Web Eng 54 / 54