CIT 668: System Architecture

Distributed Topics

1. MySQL 2. Concurrency 3. Transactions and ACID 4. scaling 5. Replication 6. Partitioning 7. Brewer’s CAP Theorem 8. ACID vs. BASE 9. Taxonomy of NoSQL databases MySQL MySQL Architecture MySQL Storage Engines InnoDB – Default storage engine as of MySQL 5.5. – Supports transactions, hot backups, etc. – Row level locking. – Fast crash recovery. MyISAM: – Default storage engine starting with MySQL 3.23. – Does NOT support transactions. – Must halt writes before doing a backup. – level locking. – Higher performance for read-heavy applications. MySQL History

Year Version Description 1995 First commercial version from MySQL AB corporation in Sweden. 2001 3.23 First open source version that is widely used, supporting full text indexing and replication.

Dual licensing: commercial and GPL. 2003 4.0 Better SQL syntax support, incl UNION. SSL and InnoDB options. 2005 4.1 Better SQL syntax support, incl subqueries. UTF-8 support. 2006 5.0 Even more SQL: views, triggers, stored procedures. 2008 5.1 First release after Sun purchase of MySQL. Added partitioning. 2010 5.5 First release after Oracle purchase of Sun. InnoDB is default. 2013 5.6 InnoDB full text search support and speed improvements. MySQL Forks

MariaDB: community developed GPL only fork started by MySQL co-founder Monty Widenius in reaction Sun purchase. Used by Wikipedia, Google, etc. – Backwards compatible (versions 5.1-5.5). – Version 10: Multi-master, NoSQL storage engines.

Drizzle: fork based on MySQL 6.0 code base, designed to be smaller and faster by removing features.

WebScaleSQL: fork of MySQL 5.6 designed for larger scale databases started by consortium of , LinkedIn, Google, and . Concurrency Race Conditions

A race condition is a bug in which the result of a process depends on the sequence or timing of other events. Mutual Exclusion

To synchronize access to shared objects, we can use mutual exclusion. Code that uses mutual exclusion to synchronize its execution is called a critical section, which is a section of code such that: 1. Only one thread at a time can execute in the critical section. 2. All other threads must wait to enter the section. 3. When a thread leaves the critical section, another thread can enter. Critical section requirements

Mutual exclusion – At most one thread is in the critical section. Progress – If a thread is outside the critical section, it cannot prevent another thread from entering the critical section. Bounded waiting – If a thread is waiting on the critical section, it will eventually enter the critical section. Performance – The cost of entering and leaving the critical section is small with respect to the work done within it. Locking

A lock is an object that provides two operations: acquire(): thread calls this before entering critical section release(): thread calls this after leaving critical section A thread holds the lock btw acquire() and release(). Lock Granularity

Table Locks – If a thread is reading, other readers can use table. – If a thread is writing, only it can access table. – Low overhead, small number of locks. – Low concurrency. Row Locks – High overhead, large number of locks. – High concurrency. – Supported by InnoDB storage in MySQL. Deadlocks Deadlocks

A deadlock is a situation where two or more actions are waiting for the other to finish, and thus neither ever completes.

Transactions The classic problem

Code to withdraw funds from bank account withdraw(account, amount) { balance = get_balance(account); balance -= amount; put_balance(account, balance); return amount; } What happens if you setup automatic bill pay and two withdrawals are made simultaneously? Create a separate thread for each withdrawal, each running the same code.

withdraw(account , amount) { withdraw(account, amount) { balance = get_balance(account); balance = get_balance(account); balance -= amount; balance -= amount; put_balance (account, balance); put_balance(account, balance); return amount; return amount; } }

Interleaved schedules

Execution of the two threads can be interleaved, with preemptive scheduling: balance = get_balance(account); balance -= amount; Execution sequence context switch as seen by CPU balance = get_balance(account); balance -= amount;

put_balance(account, balance); context switch put_balance(account, balance);

What’s the account balance after this sequence? Transactions

A transaction is a set of actions that are executed atomically. Either all actions are completed or none. – If an error occurs during a transaction, the database rolls back the actions, so the state of database is left as it was before the transaction. SQL Transaction example: START TRANSACTION; UPDATE account SET balance = balance - amount WHERE id=1; UPDATE account SET balance = balance + amount WHERE id=2; COMMIT;

ACID Properties

Atomicity—All data modifications within a transaction must happen completely or not at all. No partial transactions can be recorded. Consistency—All changes to an instance of data must be reflected in all instances of that data. Isolation—The Elements of a transaction should be isolated to the user performing the transaction until it is completed. Durability—When a system failure occurs, the data in the DB must be accurate up to the last committed transaction before failure. Database scaling Database Scaling Techniques

Base case Scaleup a 1 TPS system to a 2 TPS centralized system

1 TPS 100 Users 200 Users 2 TPS server

Partitioning Replication Two 1 TPS systems Two 2 TPS systems

1 TPS server 100 Users 100 Users 2 TPS server O tps O tps 1 tps 1 tps

1 TPS server 100 Users 100 Users 2 TPS server Distributed System Types

Shared • All CPUs share memory/disk • Scalability limited by memory Memory contention (vertical scaling only)

Shared • CPUs share storage, not RAM • Scalability limited by disk contention Disk (vertical scaling only)

Shared • Each CPU has its own RAM and disks • Very high (horizontal) scalability since Nothing no contention for shared resources Database replication Purposes of Replication Data distribution – Maintains a copy of DB at another geographic site to lower latency at site or for DR. Load balancing – Allows application to access data on multiple servers. Backup and recovery – Backups of replicated DB can be performed without impacting performance of original production DB. High availability – Application can failover to replicated DB. Replication techniques

Eager (synchronous) – All replicas updated as part of original transaction. – Data is always consistent, but ensuring data consistency can lead to long waits or deadlocks. Lazy (asynchronous) – Original transaction completes on node, then updates propagated to other nodes as separate transactions. – Can result in conflicts when transactions modify same object on different nodes before replicas are updated. – Must have reconciliation protocol to resolve conflicts. Master/Slave Replication

Slave DBs only accept read operations for application. – Flickr.com DBs logged 13 SELECTS for each write. Master DB does all writes – No read operations. – Copies write operations to slave DBs. – Slave data will be slightly behind master. – Single point of failure! Master/Slave Replication

Scales reads, not writes. – Writes are faster, since master only does writes, but writes do not scale with the addition of more slaves. – Good for read-centric applications. Master is a single point of failure. – Manually promote one slave to master, then – Re-parent slaves on master failure. Dual Master Replication

High reliability – Two identical copies of DB. Easy maintenance – Set only one DB to be active. – Update inactive server. – Synchronize. – Flip to other DB as active one, then update it. One big problem – How to handle conflicting changes?

Complex Replication Topologies

Dual Master with Read Replicas

Ring Multimaster Topology with Read Replicas at each site

Pyramid Replication (reduces replication load on master) MySpace Case Study • 3000 web servers • 800 cache servers • 440 database servers hosting >1000 databases • Each DB server has 4 2-core CPUs + 64GB RAM

https://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004532 Database Partitioning Partitioning

A partition is a division of a logical database into independent components, called shards or partitions.

Horizontal partitioning divides the database by rows, with groups of rows stored on diff nodes. Horizontal scaling is highly scalable. A horizontal partition is called a shard.

Vertical partitioning divides the database by columns, with sets of columns stored on different nodes. Vertical scaling is limited by the number of columns that are accessed independently.

Partitioning Criteria

Range partitioning selects a partition by determining if the partition key is within a certain range.

List partitioning assigns each partition a list of values.

Hash partitioning uses a hash function to determine partition membership.

Wikipedia Shard Architecture LiveJournal Sharding Architecture 2007 Sharding Advantages

Faster Queries – Since each shard has a fraction of the whole DB, queries are faster than they would be on whole. Higher Write Bandwidth – Can have master/slave configuration for each shard, so each shard has its own dedicated master to do writes. High Scalability – Can continue to divide DB into more shards to scale out indefinitely. Sharding Disadvantages Rebalancing – To scale, you need to split shards, which can require substantial manual effort and downtime. – Google and Flickr’s shards auto-rebalance, which requires a mechanism to invalidate references, so underlying data can be moved while in use. Cross-shard joins are slow – If a request requires data from multiple shards, it’s slower than accessing a single DB. – Social networks need to find relationships among as many dimensions and users as possible. Poor support – Less mature tools and documentation than replication. Rebalancing Shards

Adding a Shard + Rebalancing Efficient Rebalancing: Multi-Range Shards

Adding a Multi-Range Shard NoSQL What is NoSQL?

“NoSQL is the term used to designate database management systems that differ from classic relational database management systems in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally.” -- Wikipedia

Non-relational may be more accurate than NoSQL, as some NoSQL DBs support a subset of SQL. Why not stick to relational DBs? Limited scalability – Master/slave clusters are limited by write bandwidth and have a SPOF. – Partitioning requires that you rewrite your application to find your data, and does not scale joins. Availability is more important than consistency – A RDBMS will make data unavailable until it is consistent on each node. – In large clusters or if a node is down, the period of unavailability can be too long. What problem do you have?

What data problem are you trying to solve? – Fault tolerance – High availability – Consistency – Scalability Why not use a database that solves all of these problems at once? – It’s impossible to make such a database! You Still Need RDBMS Brewer’s CAP Theorem

Web services can at most ensure 2 of the 3 following properties for any given operation at a time: – Strong consistency. All clients perceive that a set of operations has occurred completely or not at all. – Availability. All clients can read or write to some replica of the data, even if some nodes fail. – Partition tolerance. Operations will complete, even if individual components are unavailable. Distributed systems must be partition tolerant, so we have to choose between Consistency and Availability. However, a system could have two modes: +A until a partition is detected, then operate in mode with C|A. Network Partitions

A network partition occurs when a network failure causes a cluster to be divided into two subclusters that are isolated from each other. Partitions may occur within or between data centers.

Server 1 Server 2 Server 3 Server 4 Reacting to a Partition

If a partition occurs, a system must either – Cancel operation, reducing availability, OR – Perform operation, risking inconsistency Business perspective: A > C – When in doubt, take the customer’s order. – Apologize, fix, and compensate later. BASE

Basically Available – System provides availability according to CAP. Soft-state – State of system changes over time even without new user inputs due to asynchronous updates. Eventually Consistent – Some nodes have current data – Other nodes have previous data – If no user updates are made, all nodes will eventually have identical current data. ACID vs. BASE ACID BASE

Consistency most important Availability most important

Strong consistency Weak consistency

Pessimistic Optimistic

Data unavailable until correct answer on all nodes Approximate answers OK Slower Faster

Less scalable More Scalable Give up Partition Tolerance

Examples – RDBMSs – Master/slave Consistency Availability RDBMS clusters Traits – All nodes must be in contact to function. – 2-phase commit Partition – Cache invalidation Tolerance protocols Give up Availability

Examples – Multi-master distributed DBs Consistency Availability – BigTable, Hbase, Neo4J – MongoDB, Redis Traits – Shards – Quorum/majority algorithms – System down when Partition transactions cannot Tolerance complete across cluster. Give up Consistency

Examples – DNS Consistency Availability – Cassandra, Voldemort – SimpleDB, CouchDB Traits – Eventually consistent – Highly scalable Partition – Conflict resolution Tolerance – Optimistic NoSQL Taxonomy

Key/Value • Access data by strings called keys. Store • Data has no required format.

• Access data by key or by search of Document document data. Store • Document formats: XML, YAML, JSON

• Nodes (objects), properties, edges Graph • Object-oriented schema-less network Key/Value Stores

A simple model – Hash, map, or dictionary Key/value stores are an old idea – BerkeleyDB is a key/value store Modern key value stores are different – Non-ACID – Highly scalable – Highly available Cassandra: Key/Value Column Store Document Stores

Store documents instead of rows – No schema – High flexibility – Full-text and key search instead of SQL Document structures include – JSON – XML – YAML CouchDB: JSON Documents in a B-Tree CouchDB: Multi-Version Concurrency Control

• Writes don’t have to wait for locks. • Instead, writes create a new version of document. • Before complete, reads retrieve old version. • After complete, reads retrieve new version. • System must later remove old documents. Graphs

A graph is a representation of a set of objects (nodes) connected by links (edges). In a graph database, nodes contain data items called properties. Social Graphs Graph Databases Why graph DBs over Relational DBs?

Many web applications have graph data – Social networks, tagging systems, CMS, wikis Graph operations are slow in RDBMS – Graphs are recursive data structures – Each traversal of an edge of a graph is a join Schemas difficult to pre-conceive – You don’t know which users will friend each other, which tags will be applied, which items in a CMS or wiki will be hyperlinked Key Point: SQL and NoSQL

If SQL is your only storage tool, then all problems must look the same.

Using SQL and NoSQL storage technologies where appropriate, systems can have – SQL for high consistency where needed – NoSQL for high availability and scalability

Key Points

Database scaling techniques 1. Use caching to avoid need to scale DB 2. Scale up with faster hardware 3. Replicate DB to multiple servers 4. Split DB across multiple servers (partitions) Concurrency: critical section, lock, deadlock Transactions and ACID properties Distributed system types: shared {memory, disk, nothing} Replication – Master/slave scales reads but not writes; has SPOF – Multi-master scales both, but must be able to resolve conflicts. Partitioning – Partition tables by rows and put shards on diff nodes. NoSQL Key Points

Brewer’s CAP Theorem – Tradeoffs between CAP – Understand NoSQL and SQL CAP properties – ACID vs. BASE NoSQL Taxonomy – Key/value – Document – Graph References (1)

1. John Allspaw, The Art of Capacity Planning, O’Reilly, 2008. 2. Kristina Chodorow, Scaling MongoDB, O’Reilly, 2011. 3. Brad Fitzpatrick, LiveJournal: Behind the Scenes, USENIX, http://danga.com/words/2007_06_usenix/usenix.pdf, 2007. 4. Jim Gray et. al. The Dangers of Replication and a Solution. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data (SIGMOD '96). 5. James Hamilton, Scaling at MySpace, http://perspectives.mvdirona.com/2010/02/15/ScalingAtMySpace.aspx, 2010. 6. Cal Henderson, Building Scalable Web Sites, O’Reilly, 2006. 7. Cal Henderson, Flickr and PHP presentation, http://www.niallkennedy.com/blog/uploads/flickr_php.pdf, 2006. 8. Schwartz et. Al., High Performance MySQL, 3rd edition, O’Reilly, 2012. 9. Theo Schlossnagle, Scalable Internet Architectures, Sams Publishing, 2007. 10. Wikipedia, Server Layout Diagrams, http://meta.wikimedia.org/wiki/Server_layout_diagrams References (2)

1. Daniel Abadi, The Problems with ACID and How to Fix them without going NoSQL, http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix- them.html, 2010. 2. J. Chris Anderson, Jan Lehnardt, Noah Slater, CouchDB: The Definitive Guide, O’Reilly, 2010. 3. Julian Browne, Brewer’s CAP Theorem, http://www.julianbrowne.com/article/viewer/brewers-cap-theorem, 2009. 4. James Hamilton, Scaling at MySpace, http://perspectives.mvdirona.com/2010/02/15/ScalingAtMySpace.aspx, 2010. 5. Eben Hewitt, Cassandra: The Definitive Guide, O’Reilly, 2011. 6. Nathan Hurst, Visual Guide to NoSQL Systems, http://blog.nahurst.com/visual- guide-to-nosql-systems, 2010. 7. Anders Nawroth, Social networks in the database: using a graph database, http://blog.neo4j.org/2009/09/social-networks-in-database-using-graph.html, 2009. 8. Dan Pritchett, BASE: An ACID Alternative, ACM Queue, 2008. 9. Bret Taylor, How FriendFeed uses MySQL to store schema-less data, http://bret.appspot.com/entry/how-friendfeed-uses-mysql, 2009. 10. Werner Vogels, Eventually Consistent, ACM Queue, 2008. 11. Steve Yen, NoSQL is a Horseless Carriage, http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf, 2009.