MySQL Cluster

Monty Taylor MySQL Inc

University of São Paulo February 2007

Copyright 2006 MySQL AB 1 WWhoho aamm II??

• MMononttyy TayloTaylorr • SSeneniioror ConsConsuultltanantt • WorWorkikingng fforor MMyySSQQLL IncInc.. • LLiivivingng iinn SSeeatatttle,le, WasWashhiingtngtonon • ThTheaeattrree DDiirrececttoror,, LLiighghttiingng DDesesiignergner aandnd PPhothotogrographapherer

Copyright 2006 MySQL AB 2 Quick Introduction to MySQL

Copyright 2006 MySQL AB 3 Quick Introduction to MySQL

SQL Relational Management System

Copyright 2006 MySQL AB 4 Quick Introduction to MySQL

Aiming for full standards compliance (SQL 2003)

Copyright 2006 MySQL AB 5 Quick Introduction to MySQL

Reputation for: ✔Speed ✔Reliability ✔Ease of Use

Copyright 2006 MySQL AB 6 Quick Introduction to MySQL

Cluster appeared in 4.1

Copyright 2006 MySQL AB 7 Quick Introduction to MySQL

5.0 is GA Recommended release for Production Many speed improvements for Cluster

Copyright 2006 MySQL AB 8 Quick Introduction to MySQL

5.1 is Beta Many features for Cluster (also speed improvements)

Copyright 2006 MySQL AB 9 What to Expect Today

Copyright 2006 MySQL AB 10 What to expect today

• Learn about the architecture of MySQL Cluster • Current and future features • What problems MySQL Cluster is currently good at solving – and those that it isn't • Some discussion of internals

Copyright 2006 MySQL AB 11 Storage Engines

Copyright 2006 MySQL AB 12 Storage Engines

Unique feature of MySQL

Copyright 2006 MySQL AB 13 No one best way to store tables

Copyright 2006 MySQL AB 14 Choice of Storage Engines

Copyright 2006 MySQL AB 15 Different engine per (if you want)

Copyright 2006 MySQL AB 16 Just like a Virtual File System Layer

Application Application Application Application

open() read() write() sync()

Kernel

VFS ext3 reiserfs XFS vfat

Copyright 2006 MySQL AB 17 Just like a Virtual File System Layer

Application Application Application Application

INSERT UPDATE DELETE CREATE

MySQL Server

Storage Engine API MyISAM InnoDB NDB

Cluster

Copyright 2006 MySQL AB 18 What is MySQL Cluster?

Copyright 2006 MySQL AB 19 What is MySQL Cluster?

A High Availability

Copyright 2006 MySQL AB 20 What is MySQL Cluster?

A High Availability High Performance

Copyright 2006 MySQL AB 21 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0)

Copyright 2006 MySQL AB 22 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing

Copyright 2006 MySQL AB 23 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered

Copyright 2006 MySQL AB 24 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 25 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 26 What is MySQL Cluster?

High Availability

Copyright 2006 MySQL AB 27 High Availability

Designed for Five Nines (99.999%) Uptime

Copyright 2006 MySQL AB 28 High Availability

Sub-Second Failover

Copyright 2006 MySQL AB 29 High Availability

Sub-Second Failover

mysqld mysqld

Transactions

Data Nodes

Copyright 2006 MySQL AB 30 High Availability

Sub-Second Failover

mysqld mysqld

Transactions

Data Nodes

Copyright 2006 MySQL AB 31 High Availability

Sub-Second Failover

mysqld mysqld

Transactions

Data Nodes

Copyright 2006 MySQL AB 32 Failure Detection: Heartbeats, lost connections

Node number is an internal Nodes organized number signifying the age Data in a logical circle of the node. Node 1

Data Data Node 4 Node 2

All nodes must Data Heartbeat messages have the same view Node 3 sent to next node in circle of which nodes are alive

Copyright 2006 MySQL AB 33 Heartbeat Handling of MySQL Servers

MySQL Servers (mysqld)

Node Y1 Node Y2 Node Y3 Node Y4 (mysqld) (mysqld) (mysqld) (mysqld)

API_REGREQ API_REGCONF(HB_TIME)

Node X (ndbd)

Copyright 2006 MySQL AB 34 Time to Handle Node Failure

Detection Reorganization Handling transactions Time Time that lost coordinator

Max 4 * Heartbeat A matter of time Timer (default 1.5 microseconds Seconds) sending two rounds Almost instanteous of broadcast m essages if node failure discovered by OS.

Copyright 2006 MySQL AB 35 High Availability

“Hot” (Online) Backup

Copyright 2006 MySQL AB 36 High Availability

Hot (Online) Backup

mysqld mysqld

Transactions

Data Nodes

Copyright 2006 MySQL AB 37 High Availability

Hot (Online) Backup

mysqld mysqld

Transactions

Backup Backup

Data Nodes

Copyright 2006 MySQL AB 38 High Availability

Configurable Amount of Redundancy

Copyright 2006 MySQL AB 39 High Availability

Configurable Amount of Redundancy

NoOfReplicas

Copyright 2006 MySQL AB 40 High Availability

NoOfReplicas=1

D a t a

Copyright 2006 MySQL AB 41 High Availability

NoOfReplicas=2

Da ta Da ta

Copyright 2006 MySQL AB 42 High Availability

NoOfReplicas=3

Da Da ta Da ta ta

Copyright 2006 MySQL AB 43 High Availability

NoOfReplicas=4

Data Data Data Data

Copyright 2006 MySQL AB 44 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 45 High Performance

High Performance

Copyright 2006 MySQL AB 46 High Performance

Not BEGIN to COMMIT

Copyright 2006 MySQL AB 47 High Performance

Not BEGIN to COMMIT Through Parallelism

Copyright 2006 MySQL AB 48 High Performance

Parallelism

mysqld mysqld

Transactions

Data Nodes

Copyright 2006 MySQL AB 49 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 50 What is MySQL Cluster?

In Memory (4.1 and 5.0)

Copyright 2006 MySQL AB 51 In Memory (4.1 and 5.0)

Data and Indexes kept in main memory

Size of Database×No Of Replicas×1.1 Per Node= No Of Data Nodes

Copyright 2006 MySQL AB 52 In Memory (4.1 and 5.0)

Check point to disk

Copyright 2006 MySQL AB 53 In Memory (4.1 and 5.0)

Check point to disk

Frequent, Configurable

Copyright 2006 MySQL AB 54 In Memory (4.1 and 5.0)

Check point to disk

Not complete data loss after power outage

Copyright 2006 MySQL AB 55 Flushing the log to disk

REDO log

T1 commits T2 commits GCP T3 commits Time

A global checkpoint GCP flushes REDO log to disk.

Transactions T1 and T2 become disk persistent.

(Also called group commit.)

Copyright 2006 MySQL AB 56 Log and Local Checkpoints

The REDO log is written to disk.

LCP T1 commits T2 commits T3 commits Time

A local checkpoint LCP Since MySQL Cluster replicate all data, the REDO log saves an image of the and LCPs are only used to recover from whole system database on disk. failures. This enables cutting the tail of the REDO log. Transactions T1, T2 and T3 are main memory persistent.

Copyright 2006 MySQL AB 57 Checkpointing takes time

UNDO log REDO log

T1 commits T2 commits T3 commits Time LCP LCP start stop

UNDO log makes LCP image consistent with database at LCP start time.

Copyright 2006 MySQL AB 58 Restoring a Fragment at System Restart

1) Load Data Pages from local checkpoint into memory

Start Execute UNDO log End of UNDO log UNDO log

End Global Fuzzy point of Checkpoint start of Local Log Record Checkpoint

Execute REDO log

Copyright 2006 MySQL AB 59 In Memory (4.1 and 5.0)

Data on Disk in 5.1

Copyright 2006 MySQL AB 60 In Memory (4.1 and 5.0)

Data on Disk in 5.1 Indexed fields and Indexes in main memory

Copyright 2006 MySQL AB 61 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 62 What is MySQL Cluster?

Shared Nothing

Copyright 2006 MySQL AB 63 Shared Nothing

Commodity PCs

Copyright 2006 MySQL AB 64 Shared Nothing

Commodity Interconnects

Copyright 2006 MySQL AB 65 Shared Nothing

Commodity Interconnects Ethernet

Copyright 2006 MySQL AB 66 Shared Nothing

Commodity Interconnects Ethernet SCI

Copyright 2006 MySQL AB 67 Shared Nothing

No Expensive Shared Disk

Copyright 2006 MySQL AB 68 Shared Nothing

No Expensive Shared Disk (so no single point of failure there)

Copyright 2006 MySQL AB 69 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 70 What is MySQL Cluster?

Clustered

Copyright 2006 MySQL AB 71 What is MySQL Cluster?

Clustered

Copyright 2006 MySQL AB 72 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 73 What is MySQL Cluster?

Storage Engine

Copyright 2006 MySQL AB 74 What is MySQL Cluster?

ENGINE=NDBCLUST ER

Copyright 2006 MySQL AB 75 What is MySQL Cluster?

A High Availability High Performance In Memory (4.1 and 5.0) Shared Nothing Clustered Storage Engine

Copyright 2006 MySQL AB 76 Node Types

Copyright 2006 MySQL AB 77 Node Types

Data Nodes (ndbd)

Copyright 2006 MySQL AB 78 Data Nodes

Copyright 2006 MySQL AB 79 Data Nodes

Data nodes (running ndbd)

Copyright 2006 MySQL AB 80 Data Nodes

This cloud means a cluster

Data nodes (running ndbd)

Copyright 2006 MySQL AB 81 Data Nodes grouped

Copyright 2006 MySQL AB 82 Data Nodes grouped into nodegroups

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 83 NoOfReplicas is number of nodes in node group

NoOfReplicas=2

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 84 NoOfReplicas is number of nodes in node group

DDAA TTAA NoOfReplicas=2

DDAA TTAA

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 85 Data Nodes

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 86 Data Nodes

pk

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 87 HASH(pk) pk

Copyright 2006 MySQL AB 88 HASH(pk) pk

Copyright 2006 MySQL AB 89 HASH(pk) pk

Copyright 2006 MySQL AB 90 Perception Reality

HASH(pk) pk

Copyright 2006 MySQL AB 91 pk

Copyright 2006 MySQL AB 92 MySQL Servers talk to the Data Nodes

mysqld

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 93 One used as a Transaction Coordinator

mysqld

One storage node as Transaction Coordinator (TC)

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 94 One used as a Transaction Coordinator

mysqld

One storage node as Transaction Coordinator (TC)

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 95 Variant of 2-phase commit protocol

Example showing transaction TC with two operations using three replicas Prepare

Committed/ Primary Completed Primary

Prepare Commit/ Complete Backup Backup

Prepare Commit/ Complete Backup Backup Prepared Commit/ Complete 1. Prepare F1 2. Commit F1 1. Prepare F2 2. Commit F2 3. Complete F1 3. Complete F2

Copyright 2006 MySQL AB 96 Node Types

Data Nodes (ndbd) Up to 48 in one cluster

Copyright 2006 MySQL AB 97 Node Types

Management Server (ndb_mgmd)

Copyright 2006 MySQL AB 98 Management Server config.ini

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local//cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 99 Management Server

config.ini

[ndbd default]

NoOfReplicas= 2

DataMemory= 400M

IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd]

HostName= 192.168.0.40 mysqld ndb_mgmd [ndbd]

HostName= 192.168.0.41

[ndb_mgmd]

DataDir= /usr/local/mysql/cluster

HostName= 192.168.0.42

[mysqld]

[mysqld]

[mysqld]

ndbd ndbd

Copyright 2006 MySQL AB 100 Management Server

config.ini mysqld mysqld ndb_mgmd

ndbd ndbd

Copyright 2006 MySQL AB 101 Management Server

config.ini mysqld mysqld ndb_mgmd

ndbd ndbd

Copyright 2006 MySQL AB 102 Management Server

config.ini mysqld mysqld ndb_mgmd

ndbd ndbd

Copyright 2006 MySQL AB 103 Node Types

Management Server (ndb_mgmd) also involved in arbitration, starting backups, issuing commands to nodes (start, stop, restart)

Copyright 2006 MySQL AB 104 Node Types

SQL Nodes (mysqld) sometimes called API nodes

Copyright 2006 MySQL AB 105 SQL Nodes

mysqld mysqld mysqld

Copyright 2006 MySQL AB 106 SQL and API Nodes

mysqld mysqld mysqld

NDB API NDB API

Copyright 2006 MySQL AB 107 DELETE INSERT UPDATE mysqld mysqld mysqld

NDB API NDB API

Copyright 2006 MySQL AB 108 DELETE INSERT UPDATE mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 109 Node Types

SQL Nodes (mysqld) Accessed like any other MySQL Server

Copyright 2006 MySQL AB 110 Node Types

API Nodes Talk NDB API directly to the Data Nodes

Copyright 2006 MySQL AB 111 The full stack

Copyright 2006 MySQL AB 112 Physical Requirements

Copyright 2006 MySQL AB 113 Physical Requirements

A node is a process, not a computer

Copyright 2006 MySQL AB 114 Physical Requirements

At least three physical machines for High Availability

Copyright 2006 MySQL AB 115 Three machines minimum for HA

A B

Copyright 2006 MySQL AB 116 Three machines minimum for HA

A B

Can no longer see A

Copyright 2006 MySQL AB 117 Three machines minimum for HA

A B

Can no longer see A Did A Die?

Copyright 2006 MySQL AB 118 Three machines minimum for HA

A B

Or did the network Can no longer see A link between A and B die?

Copyright 2006 MySQL AB 119 Three machines minimum for HA

A B

Can no longer see B Or did the network Can no longer see A link between A and B die?

Copyright 2006 MySQL AB 120 Three machines minimum for HA Who is in charge now?

A B

Can no longer see B Or did the network Can no longer see A link between A and B die?

Copyright 2006 MySQL AB 121 Three machines minimum for HA Split Brain = Badness

A B

Can no longer see B Or did the network Can no longer see A link between A and B die?

Copyright 2006 MySQL AB 122 Three machines minimum for HA Split Brain = Badness

We detect possible split brain scenarios.

Nodes will shut down instead

Copyright 2006 MySQL AB 123 Three machines minimum for HA C

A B

Management server on 3rd machine

Copyright 2006 MySQL AB 124 Three machines minimum for HA C

A B

Management server on 3rd machine Is TThhee DDeecciiddeerr

Copyright 2006 MySQL AB 125 Three machines minimum for HA C

A B

Management server on 3rd machine Is TThhee DDeecciiddeerr

Copyright 2006 MySQL AB 126 Three machines minimum for HA C

A B

Management server on 3rd machine Is TThhee DDeecciiddeerr

Copyright 2006 MySQL AB 127 Physical Requirements

Management Server: Not CPU intensive

Copyright 2006 MySQL AB 128 Physical Requirements • Data Nodes –Lots of Memory –Disk I/O Can be calculated –(generally) not CPU bound –ndbd is single threaded •multiple nodes for SMP

Copyright 2006 MySQL AB 129 Physical Requirements

SQL Node: Many API/SQL nodes are needed to load Storage Nodes

Copyright 2006 MySQL AB 130 Physical Requirements

SQL Node: MySQL is multi-threaded – takes advantage of SMP transparently

Copyright 2006 MySQL AB 131 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster

[ndbd] HostName= 192.168.0.40

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] [mysqld]

Copyright 2006 MySQL AB 132 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster

[ndbd] HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] [mysqld]

Copyright 2006 MySQL AB 133 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster

[ndbd] HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 134 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 135 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 136 A Configuration

[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 137 A Configuration

[ndbd default] applications NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 138 Using the Cluster

• MySQL Cluster is a storage engine – tables can be stored in the cluster simply by ENGINE=NDBCLUSTER • No constraints • Supports 5.0 features: – Views – Stored Procedures – Triggers – standard permissions – these need to be set up on each SQL node • as the mysql.* tables aren't stored as NDB tables

Copyright 2006 MySQL AB 139 Failure Scenarios

• MySQL Server Node – Can be restarted and reconnected to the cluster. – Applications can connect to other MySQL Server nodes • Data Node – All other storage nodes are informed about failure – Data is replicated, so there is another storage node to service transaction requests – Currently, transactions using a failed node are aborted, and have to be restarted by the application • Management Server Node – Continued operation not dependent on management server – may be restarted, or have redundant nodes

Copyright 2006 MySQL AB 140 Schema Considerations

• All tables have a primary key – if no primary key is explicitly set, NDBCLUSTER creates a hidden field • Three types of indexes – Primary Hash Index – Unique Hash Index – Ordered Tree Index • Creating a UNIQUE index from SQL will also create an ordered index unless USING HASH is specified – requires knowing how you application will query your DB • Primary key is primary hash index and ordered index – unless USING HASH is specified

Copyright 2006 MySQL AB 141 Optimizations

• An improvement in 5.0 is the engine_condition_pushdown option – SELECT A from T where unindexed_field = 42; – The condition is evaluated in the data nodes, not in the SQL node – saving a full table scan • which can be rather expensive on large tables – Common to see 5-10x speed improvement – details in EXPLAIN output • In 5.0, we integrate with the query cache • Batched lookups – SELECT * from t1 where primary_key IN (1,2,3,4,5,6,7,8,9,10); – all 10 key lookups are sent in one batch rather than one at a time – 2-3 times performance gain

Copyright 2006 MySQL AB 142 Condition Pushdown

• Conditions on non-key/index attributes are pushed down to data nodes whenever possible. select * from t2,t3 where t2.attr2 = t3.attr2 and t2.attr1 < 1 and t3.attr1 < 5; Join query is executed as two nested scans on table t2 and t3.The scan on t2 will push the condition t2.attr1 < 1 and the scan on t3 will push the condition t3.attr1 < 5 • Can be a arbitrary nesting of AND/OR/NOT and support the following operators=,=!,<,<=,>,>=,IS NULL, IS NOT NULL, LIKE, and NOT LIKE. • Is enabled|disabled by set engine_condition_pushdown=on|off;

Copyright 2006 MySQL AB 143 New in 5.1

Copyright 2006 MySQL AB 144 Variable Sized Rows

Copyright 2006 MySQL AB 145 4.1 and 5.0

1 2 hello int int VARCHAR

Copyright 2006 MySQL AB 146 4.1 and 5.0

1 2 hello and how are you today? int int VARCHAR

Copyright 2006 MySQL AB 147 4.1 and 5.0

1 2 hello and how are you today? very well. int int VARCHAR

Copyright 2006 MySQL AB 148 4.1 and 5.0

1 2 hello int int VARCHAR

Copyright 2006 MySQL AB 149 Wasted Space

1 2 hello int int VARCHAR

Wasted Space

Copyright 2006 MySQL AB 150 5.1

1 2 hello int int VARCHAR

Copyright 2006 MySQL AB 151 The Saving

Copyright 2006 MySQL AB 152 The Saving

Copyright 2006 MySQL AB 153 Variable Sized Rows

Copyright 2006 MySQL AB 154 More rows per gigabyte

Copyright 2006 MySQL AB 155 Larger datasets in Cluster

Copyright 2006 MySQL AB 156 Copyright 2006 MySQL AB 157 On line Add/Drop Index

Copyright 2006 MySQL AB 158 ADD INDEX (4.1, 5.0)

t1

Index Index

Rows

Copyright 2006 MySQL AB 159 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 160 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 161 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 162 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 163 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 164 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 165 ADD INDEX (4.1, 5.0)

t1 temp table Index Index Index Index Index

Rows Rows

Copyright 2006 MySQL AB 166 ADD INDEX (4.1, 5.0)

temp table Index Index Index

Rows

Copyright 2006 MySQL AB 167 ADD INDEX (4.1, 5.0)

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 168 ADD INDEX in 5.1

t1

Index Index

Rows

Copyright 2006 MySQL AB 169 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 170 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 171 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 172 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 173 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 174 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 175 ADD INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 176 DROP INDEX in 5.1

t1 Index Index Index

Rows

Copyright 2006 MySQL AB 177 DROP INDEX in 5.1

t1

Index Index

Rows

Copyright 2006 MySQL AB 178 How much faster?

Copyright 2006 MySQL AB 179 Online Add Index

• Before (copy the table): mysql> create index b on t1(b); Query OK, 1356 rows affected (2.20 sec) Records: 1356 Duplicates: 0 Warnings: 0

mysql> drop index b on t1; Query OK, 1356 rows affected (2.03 sec) Records: 1356 Duplicates: 0 Warnings: 0

Copyright 2006 MySQL AB 180 Online Add Index

• Before (copy the table): mysql> create index b on t1(b); Query OK, 1356 rows affected (2.20 sec) Records: 1356 Duplicates: 0 Warnings: 0

mysql> drop index b on t1; Query OK, 1356 rows affected (2.03 sec) Records: 1356 Duplicates: 0 Warnings: 0 • Now (just add/drop an index): mysql> create index b on t1(b); Query OK, 0 rows affected (0.58 sec) Records: 0 Duplicates: 0 Warnings: 0

mysql> drop index b on t1; Query OK, 0 rows affected (0.46 sec) Records: 0 Duplicates: 0 Warnings: 0

Copyright 2006 MySQL AB 181 On line Add/Drop Index

Copyright 2006 MySQL AB 182 Faster Add/Drop Index

Copyright 2006 MySQL AB 183 Less memory requirements

Copyright 2006 MySQL AB 184 A more adaptive Cluster

Copyright 2006 MySQL AB 185 User Defined Partitioning

Copyright 2006 MySQL AB 186 Since the dawn of time...

Copyright 2006 MySQL AB 187 Copyright 2006 MySQL AB 188 pk

Copyright 2006 MySQL AB 189 pk

Copyright 2006 MySQL AB 190 pk

Nodegroup 0 Nodegroup 1

Copyright 2006 MySQL AB 191 HASH(pk) pk

Copyright 2006 MySQL AB 192 HASH(pk) pk

Copyright 2006 MySQL AB 193 HASH(pk) pk

Copyright 2006 MySQL AB 194 Perception Reality

HASH(pk) pk

Copyright 2006 MySQL AB 195 pk

Copyright 2006 MySQL AB 196 pk

Two Partitions

Copyright 2006 MySQL AB 197 How the default looks (in 5.1 SHOW CREATE TABLE)

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [ BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 198 How the default looks (in 5.1 SHOW CREATE TABLE)

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 199 How the default looks (in 5.1 SHOW CREATE TABLE)

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 200 How the default looks (in 5.1 SHOW CREATE TABLE)

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 201 Now, In 5.1

Copyright 2006 MySQL AB 202 User Defined Partitioning

Copyright 2006 MySQL AB 203 By Key

Copyright 2006 MySQL AB 204 Partition by Key

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 205 By Range

Copyright 2006 MySQL AB 206 Partition by Range

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) PARTITION BY RANGE (amount) PARTITIONS 2 (PARTITION P0 VALUES LESS THAN 10000 NODEGROUP 0, PARTITION P1 VALUES LESS THAN 100000 NODEGROUP 1) ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 207 or by List

Copyright 2006 MySQL AB 208 Partition by List

CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) PARTITION BY LIST (ORD(location)) PARTITIONS 2 (PARTITION P0 VALUES IN (66, 76) -- Berlin, London NODEGROUP 0, PARTITION P1 VALUES IN (78, 84) -- New York, Tokyo NODEGROUP 1) ENGINE=NDBCLUSTER;

Copyright 2006 MySQL AB 209 User Defined Partitioning

Copyright 2006 MySQL AB 210 Gives the User control

Copyright 2006 MySQL AB 211 Gives the User control over physical distribution of data

Copyright 2006 MySQL AB 212 MySQL Cluster Replication

Copyright 2006 MySQL AB 213 Not Internal Mirroring between nodes

Copyright 2006 MySQL AB 214 Replication from one Cluster to Another Cluster

Copyright 2006 MySQL AB 215 Why?

Copyright 2006 MySQL AB 216 the usual reasons

Copyright 2006 MySQL AB 217 Why replicate between Clusters?

• Geographical Redundancy

Copyright 2006 MySQL AB 218 Why replicate between Clusters?

• Geographical Redundancy • Split the processing load

Copyright 2006 MySQL AB 219 Why replicate between Clusters?

• Geographical Redundancy • Split the processing load –e.g. for monthly reports

Copyright 2006 MySQL AB 220 Quick Overview of MySQL Replication

Copyright 2006 MySQL AB 221 MySQL Replication

Copyright 2006 MySQL AB 222 MySQL Replication

INSERT ...

Copyright 2006 MySQL AB 223 MySQL Replication

INSERT ...

INSERT...

Copyright 2006 MySQL AB 224 MySQL Replication

INSERT ...

INSERT...

Copyright 2006 MySQL AB 225 MySQL Replication

INSERT ...

INSERT...

INSERT...

Copyright 2006 MySQL AB 226 MySQL Replication

INSERT ...

INSERT ...

INSERT...

INSERT...

Copyright 2006 MySQL AB 227 MySQL Replication

INSERT ...

INSERT ...

INSERT...

INSERT...

INSERT...

Copyright 2006 MySQL AB 228 MySQL Replication

INSERT ... UPDATE ... INSERT ...

INSERT... UPDATE... INSERT...

INSERT...

Copyright 2006 MySQL AB 229 MySQL Replication

INSERT ... UPDATE ... DELETE ... INSERT ...

INSERT... UPDATE... DELETE... INSERT... UPDATE... INSERT...

Copyright 2006 MySQL AB 230 MySQL Replication

Master

Copyright 2006 MySQL AB 231 MySQL Replication

Master

Copyright 2006 MySQL AB 232 MySQL Replication

Slave

Master Slave

Slave

Copyright 2006 MySQL AB 233 MySQL Replication

Slave

Master Slave

Slave

Copyright 2006 MySQL AB 234 MySQL Replication

Slave

Master Slave

Slave

Copyright 2006 MySQL AB 235 MySQL Replication

Slave

Master Slave

Slave

Copyright 2006 MySQL AB 236 MySQL Replication

Slave

Master Slave

Slave

Copyright 2006 MySQL AB 237 MySQL Replication

Slave

Slow, Unreliable Network Master Slave

Slave

Copyright 2006 MySQL AB 238 MySQL Replication

Slave

Slow, Unreliable Network Master Slave

Slave

Copyright 2006 MySQL AB 239 MySQL Replication

Slave

Slow, Unreliable Network Master Slave

Slave

Copyright 2006 MySQL AB 240 MySQL Replication

Slave

Slow, Unreliable Network Master Slave

Slave

Copyright 2006 MySQL AB 241 Back at Cluster

Copyright 2006 MySQL AB 242 mysqld mysqld mysqld

Copyright 2006 MySQL AB 243 mysqld mysqld mysqld

NDB API NDB API

Copyright 2006 MySQL AB 244 mysqld mysqld mysqld UPDATE UPDATE UPDATE

NDB API NDB API

Copyright 2006 MySQL AB 245 mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 246 mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 247 INSERT UPDATE DELETE

mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 248 INSERT UPDATE DELETE

mysqld mysqld mysqld

HOW? HOW?

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 249 INSERT UPDATE DELETE

mysqld mysqld mysqld

Copyright 2006 MySQL AB 250 INSERT INSERT INSERT UPDATE UPDATE UPDATE DELETE DELETE DELETE

SLAVE

Copyright 2006 MySQL AB 251 INSERT INSERT INSERT UPDATE UPDATE UPDATE DELETE DELETE DELETE ORDER?

SLAVE

Copyright 2006 MySQL AB 252 Serialization is in the storage nodes

Copyright 2006 MySQL AB 253 INSERT UPDATE DELETE

mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 254 INSERT UPDATE DELETE

mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 255 INSERT UPDATE DELETE

mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 256 NDB Injector Thread

• A thread inside the MySQL Server • Subscribes to events in NDB – The event of “Row was committed” • Injects the rows into the binlog • Producing a single, canonical binlog of your cluster – not just one mysql server – it contains EVERYTHING done on ALL ndbapi programs (including mysqld) connected to the cluster

Copyright 2006 MySQL AB 257 INSERT UPDATE DELETE

mysqld mysqld mysqld

NDB API NDB API

update() update()

Copyright 2006 MySQL AB 258 A Closer Look...

Copyright 2006 MySQL AB 259 MySQL Replication between Clusters

Application Application

MySQL Server Master

NdbCluster Handler

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 260 MySQL Replication between Clusters

Application Application

MySQL Server Master

NdbCluster Handler

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 261 MySQL Replication between Clusters

Application Application

MySQL Server Master

NdbCluster

Handler Binlog

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 262 MySQL Replication between Clusters

Application Application

MySQL Server MySQL Server Master Slave

NdbCluster

Handler Binlog

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 263 MySQL Replication between Clusters

Application Application

MySQL Server MySQL Server Master Slave

NdbCluster NdbCluster

Handler Binlog Handler

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 264 MySQL Replication between Clusters

Application Application

MySQL Server MySQL Server Master Slave

NdbCluster $handler Handler Binlog

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 265 MySQL Replication between Clusters

Application Application

MySQL Server MySQL Server Master Slave

NdbCluster NdbCluster

Handler Binlog Handler

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 266 MySQL Replication between Clusters

Application Application

MySQL Server MySQL Server Replication Master Slave

NdbCluster NdbCluster

Handler Binlog Handler

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 267 MySQL Replication between Clusters

Application Application

MySQL Server I / O MySQL Server Replication Master thread Slave

NdbCluster NdbCluster Relay Handler Binlog Handler Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 268 MySQL Replication between Clusters

Application Application

MySQL Server I / O MySQL Server Replication Master thread Apply Slave thread

NdbCluster NdbCluster Relay Handler Binlog Handler Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 269 MySQL Replication between Clusters

Application Application

MySQL Server I / O MySQL Server Replication Master thread Apply Slave thread

NdbCluster NdbCluster Relay Handler Binlog Handler Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 270 MySQL Replication between Clusters

I / O MySQL Server thread Apply Slave thread

NdbCluster Relay Handler Binlog Binlog

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 271 MySQL Replication between Clusters

Application Application

I / O MySQL Server thread Apply Slave thread

NdbCluster Relay Handler Binlog Binlog

NDB Kernel (Data nodes) ndbd ndbd

ndbd ndbd

Copyright 2006 MySQL AB 272 MySQL Replication between Clusters

Application Application Application Application

MySQL Server I / O MySQL Server Replication Master thread Apply Slave thread

NdbCluster NdbCluster Relay Handler Binlog Handler Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 273 One thing...

Who spotted the single point of failure?

Copyright 2006 MySQL AB 274 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

Copyright 2006 MySQL AB 275 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

NdbCluster Binlog Handler

MySQL Server Master

Copyright 2006 MySQL AB 276 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

NdbCluster Binlog Relay NdbCluster Binlog Handler Binlog Handler MySQL Server Master Replication MySQL Server I/O Apply thread thread Slave

Copyright 2006 MySQL AB 277 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

NdbCluster Binlog Relay NdbCluster Binlog Handler Binlog Handler MySQL Server Master Replication MySQL Server I/O Apply thread thread Slave

Copyright 2006 MySQL AB 278 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

NdbCluster Binlog Relay NdbCluster Binlog Handler Binlog Handler MySQL Server Master Replication MySQL Server I/O Apply thread thread Slave

Copyright 2006 MySQL AB 279 Redundant Replication Channels

Replication Apply MySQL Server Master I/O MySQL Server thread thread Slave NdbCluster NdbCluster Handler Relay Handler Binlog Binlog Binlog

NDB Kernel NDB Kernel (Data nodes) (Data nodes) ndbd ndbd ndbd ndbd

ndbd ndbd ndbd ndbd

NdbCluster Binlog Relay NdbCluster Binlog Handler Binlog Handler MySQL Server Master Replication MySQL Server I/O Apply thread thread Slave

Copyright 2006 MySQL AB 280 So,

THAT'S COOL!

Copyright 2006 MySQL AB 281 Not just Cool,

Copyright 2006 MySQL AB 282 It's SHINY.

Copyright 2006 MySQL AB 283 How do I make fail over happen?

Copyright 2006 MySQL AB 284 But first...

Copyright 2006 MySQL AB 285 Epoch

• A point of synchronization in the cluster • Everybody agrees on what transactions are disk persistent • In case of system crash, this is where we'll recover to.

Copyright 2006 MySQL AB 286 Okay, but fail over?

Copyright 2006 MySQL AB 287 Currently manual,

Copyright 2006 MySQL AB 288 Currently manual, but only four simple steps

Copyright 2006 MySQL AB 289 STEP 1

• Find out where the Slave is up to • In the binary log produced by the injector – each Global Check Point (epoch) is a transaction – Where we are up to is recorded on the slave • The cluster database tracks these things – here, the apply_status table. • Which has two columns: server_id and epoch (both integers) • and is ENGINE=ndb – so is available everywhere! • So, we – mysqlS`> SELECT @latest:=MAX(epoch) from cluster.apply_status; – possibly with a WHERE clause for server_id

Copyright 2006 MySQL AB 290 STEP 2

• Find the binlog position for this epoch • The cluster.binlog_index table will help us here – It maps binlog position to GCI – and tells us the number of INSERT, UPDATES, DELETES and SCHEMAOPS per GCI – Is MyISAM and is per-master • So, we run the query (@latest from slave) • mysqlM`> SELECT @file:=SUBSTRING_INDEX(File, '/', -1), @pos:=Position FROM cluster.binlog_index WHERE epoch > @latest ORDER BY epoch ASC LIMIT 1;

Copyright 2006 MySQL AB 291 STEP 3

• Synchronize the second channel – i.e. change the master • Run the query (using @file and @pos from last query) – mysqlS`> CHANGE MASTER TO MASTER_LOG_FILE='@file', MASTER_LOG_POS=@pos;

Copyright 2006 MySQL AB 292 STEP 4

• mysqlS`> START SLAVE; • No, really, that's it.

Copyright 2006 MySQL AB 293 Limitations

• Fail over of replication channels is manual – can be scripted • Since all updates are through one injector thread, there is a limit – this limit is much less than what you can pump through a good cluster – We are working to overcome this

Copyright 2006 MySQL AB 294 You can now...

• Have a 99.999% uptime cluster – with great performance – and a cool name

Copyright 2006 MySQL AB 295 You can now...

• Have a 99.999% uptime cluster – with great performance – and a cool name • Have replication between it and another Cluster (or single server) – Load balancing – Geographical redundancy

Copyright 2006 MySQL AB 296 You can now...

• Have a 99.999% uptime cluster – with great performance – and a cool name • Have replication between it and another Cluster (or single server) – Load balancing – Geographical redundancy • Redundant replication channels between these setups • Redundancy up the wazoo!

Copyright 2006 MySQL AB 297 Copyright 2006 MySQL AB 298 Disk Data

Copyright 2006 MySQL AB 299 Options for Adding Disk Data to NDB

1. Put Existing disk engine into NDB kernel

Copyright 2006 MySQL AB 300 Options for Adding Disk Data to NDB

1. Put Existing disk engine into NDB kernel 2. Careful re-engineering of NDB kernel to add support for disk based attributes

Copyright 2006 MySQL AB 301 Options for Adding Disk Data to NDB

1. Put Existing disk engine into NDB kernel 2. Careful re-engineering of NDB kernel to add support for disk based attributes

Copyright 2006 MySQL AB 302 Two Phase Implementation

1. Data on disk (5.1) 2. Indexes on disk (future)

Copyright 2006 MySQL AB 303 A few concepts...

Copyright 2006 MySQL AB 304 Where we store things

Table Space

Copyright 2006 MySQL AB 305 Where we store things

Table Space

Data file

Copyright 2006 MySQL AB 306 Where we store things

Table Space

Data file Data file

Copyright 2006 MySQL AB 307 Where we store things

Table Space

Data file Data file

Data file

Copyright 2006 MySQL AB 308 Where we store things

Table Space

Table Space Data file Data file

Data file Data file

Data file

Copyright 2006 MySQL AB 309 Where we store things

Table Space

Table Space Data file Data file

Data file Data file

Data file Log file group

Undo file

Copyright 2006 MySQL AB 310 Where we store things

Table Space

Table Space Data file Data file

Data file Data file

Data file Log file group

Undo file Undo file

Copyright 2006 MySQL AB 311 Where we store things

Table Space

Table Space Data file Data file

Data file Data file

Data file Log file group

Undo file Undo file

Copyright 2006 MySQL AB 312 Files are per node

Node 2 Node 1

df1 df1

Copyright 2006 MySQL AB 313 Let's look at the SQL

Copyright 2006 MySQL AB 314 CREATE LOGFILE GROUP

• CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo1' INITIAL_SIZE 16M UNDO_BUFFER_SIZE 2M ENGINE=NDB; • We can add another undo file • ALTER LOGFILE GROUP lg_1 ADD UNDOFILE 'undo2' INITIAL_SIZE 12M ENGINE=NDB; • We currently don't auto-extend files – For more space, add files

Copyright 2006 MySQL AB 315 CREATE TABLESPACE

• CREATE TABLESPACE ts1 ADD DATAFILE 'datafile1' USE LOGFILE GROUP lg1 INITIAL_SIZE 32M ENGINE=NDB; • We can add another datafile too • ALTER TABLESPACE ts1 ADD DATAFILE 'datafile2' INITIAL_SIZE 48M ENGINE=NDB; • We currently don't auto-extend – So just add another file

Copyright 2006 MySQL AB 316 CREATE TABLE

• CREATE TABLE t1 ( pk1 INT NOT NULL PRIMARY KEY, b INT NOT NULL, c INT NOT NULL) TABLESPACE ts1 STORAGE DISK ENGINE=NDB; • b and c will be stored on disk • pk1 in memory (as it's indexed)

Copyright 2006 MySQL AB 317 ALTER TABLE (add index)

• If you add an index, fields moved to MM

Copyright 2006 MySQL AB 318 Checking free space

• INFORMATION_SCHEMA.FILES – Ongoing debate on if this is going to stay in INFORMATION_SCHEMA or move to performance schema. – The actual schema of the table is pretty decided though – Currently NDB only • Pressure your storage engine developers to add code for theirs! • Does require some knowledge about how space is allocated to tables

Copyright 2006 MySQL AB 319 How we allocate disk space

Copyright 2006 MySQL AB 320 Datafile

Copyright 2006 MySQL AB 321 Datafile

Extents

Copyright 2006 MySQL AB 322 Datafile t1

Copyright 2006 MySQL AB 323 Datafile t1

Copyright 2006 MySQL AB 324 Datafile t1

Copyright 2006 MySQL AB 325 Datafile t1

t2

Copyright 2006 MySQL AB 326 Datafile t1

t2

Copyright 2006 MySQL AB 327 Datafile t1

t2

Copyright 2006 MySQL AB 328 Datafile t1

t2

Copyright 2006 MySQL AB 329 Datafile t1

t2

Copyright 2006 MySQL AB 330 Datafile t1

t2

1 Free Extent

Copyright 2006 MySQL AB 331 I_S.FILES for Data files

• Datafile – what ts it belongs to • Extent Size (bytes) • Number of extents in file • Number of free extents • So, Free extents multiplied by extent size = free bytes that can be allocated to tables

Copyright 2006 MySQL AB 332 A useful VIEW

CREATE VIEW isf AS SELECT FILE_NAME, (TOTAL_EXTENTS * EXTENT_SIZE) AS 'Total', (FREE_EXTENTS * EXTENT_SIZE) AS 'Free', ( ((FREE_EXTENTS * EXTENT_SIZE)*100) / (TOTAL_EXTENTS * EXTENT_SIZE)) AS '% Free' FROM INFORMATION_SCHEMA.FILES WHERE ENGINE="ndbcluster" and FILE_TYPE = 'DATAFILE';

Copyright 2006 MySQL AB 333 A useful VIEW

mysql> select * from isf; Empty set (0.01 sec)

Copyright 2006 MySQL AB 334 A useful VIEW

mysql> select * from isf; Empty set (0.01 sec)

mysql> CREATE TABLESPACE ts1 ADD DATAFILE 'datafile.dat' USE LOGFILE GROUP lg1 INITIAL_SIZE 12M ENGINE NDB; Query OK, 0 rows affected (2.55 sec)

Copyright 2006 MySQL AB 335 A useful VIEW

mysql> select * from isf; +------+------+------+------+ | FILE_NAME | Total Extent Size | Total Free In Bytes | % Free Space | +------+------+------+------+ | datafile.dat | 12582912 | 12582912 | 100.0000 | | datafile.dat | 12582912 | 12582912 | 100.0000 | +------+------+------+------+ 2 rows in set (0.00 sec)

mysql>

Copyright 2006 MySQL AB 336 A useful VIEW mysql> select * from isf; +------+------+------+------+ | FILE_NAME | Total Extent Size | Total Free In Bytes | % Free Space | +------+------+------+------+ | datafile.dat | 12582912 | 12582912 | 100.0000 | | datafile.dat | 12582912 | 12582912 | 100.0000 | +------+------+------+------+ 2 rows in set (0.00 sec)

mysql> CREATE TABLE t1 (pk1 int not null primary key, b int not null, c int not null) tablespace ts1 storage disk engine ndb; Query OK, 0 rows affected (0.68 sec)

mysql> insert into t1 () values (); Query OK, 1 row affected, 3 warnings (0.00 sec)

Copyright 2006 MySQL AB 337 A useful VIEW

mysql> select * from isf; +------+------+------+------+ | FILE_NAME | Total Extent Size | Total Free In Bytes | % Free Space | +------+------+------+------+ | datafile.dat | 12582912 | 11534336 | 91.6667 | | datafile.dat | 12582912 | 11534336 | 91.6667 | +------+------+------+------+ 2 rows in set (0.01 sec)

mysql>

Copyright 2006 MySQL AB 338 I_S.FILES for UNDO files

• Free log space • If running out, maybe need to add more

Copyright 2006 MySQL AB 339 The future of I_S.FILES

• Perhaps reporting of (approximate or exact) free space inside extents allocated to tables – This means there'll be the generic row for the data file as well as rows for each table – Potentially very long I_S.FILES table • Other storage engines – Developers need to add fill_files_table to their handlerton – Potentially some code cleanups to make it easier to extend the table

Copyright 2006 MySQL AB 340 Some internal details....

Copyright 2006 MySQL AB 341 Buffer Manager

• Main memory database has no buffer manager – no need, everything is in memory – simplifies things a lot :) • Good Buffer Manager = Good Performance • There will be tuning and enhancements here in future releases

Copyright 2006 MySQL AB 342 NO-STEAL

• We use no-steal algorithm – no uncommitted data written to disk – Saves some disk writes – Limited transaction size (configurable – same as always)

Copyright 2006 MySQL AB 343 Optimized NR

• Traditional NR – copy everything over the wire – PRO: easy to implement correctly – PRO: not too bad for a few gigs of data – CON: very very bad for disk data • think 2TB going over the wire... ouch! • Details on optimized node recovery for NDB – in s1108-ronstrom.pdf – recovery from checkpoint, so don't have to copy everything

Copyright 2006 MySQL AB 344 Disk Data

Copyright 2006 MySQL AB 345 More data in Cluster

Copyright 2006 MySQL AB 346 While keeping Main Memory performance when needed

Copyright 2006 MySQL AB 347 Questions?

Copyright 2006 MySQL AB 348 In 4.1 and 5.0...

Copyright 2006 MySQL AB 349 Storage Nodes

Copyright 2006 MySQL AB 350 CREATE TABLE t1...

Storage Nodes

Copyright 2006 MySQL AB 351 t1 t1

t1 t1

Storage Nodes

Copyright 2006 MySQL AB 352 CREATE DATABASE foo

Storage Nodes

Copyright 2006 MySQL AB 353 CREATE DATABASE foo

CREATE DATABASE foo

Storage Nodes

Copyright 2006 MySQL AB 354 CREATE DATABASE foo CREATE DATABASE foo

CREATE DATABASE foo

Storage Nodes

Copyright 2006 MySQL AB 355 CREATE DATABASE foo CREATE DATABASE foo CREATE DATABASE foo

CREATE DATABASE foo

Storage Nodes

Copyright 2006 MySQL AB 356 But now...

Copyright 2006 MySQL AB 357 in 5.1

Copyright 2006 MySQL AB 358 CREATE DATABASE foo

Storage Nodes

Copyright 2006 MySQL AB 359 foo foo

foo foo

Storage Nodes

Copyright 2006 MySQL AB 360 in 4.1 and 5.0

Copyright 2006 MySQL AB 361 Table Auto Discovery

Copyright 2006 MySQL AB 362 in 5.1 Database and Table auto discovery

Copyright 2006 MySQL AB 363 Disk Data

in 5.1 Database and Table auto discovery

Copyright 2006 MySQL AB 364 Disk Data Integration with Replication in 5.1 Database and Table auto discovery

Copyright 2006 MySQL AB 365 Disk Data Integration with Replication in 5.1 Database and Table auto discovery

User Defined Partitioning

Copyright 2006 MySQL AB 366 Disk Data Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery

User Defined Partitioning

Copyright 2006 MySQL AB 367 Disk Data Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery On line Add/Drop Index User Defined Partitioning

Copyright 2006 MySQL AB 368 Disk Data Better error reporting Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery On line Add/Drop Index User Defined Partitioning

Copyright 2006 MySQL AB 369 Disk Data Better error reporting Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery On line Add/Drop Index User Defined Partitioning More internal QA

Copyright 2006 MySQL AB 370 Disk Data Better error reporting Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery Improvements in Cluster managementOn line Add/Drop Index User Defined Partitioning More internal QA

Copyright 2006 MySQL AB 371 Disk Data Better error reporting

Improving Documentation Integration with Replication Variable Sized Attributes in 5.1 Database and Table auto discovery Improvements in Cluster managementOn line Add/Drop Index User Defined Partitioning More internal QA

Copyright 2006 MySQL AB 372 Disk Data Better error reporting

Improving Documentation Integration with Replication Variable Sized Attributes Improving configuration in 5.1 Database and Table auto discovery Improvements in Cluster managementOn line Add/Drop Index User Defined Partitioning More internal QA

Copyright 2006 MySQL AB 373 Find out More!

• MySQL Online Documentation – http://dev.mysql.com/doc/refman/5.0/en/index.html • Cluster Forum – http://forums.mysql.com/list.php?25 • Cluster Mailing List – http://lists.mysql.com/cluster • Contact Sales – [email protected] • Contact Me – Monty Taylor – [email protected]

Copyright 2006 MySQL AB 374 Quick intro to MySQL

• SQL RDBMS – Aiming for full standards compliance • Reputation for speed, reliability and ease of use • MySQL 5.0 is GA, 5.1 is Beta – many Cluster improvements over 4.1 • 5.0 Supports (among other things) – SQL 2003 Stored procedures – Triggers – Updateable Views – Cursors – Precision math – data dictionary (INFORMATION_SCHEMA database) – distributed transactions (XA)

Copyright 2006 MySQL AB 375 Cluster across the versions...

• Cluster in 4.1 – Focused on High Availability • Cluster in 5.0 – Make it faster! • Cluster in 5.1 – Added new features • Online upgrade between minor versions (e.g. 5.0.18 to 5.0.19)

Copyright 2006 MySQL AB 376 Storage Engines

• Unique feature of MySQL – Theory is: no one way of storing a table is ideal for all situations – So we let you choose! – on a per table basis – it's a VFS for your database tables! • Several to choose from • Can create your own • Well known existing ones: – MyISAM, InnoDB, HEAP, CSV, BDB, Archive, federated • Cluster is a storage engine for MySQL – ENGINE=NDB or ENGINE=ndbcluster

Copyright 2006 MySQL AB 377 What is MySQL Cluster?

• An in-memory clustered storage engine for MySQL Size of Database×No Of Replicas×1.1 – Data and Indexes kept in main mePemr Noordye= No Of Data Nodes – Check-pointed to disk – Disk Data 5.1 • Highly Available – Designed for five 9s availability (99.999% uptime) – sub-second fail over – Hot (Online) Backup (no locks taken during backup) – Amount of redundancy is configurable: NoOfReplicas • Shared-nothing architecture – A cluster of commodity hardware – Choice for interconnects (TCP and SCI are the main ones) – No need for expensive SANs

Copyright 2006 MySQL AB 378 Management Server config.ini

[ndbd default] NoOfReplicas= 2 outside world DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster mysqld [ndbd] mysqld HostName= 192.168.0.40 ndb_mgmd

[ndbd] HostName= 192.168.0.41

[ndb_mgmd] DataDir= /usr/local/mysql/cluster HostName= 192.168.0.42

[mysqld] [mysqld] ndbd ndbd [mysqld]

Copyright 2006 MySQL AB 379 Nodes

• Storage nodes (ndbd) – Grouped into nodegroups – Each nodegroup holds a partition of the database – Number of nodes in a nodegroup configurable (NoOfReplicas) – Up to 48 in one cluster – One acts as a transaction coordinator • Management Node – Distributes configuration to other nodes • MySQL nodes – Sometimes referred to as API nodes – MySQL servers – Access just like you would any other mysql server

Copyright 2006 MySQL AB 380 Physical Requirements

• A node is a process, not a computer • Need at least three machines in a cluster – This avoids the split brain problem • Management server doesn't need a powerful machine • Data nodes – generally have a lot of memory – Disk IO requirements can be calculated from configuration parameters – not CPU bound – ndbd is single threaded • SQL nodes need the most CPU – have more of these than data nodes – mysqld is multithreaded

Copyright 2006 MySQL AB 381 CREATE TABLE

• CREATE TABLE fbhighscores ( level int, piclevel int, time float, name varchar(50) ) engine=ndb; • CREATE TABLE fblevelset ( name varchar(20), `num` int auto_increment, `level` text, PRIMARY KEY (name,num) ) ENGINE=ndbcluster;

Copyright 2006 MySQL AB 382 Data Distribution on 4 Nodes

Pn AccN Va $ r o l $ F1

Horizontal fragmentation F2 of Table 1 (4 fragments) F3 Fragments distributed 2 copies of data on nodes F4 Fx – primary replica Fx – secondary replica Table 1

Physical configuration Logical configuration

Node group 1 Node group 2 Node 1 Node 2 F1 F1 F2 F2

F3 F3 F4 F4 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4

Copyright 2006 MySQL AB 383