Building a transactional distributed store with Erlang

Alexander Reinefeld, Florian Schintke, Thorsten Schütt

Zuse Institute Berlin, onScale solutions GmbH Transactional data store - What for? z Web 2.0 services: shopping, banking, gaming, …

− don’ t need full SQL semantics, key/value DB often suffice e.g. Mike Stonebreaker: “One size does not fit all”

z Scalability matters

− >104 accesses per sec. − many concurrent writes Traditional Web 2.0 hosting

Clients Traditional Web 2.0 hosting

Clients Traditional Web 2.0 hosting

Clients

...... Traditional Web 2.0 hosting

Clients

...... Now think big. Really BIG.

NthNot how fas t our cod e is tdtoday, btbut:

− Can it “scale out”? − Can it run in parallel? … distributed? − Anyygg common resources causing locking? AAsymptoticsymptotic performance matters! Our Approach: P2P makes it scalable

“arbitrary“ number of clients

Web 2.0 services with P2P nodes idtin datacen ters x x Our Approach

Application Layer

crash recovery model Key/Value Store (= simple ) strong data consistency

Transaction Layer implements ACID

improves availability Layer at the cost of consistency crash stop implements modldel P2P Layer - scalability - eventual consistency

unreliable, distributed nodes providing a scalable distributed data store: P2P LAYER Key/Value Store z for storing “items” (= “key/value pairs”)

− synonyms: “key/value store”“, dictionary”“, map”, …

z just 3 ops

− insert(key, value) Turing Award Winners Key Value − delete(key) Backus 1977 − lk(k)lookup(key) Hoare 1980 Karp 1985 Knuth 1974 Wirth 1984 ...... Chord# - Distributed Key/Value Store z key space: total order on items (strings, numbers, …) z nodes have a random key as their position in ring z items are stored on the successor (clockwise)

(Backus, …, Karp] keys

(Karp, …, Knuth] Key Value

# Backus 1977 Chord item distributed Hoare 1980 key/value Karp 1985 store Knuth 1974 Wirth 1984 nodes ...... Routing Table and Data Lookup Building the routing table

z log2N pointers

z exponentially spaced pointers

Chord# Routing Table and Data Lookup Building the routing table Retrieving items

z log2N pointers z ≤ log2N hops

z exponentially spaced pointers Example: lookup (Hoare) started from here (Backus – Karp]

Chord# Chord# Churn z Nodes join, leave, or crash at any time

z Need “failure detector” to check aliveness of nodes

− failure detector may be wrong: Node dead? Or just slow network?

z Churn may cause inconsistencies

− need local repair mechanism Responsibility Consistency

z Violated responsibility consistency caused by imperfect filfailure dtdetec tor: BthN3Both, N3 and N4 cl ai m respons ibility for item k

N3 crashed k ! N2 N3 N1 N4 Lookup Consistency

z Violated lookup consistency caused by imperfect failure dtdetec tor: lk(k)lookup(k): atN1t N1 Æ N3, bu t a t N2 Æ N4

N2

crashed k ! N2 N3 N1 N4

N3 crashed ! How often does this occur? z Simulated nodes with imperfect failure detectors (A node detects another alive node as dead ppy)robabilistically) SUMMARY P2P LAYER # z Chord provides a key/value store

− scalable

− efficient: log2N hops

z Quality of failure detector is crucial

z Need replication to prevent data loss … improving availability REPLICATION LAYER Replication z Many schemes

− symmetric replication Æ − succ. list replication − …

z Must ensure data consistency

− need quorum-based methods Quorum based algorithms z Enforce consistency by operating on majorities

r1 r2 r3 r4 r5

majority z Comes at the cost of increased latency

− but latency can be avoided by clever replica distribution in datacenters (cloud comppguting) SUMMARY REPLICATION LAYER

z availability in face of churn

z quorum algorithms

z But need transactional data access … coping with concurrency: TRANSACTION LAYER Transaction Layer z Transactions on P2P are challenging because of …

− churn

z changing node responsibilities

− crash stop fault model

z as opposed to crash recovery in traditional DBMS

− iftimperfect filfailure d dtetect or

z don’t know whether node crashed or slow network Strong Data Consistency z What is it?

− When a write is finished, all following reads return the new value.

z How to implement?

− Always read/write majority ⎣f/2⎦ + 1 of f replicas. Æ Latest version is always in the read or write set

− Must ensure that replication degree is ≤ f Atomicity z What is it?

− Make all or no changes! − Either ‘commit’ or ‘abort’.

z How to implement?

− 2PC? Blocks if the transaction manager fails. − 3PC? ThltToo much latency. − We use a variant of the Paxos Commit Protocol

z non-blocking: Votes of transaction participants are sent to multiple “acceptors” Adapted Paxos Commit z Optimistic CC with fallback z Write

− 3 rounds

− non-blocking (fallback) z Read even faster

− reads majority of replicas

− just 1 round z succeeds when >f/2 nodes alive Adapted Paxos Commit replicated Items at z Optimistic CC with fallback Transaction Transaction Managers Participants Leader (()TMs) (TPs) z Write

− 3 rounds 1. Step: 1. Step − non-blocking (fallback) O(log N) hops 2. Step z Read even faster

− reads majority of replicas 2.-6. Step: 3. Step − just 1 round O(1) hops 4. Step z succeeds when >f/2 nodes alive 5. Step After majority

6. Step Transactions have two purposes: Consistencyyp of replicas & consistenc y across items

User Request Operation on replicas

BOT BOT

− debit (a1, 100);

− debit (a, 100); − debit (a2, 100);

− debit (a3, 100);

− deposit (b1, 100);

− deposit (b, 100); − deposit (b2, 100);

− deposit (b3, 100); EOT EOT SUMMARY TRANSACTION LAYER

z Consistent update of items and replicas

z Mitigates some of the overlay oddities

− node failures

− asynchronicity demonstrator application: WIKIPEDIA Wikipedia

Top 10 Web sites 50.000 requests/sec 1. Yahoo! − 95% are answered byyq squid proxies 2. − only 2,000 req./sec hit the backend 3. YouTube

4. Windows Live

5. MSN

6. Myspace

7. Wikipedia

8.

9. Blogger.com 10. Yahoo!カテゴリ Public Wikipedia

other

search servers

NFS web servers Our Wikipedia Renderer z Java

− Tomcat, Plog4u Java z Jin ter face

− Interface to Erlang Erlang

Key/Value Store # z Chord + Replication + Transactions Mapping Wikipedia to Key/Value Store z Mapping

key value page content title list of Wikitext for all versions backlinks title list of titles categories category name list of titles

z For each insert or modify we must

− update backlinks write transaction − update category page(s) Erlang Processes

Erlang Processes

− Chord#

− load balancing

− transaction framework

− supervision (OTP) Erlang Processes (per node)

# z Failure Detector supervises Chord nodes and sends crash messages when a failure is detected. z Configuration provides access to the configuration file and maintains par am et er ch an ges m ade at r untim e. z Key Holder stores the identifier of the node in the overlay. z Statistics Collector collects statistics information and forwards them to statistic servers.

# z Chord Node performs the main functionality of the node, e.g. successor list and routing table. z Database stores key/value pairs in each node. AiAccessing ElErlang TtiTransactions from Java via Jinterface void updatePage(string title, int oldVersion, string newText) { Transaction t = new Transaction(); //new transaction Page p = t.read(title); // read old version if (p.currentVersion != oldVersion) // concurrent update? t.abort(); else { t.write(p.add(newText)); // write new text //upd at e catitegories foreach(Category c in p) t.write (t.read (c.name) .add(title)); t.commit(); } } Performance on Linux Cluster

test results with load generator

throughput with increasing access rate over time CPU load with increasing access rate over time

1500 trans./sec on 10 CPUs 2500 trans./sec on 16 CPUs (64 cores) and 128 DHT nodes Implementation z 11,000 lines of Erlang code

− 2, 700 for transactions − 1,300 for Wikipedia − 7, 000 for Chor d# an d in frastructure

z Distributed Erlang

− currently has weak security and limited scalability ⇒ we implemented own transport layer on top of TCP

z Java for rendering and user interface

SUMMARY Summary z P2P as new paradigm for Web 2.0 hosting

− we support consistent , distributed write operations.

z Numerous applications:

− Internet , transactional online-services, … Tradeoff: High availability vs. data consistency Team z Thorsten Schütt z Florian Schintke z Monika Moser z Stefan Plantikow z Alexander Reinefeld z Nico Kruber z Christian von Prollius z Seif Haridi (SICS) z Ali Ghodsi (SICS) z Tallat Shafaat (()SICS) Publications # Chord Talks / Demos T. Schütt, F. Schintke, A. Reinefeld. A Structured Overlay for Multi-dimensional IEEE Scale Challenge, May 2008 Range Queries. Euro-Par, August 2007. 1st price (live demo)

T. Schütt, F. Schintke, A. Reinefeld. Structured Overlay without Consistent Hashing: Empirical Results. GP2PC, May 2006.

Transactions M. Moser, S. Haridi. Atomic Commitment in Transactional DHTs . 1st CoreGRID Symposium, August 2007.

T. Shafaat, M. Moser, A. Ghodsi , S. Haridi, T. Sc hütt, A . R e ine fe ld. Key-BdBased CitConsistency and Availability in Structured Overlay Networks. Infoscale, June 2008.

Wiki S. Plantikow, A. Reinefeld, F. Schintke. Transactions for Distributed Wikis on Structured Overlays. DSOM, O ctober 2007.