XtreemFS - a distributed and replicated cloud

Michael Berlin Zuse Institute Berlin

DESY Computing Seminar, 16.05.2011 Who we are

– Zuse Institute Berlin

– operates the HLRN supercomputer (#63+64)

– Research in Computer Science and Mathematics

– Parallel and Distributed Systems Group

– lead by Prof. Alexander Reinefeld (Humboldt University)

– Distributed and failure-tolerant storage systems Who we are

 Michael Berlin  PhD student since 03/2011  studied Informatik at Humboldt Universität zu Berlin  Diplom thesis dealt with XtreemFS  currently working on the XtreemFS client

3 Motivation

 Problem: Multiple copies of data  Where?  Copy complete?  Different versions?

internal external PC Cluster Cluster Nodes Nodes

local internal external file server Cluster Cluster storage storage

4 Motivation (2)

 Problem: Different access interfaces

Laptop external PC via 3G/Wi-Fi Cluster SCP Nodes VPN+?/ NFS/ SSHFS local external file server Cluster storage

5 Motivation (3)

 XtreemFS goals:  Transparency  Availability

Laptop internal external PC via 3G/Wi-Fi Cluster Cluster Nodes Nodes

XtreemFS

6 File Systems Landscape

7 Outline

1. XtreemFS Architecture 2. Client Interfaces 3. Read-Only Replication 4. Read-Write Replication 5. Metadata Replication 6. Customization through Policies 7. Security 8. Use Case: Mosgrid 9. Snapshots

8 XtreemFS Architecture (1)

 Volume on a Metadata Server:  provides hierarchical namespace

 File Content on Storage servers:  accessed directly by clients

internal PC Cluster Nodes

local internal file server Cluster storage

9 XtreemFS Architecture (2)

Metadata and Replica Catalog (MRC):

– holds volumes

Object Storage Devices (OSDs):

– file content split into objects

– objects can be striped across OSDs

 object-based file system architecture

10 Scalability

 File I/O Throughput

 parallel I/O: scales with number of OSDs

READ  Storage Capacity  add and removal of OSDs possible  OSDs may be used by multiple volumes

 Metadata Throughput

 limited by MRC hardware

 use many volumes spread over WRITE multiple MRCs

11 Accessing Components

Service (DIR)  central registry  all servers (MRC, OSD) register there with their id

 provides:  list of available volumes  mapping id  URL to service  list of available OSDs

12 Client Interfaces

 XtreemFS supports POSIX interface and semantics

 mount.xtreemfs: using FUSE  runs on Linux, FreeBSD, OS X and Windows (Dokan)  libxtreemfs for Java and C++

Laptop internal external PC via 3G/WiFi Cluster Cluster Nodes Nodes

mount.xtreemfs

mount.xtreemfs mount.xtreemfs

XtreemFS

13 Read-Only Replication

 Requirement: Mark file as read-only

 Replica types: a. Full replica:  requires complete copy external b. Partial replica: Cluster  fills itself on demand Nodes  instantly ready to use

internal Cluster storage external Cluster storage

14 Read-Only Replication (2)

15 Read-Only Replication (3)

 Receiver-initiated transfer at object level  OSDs exchange object lists

– Filling strategies: Fetch objects

– in order

– rarest first

– Prefetching available

– On-Close Replication: automatic replica creation

16 Read-Write Replication

 Availability  Data safety

 Allow Modifications

PC

local internal file server Cluster storage important.cpp important.cpp 17 Read-Write Replication (2)

Primary/Backup:

18 Read-Write Replication (3)

Primary/Backup: 1. Lease Acquisition  at most one valid lease per file  revocation = lease timeout

19 Read-Write Replication (4)

Primary/Backup: 1. Lease Acquisition  at most one valid lease per file  revocation = lease timeout 2. Data Dissemination

20 Read-Write Replication (5)

Central Lock Service Flease  Lease Acquisition  XtreemFS: Flease  scalable  majority-based

 Data Dissemination  Update Strategies:  Write All, Read 1  Write Quorum, Read Quorum

21 Metadata Replication

 Primary/backup replication  volume = database  transparently replicate database

 use leases to elect primary  replicate insert/update/delete

 Database = Key/Value Store  own implementation: BabuDB

22 Customization through Policies external Cluster  Example: Nodes  Which replica shall the client select?  determined by policies internal Cluster ??? storage external Cluster storage

 Policies:

– Authentication

– Authorization

– UID/GID mappings

– Replica placement

– Replica selection

23

Customization through Policies (2)

 Replica Placement/Selection Policies:  filter / sort / group replica list  available default policies:  FQDN-based  datacenter map MRC  Vivaldi (latency estimation) open() external  can be chained Cluster Nodes sorted replica list  own policies possible (Java) node1.ext-cluster internal Cluster storage external Cluster storage osd1.int-cluster osd1.ext-cluster 24 Security

 X.509 certificates support for authentication  SSL to encrypt communication

Laptop external via 3G/Wi-Fi Cluster Nodes mount.xtreemfs mount.xtreemfs w/ user certificate w/ host certificate XtreemFS

25 Use case: Mosgrid

 Mosgrid:  ease running experiments in computational chemistry  use grid resources through a web portal  portal allows to submit and retrieve compute jobs

 XtreemFS: global data repository

26 Use case: Mosgrid (2)

Browser Cluster Nodes

PC Submit Job Retrieve Results Input Data Results Unicore Web Portal Frontend mount.xtreemfs w/ user certificate libxtreemfs (Java) mount.xtreemfs w/ host certificate

XtreemFS XtreemFS scope

Berlin Köln Dresden 27 Snapshots

 Backups needed in case of  accidental deletion/modification  virus infections  Snapshot

 stable image of the file system at a given point in time

PC

unlink(“important.cpp“)

local internal file server Cluster storage important.cpp important.cpp 28 Snapshots (2)

 MRC: create snapshot if requested  OSDs: Copy-on-Write  on modify: create new object instead of overwriting  on delete: only mark as deleted snapshot() write("file.txt“) write("file.txt“) t0 t

file.txt: V1, t1 file.txt: V2, t2

29 Snapshots (3)

 No exact global time: Loosely synchronized clocks  assumption: maximum drift ε  Time span-based snapshots

snapshot() write("file.txt“) write("file.txt“) t write("file.txt“) t - ε 0 t + ε 0 0 t

file.txt: V1, t1 file.txt: V2, t2 file.txt: V2, t2

30 Snapshots (4)

 OSDs: limit number of versions  not version-on-every-write  Instead: close-to-open

 problem: client sends no explicit close  implicit close:  create new version if last write at least X seconds ago

 Cleanup tool:  deletes versions which belong to no snapshot

 Snapshots on directory level possible

31 Future Research

 Self-Tuning

 Quota support

 Data de-duplication

 Hierarchical Storage Management

32 XtreemFS Software

 Open source: www.xtreemfs.org  Development:  5 core developers at ZIB  integration tests for quality assurance

 Community:  users and bug reporters  mailing list with 102 subscribers

 Release 1.3:  Experimental support for read/write replication and snapshots

33 Thank You!

 References:  http://www.xtreemfs.org/publications.php

 www.contrail-project.eu  The Contrail project is supported by funding under the Seventh Framework Programme of the European Commission: ICT, Internet of Services, Software and Virtualization. GA nr.: FP7-ICT-257438.

34