XtreemFS - a distributed and replicated cloud file system
Michael Berlin Zuse Institute Berlin
DESY Computing Seminar, 16.05.2011 Who we are
– Zuse Institute Berlin
– operates the HLRN supercomputer (#63+64)
– Research in Computer Science and Mathematics
– Parallel and Distributed Systems Group
– lead by Prof. Alexander Reinefeld (Humboldt University)
– Distributed and failure-tolerant storage systems Who we are
Michael Berlin PhD student since 03/2011 studied Informatik at Humboldt Universität zu Berlin Diplom thesis dealt with XtreemFS currently working on the XtreemFS client
3 Motivation
Problem: Multiple copies of data Where? Copy complete? Different versions?
internal external PC Cluster Cluster Nodes Nodes
local internal external file server Cluster Cluster storage storage
4 Motivation (2)
Problem: Different access interfaces
Laptop external PC via 3G/Wi-Fi Cluster SCP Nodes VPN+?/ NFS/ SSHFS
5 Motivation (3)
XtreemFS goals: Transparency Availability
Laptop internal external PC via 3G/Wi-Fi Cluster Cluster Nodes Nodes
XtreemFS
6 File Systems Landscape
7 Outline
1. XtreemFS Architecture 2. Client Interfaces 3. Read-Only Replication 4. Read-Write Replication 5. Metadata Replication 6. Customization through Policies 7. Security 8. Use Case: Mosgrid 9. Snapshots
8 XtreemFS Architecture (1)
Volume on a Metadata Server: provides hierarchical namespace
File Content on Storage servers: accessed directly by clients
internal PC Cluster Nodes
local internal file server Cluster storage
9 XtreemFS Architecture (2)
Metadata and Replica Catalog (MRC):
– holds volumes
Object Storage Devices (OSDs):
– file content split into objects
– objects can be striped across OSDs
object-based file system architecture
10 Scalability
File I/O Throughput
parallel I/O: scales with number of OSDs
READ Storage Capacity add and removal of OSDs possible OSDs may be used by multiple volumes
Metadata Throughput
limited by MRC hardware
use many volumes spread over WRITE multiple MRCs
11 Accessing Components
Directory Service (DIR) central registry all servers (MRC, OSD) register there with their id
provides: list of available volumes mapping id URL to service list of available OSDs
12 Client Interfaces
XtreemFS supports POSIX interface and semantics
mount.xtreemfs: using FUSE runs on Linux, FreeBSD, OS X and Windows (Dokan) libxtreemfs for Java and C++
Laptop internal external PC via 3G/WiFi Cluster Cluster Nodes Nodes
mount.xtreemfs
mount.xtreemfs mount.xtreemfs
XtreemFS
13 Read-Only Replication
Requirement: Mark file as read-only
Replica types: a. Full replica: requires complete copy external b. Partial replica: Cluster fills itself on demand Nodes instantly ready to use
internal Cluster storage external Cluster storage
14 Read-Only Replication (2)
15 Read-Only Replication (3)
Receiver-initiated transfer at object level OSDs exchange object lists
– Filling strategies: Fetch objects
– in order
– rarest first
– Prefetching available
– On-Close Replication: automatic replica creation
16 Read-Write Replication
Availability Data safety
Allow Modifications
PC
local internal file server Cluster storage important.cpp important.cpp 17 Read-Write Replication (2)
Primary/Backup:
18 Read-Write Replication (3)
Primary/Backup: 1. Lease Acquisition at most one valid lease per file revocation = lease timeout
19 Read-Write Replication (4)
Primary/Backup: 1. Lease Acquisition at most one valid lease per file revocation = lease timeout 2. Data Dissemination
20 Read-Write Replication (5)
Central Lock Service Flease Lease Acquisition XtreemFS: Flease scalable majority-based
Data Dissemination Update Strategies: Write All, Read 1 Write Quorum, Read Quorum
21 Metadata Replication
Primary/backup replication volume = database transparently replicate database
use leases to elect primary replicate insert/update/delete
Database = Key/Value Store own implementation: BabuDB
22 Customization through Policies external Cluster Example: Nodes Which replica shall the client select? determined by policies internal Cluster ??? storage external Cluster storage
Policies:
– Authentication
– Authorization
– UID/GID mappings
– Replica placement
– Replica selection
23
Customization through Policies (2)
Replica Placement/Selection Policies: filter / sort / group replica list available default policies: FQDN-based datacenter map MRC Vivaldi (latency estimation) open() external can be chained Cluster Nodes sorted replica list own policies possible (Java) node1.ext-cluster internal Cluster storage external Cluster storage osd1.int-cluster osd1.ext-cluster 24 Security
X.509 certificates support for authentication SSL to encrypt communication
Laptop external via 3G/Wi-Fi Cluster Nodes mount.xtreemfs mount.xtreemfs w/ user certificate w/ host certificate XtreemFS
25 Use case: Mosgrid
Mosgrid: ease running experiments in computational chemistry use grid resources through a web portal portal allows to submit and retrieve compute jobs
XtreemFS: global data repository
26 Use case: Mosgrid (2)
Browser Cluster Nodes
PC Submit Job Retrieve Results Input Data Results Unicore Web Portal Frontend mount.xtreemfs w/ user certificate libxtreemfs (Java) mount.xtreemfs w/ host certificate
XtreemFS XtreemFS scope
Berlin Köln Dresden 27 Snapshots
Backups needed in case of accidental deletion/modification virus infections Snapshot
stable image of the file system at a given point in time
PC
unlink(“important.cpp“)
local internal file server Cluster storage important.cpp important.cpp 28 Snapshots (2)
MRC: create snapshot if requested OSDs: Copy-on-Write on modify: create new object instead of overwriting on delete: only mark as deleted snapshot() write("file.txt“) write("file.txt“) t0 t
file.txt: V1, t1 file.txt: V2, t2
29 Snapshots (3)
No exact global time: Loosely synchronized clocks assumption: maximum drift ε Time span-based snapshots
snapshot() write("file.txt“) write("file.txt“) t write("file.txt“) t - ε 0 t + ε 0 0 t
file.txt: V1, t1 file.txt: V2, t2 file.txt: V2, t2
30 Snapshots (4)
OSDs: limit number of versions not version-on-every-write Instead: close-to-open
problem: client sends no explicit close implicit close: create new version if last write at least X seconds ago
Cleanup tool: deletes versions which belong to no snapshot
Snapshots on directory level possible
31 Future Research
Self-Tuning
Quota support
Data de-duplication
Hierarchical Storage Management
32 XtreemFS Software
Open source: www.xtreemfs.org Development: 5 core developers at ZIB integration tests for quality assurance
Community: users and bug reporters mailing list with 102 subscribers
Release 1.3: Experimental support for read/write replication and snapshots
33 Thank You!
References: http://www.xtreemfs.org/publications.php
www.contrail-project.eu The Contrail project is supported by funding under the Seventh Framework Programme of the European Commission: ICT, Internet of Services, Software and Virtualization. GA nr.: FP7-ICT-257438.
34