A Generic Basis for Distributed Kernel Infrastructure

Total Page:16

File Type:pdf, Size:1020Kb

A Generic Basis for Distributed Kernel Infrastructure KDDM: A Generic Basis For Distributed Kernel Infrastructure Renaud Lottiaux – Kerlabs Erich Focht - NEC June 28wwwth - .Okerlabs.LS 20com07 Goal of this BoF Introduce the KDDM concept Measure the interest in such a sub-system Identify who could be interested in using this Have an idea of how far we could be from an integration in the main line KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 2 Linux and Clustering Linux community was quite skeptic regarding clustering However, several cluster projects are already included or close to be included in main line Transparent Inter Process Communication (TIPC) Distributed Lock Manager (DLM) GFS Oracle Cluster FS 2 (OCFS2) Distributed IPC (DIPC) ... This is just the beginning ! Good time to setup the basis of a kernel level distributed infrastructure KDDM KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 3 Definition of KDDM Distributed cache of objects Generic mechanism to share objects between nodes Ensure an access to data which is Transparent Efficient Coherent ! Objects are accessed through a set of functions Don't care about data localization Just use data ! KDDM can host any kind of data Memory pages Data structure KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 4 Object identifier Objects are identified using 3 values Object id Set id Name space id A set is of collection of objects You can freely define your sets Pages from the same system V IPC segment ... A name space is a collection of sets You can freely define your name spaces Regular linux name spaces ... KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 5 Object coherence R/W-semaphore like access to object Single writer / multiple reader Object are transparently moved / duplicated between nodes Duplication means coherence problem Coherence managed using “invalidation on write” mechanism Lighter coherence mechanisms will be implemented Update on time-out Hardware helped data sharing (specific network needed) ... KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 6 Basic KDDM interface void * _kddm_get_object ( struct kddm_set *kddm_set, objid_t objid ) void * _kddm_grab_object ( struct kddm_set *kddm_set, objid_t objid ) void * _kddm_put_object ( struct kddm_set *kddm_set, objid_t objid ) void * _kddm_remove_object ( struct kddm_set *kddm_set, objid_t objid ) KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 7 IO Linkers KDDM set instantiated by IO linkers Determine the nature of hosted objects Define object input/output functions One kind of IO linker per kind of object to share Memory pages File cache pages Inodes ... Define links between object and physical nodes File pages are linked to the node hosting the file data Process memory pages are not linked to a given node KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 8 IO Linker functions IO Linkers are mainly a set of function pointer Define the behavior of the KDDM set Object allocation / Free First touch Object invalidation Object export / Import Object synchronization Etc... Default functions for kmalloc based objects. KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 9 The IO Linker Structure struct iolinker_struct { int (*instantiate) (struct kddm_set *set, void *private_data, int master); void (*uninstantiate) (struct kddm_set *set, int destroy); int (*first_touch) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*remove_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*invalidate_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*flush_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*insert_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*put_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*sync_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); void (*change_access) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid, dsm_state_t state); void *(*alloc_object) (struct kddm_obj *entry, struct kddm_set *set, objid_t objid); int (*import_object) (struct kddm_obj *entry, struct rpc_desc *desc); int (*export_object) (struct rpc_desc *desc, struct kddm_obj *obj_entry); void (*freeze_object) (struct kddm_obj *obj_entry); void (*warm_object) (struct kddm_obj *obj_entry); int (*is_frozen) (struct kddm_obj *obj_entry); char linker_name[16]; iolinker_id_t linker_id; }; KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 10 KDDM Architecture Distributed service Distributed service KDDM Core I/O Linker I/O Linker Local resource Local resource manager manager KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 11 Outline General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 12 KDDM “Hello World !” (1/3) struct iolinker_struct hw_linker = { linker_name: “hw”, linker_id: 1 }; struct kddm_set *hw_set; void hello_world_setup (void) { register_io_linker (1, &hw_linker); hw_set = create_new_kddm_set (kddm_def_ns, /* Default name space */ 1, /* IO linker id */ KDDM_SET_NOT_LINKED, 64, /* Size of objects to share*/ NULL, 0, 0); } KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 13 KDDM “Hello World !” (2/3) void hello_world_node0 (void) { char *buf_en, *buf_en; buf_en = kddm_grab_object (hw_set, 0); strcpy (buf_en, “Hello “); kddm_put_object (hw_set, 0); buf_fr = kddm_grab_object (hw_set, 1); strcpy (buf_fr, “Bonjour “); kddm_put_object (hw_set, 1); } void hello_world_node1 (void) { char *buf_en, *buf_en; buf_en = kddm_grab_object (hw_set, 0); strcpy (&buf_en[6], ”world !“); kddm_put_object (hw_set, 0); buf_fr = kddm_grab_object (hw_set, 1); strcpy (&buf_fr[8], “monde !“); kddm_put_object (hw_set, 1); } KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 14 KDDM “Hello World !” (3/3) Node 0 hello_world_setup (); hello_world_node0 (); Node 1 hello_world_setup (); hello_world_node1 (); char *buf; buf = kddm_get_object (hw_set, 0); printk (“%s\n”, buf); kddm_put_object (hw_set, 0); buf = kddm_get_object (hw_set, 1); printk (“%s\n”, buf); kddm_put_object (hw_set, 1); Hello world ! kddm_remove_object (hw_set, 0); Bonjour monde ! kddm_remove_object (hw_set, 1); KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 15 How does that work ? kddm_grab_object (hw_st, 0) KDDM set I/O Linker I/O Linker Memory Memory Hello KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 16 How does that work ? kddm_grab_object (hw_set, 0) KDDM set I/O Linker I/O Linker Memory Memory Hello Hello KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 17 How does that work ? kddm_grab_object (hw_set, 0) KDDM set I/O Linker I/O Linker Memory Memory Hello world ! KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 18 How does that work ? kddm_get_object (hw_set, 0) KDDM set I/O Linker I/O Linker Memory Memory Hello world ! Hello world ! KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 19 How does that work ? kddm_get_object (hw_set, 0) KDDM set I/O Linker I/O Linker Memory Memory Hello world ! Hello world ! KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 20 Outline General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 21 KDDM Design Interface KDDM Object NS Set Object Protocol IO Linker Core server manager manager manager KDDM Communication interface HotPlug Comm Layer TIPC KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 22 Outline General overview Hello world with KDDM ! Quick KDDM architecture overview System V Shared Memory example KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 23 Building Distributed SHM with KDDM Based on KDDM, building a distributed SHM mechanism is quite simple We need to share Segment content One SHM memory data IO linker A set of KDDM set instantiated with this linker One per memory segment SHM ids One SHM ids IO linker A unique KDDM hosting existing ids cluster wide KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 24 Distributed SHM implementation On a new segment creation Hook in kernel newseg function Create a new KDDM set for segment data Create a new entry for the segment in the ids KDDM set Make the link between SHM id and KDDM data set id On segment removal Hook in kernel do_shm_rmid function Destroy the data KDDM set Remove the entry in ids KDDM set. On segment mapping Hook in kernel shm_mmap function Set the vm_ops mm field to our set of functions. KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 25 Distributed SHM implementation On a segment lookup Hook in the kernel shm_lock function Check if the requested id exist in the ids KDDM set. kddm_get_object VM operations no_page kddm_get_object / kddm_grab_object on data KDDM set wp_page (present in 2.2 series) kddm_grab_object on data KDDM set KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 26 Container use in Kerrighed SSI OS Used as a basic bloc to implement Process memory migration Memory sharing cluster wide File cache sharing cluster wide Inodes sharing cluster wide Cluster wide locks Signal sharing Etc... Could be used by some other projects DIPC OpenSSI Etc... KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 27 Conclusion KDDM is a high level abstraction to share data between nodes at kernel level KDDM can be used to (more) easily implement distributed services Could be a very good basis for a distributed kernel infrastructure KDDM 0BoF2/11/0 8- June 28thw ww.- OkerLSlab s20.com07 28 Backups KDDM 0BoF2/11/0 8-
Recommended publications
  • Understanding Lustre Filesystem Internals
    ORNL/TM-2009/117 Understanding Lustre Filesystem Internals April 2009 Prepared by Feiyi Wang Sarp Oral Galen Shipman National Center for Computational Sciences Oleg Drokin Tom Wang Isaac Huang Sun Microsystems Inc. DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via the U.S. Department of Energy (DOE) Information Bridge. Web site http://www.osti.gov/bridge Reports produced before January 1, 1996, may be purchased by members of the public from the following source. National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone 703-605-6000 (1-800-553-6847) TDD 703-487-4639 Fax 703-605-6900 E-mail [email protected] Web site http://www.ntis.gov/support/ordernowabout.htm Reports are available to DOE employees, DOE contractors, Energy Technology Data Exchange (ETDE) representatives, and International Nuclear Information System (INIS) representatives from the following source. Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone 865-576-8401 Fax 865-576-5728 E-mail [email protected] Web site http://www.osti.gov/contact.html This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof.
    [Show full text]
  • High Availability for RHEL on System Z
    High Availability for RHEL on System z David Boyes Sine Nomine Associates Agenda • Clustering • High Availability • Cluster Management • Failover • Fencing • Lock Management • GFS2 Clustering • Four types – Storage – High Availability – Load Balancing – High Performance High Availability • Eliminate Single Points of Failure • Failover • Simultaneous Read/Write • Node failures invisible outside the cluster • rgmanager is the core software High Availability • Major Components – Cluster infrastructure — Provides fundamental functions for nodes to work together as a cluster • Configuration-file management, membership management, lock management, and fencing – High availability Service Management — Provides failover of services from one cluster node to another in case a node becomes inoperative – Cluster administration tools — Configuration and management tools for setting up, configuring, and managing the High Availability Implementation High Availability • Other Components – Red Hat GFS2 (Global File System 2) — Provides a cluster file system for use with the High Availability Add-On. GFS2 allows multiple nodes to share storage at a block level as if the storage were connected locally to each cluster node – Cluster Logical Volume Manager (CLVM) — Provides volume management of cluster storage – Load Balancer — Routing software that provides IP-Load-balancing Cluster Infrastructure • Cluster management • Lock management • Fencing • Cluster configuration management Cluster Management • CMAN – Manages quorum and cluster membership – Distributed
    [Show full text]
  • Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques
    Comparative Analysis of Distributed and Parallel File Systems’ Internal Techniques Viacheslav Dubeyko Content 1 TERMINOLOGY AND ABBREVIATIONS ................................................................................ 4 2 INTRODUCTION......................................................................................................................... 5 3 COMPARATIVE ANALYSIS METHODOLOGY ....................................................................... 5 4 FILE SYSTEM FEATURES CLASSIFICATION ........................................................................ 5 4.1 Distributed File Systems ............................................................................................................................ 6 4.1.1 HDFS ..................................................................................................................................................... 6 4.1.2 GFS (Google File System) ....................................................................................................................... 7 4.1.3 InterMezzo ............................................................................................................................................ 9 4.1.4 CodA .................................................................................................................................................... 10 4.1.5 Ceph.................................................................................................................................................... 12 4.1.6 DDFS ..................................................................................................................................................
    [Show full text]
  • The Chubby Lock Service for Loosely-Coupled Distributed Systems
    The Chubby lock service for loosely-coupled distributed systems Mike Burrows, Google Inc. Abstract example, the Google File System [7] uses a Chubby lock to appoint a GFS master server, and Bigtable [3] uses We describe our experiences with the Chubby lock ser- Chubby in several ways: to elect a master, to allow the vice, which is intended to provide coarse-grained lock- master to discover the servers it controls, and to permit ing as well as reliable (though low-volume) storage for clients to find the master. In addition, both GFS and a loosely-coupled distributed system. Chubby provides Bigtable use Chubby as a well-known and available loca- an interface much like a distributed file system with ad- tion to store a small amount of meta-data; in effect they visory locks, but the design emphasis is on availability use Chubby as the root of their distributed data struc- and reliability, as opposed to high performance. Many tures. Some services use locks to partition work (at a instances of the service have been used for over a year, coarse grain) between several servers. with several of them each handling a few tens of thou- Before Chubby was deployed, most distributed sys- sands of clients concurrently. The paper describes the tems at Google used ad hoc methods for primary elec- initial design and expected use, compares it with actual tion (when work could be duplicated without harm), or use, and explains how the design had to be modified to required operator intervention (when correctness was es- accommodate the differences.
    [Show full text]
  • Scaling HDFS with a Strongly Consistent Relational Model for Metadata Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling
    Scaling HDFS with a Strongly Consistent Relational Model for Metadata Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling To cite this version: Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling. Scaling HDFS with a Strongly Consistent Relational Model for Metadata. 4th International Conference on Distributed Applications and In- teroperable Systems (DAIS), Jun 2014, Berlin, Germany. pp.38-51, 10.1007/978-3-662-43352-2_4. hal-01287731 HAL Id: hal-01287731 https://hal.inria.fr/hal-01287731 Submitted on 14 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Scaling HDFS with a Strongly Consistent Relational Model for Metadata Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling KTH - Royal Institute of Technology Swedish Institute of Computer Science (SICS) {mahh, shps, jdowling}@kth.se Abstract. The Hadoop Distributed File System (HDFS) scales to store tens of petabytes of data despite the fact that the entire le system's metadata must t on the heap of a single Java virtual machine. The size of HDFS' metadata is limited to under 100 GB in production, as garbage collection events in bigger clusters result in heartbeats timing out to the metadata server (NameNode).
    [Show full text]
  • Clustering of Openvms Installations for High Availability
    Clustering of OpenVMS installations for high availability Norman Kluge norman.kluge [at] student.hpi.uni-potsdam.de Hasso-Plattner-Institut, Potsdam Lecture: Dependable Systems Summer term 2010 Key words: OpenVMS, Cluster, Availabilty, Dependable Systems 1 Introduction Nowadays services have to be high available. Datacenters are often fault tolerant. That means they are for instance tolerant against hardware failures, software fail- ures or electricity failures. But what happened if a hole datacenter is destroyed. High availabilty means that the service still has to be online. OpenVMSCluster oers concepts of so called disaster tolerance, means even if a datacenter fails, the service never stops the beat. That is ensured by dierent approaches covered in this paper. In the rst section, some terminology of OpenVMSCluser are introduced. In the second one, some concepts, how OpenVMSCluser ensures high availabilty, are covered. In the las section, some practical experience is mentioned. 2 OpenVMSCluster Basics In this section some basic terminology of OpenVMSCluster is described. 2.1 Benets There are some benets resulting from clustering of OpenVMS installations. First point is, that the dierent Workstations can share resources (e.g. disks, network connections, ...) among them. The sharing concept is very important for another important benet: the promise of high availability. With this and nonstop processing an OpenVMSCluster guarantees that the services running on it are always available and responding the user. That correlates with the benets of scalability, performance and load balancing, which means that the load is spread into the cluster and all available workstations are working on the task. There are also some security features oered by OpenVMSCluster.
    [Show full text]
  • Linux Clustering & Storage Management
    Linux Clustering & Storage Management Peter J. Braam CMU, Stelias Computing, Red Hat Disclaimer Several people are involved: Stephen Tweedie (Red Hat) Michael Callahan (Stelias) Larry McVoy (BitMover) Much of it is not new – Digital had it all and documented it! IBM/SGI ... have similar stuff (no docs) Content What is this cluster fuzz about? Linux cluster design Distributed lock manager Linux cluster file systems Lustre: the OBSD cluster file system Cluster Fuz Clusters - purpose Assume: Have a limited number of systems On a secure System Area Network Require: A scalable almost single system image Fail-over capability Load-balanced redundant services Smooth administration Precursors – ad hoc solutions WWW: Piranha, TurboCluster, Eddie, Understudy: 2 node group membership Fail-over http services Database: Oracle Parallel Server File service Coda, InterMezzo, IntelliMirror Ultimate Goal Do this with generic components OPEN SOURCE Inspiration: VMS VAX Clusters New: Scalable (100,000’s nodes) Modular The Linux “Cluster Cabal”: Peter J. Braam – CMU, Stelias Computing, Red Hat (?) Michael Callahan – Stelias Computing, PolyServe Larry McVoy – BitMover Stephen Tweedie – Red Hat Who is doing what? Tweedie McVoy Project leader Cluster computing Core cluster services SMP clusters Callahan Braam Varia DLM Red Hat InterMezzo FS Cluster apps & admin Lustre Cluster FS UMN GFS: Shared block FS Technology Overview Modularized VAX cluster architecture (Tweedie) Core Support Clients Transition Cluster db Distr. Computing Integrity Quorum Cluster Admin/Apps Link Layer Barrier Svc Cluster FS & LVM Channel Layer Event system DLM Components Channel layer - comms: eth, infiniband Link layer - state of the channels Integration layer - forms cluster topology CDB - persistent cluster internal state (e.g.
    [Show full text]
  • Inside the Lustre File System
    Inside The Lustre File System Technology Paper An introduction to the inner workings of the world’s most scalable and popular open source HPC file system Torben Kling Petersen, PhD Inside The Lustre File System The Lustre High Performance Parallel File System Introduction Ever since the precursor to Lustre® (known as the Object- Based Filesystem, or ODBFS) was developed at Carnegie Mellon University in 1999, Lustre has been at the heart of high performance computing, providing the necessary throughput and scalability to many of the fastest supercomputers in the world. Lustre has experienced a number of changes and, despite the code being open source, the ownership has changed hands a number of times. From the original company started by Dr. Peter Braam (Cluster File Systems, or CFS), which was acquired by Sun Microsystems in 2008—which was in turn acquired by Oracle in 2010—to the acquisition of the Lustre assets by Xyratex in 2013, the open source community has supported the proliferation and acceptance of Lustre. In 2011, industry trade groups like OpenSFS1, together with its European sister organization, EOFS2, took a leading role in the continued development of Lustre, using member fees and donations to drive the evolution of specific projects, along with those sponsored by users3 such as Oak Ridge National Laboratory, Lawrence Livermore National Laboratory and the French Atomic Energy Commission (CEA), to mention a few. Today, in 2014, the Lustre community is stronger than ever, and seven of the top 10 high performance computing (HPC) systems on the international Top 5004 list (as well as 75+ of the top 100) are running the Lustre high performance parallel file system.
    [Show full text]
  • Digital Technical Journal, Number 5, September 1987: Vaxcluster Systems
    VAX clusterSystems Digital Technical Journal Digital Equipment Corporation Number 5 September 1987 Editorial Staff Ediwr- Richard W. Beane Production Staff Production Editor- Jane C. 131ake Designer- Charlotte Bell Interactive Page Makeup- Te rry Reed Advisory Board Samuel H. Fuller, Chairman Robert M. Glorioso John W. McCredie Mahendra R. Patel F. Grant Saviers William D. Strecker The Digital Te chnical journal is published by Digital Equipment Corporation, 77 Reed Road, Hudson, MassachusettS 01749. Changes of address should be sent to Digital Equipment Corporation, attention: Media Response Manager, 444 Whitney Street, NR02-1/J5, Northboro, M.A 01532-2599 Comments on the content of any paper are welcomed. Write to the editor at Mail Stop HL02-3/K11 at the published-by address. Comments can also be sent on the ENET to RDVAX::BEANE or on the ARPANET to BEANE%RDVAX.DEC@DECWRL. Copyright© 1987 Digital Equipment Corporation. Copying without fee is permiued provided that such copies are made for use in educational institutions by facuhy members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted Requests for other copies for a fee may be made to the Digital Press of Digital Equipment Corporation. All rights reserved. The information in this journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document.
    [Show full text]
  • GFS Common Goals of GFS and Most Distributed File Systems
    !e Google File System GFS Common Goals of GFS and most Distributed File Systems • Performance • Reliability • Scalability • Availability Other GFS Concepts • Component failures are the norm rather than the exception. o File System consists of hundreds or even thousands of storage machines built from inexpensive commodity parts. • Files are Huge. Multi-GB Files are common. o Each "le typically contains many application objects such as web documents. • Append, Append, Append. o Most "les are mutated by appending new data rather than overwriting existing data. o Co-Designing o Co-designing applications and "le system API bene"ts overall system by increasing #exibility o Slide from Michael Raines GFS • Why assume hardware failure is the norm? o It is cheaper to assume common failure on poor hardware and account for it, rather than invest in expensive hardware and still experience occasional failure. • !e amount of layers in a distributed system (network, disk, memory, physical connections, power, OS, application) mean failure on any could contribute to data corruption. Slide from Michael Raines Assumptions • System built from inexpensive commodity components that fail • Modest number of "les – expect few million and > 100MB size. Did not optimize for smaller "les • 2 kinds of reads – large streaming read (1MB), small random reads (batch and sort) • Well-de"ned semantics of producer/ consumer and many-way merge. 1 producer per machine append to "le. Atomicity • High sustained bandwidth chosen over low latency Interface • GFS – familiar "le
    [Show full text]
  • GPFS General Parallel File System
    Fachhochschule Bonn-Rhein-Sieg University of Applied Sciences Fachbereich Informatik Department of Computer Science GPFS General Parallel File System im Studiengang Master of Science in Computer Science SS 2005 Von Numiel Ghebre und Felix Rommel Betreuer: Prof. Dr. Berrendorf Inhaltsverzeichnis 1 Einleitung...........................................................................................................................3 2 Übersicht...........................................................................................................................4 2.1 Herkunft GPFS...........................................................................................................4 2.2 Was ist ein paralleles Dateisystem............................................................................4 2.3 Besondere Anforderungen an ein PFS......................................................................4 2.4 Wichtige Eigenschaften des GPFS............................................................................4 2.5 GPFS Cluster und Knoten..........................................................................................5 2.6 GPFS-Shared-Disk-Architektur..................................................................................5 3 Technische Eigenschaften................................................................................................8 3.1 Arbeitsweise von GPFS.............................................................................................8 3.2 Der Konfigurations-Manager......................................................................................8
    [Show full text]
  • Using Sas® 9 and Red Hat's Global File System2 (Gfs2)
    USING SAS® 9 AND RED HAT’S GLOBAL FILE SYSTEM2 (GFS2) SAS Institute August 2013 With the growing number of SAS GRID usage, our Red Hat (RHEL) users are looking for a clustered file system for the RHEL grid nodes. This short paper will discuss what is available, starting with Red Hat’s cluster file system is called Global File System 2 (GFS2) that is part of the Red Hat Resilient Storage Add- On. Note that Red Hat requires an architecture review be conducted by Red Hat prior to the implementation of a RHEL Cluster. Please work with your Red Hat account team to make this happen. This clustered file systems works very nicely with SAS, however to get the performance enhancements that Red Hat has made to GFS2, you need to run with RHEL 6.4 + errata patches through mid-May 2013. This errata provide fixes to the tuned service as well as address a concern with irqbalance. You should not use GFS2 with any versions of RHEL prior to RHEL 6.4 and the mid-May errata due to various performance issues. Using GFS2 with a version of RHEL5 will result in severe functional problems, and these problems may exist with early versions of RHEL 6. In addition to using the above release of RHEL, there are some other tuning requirements that should be done when setting up the clustered file system for your SAS GRID environment. The penalty for not applying these tuning guidelines is very costly down time due to unacceptable performance. It is strongly recommended that SAS WORK directories be place on a separate GFS2 file system from the permanent SAS data file space to avoid fragmentation in the permanent SAS data file space.
    [Show full text]