The LOCUS Distributed Operating System

Total Page:16

File Type:pdf, Size:1020Kb

The LOCUS Distributed Operating System The LOCUS Distributed Operating System Bruce Walker, Gerald Popek, Robert English, Charles Kline and Greg Thiel University of California at Los Angeles 1983 Presented By – Quan(Cary) Zhang LOCUS is not just a paper… History on Distributed System Roughly speaking, we can divide the history of modern compu9ng into the following eras: 1970s: Timesharing (1 computer with many users) 1980s: Personal compu9ng (1 computer per user) 1990s: Parallel compu9ng (many computers per user) -Andrew S. Tanenbaum(in Amoeba Project) LOCUS Distributed Operating System Network Transparency Looks like one system, though each site runs a copy of kernel ☼ Distributed File system with name transparency and replication Remote process creation & Remote IPC Can be across heterogeneous CPU Overview Distributed File System Flexible and automatic replication Nested transaction Remote Processes Recovery from conflicts Dynamic Reconfiguration Distributed File System Naming similar to UNIX: single directory tree Transparent Naming: not mapping name to the location Transparent Replication: Glue filegroups to the directory tree Distributed File System(Cont.) Distributed File System Replication Availability, Yet complicate the update Directory entries stored `nearby’ the site the child file store Performance High level of the naming tree should be highly replicated, vice versa Essential to the Environment Set up files (Q) Mechanism Support for replication Physical containers store subset of filegroup’s files Copies of a file samely resolves to <filegroup number, inode> Version vector for the complicated update File System: Operation All operations keep intact to the unix system call Access Using Site (US) Storage Site (SS) Current Synchronization Site (CSS) <filegroup number, inode> US US … … File System: Operation(cont.) File System: Operation(cont.) Open/Read SS CSS Read Sharing File System: Operation(cont.) Close If it is the last close, because the US can open the file for several 9mes. Name resolution Search for the pathname iteratively starting from working directory or root Finds a <filegroup, inode> at the end of each search that can match the pathname Possible optimization: No synchronization on CSS Because directory never sees an inconsistent picture ( directory is just a pointer ) Instead of iterating through a tree which is remotely located, try to use migration File System: Operation (Contd…) Modification Modifications are written to new pages/old pages, followed by atomic commit to SS, and close Commit & Abort( use undo log ) One copy of file ( shadow page ) is updated and committed Notify to the CSS( change the version vector ) and to SS Updated file propagation - “Pulling” by other SS, and also use the commit mechanism Creation Storage locations/Copy number for new file determined at create time. Attempts to use same storage sites as parent directory/ local site Remote sites – inode allocated by physical container decided by filegroup File System: Operation (Contd…) Machine Dependent File Different Versions of the same file (Process Context based) Remote device and IPC pipe(distributed memory) Remote Processes Supports remote fork and exec (a special module) Copies entire process memory to destination machine: can be slow run system call performs the equivalent to local fork and remote exec Shared data protected using token passing (e.g. file descriptors, handle) Child Parent notified upon failures Recovery “Partitioning will occur” Strict Synchronization in a partition (independently/transaction) Merging: Directories and mailboxes are easy Issue: File is both deleted and updated Solution: Propagate changes/Delete, whichever is later Name conflict: Rename and email Automatic - CSS Else, pass to filetype-specific recovery program Else, mail owner by massage Dynamic Reconfiguration Principles for Reconfiguration Delays (internal table reconstruction) should be negligible Same version availability even upon failure - Consistency Clean up when failure detected Close files and sessions & issue error return values Partition: Find fully connected component Synchronize site tables Polls all suspected-good machines then broadcast result Merge: Forms large partitions - Centralized Polls all sites asynchronously, merges partitions if found. Finds new CSS for all filegroups and global tables. Protocol Synchronization Passive site periodically checks active site Passive site can restart protocol Experiment Bla, bla, bla, yet with no essen9al result Conclusion Transparent Distributed System Behavior with high performance “is feasible” “Performance Transparency” Except remote fork/exec Not much experience with replicated storage Protocol for membership to a partition with high performance works My perspective on LOCUS Not done: Parallel Compu9ng: because no thread concept currently Process/Thread Migraon: Load balance Security: SSH Distributed File System VS File System Service in Distributed environment Discussion Why not simply store the Kernel, Shells etc on "local storage space"? – I think you are right, not all the resource/file should be distributed into the network of computers e.g. FITCI Can the RPC techniques we discussed earlier be implemented in this framework and help? – Probably true: That RPC paper was in 1984, yet this paper was finished in 1983 The locaon transparency required/implemented in this work may incur imbalance of performance cost, is this a problem? Can centralized solu9on help for this problem? – I think it is possible to add some policies to aain the load balance, e.g., set the physical container for the file group respec9vely to the sites, yet there is real challenge in geng the informaon about the topology of the network. Compare and contrast "The Mul9kernel" that we discussed on last Systems Research group mee9ng and LOCUS ? Could we say LOCUS is a very early stage of a Mul9kernel approach – I think the distributed system(Amoeba) itself is planning to design the mul9kernel, so, the mul9kernel concept is not new, nonetheless, the mul9kernel we discussed last week can apply to one mechine with different execu9on unit(ISA) If the node responsible for a file is very busy and unable to handle network requests, then what happens to my request? – I don’t think this can happen, because the replicaon never stops. It seems that distributed filesystems don't tend to be very popular in the real world. Instead, networked organizaons who need to make files available in mul9ple locaons tend to concentrate all storage on one server (or bank of servers), using something like NFS. Why has distributed storage, like in this paper, not become more popular? – Because of the scalability issue As number of nodes and par99oning events increase I think that the paper's approach of manual conflict resolu9on won't scale! What do they mean by the 'guess' provided for the incore inode number? Is this sent by the US or the CSS and what happens if the guess is incorrect? And why should guessing be necessary at all - shouldn't the CSS know exactly which logical file it needs to access from the SS Is it possible to store a single file over mul9ple nodes, not in a replicated form, but in a striped form? That would mean that block A of a file is on PC 1, while block B of the file is on PC 2.
Recommended publications
  • Sprite File System There Are Three Important Aspects of the Sprite ®Le System: the Scale of the System, Location-Transparency, and Distributed State
    Naming, State Management, and User-Level Extensions in the Sprite Distributed File System Copyright 1990 Brent Ballinger Welch CHAPTER 1 Introduction ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ This dissertation concerns network computing environments. Advances in network and microprocessor technology have caused a shift from stand-alone timesharing systems to networks of powerful personal computers. Operating systems designed for stand-alone timesharing hosts do not adapt easily to a distributed environment. Resources like disk storage, printers, and tape drives are not concentrated at a single point. Instead, they are scattered around the network under the control of different hosts. New operating system mechanisms are needed to handle this sort of distribution so that users and application programs need not worry about the distributed nature of the underlying system. This dissertation explores the approach of centering a distributed computing environment around a shared network ®le system. The ®le system is chosen as a starting point because it is a heavily used service in stand-alone systems, and the read/write para- digm of the ®le system is a familiar one that can be applied to many system resources. The ®le system described in this dissertation provides a distributed name space for sys- tem resources, and it provides remote access facilities so all resources are available throughout the network. Resources accessible via the ®le system include disk storage, other types of peripheral devices, and user-implemented service applications. The result- ing system is one where resources are named and accessed via the shared ®le system, and the underlying distribution of the system among a collection of hosts is not important to users.
    [Show full text]
  • Distributed Virtual Machines: a System Architecture for Network Computing
    Distributed Virtual Machines: A System Architecture for Network Computing Emin Gün Sirer, Robert Grimm, Arthur J. Gregory, Nathan Anderson, Brian N. Bershad {egs,rgrimm,artjg,nra,bershad}@cs.washington.edu http://kimera.cs.washington.edu Dept. of Computer Science & Engineering University of Washington Seattle, WA 98195-2350 Abstract Modern virtual machines, such as Java and Inferno, are emerging as network computing platforms. While these virtual machines provide higher-level abstractions and more sophisticated services than their predecessors from twenty years ago, their architecture has essentially remained unchanged. State of the art virtual machines are still monolithic, that is, they are comprised of closely-coupled service components, which are thus replicated over all computers in an organization. This crude replication of services forms one of the weakest points in today’s networked systems, as it creates widely acknowledged and well-publicized problems of security, manageability and performance. We have designed and implemented a new system architecture for network computing based on distributed virtual machines. In our system, virtual machine services that perform rule checking and code transformation are factored out of clients and are located in enterprise- wide network servers. The services operate by intercepting application code and modifying it on the fly to provide additional service functionality. This architecture reduces client resource demands and the size of the trusted computing base, establishes physical isolation between virtual machine services and creates a single point of administration. We demonstrate that such a distributed virtual machine architecture can provide substantially better integrity and manageability than a monolithic architecture, scales well with increasing numbers of clients, and does not entail high overhead.
    [Show full text]
  • Workstation Operating Systems Mac OS 9
    15-410 “Now that we've covered the 1970's...” Plan 9 Nov. 25, 2019 Dave Eckhardt 1 L11_P9 15-412, F'19 Overview “The land that time forgot” What style of computing? The death of timesharing The “Unix workstation problem” Design principles Name spaces File servers The TCP file system... Runtime environment 3 15-412, F'19 The Land That Time Forgot The “multi-core revolution” already happened once 1982: VAX-11/782 (dual-core) 1984: Sequent Balance 8000 (12 x NS32032) 1985: Encore MultiMax (20 x NS32032) 1990: Omron Luna88k workstation (4 x Motorola 88100) 1991: KSR1 (1088 x KSR1) 1991: “MCS” paper on multi-processor locking algorithms 1995: BeBox workstation (2 x PowerPC 603) The Land That Time Forgot The “multi-core revolution” already happened once 1982: VAX-11/782 (dual-core) 1984: Sequent Balance 8000 (12 x NS32032) 1985: Encore MultiMax (20 x NS32032) 1990: Omron Luna88k workstation (4 x Motorola 88100) 1991: KSR1 (1088 x KSR1) 1991: “MCS” paper on multi-processor locking algorithms 1995: BeBox workstation (2 x PowerPC 603) Wow! Why was 1995-2004 ruled by single-core machines? What operating systems did those multi-core machines run? The Land That Time Forgot Why was 1995-2004 ruled by single-core machines? In 1995 Intel + Microsoft made it feasible to buy a fast processor that fit on one chip, a fast I/O bus, multiple megabytes of RAM, and an OS with memory protection. Everybody could afford a “workstation”, so everybody bought one. Massive economies of scale existed in the single- processor “Wintel” universe.
    [Show full text]
  • Distributed Operating Systems
    Distributed Operating Systems ANDREW S. TANENBAUM and ROBBERT VAN RENESSE Department of Mathematics and Computer Science, Vrije Universiteit, Amsterdam, The Netherlands Distributed operating systems have many aspects in common with centralized ones, but they also differ in certain ways. This paper is intended as an introduction to distributed operating systems, and especially to current university research about them. After a discussion of what constitutes a distributed operating system and how it is distinguished from a computer network, various key design issues are discussed. Then several examples of current research projects are examined in some detail, namely, the Cambridge Distributed Computing System, Amoeba, V, and Eden. Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]: Distributed Systems-network operating system; D.4.3 [Operating Systems]: File Systems Management-distributed file systems; D.4.5 [Operating Systems]: Reliability-fault tolerance; D.4.6 [Operating Systems]: Security and Protection-access controls; D.4.7 [Operating Systems]: Organization and Design-distributed systems General Terms: Algorithms, Design, Experimentation, Reliability, Security Additional Key Words and Phrases: File server INTRODUCTION more convenient to use than the bare ma- chine. Examples of well-known centralized Everyone agrees that distributed systems (i.e., not distributed) operating systems are are going to be very important in the future. CP/M,’ MS-DOS,’ and UNIX.3 Unfortunately, not everyone agrees on A distributed operating system is one that what they mean by the term “distributed looks to its users like an ordinary central- system.” In this paper we present a view- ized operating system but runs on multi- point widely held within academia about ple, independent central processing units what is and is not a distributed system, we (CPUs).
    [Show full text]
  • A Single System Image Java Operating System for Sensor Networks
    A SINGLE SYSTEM IMAGE JAVA OPERATING SYSTEM FOR SENSOR NETWORKS Emin Gun Sirer Rimon Barr John C. Bicket Daniel S. Dantas Computer Science Department Cornell University Ithaca, NY 14853 {egs, barr, bicket, ddantas}@cs.cornell.edu Abstract In this paper we describe the design and implementation of a distributed operating system for sensor net- works. The goal of our system is to extend total system lifetime through power-aware adaptation for sensor networking applications. Our system achieves this goal by providing a single system image of a unified Java virtual machine to applications over an ad hoc collection of heterogeneous sensors. It automatically and transparently partitions applications into components and dynamically finds a placement of these components on nodes within the sensor network to reduce energy consumption and increase system longevity. This paper describes the design and implementation of our system and examines the question of where and when to mi- grate components in a sensor network. We evaluate two practical, power-aware, general-purpose algorithms for object placement, as well as an adaptive scheme for deciding the time granularity of object migration. We demonstrate that our algorithms can increase sensor network longevity by a factor of four to five by effec- tively distributing energy consumption and avoiding hotspots. 1. Introduction able to components at each node, in particular the available power and bandwidth may change over Sensor networks simultaneously promise a radi- time and necessitate the relocation of application cally new class of applications and pose signifi- components. Further, event sources that are being cant challenges for application development.
    [Show full text]
  • Design of the SPEEDOS Operating System Kernel
    Universität Ulm Fakultät für Informatik Abteilung Rechnerstrukturen Design of the SPEEDOS Operating System Kernel Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der Fakultät für Informatik der Universität Ulm vorgelegt von Klaus Espenlaub aus Biberach a.d. Riß 2005 Official Copy, serial number df13c6b7-d6ed-e3f4-58b6-8662375a2688 Amtierender Dekan: Prof. Dr. Helmuth Partsch Gutachter: Prof. Dr. J. Leslie Keedy (Universität Ulm) Gutachter: Prof. Dr. Jörg Kaiser (Otto-von-Guericke-Universität, Magdeburg) Gutachter: Prof. John Rosenberg (Deakin University, Geelong, Victoria, Australia) Prüfungstermin: 11.07.2005 iii Abstract (Eine inhaltsgleiche, deutsche Fassung dieser Übersicht ist ab Seite 243 zu finden.) The design of current operating systems and their kernels shows deficiencies in re- spect to the structuring approach and the flexibility of their protection systems. The operating systems and applications suffer under this lack of extensibility and flexib- ility. The protection model implemented in many operating systems is not powerful enough to represent arbitrary protection conditions on a more fine-grained granu- larity than giving read and/or write access to an entire object. Additionally current operating systems are not capable of controlling the flow of information between software units effectively. Confinement conditions cannot be expressed explicitly and thus confinement problems can only be solved indirectly. Further complications with the protection system and especially the software structure in modern operating systems based on the microkernel approach are caused by the use of the out-of-process model. It is extremely difficult to spe- cify access rights appropriately, because the client/server paradigm does not easily allow a relationship to be established between the role of the client and the per- missions of the server.
    [Show full text]
  • MOSIX Evaluation on a Linux Cluster
    62 The International Arab Journal of Information Technology, Vol. 3, No. 1, January 2006 MOSIX Evaluation on a Linux Cluster Najib Kofahi1, Saeed Al Zahrani2, and Syed Manzoor Hussain 3 1Department of Computer Science s, Yarmouk University, Jordan 2Saudi Aramco, SA 3Dep t. of Information and Computer Science , King Fahd University of Petroleum and Minerals , SA Abstract: Multicomputer Operating System for Unix ( MOSIX) is a cluster -computing enhancement of Linux kernel that supports preemptive process migration. It consists of adaptive resource sharing algorithms for high performance scalability b y migrating processes across a cluster. Message passing Interface ( MPI ) is a library standard for writing message passing programs, which has the advantage of portability and ease of use. This paper highlights the advantages of a process migration model to utilize computing resources better and gain considerable speedups in the execution of parallel and multi -tasking applications. We executed several CPU bound tests under MPI and MOSIX. The results of these tests show the advantage of using MOSIX over MPI. At the end of this paper, we present the performance of the executions of those tests, which showed that in some cases improvement in the performance of MOSIX over MPI can reach tens of percents. Keywords: High performance computing, performance evaluatio n, Linux cluster, MOSIX, MPI, process migration. Received July 28, 2004; accepted September 30, 2004 1. Introduction model requires application developers to write their code according to set of stan dards. Message passing In recent years, interest in high performance computing Interface ( MPI ) and Parallel Virtual Machine ( PVM ) has increased [ 4, 7 , 17].
    [Show full text]
  • Advanced Programming in the UNIX Environment
    CS631-AdvancedProgrammingintheUNIXEnvironment Slide1 CS631 - Advanced Programming in the UNIX Environment Department of Computer Science Stevens Institute of Technology Jan Schaumann [email protected] https://www.cs.stevens.edu/~jschauma/631/ Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide2 New Rules Close your laptops! Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide3 New Rules Close your laptops! Open your eyes! (Mind, too.) Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide4 So far, so good... What questions do you have? Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide5 About this class The class is called “Advanced Programming in the UNIX Environment”. It is not called: “An Introduction to Unix” “An Introduction to Programming” “An introduction to C” Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide6 What is it? https://www.bell-labs.com/usr/dmr/www/chist.html Lecture 01: Introduction, UNIX history, UNIX Programming Basics August 27, 2018 CS631-AdvancedProgrammingintheUNIXEnvironment Slide7 In a nutshell: the ”what” $ ls /bin [ csh ed ls pwd sleep cat date expr mkdir rcmd stty chio dd hostname mt rcp sync chmod df kill mv
    [Show full text]
  • Cluster Computing White Paper
    Cluster Computing White Paper Status – Final Release Version 2.0 Date – 28th December 2000 Editor-MarkBaker,UniversityofPortsmouth,UK Contents and Contributing Authors: 1 An Introduction to PC Clusters for High Performance Computing, Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory, USA 2 Network Technologies, Amy Apon, University of Arkansas, USA, and Mark Baker, University of Portsmouth, UK. 3 Operating Systems, Steve Chapin, Syracuse University, USA and Joachim Worringen, RWTH Aachen, University of Technology, Germany 4 Single System Image (SSI), Rajkumar Buyya, Monash University, Australia, Toni Cortes, Universitat Politecnica de Catalunya, Spain and Hai Jin, University of Southern California, USA 5 Middleware, Mark Baker, University of Portsmouth, UK, and Amy Apon, University of Arkansas, USA. 6 Systems Administration, Anthony Skjellum, MPI Software Technology, Inc. and Mississippi State University, USA, Rossen Dimitrov and Srihari Angulari, MPI Software Technology, Inc., USA, David Lifka and George Coulouris, Cornell Theory Center, USA, Putchong Uthayopas, Kasetsart University, Bangkok, Thailand, Stephen Scott, Oak Ridge National Laboratory, USA, Rasit Eskicioglu, University of Manitoba, Canada 7 Parallel I/O, Erich Schikuta, University of Vienna, Austria and Helmut Wanek, University of Vienna, Austria 8 High Availability, Ira Pramanick, Sun Microsystems, USA 9 Numerical Libraries and Tools for Scalable Parallel Cluster Computing, Jack Dongarra, University of Tennessee and ORNL, USA, Shirley Moore, University of Tennessee, USA, and Anne Trefethen, Numerical Algorithms Group Ltd, UK 10 Applications, David Bader, New Mexico, USA and Robert Pennington, NCSA, USA 11 Embedded/Real-Time Systems, Daniel Katz, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA and Jeremy Kepner, MIT Lincoln Laboratory, Lexington, MA, USA 12 Education, Daniel Hyde, Bucknell University, USA and Barry Wilkinson, University of North Carolina at Charlotte, USA Preface Cluster computing is not a new area of computing.
    [Show full text]
  • An Introduction to Single System Image (SSI) Cluster Technique
    Volume III, Issue IV, April 2014 IJLTEMAS ISSN 2278 - 2540 An Introduction to Single System Image (SSI) Cluster Technique Tarun Kumawat [CSE] , JECRC UDML College of Engineering. Kukas, Jaipur, Rajasthan, India1 Sandeep Tomar [CSE] , Arya College of Engineering & I.T. Kukas, Jaipur, Rajasthan, India2 Mohit Gupta [CSE] , Arya College of Engineering & I.T. Kukas, Jaipur, Rajasthan, India3 [email protected] [email protected] 3 [email protected] beowulf.myinstitute.edu), although the cluster Abstract-Cluster computing is not a new area of computing. may have multiple physical host nodes to serve It is, however, evident that there is a growing interest in its the login session. The system transparently usage in all areas where applications have traditionally used distributes user’s connection requests to different parallel or distributed computing platforms. A Single System physical hosts to balance load. Image (SSI) is the property of a system that hides the Single user interface: The user should be able to heterogeneous and distributed nature of the available use the cluster through a single GUI. The resources and presents them to users and applications as a single unified computing resource. SSI can be enabled in interface must have the same look and feel than numerous ways, this range from those provided by extended the one available for workstations (e.g., Solaris hardware through to various software mechanisms. SSI OpenWin or Windows NT GUI). means that users have a globalised view of the resources Single process space: All user processes, no available to them irrespective of the node to which they are matter on which nodes they reside, have a unique physically associated.
    [Show full text]
  • Downloaded on 2018-08-23T19:11:32Z Single System Image: a Survey
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cork Open Research Archive Title Single system image: A survey Author(s) Healy, Philip D.; Lynn, Theo; Barrett, Enda; Morrisson, John P. Publication date 2016-02-17 Original citation Healy, P., Lynn, T., Barrett, E. and Morrison, J. P. (2016) 'Single system image: A survey', Journal of Parallel and Distributed Computing, 90- 91(Supplement C), pp. 35-51. doi:10.1016/j.jpdc.2016.01.004 Type of publication Article (peer-reviewed) Link to publisher's http://dx.doi.org/10.1016/j.jpdc.2016.01.004 version Access to the full text of the published version may require a subscription. Rights © 2016 Elsevier Inc. This is the preprint version of an article published in its final form in Journal of Parallel and Distributed Computing, available https://doi.org/10.1016/j.jpdc.2016.01.004. This manuscript version is made available under the CC BY-NC-ND 4.0 licence https://creativecommons.org/licenses/by-nc-nd/4.0/ Item downloaded http://hdl.handle.net/10468/4932 from Downloaded on 2018-08-23T19:11:32Z Single System Image: A Survey Philip Healya,b,∗, Theo Lynna,c, Enda Barretta,d, John P. Morrisona,b aIrish Centre for Cloud Computing and Commerce, Dublin City University, Ireland bComputer Science Dept., University College Cork, Ireland cDCU Business School, Dublin City University, Ireland dSoftware Research Institute, Athlone Institute of Technology, Ireland Abstract Single system image is a computing paradigm where a number of distributed computing resources are aggregated and presented via an interface that maintains the illusion of interaction with a single system.
    [Show full text]
  • Process Migration
    Process Migration DEJAN S. MILOJI´ CIˇ C´ HP Labs FRED DOUGLIS AT&T Labs–Research YVES PAINDAVEINE TOG Research Institute RICHARD WHEELER EMC AND SONGNIAN ZHOU University of Toronto and Platform Computing Process migration is the act of transferring a process between two machines. It enables dynamic load distribution, fault resilience, eased system administration, and data access locality. Despite these goals and ongoing research efforts, migration has not achieved widespread use. With the increasing deployment of distributed systems in general, and distributed operating systems in particular, process migration is again receiving more attention in both research and product development. As high-performance facilities shift from supercomputers to networks of workstations, and with the ever-increasing role of the World Wide Web, we expect migration to play a more important role and eventually to be widely adopted. This survey reviews the field of process migration by summarizing the key concepts and giving an overview of the most important implementations. Design and implementation issues of process migration are analyzed in general, and then revisited for each of the case studies described: MOSIX, Sprite, Mach, and Load Sharing Facility. The benefits and drawbacks of process migration depend on the details of implementation and, therefore, this paper focuses on practical matters. This survey will help in understanding the potentials of process migration and why it has not caught on. Categories and Subject Descriptors: C.2.4 [Computer-Communication
    [Show full text]