Workload Schedulers - Genesis, Algorithms and Comparisons
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Univa Grid Engine 8.0 Superior Workload Management for an Optimized Data Center
Univa Grid Engine 8.0 Superior Workload Management for an Optimized Data Center Why Univa Grid Engine? Overview Univa Grid Engine 8.0 is our initial Grid Engine is an industry-leading distributed resource release of Grid Engine and is quality management (DRM) system used by hundreds of companies controlled and fully supported by worldwide to build large compute cluster infrastructures for Univa. This release enhances the Grid processing massive volumes of workload. A highly scalable and Engine open source core platform with reliable DRM system, Grid Engine enables companies to produce higher-quality products, reduce advanced features and integrations time to market, and streamline and simplify the computing environment. into enterprise-ready systems as well as cloud management solutions that Univa Grid Engine 8.0 is our initial release of Grid Engine and is quality controlled and fully allow organizations to implement the supported by Univa. This release enhances the Grid Engine open source core platform with most scalable and high performing advanced features and integrations into enterprise-ready systems as well as cloud management distributed computing solution on solutions that allow organizations to implement the most scalable and high performing distributed the market. Univa Grid Engine 8.0 computing solution on the market. marks the start for a product evolution Protect Your Investment delivering critical functionality such Univa Grid Engine 8.0 makes it easy to create a as improved 3rd party application cluster of thousands of machines by harnessing the See why customers are integration, advanced capabilities combined computing power of desktops, servers and choosing Univa Grid Engine for license and policy management clouds in a simple, easy-to-administer environment. -
Solaris-Cluster-Businesscontinuity-168285.Pdf
An Oracle White Paper September 2010 Oracle Solaris and Oracle Solaris Cluster: Extending Oracle Solaris for Business Continuity Oracle White Paper— Oracle Solaris and Oracle Solaris Cluster: Extending Oracle Solaris for Business Continuity Executive Summary.............................................................................1 Introduction..........................................................................................3 Traditional Solution Components ....................................................5 Oracle Solaris Cluster for Business Continuity....................................6 Oracle Solaris Cluster......................................................................6 Introduction—Basic Clustering ........................................................8 Data Availability .............................................................................10 Applications Availability .................................................................11 Network Availability .......................................................................12 Virtualization and Oracle Solaris Cluster .......................................13 Disaster Recovery—Campus and Beyond ....................................16 A Single HA and DR Solution for Multitier Oracle Applications and Databases ..........................................................17 Oracle Solaris ....................................................................................19 Resource Management .................................................................19 -
Push-Based Job Submission Using Reverse SSH Connections
rvGAHP – Push-Based Job Submission using Reverse SSH Connections Scott Callaghan Gideon Juve Karan Vahi University of Southern California USC Information Sciences Institute USC Information Sciences Institute Los Angeles, California Marina Del Rey, California Marina Del Rey, California [email protected] [email protected] [email protected] Philip J. Maechling Thomas H. Jordan Ewa Deelman University of Southern California University of Southern California USC Information Sciences Institute Los Angeles, California Los Angeles, California Marina Del Rey, California [email protected] [email protected] [email protected] ABSTRACT using Reverse SSH Connections. In WORKS’17: WORKS’17: 12th Workshop Computational science researchers running large-scale scientific on Workflows in Support of Large-Scale Science, November 12–17, 2017, Denver, CO, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3150994. workflow applications often want to run their workflows onthe 3151003 largest available compute systems to improve time to solution. Workflow tools used in distributed, heterogeneous, high perfor- mance computing environments typically rely on either a push- 1 INTRODUCTION based or a pull-based approach for resource provisioning from these Modern scientific applications typically require the execution of compute systems. However, many large clusters have moved to multiple codes, may include computational and data dependencies, two-factor authentication for job submission, making traditional au- and contain varied computational models ranging from a bag of tomated push-based job submission impossible. On the other hand, tasks to a monolithic parallel code. These applications are often pull-based approaches such as pilot jobs may lead to increased com- complex software suites with substantial computational and data plexity and a reduction in node-hour efficiency. -
Introduction to Python University of Oxford Department of Particle Physics
Particle Physics Cluster Infrastructure Introduction University of Oxford Department of Particle Physics October 2019 Vipul Davda Particle Physics Linux Systems Administrator Room 661 Telephone: x73389 [email protected] Particle Physics Computing Overview 1 Particle Physics Linux Infrastructure Distributed File System gluster NFS /data Worker Nodes /data/atlas /data/lhcb NFS HTCondor /home Batch Server Interactive Servers physics_s/eduroam Network Printer Managed Laptops Managed Desktops Particle Physics Computing Overview 2 Introduction to the Unix Operating System Unix is a Multi-User/Multi-Tasking operating system. Developed in 1969 at AT&T’s Bell Labs by Ken Thompson (Unix) Dennis Ritchie (C) Unix is written in C programming language. Unix was originally a command-line OS, but now has a graphical user interface. It is available in many different forms: Linux , Solaris, AIX, HP-UX, freeBSD It is a well-suited environment for program development: C, C++, Java, Fortran, Python… Unix is mainly used on large servers for scientific applications. Particle Physics Computing Overview 3 Linux Distributions Source: https://www.muylinux.com/2009/04/24/logos-de-distribuciones-gnulinux/ Particle Physics Computing Overview 4 Particle Physics Linux Infrastructure Particle Physics uses CentOS Linux on the cluster. It is a free version of RedHat Enterprise Linux. Particle Physics Computing Overview 5 Basic Linux Commands Particle Physics Computing Overview 6 Basic Linux Commands o ls - list directory contents ls –l - long listing -
The Translational Journey of the Htcondor-CE
Journal of Computational Science xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs Principles, technologies, and time: The translational journey of the HTCondor-CE Brian Bockelman a,*, Miron Livny a,b, Brian Lin b, Francesco Prelz c a Morgridge Institute for Research, Madison, USA b Department of Computer Sciences, University of Wisconsin-Madison, Madison, USA c INFN Milan, Milan, Italy ARTICLE INFO ABSTRACT Keywords: Mechanisms for remote execution of computational tasks enable a distributed system to effectively utilize all Distributed high throughput computing available resources. This ability is essential to attaining the objectives of high availability, system reliability, and High throughput computing graceful degradation and directly contribute to flexibility, adaptability, and incremental growth. As part of a Translational computing national fabric of Distributed High Throughput Computing (dHTC) services, remote execution is a cornerstone of Distributed computing the Open Science Grid (OSG) Compute Federation. Most of the organizations that harness the computing capacity provided by the OSG also deploy HTCondor pools on resources acquired from the OSG. The HTCondor Compute Entrypoint (CE) facilitates the remote acquisition of resources by all organizations. The HTCondor-CE is the product of a most recent translational cycle that is part of a multidecade translational process. The process is rooted in a partnership, between members of the High Energy Physics community and computer scientists, that evolved over three decades and involved testing and evaluation with active users and production infrastructures. Through several translational cycles that involved researchers from different organizations and continents, principles, ideas, frameworks and technologies were translated into a widely adopted software artifact that isresponsible for provisioning of approximately 9 million core hours per day across 170 endpoints. -
Design of an Accounting and Metric-Based Cloud-Shifting
Design of an Accounting and Metric-based Cloud-shifting and Cloud-seeding framework for Federated Clouds and Bare-metal Environments Gregor von Laszewski1*, Hyungro Lee1, Javier Diaz1, Fugang Wang1, Koji Tanaka1, Shubhada Karavinkoppa1, Geoffrey C. Fox1, Tom Furlani2 1Indiana University 2Center for Computational Research 2719 East 10th Street University at Buffalo Bloomington, IN 47408. U.S.A. 701 Ellicott St [email protected] Buffalo, New York 14203 ABSTRACT and its predecessor meta-computing elevated this goal by not We present the design of a dynamic provisioning system that is only introducing the utilization of multiple queues accessible to able to manage the resources of a federated cloud environment the users, but by establishing virtual organizations that share by focusing on their utilization. With our framework, it is not resources among the organizational users. This includes storage only possible to allocate resources at a particular time to a and compute resources and exposes the functionality that users specific Infrastructure as a Service framework, but also to utilize need as services. Recently, it has been identified that these them as part of a typical HPC environment controlled by batch models are too restrictive, as many researchers and groups tend queuing systems. Through this interplay between virtualized and to develop and deploy their own software stacks on non-virtualized resources, we provide a flexible resource computational resources to build the specific environment management framework that can be adapted based on users' required for their experiments. Cloud computing provides here a demands. The need for such a framework is motivated by real good solution as it introduces a level of abstraction that lets the user data gathered during our operation of FutureGrid (FG). -
Beginner's Guide to Oracle Grid Engine 6.2 Oracle White Paper—Beginner's Guide to Oracle Grid Engine 6.2
An Oracle White Paper August 2010 Beginner's Guide to Oracle Grid Engine 6.2 Oracle White Paper—Beginner's Guide to Oracle Grid Engine 6.2 Executive Overview ..................................................................................... 1 Introduction .................................................................................................. 1 Chapter 1: Introduction to Oracle Grid Engine ............................................ 3 Oracle Grid Engine Jobs ......................................................................... 3 Oracle Grid Engine Component Architecture .......................................... 3 Oracle Grid Engine Basics ...................................................................... 5 Chapter 2: Oracle Grid Engine Scheduler ................................................... 10 Job Selection ........................................................................................... 10 Job Scheduling ........................................................................................ 17 Other Scheduling Features ...................................................................... 18 Additional Information on Job Scheduling ............................................... 20 Chapter 3: Planning an Oracle Grid Engine Installation .............................. 21 Installation Layout .................................................................................... 21 QMaster Data Spooling ........................................................................... 22 Execution Daemon Data -
7.1 Task Computing 7.2 Task-Based Application Models 7
15CS565 CLOUD COMPUTING MODULE – III High-Throughput Computing – Task Programming (Chapter 7) Task computing is a wide area of distributed system programming encompassing several different models of architecting distributed applications, A task represents a program, which require input files and produce output files as a result of its execution. Applications are then constituted of a collection of tasks. These are submitted for execution and their output data are collected at the end of their execution. This chapter characterizes the abstraction of a task and provides a brief overview of the distrib- uted application models that are based on the task abstraction. The Aneka Task Programming Model is taken as a reference implementation to illustrate the execution of bag-of-tasks (BoT) applications on a distributed infrastructure. 7.1 Task computing 7.1.1 Characterizing a task 7.1.2 Computing categories 1 High-performance computing 2 High-throughput computing 3 Many-task computing 7.1.3 Frameworks for task computing 1. Condor 2. Globus Toolkit 3. Sun Grid Engine (SGE) 4. BOINC 5. Nimrod/G 7.2 Task-based application models 7.2.1 Embarrassingly parallel applications 7.2.2 Parameter sweep applications 7.2.3 MPI applications 7.2.4 Workflow applications with task dependencies 1 What is a workflow? 2 Workflow technologies 1. Kepler, 2. DAGMan, 3. Cloudbus Workflow Management System, and 4. Offspring. 7.3 Aneka task-based programming 7.3.1 Task programming model 7.3.2 Developing applications with the task model 7.3.3 Developing a parameter sweep application 7.3.4 Managing workflows 7.1 Task computing A task identifies one or more operations that produce a distinct output and that can be isolated as a single logical unit. -
Hepix Report
HEPiX Report Helge Meinhard, Pawel Grzywaczewski, Romain Wartel / CERN-IT Post-C5/Computing Seminar 03 December 2010 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Outline • Meeting organisation, site reports, (benchmarking,) infrastructure (Helge Meinhard) • Storage, OS and applications, miscellaneous (Pawel Grzywaczewski) • Virtualisation, security and networking, grid and cloud (Romain Wartel) HEPiX report – Helge.Meinhard at cern.ch – 03-Dec-2010 HEPiX • Global organisation of service managers and support staff providing computing facilities for HEP • Covering all platforms of interest (Unix/Linux, Windows, Grid, …) • Aim: Present recent work and future plans, share experience, advise managers • Meetings ~ 2 / y (spring in Europe, autumn typically in North America) HEPiX report – Helge.Meinhard at cern.ch – 03-Dec-2010 HEPiX Autumn 2010 (1) • Held 01 – 05 November at Cornell University, Ithaca NY – CESR: Electron-positron storage ring; CLEO: experiment doing a lot of interesting b physics – New player in the HEPiX field at as site… but a well-known face: Chuck Boeheim, the previous north-American co-chair of HEPiX – Good local organisation – Nice auditorium in conference hotel, basically unlimited coffee supply – First face-to-face meeting in 2010 for most participants HEPiX report – Helge.Meinhard at cern.ch – 03-Dec-2010 HEPiX Autumn 2010 (2) • Format: Pre-defined tracks with conveners and invited speakers per track – Still room for spontaneous talks – either fit into one of the tracks, or classified as ‘miscellaneous’ -
A Comprehensive Perspective on Pilot-Job Systems
A Comprehensive Perspective on Pilot-Job Systems ∗ Matteo Turilli Mark Santcroos Shantenu Jha RADICAL Laboratory, ECE RADICAL Laboratory, ECE RADICAL Laboratory, ECE Rutgers University Rutgers University Rutgers University New Brunswick, NJ, USA New Brunswick, NJ, USA New Brunswick, NJ, USA [email protected] [email protected] [email protected] Pilot-Jobs provide a multi-stage mechanism to execute ABSTRACT workloads. Resources are acquired via a placeholder job and subsequently assigned to workloads. Pilot-Jobs are having Pilot-Job systems play an important role in supporting dis- a high impact on scientific and distributed computing [1]. tributed scientific computing. They are used to consume more They are used to consume more than 700 million CPU hours than 700 million CPU hours a year by the Open Science Grid a year [2] by the Open Science Grid (OSG) [3, 4] communi- communities, and by processing up to 1 million jobs a day for ties, and process up to 1 million jobs a day [5] for the ATLAS the ATLAS experiment on the Worldwide LHC Computing experiment [6] on theLarge Hadron Collider (LHC) [7] Com- Grid. With the increasing importance of task-level paral- puting Grid (WLCG) [8, 9]. A variety of Pilot-Job systems lelism in high-performance computing, Pilot-Job systems are used on distributed computing infrastructures (DCI): are also witnessing an adoption beyond traditional domains. Glidein/GlideinWMS [10, 11], the Coaster System [12], DI- Notwithstanding the growing impact on scientific research, ANE [13], DIRAC [14], PanDA [15], GWPilot [16], Nim- there is no agreement upon a definition of Pilot-Job system rod/G [17], Falkon [18], MyCluster [19] to name a few. -
EGI Federated Platforms Supporting Accelerated Computing
EGI federated platforms supporting accelerated computing PoS(ISGC2017)020 Paolo Andreetto Jan Astalos INFN, Sezione di Padova Institute of Informatics Slovak Academy of Sciences Via Marzolo 8, 35131 Padova, Italy Bratislava, Slovakia E-mail: [email protected] E-mail: [email protected] Miroslav Dobrucky Andrea Giachetti Institute of Informatics Slovak Academy of Sciences CERM Magnetic Resonance Center Bratislava, Slovakia CIRMMP and University of Florence, Italy E-mail: [email protected] E-mail: [email protected] David Rebatto Antonio Rosato INFN, Sezione di Milano CERM Magnetic Resonance Center Via Celoria 16, 20133 Milano, Italy CIRMMP and University of Florence, Italy E-mail: [email protected] E-mail: [email protected] Viet Tran Marco Verlato1 Institute of Informatics Slovak Academy of Sciences INFN, Sezione di Padova Bratislava, Slovakia Via Marzolo 8, 35131 Padova, Italy E-mail: [email protected] E-mail: [email protected] Lisa Zangrando INFN, Sezione di Padova Via Marzolo 8, 35131 Padova, Italy E-mail: [email protected] While accelerated computing instances providing access to NVIDIATM GPUs are already available since a couple of years in commercial public clouds like Amazon EC2, the EGI Federated Cloud has put in production its first OpenStack-based site providing GPU-equipped instances at the end of 2015. However, many EGI sites which are providing GPUs or MIC coprocessors to enable high performance processing are not directly supported yet in a federated manner by the EGI HTC and Cloud platforms. In fact, to use the accelerator cards capabilities available at resource centre level, users must directly interact with the local provider to get information about the type of resources and software libraries available, and which submission queues must be used to submit accelerated computing workloads. -
Bright Cluster Manager 9.0 User Manual
Bright Cluster Manager 9.0 User Manual Revision: 406b1b9 Date: Fri Oct 1 2021 ©2020 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Bright Computing, Inc. Trademarks Linux is a registered trademark of Linus Torvalds. PathScale is a registered trademark of Cray, Inc. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SUSE is a registered trademark of Novell, Inc. PGI is a registered trademark of NVIDIA Corporation. FLEXlm is a registered trademark of Flexera Software, Inc. PBS Professional, PBS Pro, and Green Provisioning are trademarks of Altair Engineering, Inc. All other trademarks are the property of their respective owners. Rights and Restrictions All statements, specifications, recommendations, and technical information contained herein are current or planned as of the date of publication of this document. They are reliable as of the time of this writing and are presented without warranty of any kind, expressed or implied. Bright Computing, Inc. shall not be liable for technical or editorial errors or omissions which may occur in this document. Bright Computing, Inc. shall not be liable for any damages resulting from the use of this document. Limitation of Liability and Damages Pertaining to Bright Computing, Inc. The Bright Cluster Manager product principally consists of free software that is licensed by the Linux authors free of charge. Bright Computing, Inc. shall have no liability nor will Bright Computing, Inc. provide any warranty for the Bright Cluster Manager to the extent that is permitted by law.