Database Supercomputer

Total Page:16

File Type:pdf, Size:1020Kb

Database Supercomputer Dr. Craig Thompson, Professor and Acxiom Database Chair in Engineering Computer Science and Computer Engineering Department 313 Engineering Hall, Fayetteville, AR 72701 Tel: (479) 575-6519, Fax: (479) 575-5339, Email: [email protected] Database Supercomputer Problem Industry, government, the U.S. military, DHS, the scientific research community, our networked society, RFID and sensor networks are all generating data at unprecedented and accelerating rates. The commercial database community is dominated by a few main vendors and database architectures have not changed enough in thirty years – they are monolithic and dominated by disk-based storage. Organizations with huge databases and extreme throughput requirements are outgrowing traditional database architectures and searching for a next-generation approach. Objective The objective of this proposed project is to develop an open service architecture for database management systems that can complement disk-based storage with grids of hundreds of low-cost machines (e.g., PCs) which can all process query fragments in parallel. Impact if Successful In Arkansas, companies like Acxiom Corporation and Wal-Mart will benefit from low-cost, scalable database supercomputer technology. These companies currently process terabytes daily and petabytes annually, always more. Acxiom is already years into pioneering data grid technology – they can process datasets for clients like Citibank, major credit agencies, and hundreds of other clients and compare customer records against a knowledgebase the size of the US population in hours to days. Wal-Mart is processing all transactions worldwide and will soon also process all RFID sensor data worldwide. These companies do not want to be in the database business but need better database management architectures to keep pace with their voracious data needs. At the level of the US, the need for scalable database supercomputing is glaring: Department of Homeland Security lists data sharing among its organizations as its top priority. The US military views information dominance as a critical goal. NASA and the scientific computing community need ways to process huge data sets. Approach Over the past year, researchers at University of Arkansas have been working closely with Acxiom Corporation on a next-generation data management architecture (funded by $280K from Acxiom). The approach is related to NSF-sponsored research in grid computing. Much of that research has focused on computational grids, that is, how to spread a computation across large numbers of cheap machines (e.g., PCs). Our work with Acxiom has focused on data grids in which huge files are partitioned across the main memory of pools of 100’s of PCs. Then, queries can be partitioned to operate in parallel across all machines at once. The architecture is scalable by adding machines and tunable to assure very high throughput. Acxiom uses a special purpose version of this kind of architecture to effectively process huge data sets. The opportunity is to generalize this architecture to cover a wide variety of traditional query operations while retaining the massive capacity and huge speedups of the “embarrassingly parallel” data grid architecture. Task 1 – Year 1 – Storage Management Architecture Prototype We proposed to work with the open source database community MySQL and/or Ingres to develop a standard storage architecture API so that third parties can add their own storage architecture to any front-end relational system. We will then develop a data grid cluster architecture consisting of a collection of general index structures – initially HASH and ISAM. The system will be prototyped and tested on large datasets. We will also demonstrate that custom indexes like current Acxiom workhorse Abiletec indexes can be interfaced to via the storage engine. This would benefit Acxiom by providing a broader class of relational queries on current datasets. But at the same time, the architecture, motivated by Acxiom’s positive experience in data grids, will generalize so the entire field can benefit from a radical new, highly parallel and scalable storage management architecture. Task 2 & 3 – Years 2 & 3 – Query Processing, Workflow Automation, and Digital Rights Management Results from Task 1 are immediately useful because some high-performance applications can interface to that architectural level directly. But generalizations of the higher levels of a DBMS system can also lead to richer, more capable DBMS systems. In follow-on years, we expect to extend the work from Year 1 to build higher level capabilities. These include Multiple query processors that can operate in parallel against one or several storage management architectures. Extensions of relational queries to scheduling parallel workflows that speed up query processing from multiple simultaneous queries. Critical to the long term success of next-generation database supercomputers is secure access. New policy-driven languages are needed to assure security and privacy. Risk Assessment and Chance of Success Based on special purpose prototypes developed at Acxiom, we can claim with reasonable certainty that new, more general architectures can provide similar benefits to a wide variety of mega-databases. The Principal Investigator has credentials (see below) that qualify him for architecting and leading the proposed development effort. A partnership between University of Arkansas and Acxiom with connectivity to open source vendors, e.g., MySQL is already yielding productive results and can build on this working relationship. Team Dr. Craig Thompson – Principal Investigator - Dr. Thompson if Professor and Acxiom Database Chair in Engineering. He has a strong record of industrial research and external funding (PI for $11.3M middleware and database research from DARPA 1990-present)). His background includes university teaching, publications, presentations, inventions, consulting, standards, and administration; and his work has had reasonable impact in several fields - software architectures, survivable reliable secure distributed object middleware, and multi-agent systems. He led the DARPA Open OODB project that helped define OODB functionality but more importantly led to service-oriented architectures (SOAs) – middleware architectures consisting of collections of Lego-like modular components. He co-authored Object Management Group’s Object Management Architecture and Object Services Architecture documents that form the blue-print for CORBA and CORBAservices, a direct precursor to today’s web services. He also helped architect the DARPA CoABS agent grid and is currently working with Acxiom on improvements to a very high-performance data grid architecture. Budget A detailed budget and plan will be developed if this project is selected. A successful project would involve University of Arkansas CSCE Department – 3-4 faculty for 2-3 months each summer plus some release time and cost share during the academic year. 7-10 grad students at the PhD and MS level plus undergraduate researchers Acxiom guidance – probably a range of Acxiom architects experienced in different facets of the architecture. Most of these contacts are already in place. MySQL partnership – we have made initial contacts and believe our proposal is aligned with their open source project goals. Initial estimate: $700-$800K in year one. Benefits to University of Arkansas Arkansas is an EPSCoR state and would benefit from the opportunity to create a center of excellence in data engineering. The investment is strategic regionally because Arkansas is home to Wal-Mart, Acxiom, JB Hunt, Tyson, and many other companies that depend on enterprise computing to process huge data sets. This is a large software architecture and development project that could also involve resources from CAST and Walton College of Business so there is a likelihood of cross- College participation. Of course, this would strengthen university-industry ties in a year when strong ties with industry partners, esp. Acxiom, are especially important. This project puts U Arkansas on the map as a world leader in data management. The University’s 2010 goal is to build knowledge-intensive industries and world-class leadership in research in Arkansas. This proposal offers an opportunity to do that. If we win this effort, we can leverage our relationship with Acxiom to win other efforts, hopefully benefiting DHS, DoD, NSF, and scientific research and many data management efforts US wide. This project involves thinking outside the box, re-thinking the boundaries in the data management area. It would draw students to esp. our Ph.D. and M.S. programs. Even better, it would make a difference in the world. And it would be fun. .
Recommended publications
  • NVIDIA Tesla Personal Supercomputer, Please Visit
    NVIDIA TESLA PERSONAL SUPERCOMPUTER TESLA DATASHEET Get your own supercomputer. Experience cluster level computing performance—up to 250 times faster than standard PCs and workstations—right at your desk. The NVIDIA® Tesla™ Personal Supercomputer AccessiBLE to Everyone TESLA C1060 COMPUTING ™ PROCESSORS ARE THE CORE is based on the revolutionary NVIDIA CUDA Available from OEMs and resellers parallel computing architecture and powered OF THE TESLA PERSONAL worldwide, the Tesla Personal Supercomputer SUPERCOMPUTER by up to 960 parallel processing cores. operates quietly and plugs into a standard power strip so you can take advantage YOUR OWN SUPERCOMPUTER of cluster level performance anytime Get nearly 4 teraflops of compute capability you want, right from your desk. and the ability to perform computations 250 times faster than a multi-CPU core PC or workstation. NVIDIA CUDA UnlocKS THE POWER OF GPU parallel COMPUTING The CUDA™ parallel computing architecture enables developers to utilize C programming with NVIDIA GPUs to run the most complex computationally-intensive applications. CUDA is easy to learn and has become widely adopted by thousands of application developers worldwide to accelerate the most performance demanding applications. TESLA PERSONAL SUPERCOMPUTER | DATASHEET | MAR09 | FINAL FEATURES AND BENEFITS Your own Supercomputer Dedicated computing resource for every computational researcher and technical professional. Cluster Performance The performance of a cluster in a desktop system. Four Tesla computing on your DesKtop processors deliver nearly 4 teraflops of performance. DESIGNED for OFFICE USE Plugs into a standard office power socket and quiet enough for use at your desk. Massively Parallel Many Core 240 parallel processor cores per GPU that can execute thousands of GPU Architecture concurrent threads.
    [Show full text]
  • Blockchain Database for a Cyber Security Learning System
    Session ETD 475 Blockchain Database for a Cyber Security Learning System Sophia Armstrong Department of Computer Science, College of Engineering and Technology East Carolina University Te-Shun Chou Department of Technology Systems, College of Engineering and Technology East Carolina University John Jones College of Engineering and Technology East Carolina University Abstract Our cyber security learning system involves an interactive environment for students to practice executing different attack and defense techniques relating to cyber security concepts. We intend to use a blockchain database to secure data from this learning system. The data being secured are students’ scores accumulated by successful attacks or defends from the other students’ implementations. As more professionals are departing from traditional relational databases, the enthusiasm around distributed ledger databases is growing, specifically blockchain. With many available platforms applying blockchain structures, it is important to understand how this emerging technology is being used, with the goal of utilizing this technology for our learning system. In order to successfully secure the data and ensure it is tamper resistant, an investigation of blockchain technology use cases must be conducted. In addition, this paper defined the primary characteristics of the emerging distributed ledgers or blockchain technology, to ensure we effectively harness this technology to secure our data. Moreover, we explored using a blockchain database for our data. 1. Introduction New buzz words are constantly surfacing in the ever evolving field of computer science, so it is critical to distinguish the difference between temporary fads and new evolutionary technology. Blockchain is one of the newest and most developmental technologies currently drawing interest.
    [Show full text]
  • An Introduction to Cloud Databases a Guide for Administrators
    Compliments of An Introduction to Cloud Databases A Guide for Administrators Wendy Neu, Vlad Vlasceanu, Andy Oram & Sam Alapati REPORT Break free from old guard databases AWS provides the broadest selection of purpose-built databases allowing you to save, grow, and innovate faster Enterprise scale at 3-5x the performance 14+ database engines 1/10th the cost of vs popular alternatives - more than any other commercial databases provider Learn more: aws.amazon.com/databases An Introduction to Cloud Databases A Guide for Administrators Wendy Neu, Vlad Vlasceanu, Andy Oram, and Sam Alapati Beijing Boston Farnham Sebastopol Tokyo An Introduction to Cloud Databases by Wendy A. Neu, Vlad Vlasceanu, Andy Oram, and Sam Alapati Copyright © 2019 O’Reilly Media Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Development Editor: Jeff Bleiel Interior Designer: David Futato Acquisitions Editor: Jonathan Hassell Cover Designer: Karen Montgomery Production Editor: Katherine Tozer Illustrator: Rebecca Demarest Copyeditor: Octal Publishing, LLC September 2019: First Edition Revision History for the First Edition 2019-08-19: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. An Introduction to Cloud Databases, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views.
    [Show full text]
  • Middleware-Based Database Replication: the Gaps Between Theory and Practice
    Appears in Proceedings of the ACM SIGMOD Conference, Vancouver, Canada (June 2008) Middleware-based Database Replication: The Gaps Between Theory and Practice Emmanuel Cecchet George Candea Anastasia Ailamaki EPFL EPFL & Aster Data Systems EPFL & Carnegie Mellon University Lausanne, Switzerland Lausanne, Switzerland Lausanne, Switzerland [email protected] [email protected] [email protected] ABSTRACT There exist replication “solutions” for every major DBMS, from Oracle RAC™, Streams™ and DataGuard™ to Slony-I for The need for high availability and performance in data Postgres, MySQL replication and cluster, and everything in- management systems has been fueling a long running interest in between. The naïve observer may conclude that such variety of database replication from both academia and industry. However, replication systems indicates a solved problem; the reality, academic groups often attack replication problems in isolation, however, is the exact opposite. Replication still falls short of overlooking the need for completeness in their solutions, while customer expectations, which explains the continued interest in developing new approaches, resulting in a dazzling variety of commercial teams take a holistic approach that often misses offerings. opportunities for fundamental innovation. This has created over time a gap between academic research and industrial practice. Even the “simple” cases are challenging at large scale. We deployed a replication system for a large travel ticket brokering This paper aims to characterize the gap along three axes: system at a Fortune-500 company faced with a workload where performance, availability, and administration. We build on our 95% of transactions were read-only. Still, the 5% write workload own experience developing and deploying replication systems in resulted in thousands of update requests per second, which commercial and academic settings, as well as on a large body of implied that a system using 2-phase-commit, or any other form of prior related work.
    [Show full text]
  • The Sunway Taihulight Supercomputer: System and Applications
    SCIENCE CHINA Information Sciences . RESEARCH PAPER . July 2016, Vol. 59 072001:1–072001:16 doi: 10.1007/s11432-016-5588-7 The Sunway TaihuLight supercomputer: system and applications Haohuan FU1,3 , Junfeng LIAO1,2,3 , Jinzhe YANG2, Lanning WANG4 , Zhenya SONG6 , Xiaomeng HUANG1,3 , Chao YANG5, Wei XUE1,2,3 , Fangfang LIU5 , Fangli QIAO6 , Wei ZHAO6 , Xunqiang YIN6 , Chaofeng HOU7 , Chenglong ZHANG7, Wei GE7 , Jian ZHANG8, Yangang WANG8, Chunbo ZHOU8 & Guangwen YANG1,2,3* 1Ministry of Education Key Laboratory for Earth System Modeling, and Center for Earth System Science, Tsinghua University, Beijing 100084, China; 2Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 3National Supercomputing Center in Wuxi, Wuxi 214072, China; 4College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China; 5Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; 6First Institute of Oceanography, State Oceanic Administration, Qingdao 266061, China; 7Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China; 8Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China Received May 27, 2016; accepted June 11, 2016; published online June 21, 2016 Abstract The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip.
    [Show full text]
  • Database Technology for Bioinformatics from Information Retrieval to Knowledge Systems
    Database Technology for Bioinformatics From Information Retrieval to Knowledge Systems Luis M. Rocha Complex Systems Modeling CCS3 - Modeling, Algorithms, and Informatics Los Alamos National Laboratory, MS B256 Los Alamos, NM 87545 [email protected] or [email protected] 1 Molecular Biology Databases 3 Bibliographic databases On-line journals and bibliographic citations – MEDLINE (1971, www.nlm.nih.gov) 3 Factual databases Repositories of Experimental data associated with published articles and that can be used for computerized analysis – Nucleic acid sequences: GenBank (1982, www.ncbi.nlm.nih.gov), EMBL (1982, www.ebi.ac.uk), DDBJ (1984, www.ddbj.nig.ac.jp) – Amino acid sequences: PIR (1968, www-nbrf.georgetown.edu), PRF (1979, www.prf.op.jp), SWISS-PROT (1986, www.expasy.ch) – 3D molecular structure: PDB (1971, www.rcsb.org), CSD (1965, www.ccdc.cam.ac.uk) Lack standardization of data contents 3 Knowledge Bases Intended for automatic inference rather than simple retrieval – Motif libraries: PROSITE (1988, www.expasy.ch/sprot/prosite.html) – Molecular Classifications: SCOP (1994, www.mrc-lmb.cam.ac.uk) – Biochemical Pathways: KEGG (1995, www.genome.ad.jp/kegg) Difference between knowledge and data (semiosis and syntax)?? 2 Growth of sequence and 3D Structure databases Number of Entries 3 Database Technology and Bioinformatics 3 Databases Computerized collection of data for Information Retrieval Shared by many users Stored records are organized with a predefined set of data items (attributes) Managed by a computer program: the database
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]
  • What Is a Database? Differences Between the Internet and Library
    What is a Database? Library databases are mostly full-text material (in their entirety) and summaries or descriptions of articles. They are collections of articles from newspapers, magazines and journals and electronic reference sources. Databases are selected for the quality and variety of resources they offer and are accessed using the Internet. Your library pays for you to have access to a number of relevant databases. You support this with your tuition, so get the most out of your money! You can access them from home or school via the Library Webpage or use the link below. http://www.mxcc.commnet.edu/Content/Find_Articles.asp Two short videos on the benefits of using library databases: http://www.youtube.com/watch?v=VUp1P-ubOIc http://youtu.be/Q2GMtIuaNzU Differences Between the Internet and Library Databases The Internet Library Databases Examples Google, Yahoo, Bing LexisNexis, Literary Reference Center or Health and Wellness Resource Center Review process None – anyone can add Checked for accuracy by content to the Web. publishers. Chosen by your college’s library. Includes “peer-reviewed” scholarly articles. Reliability Unknown Very No quality control mechanisms! Content Anything, from pictures of a Scholarly journal articles, Book person’s pets to personal reviews, Research papers, (usually not researched and Conference papers, and other unsubstantiated) opinions on scholarly information gun control, abortion, etc. How often updated Unknown/varies. Regularly – daily, quarterly monthly Cost “Free” but some of the info Library has paid for you to you may need for your access these databases. assignment requires a fee. Organization Very little or no organization Very organized Availability Websites come and go.
    [Show full text]
  • Data Quality Requirements Analysis and Modeling December 1992 TDQM-92-03 Richard Y
    Published in the Ninth International Conference of Data Engineering Vienna, Austria, April 1993 Data Quality Requirements Analysis and Modeling December 1992 TDQM-92-03 Richard Y. Wang Henry B. Kon Stuart E. Madnick Total Data Quality Management (TDQM) Research Program Room E53-320 Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139 USA 617-253-2656 Fax: 617-253-3321 Acknowledgments: Work reported herein has been supported, in part, by MITís Total Data Quality Management (TDQM) Research Program, MITís International Financial Services Research Center (IFSRC), Fujitsu Personal Systems, Inc. and Bull-HN. The authors wish to thank Gretchen Fisher for helping prepare this manuscript. To Appear in the Ninth International Conference on Data Engineering Vienna, Austria April 1993 Data Quality Requirements Analysis and Modeling Richard Y. Wang Henry B. Kon Stuart E. Madnick Sloan School of Management Massachusetts Institute of Technology Cambridge, Mass 02139 [email protected] ABSTRACT Data engineering is the modeling and structuring of data in its design, development and use. An ultimate goal of data engineering is to put quality data in the hands of users. Specifying and ensuring the quality of data, however, is an area in data engineering that has received little attention. In this paper we: (1) establish a set of premises, terms, and definitions for data quality management, and (2) develop a step-by-step methodology for defining and documenting data quality parameters important to users. These quality parameters are used to determine quality indicators, to be tagged to data items, about the data manufacturing process such as data source, creation time, and collection method.
    [Show full text]
  • E-Commerce Marketplace
    E-COMMERCE MARKETPLACE NIMBIS, AWESIM “Nimbis is providing, essentially, the e-commerce infrastructure that allows suppliers and OEMs DEVELOP COLLABORATIVE together, to connect together in a collaborative HPC ENVIRONMENT form … AweSim represents a big, giant step forward in that direction.” Nimbis, founded in 2008, acts as a clearinghouse for buyers and sellers of technical computing — Bob Graybill, Nimbis president and CEO services and provides pre-negotiated access to high performance computing services, software, and expertise from the leading compute time vendors, independent software vendors, and domain experts. Partnering with the leading computing service companies, Nimbis provides users with a choice growing menu of pre-qualified, pre-negotiated services from HPC cycle providers, independent software vendors, domain experts, and regional solution providers, delivered on a “pay-as-you- go” basis. Nimbis makes it easier and more affordable for desktop users to exploit technical computing for faster results and superior products and solutions. VIRTUAL DESIGNS. REAL BENEFITS. Nimbis Services Inc., a founding associate of the AweSim industrial engagement initiative led by the Ohio Supercomputer Center, has been delving into access complexities and producing, through innovative e-commerce solutions, an easy approach to modeling and simulation resources for small and medium-sized businesses. INFORMATION TECHNOLOGY INFORMATION TECHNOLOGY 2016 THE CHALLENGE Nimbis and the AweSim program, along with its predecessor program Blue Collar Computing, have identified several obstacles that impede widespread adoption of modeling and simulation powered by high performance computing: expensive hardware, complex software and extensive training. In response, the public/private partnership is developing and promoting use of customized applications (apps) linked to OSC’s powerful supercomputer systems.
    [Show full text]
  • Wearable Technology for Enhanced Security
    Communications on Applied Electronics (CAE) – ISSN : 2394-4714 Foundation of Computer Science FCS, New York, USA Volume 5 – No.10, September 2016 – www.caeaccess.org Wearable Technology for Enhanced Security Agbaje M. Olugbenga, PhD Babcock University Department of Computer Science Ogun State, Nigeria ABSTRACT Sproutling. Watches like the Apple Watch, and jewelry such Wearable's comprise of sensors and have computational as Cuff and Ringly. ability. Gadgets such as wristwatches, pens, and glasses with Taking a look at the history of computer generations up to the installed cameras are now available at cheap prices for user to present, we could divide it into three main types: mainframe purchase to monitor or securing themselves. Nigerian faced computing, personal computing, and ubiquitous or pervasive with several kidnapping in schools, homes and abduction for computing[4]. The various divisions is based on the number ransomed collection and other unlawful acts necessitate these of computers per users. The mainframe computing describes reviews. The success of the wearable technology in medical one large computer connected to many users and the second, uses prompted the research into application into security uses. personal computing, as one computer per person while the The method of research is the use of case studies and literature term ubiquitous computing however, was used in 1991 by search. This paper takes a look at the possible applications of Paul Weiser. Weiser depicted a world full of embedded the wearable technology to combat the cases of abduction and sensing technologies to streamline and improve life [5]. kidnapping in Nigeria. Nigeria faced with several kidnapping in schools and homes General Terms are in dire need for solution.
    [Show full text]
  • Defining and Measuring Supercomputer Reliability
    Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS) Jon Stearley <[email protected]> Sandia National Laboratories?? Albuquerque, New Mexico Abstract. The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercom- puter RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 [1] specification which is widely used in the semiconductor manufacturing industry. 1 Impetus The needs for standardized terminology and metrics for supercomputer RAS begins with the procurement process, as the below quotation excellently summarizes: “prevailing procurement practices are ... a lengthy and expensive undertak- ing both for the government and for participating vendors. Thus any technically valid methodologies that can standardize or streamline this process will result in greater value to the federally-funded centers, and greater opportunity to fo- cus on the real problems involved in deploying and utilizing one of these large systems.” [2] Appendix A provides several examples of “numerous general hardware and software specifications” [2] from the procurements of several modern systems. While there are clearly common issues being communicated, the language used is far from consistent. Sites struggle to describe their reliability needs, and vendors strive to translate these descriptions into capabilities they can deliver to multiple customers. Another example is provided by this excerpt: “The system must be reliable... It is important to define what we mean by reliable. We do not mean high availability... Reliability in this context means that a large parallel job running for many hours has a high probability of suc- cessfully completing.
    [Show full text]