Automatic Techniques for Proving Correctness of Heap-Manipulating Programs

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Techniques for Proving Correctness of Heap-Manipulating Programs c 2013 Xiaokang Qiu AUTOMATIC TECHNIQUES FOR PROVING CORRECTNESS OF HEAP-MANIPULATING PROGRAMS BY XIAOKANG QIU DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Doctoral Committee: Associate Professor Madhusudan Parthasarathy, Chair Zisman Family Professor Rajeev Alur, University of Pennsylvania Associate Professor Grigore Ro¸su Associate Professor Mahesh Viswanathan ABSTRACT Reliability is critical for system software, such as OS kernels, mobile browsers, embedded systems and cloud systems. The correctness of these programs, especially for security, is highly desirable, as they should provide a trustwor- thy platform for higher-level applications and the end-users. Unfortunately, due to its inherent complexity, the verification process of these programs is typically manual/semi-automatic, tedious, and painful. Automating the reasoning behind these verification tasks and decreasing the dependence on manual help is one of the greatest challenges in software verification. This dissertation presents two logic-based automatic software verification systems, namely Strand and Dryad, that help in the task of verification of heap-manipulating programs, which is one of the most complex aspects of modern software that eludes automatic verification. Strand is a logic that combines an expressive heap-logic with an arbitrary data-logic and admits several powerful decidable fragments. The general decision procedures can be used in not only proving programs correct but also in software analysis and testing. Dryad is a family of logics, including Dryadtree as a first-order logic for trees and Dryadsep as a dialect of separation logic. Both the two logics are amenable to automated reasoning using the natural proof strategy, a radically new approach to software verification. Dryad and the natural proof techniques are so far the most efficient logic-based approach that can verify the full correctness of a wide variety of challenging programs, including a large number of programs from various open-source libraries. They hold promise of hatching the next-generation automatic verification techniques. ii To Ping. iii ACKNOWLEDGMENTS The six-year scholarly pursuit of a Ph.D. in Illinois has been the greatest challenge in my life. It is my privilege to thank all the people that have influenced this endeavor. I am deeply indebted to my advisor, Madhusudan Parthasarathy, who has been a consistent source of research guidance and financial support. As a venerable mentor, he has taught me virtually all I know of being a good researcher. I am extremely admired for his incredible enthusiasm, deep think- ing, broad knowledge, and sense of responsibility. I thank other members of my doctoral committee, Rajeev Alur, Grigore Ro¸su and Mahesh Viswanathan. They have all eagerly provided me with advice on both my research and career. My research collaborator Gennaro Parlato has been an “older brother” of mine, and taught me important and practical lessons on conducting research, writing papers, and surviving the Ph.D. I also thank my fellow students, including Pranav Garg, Edgar Pek, Shambwaditya Saha, Francesco Sorrentino and Andrei S¸tef˘anescu. They have been a source for great conversation and feedback on my research. I am grateful for the great staff in the Department of Computer Science. Elaine Wilson handled my travel arrangements and reimbursements with ex- traordinary efficiency. Shirley Finke and Jennifer Dittmar made sure I got paid every semester. Rhonda McElroy, Mary Beth Kelly and Kara MacGre- gor helped me get visas, write and sign important supporting letters, and meet the coursework requirements on time. I owe much thanks to my parents, Jianxiang Qiu and Minguang Shen. They have unceasingly supported and encouraged me for coming to America and being an academic. I am also very thankful to Elder Wei-Laung Hu and brothers and sisters in the Cornerstone Fellowship, for their prayer, sharing and encouragement in my spiritual journey in the Midwest. iv Finally, I especially thank my wife, Ping. I would not have finished the de- gree without her love, support and patience. She has been with me, through thick and thin. For all those times, this dissertation is lovingly dedicated to her. v TABLE OF CONTENTS LISTOFTABLES .............................viii LISTOFFIGURES ............................ ix LISTOFABBREVIATIONS . .. .. x CHAPTER 1 INTRODUCTION . 1 1.1 SummaryofContributions . 3 1.2 Organization ........................... 4 CHAPTER 2 PRELIMINARIES . 6 2.1 Satisfiability Modulo Theories and Z3 . 6 2.2 Monadic Second-Order Theory and Mona ........... 7 2.3 SeparationLogic ......................... 8 CHAPTER 3 STRAND LOGIC .................... 11 3.1 Overview.............................. 12 3.2 Motivating Examples and Logic Design . 13 3.3 RecursivelyDefinedData-Structures . 17 3.4 TheLogic ............................. 23 3.5 Program Verification Using Strand .............. 27 CHAPTER 4 DECISION PROCEDURES . 39 4.1 Overview.............................. 39 4.2 Satisfiability-Preserving Embeddings . 43 4.3 ASemanticallyDefinedFragment . 48 4.4 ASyntacticallyDefinedFragment . 54 4.5 ExperimentalEvaluation . 64 4.6 RelatedWork ........................... 69 CHAPTER5 NATURALPROOFS. 71 5.1 The Dryadtree Logic ....................... 75 5.2 Deriving the Verification Condition . 82 5.3 A Decidable Fragment of Dryadtree ...............100 5.4 FormulaAbstraction . .109 vi 5.5 Experiments............................115 5.6 RelatedWork ...........................119 CHAPTER 6 COMBINING SEPARATION AND RECURSION . 123 6.1 LogicDesign............................124 6.2 Syntax...............................127 6.3 Semantics .............................129 6.4 Examples .............................137 6.5 Translating toALogicover theGlobalHeap . 139 CHAPTER 7 NATURAL PROOFS FOR STRUCTURE, DATA, ANDSEPARATION ..........................147 7.1 ProgramsandHoare-triples . .149 7.2 Generating the Verification Condition . 151 7.3 UnfoldingAcrosstheFootprint . .158 7.4 FormulaAbstraction . .162 7.5 CaseStudy ............................166 7.6 ExperimentalEvaluation . .171 7.7 RelatedWork ...........................177 7.8 AnnotationSynthesis . .179 CHAPTER8 CONCLUSIONS . .191 8.1 Conclusions ............................191 8.2 ALookAhead...........................192 REFERENCES...............................194 vii LIST OF TABLES 4.1 Results of program verification using Strand ......... 68 5.1 Results of program verification using Dryadtree ........120 6.1 Domain-exact property and Scope function . 141 7.1 Results of verifying data-structure algorithms . 173 7.2 Results of verifying open-source libraries . 175 viii LIST OF FIGURES 3.1 A list with head and tail . 14 3.2 l′ inherits data values from l ................... 15 3.3 A binary tree example represented in Rbt ............ 20 3.4 Definition of the tailorX function ................ 22 3.5 Syntax of Strand ........................ 24 3.6 Predicates defined for non-updating statements. 33 3.7 Predicates defined for updating statements. 34 3.8 Asyntactic transformationforconditions. 35 4.1 Definition of the interpret function ............... 49 4.2 Definition of the tailor function ................. 51 syn 4.3 Syntax of Stranddec ...................... 55 4.4 A valid subset X that falsifies β ................. 58 5.1 Syntax of Dryadtree ....................... 77 5.2 Recursivedefinitionsforredblacktrees . 80 5.3 AVL-findroutine .........................100 5.4 Pre/post conditions and recursive definition for AVL-find . 100 5.5 Expanding the symbolic heap and generating the formulas . 101 5.6 Syntax of local formulas ϕp(x)..................103 dec 5.7 Syntax of Dryadtree .......................104 5.8 Inductive definition of map(ϕ)..................106 6.1 Syntax of Dryadsep .......................128 6.2 The pure predicateforterms/formulas . 131 6.3 Translation of Dryadsep terms .................142 6.4 Translation of Dryadsep formulas ................142 7.1 Syntaxofprograms . .149 7.2 Casestudy:Heapify . .. .. .166 ix LIST OF ABBREVIATIONS APF Array Property Fragment BST Binary Search Tree FOL First-Order Logic ITE If-Then-Else LCA Least Common Ancestor MSO Monadic Second-Order MSOL Monadic Second-Order Logic RDDS Recursively Defined Data-Structure SL Separation Logic SMT Satisfiability Modulo Theories SPE Satisfiability-Preserving Embedding VC Verification Condition WS1S Weak monadic Second-order theory of 1 Successor WS2S Weak monadic Second-order theory of 2 Successors x CHAPTER 1 INTRODUCTION As software systems have become indispensable in our daily lives, their re- liability has grown to be one of the most concerned issues, especially when deployed for critical applications and services. This includes complex embed- ded software in avionics, vehicles and medical equipments, system software such as operating systems and browsers, as well as today’s emerging cloud software. Numerous approaches have been proposed to build reliability-critical soft- ware to satisfy its complex correctness requirements. A focus of intense re- search in this area has been program verification using theorem provers, uti- lizing manually provided proof annotations, such as pre- and post-conditions for functions and loop invariants. Automatic theory solvers (e.g. SMT solvers) that handle a variety of quantifier-free theories including arithmetic, uninterpreted functions, Boolean logic, etc., serve as
Recommended publications
  • Recursive Definitions Across the Footprint That a Straight- Line Program Manipulated, E.G
    natural proofs for verification of dynamic heaps P. MADHUSUDAN UNIV. OF ILLINOIS AT URBANA-CHAMPAIGN (VISITING MSR, BANGALORE) WORKSHOP ON MAKING FORMAL VERIFICATION SCALABLE AND USEABLE CMI, CHENNAI, INDIA JOINT WORK WITH… Xiaokang Qiu .. graduating! Gennaro Andrei Pranav Garg Parlato Stefanescu GOAL: BUILD RELIABLE SOFTWARE Build systems with proven reliability and security guarantees. (Not the same as finding bugs!) Deductive verification with automatic theorem proving • Floyd/Hoare style verification • User supplied modular annotations (pre/post cond, class invariants) and loop invariants • Resulting verification conditions are derived automatically for each linear program segment (using weakest-pre/strongest-post) Verification conditions are then proved valid using mostly automatic theorem proving (like SMT solvers) TARGET: RELIABLE SYS TEMS SOFTWARE Operating Systems Angry birds Browsers Excel VMs Mail clients … App platforms Systems Software Applications Several success stories: • Microsoft Hypervisor verification using VCC [MSR, EMIC] • A secure foundation for mobile apps [@Illinois, ASPLOS’13] A SECURE FOUNDATION FOR MOBILE APPS In collaboration with King’s systems group at Illinois First OS architecture that provides verifiable, high-level abstractions for building mobile applications. Application state is private by default. • Meta-data on files controls access by apps; only app with appropriate permissions can access files • Every page is encrypted before sending to the storage service. • Memory isolation between apps • A page fault handler serves a file- backed page for a process, the fille has to be opened by the same process. • Only the current app can write to the screen buffer. • IPC channel correctness • … VERIFICATION TOOLS ARE MATURE And works! ENOUGH TO BUILD RELIABLE S/W.
    [Show full text]
  • A Formalization of the C99 Standard in HOL, Isabelle And
    A Formalization of the C99 Standard in HOL, Isabelle and Coq Robbert Krebbers Freek Wiedijk Institute for Computing and Information Sciences, Radboud University Nijmegen, The Netherlands The C99 standard Related projects Some subtleties of C The official description issued by ANSI and ISO: Michael Norrish. C and C++ semantics (L4.verified) Undefined behavior due to unknown evaluation order: Written in English Xavier Leroy et al. Verified C compiler in Coq (Compcert) int i = 0; No mathematically precise formalism Chucky Ellison and Grigore Rosu. Executable C semantics in Maude i = ++i; // undefined Incomplete and ambiguous Overflow of signed integers is undefined: The formalizations The Formalin project int i = INT_MAX; Describe a space C semantics of all possible C semantics with relations return i < i + 1; May 2011 to May 2015 between these semantics // undefined: hence, a compiler is allowed to http://ch2o.cs.ru.nl/ And, a small step semantics, C99 : C semantics // optimize this to return 1 Create a formalization of the complete C99 standard On the other hand, unsigned integer arithmetic is modular In the theorem provers HOL4, Isabelle/HOL and Coq Undefined behavior due to jumping into a block with a variable length Which follow the standard closely array declaration: All derived from a common master formalization (e.g. in Ott) goto foo; // undefined Isabelle/ Features int a[n]; HOL4 label foo; printf("bar\n"); C preprocessor HOL Freeing memory makes pointers to it indeterminate C standard library int *x = malloc(sizeof(int)); Floating point arithmetic 99 free (x); Casts C printf("%p\n", x); // undefined Non-determinism Contiguously allocated objects Sequence points int x = 30, y = 31; Alignment requirements int *p = &x + 1, *q = &y; Non-local control flow ( , / , signal handling) goto setjmp longjmp if (memcmp(&p, &q, sizeof(p)) == 0) { volatile, restrict and const variables printf("%d\n", *p); Programs in a `freestanding environment' COQ // the standard is unclear whether this is Purposes // defined (see Defect report #260).
    [Show full text]
  • Type 2 Diabetes Prediction Using Machine Learning Algorithms
    Type 2 Diabetes Prediction Using Machine Learning Algorithms Parisa Karimi Darabi*, Mohammad Jafar Tarokh Department of Industrial Engineering, K. N. Toosi University of Technology, Tehran, Iran Abstract Article Type: Background and Objective: Currently, diabetes is one of the leading causes of Original article death in the world. According to several factors diagnosis of this disease is Article History: complex and prone to human error. This study aimed to analyze the risk of Received: 15 Jul 2020 having diabetes based on laboratory information, life style and, family history Revised: 1 Aug 2020 with the help of machine learning algorithms. When the model is trained Accepted: 5 Aug 2020 properly, people can examine their risk of having diabetes. *Correspondence: Material and Methods: To classify patients, by using Python, eight different machine learning algorithms (Logistic Regression, Nearest Neighbor, Decision Parisa Karimi Darabi, Tree, Random Forest, Support Vector Machine, Naive Bayesian, Neural Department of Industrial Network and Gradient Boosting) were analyzed. Were evaluated by accuracy, Engineering, K. N. Toosi University sensitivity, specificity and ROC curve parameters. of Technology, Tehran, Iran Results: The model based on the gradient boosting algorithm showed the best [email protected] performance with a prediction accuracy of %95.50. Conclusion: In the future, this model can be used for diagnosis diabete. The basis of this study is to do more research and develop models such as other learning machine algorithms. Keywords: Prediction, diabetes, machine learning, gradient boosting, ROC curve DOI : 10.29252/jorjanibiomedj.8.3.4 people die every year from diabetes disease. Introduction By 2045, it will reach 629 million (1).
    [Show full text]
  • Curriculum Vitæ RESEARCH INTERESTS EDUCATION PROFESSIONAL POSITIONS
    Curriculum Vitæ Mark Hills East Carolina University Department of Computer Science Greenville, North Carolina, United States of America Phone: (252) 328-9692 (office) E-mail: [email protected] Skype: mahills Home page: http://www.cs.ecu.edu/hillsma/ RESEARCH INTERESTS I am interested broadly in the fields of programming languages and software engineering, especially where they intersect. Specifically, my research focuses on static and dynamic program analysis; empirical software engineering; software analytics; software repository mining; automated refactoring; programming language semantics; and program verification. My goal is to provide developers with powerful new tools for under- standing, creating, analyzing, and improving software. As part of this, I am a continuing contributor to Rascal, a meta-programming language for program analysis, program transformation, and programming language implementation. EDUCATION • Ph.D. in Computer Science (advisor Grigore Ro¸su),University of Illinois Urbana-Champaign, com- pleted December 2009. • B.S. in Computer Science, Western Illinois University, 1995 PROFESSIONAL POSITIONS Associate Professor East Carolina University, Department of Computer Science (Assistant Professor 2013{2019, Associate Professor 2019{present) My research at ECU has focused on understanding complex, web-based systems written in dynamic lan- guages. This research includes a strong empirical component, looking first at how developers use specific features of dynamic languages, and second at how knowledge of these uses, which implicitly represent devel- oper intent, can lead to more precise analysis tools. Recently, I have also started to look at how to model the knowledge of developers (including students), and at how such models can be used to direct developers to more relevant, understandable code examples.
    [Show full text]
  • Scalable Fault-Tolerant Elastic Data Ingestion in Asterixdb
    UC Irvine UC Irvine Electronic Theses and Dissertations Title Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB Permalink https://escholarship.org/uc/item/9xv3435x Author Grover, Raman Publication Date 2015 License https://creativecommons.org/licenses/by-nd/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Raman Grover Dissertation Committee: Professor Michael J. Carey, Chair Professor Chen Li Professor Sharad Mehrotra 2015 © 2015 Raman Grover DEDICATION To my wonderful parents... ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xi 1 Introduction 1 1.1 Challenges in Data Feed Management . .3 1.1.1 Genericity and Extensibility . .3 1.1.2 Fetch-Once Compute-Many Model . .5 1.1.3 Scalability and Elasticity . .5 1.1.4 Fault Tolerance . .7 1.2 Contributions . .8 1.3 Organization . .9 2 Related Work 10 2.1 Stream Processing Engines . 10 2.2 Data Routing Engines . 11 2.3 Extract Transform Load (ETL) Systems . 12 2.4 Other Systems . 13 2.4.1 Flume . 14 2.4.2 Chukwa . 15 2.4.3 Kafka . 16 2.4.4 Sqoop . 17 2.5 Summary . 17 3 Background and Preliminaries 19 3.1 AsterixDB . 19 3.1.1 AsterixDB Architecture . 20 3.1.2 AsterixDB Data Model . 21 3.1.3 Querying Data .
    [Show full text]
  • Anomaly Detection and Identification Scheme for VM Live Migration in Cloud Infrastructure✩
    Future Generation Computer Systems 56 (2016) 736–745 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Anomaly detection and identification scheme for VM live migration in cloud infrastructureI Tian Huang a, Yongxin Zhu a,∗, Yafei Wu a, Stéphane Bressan b, Gillian Dobbie c a School of Microelectronics, Shanghai Jiao Tong University, China b School of Computing, National University of Singapore, Singapore c Department of Computer Science, University of Auckland, New Zealand h i g h l i g h t s • We extend Local Outlier Factors to detect anomalies in time series and adapt to smooth environmental changes. • We propose Dimension Reasoning LOF (DR-LOF) that can point out the ``most anomalous'' dimension of the performance profile. DR-LOF provides clues for administrators to pinpoint and clear the anomalies. • We incorporate DR-LOF and Symbolic Aggregate ApproXimation (SAX) measurement as a scheme to detect and identify the anomalies in the progress of VM live migration. • In experiments, we deploy a typical computing application on small scale clusters and evaluate our anomaly detection and identification scheme by imposing two kinds of common anomalies onto VMs. article info a b s t r a c t Article history: Virtual machines (VM) offer simple and practical mechanisms to address many of the manageability Received 3 March 2015 problems of leveraging heterogeneous computing resources. VM live migration is an important feature Received in revised form of virtualization in cloud computing: it allows administrators to transparently tune the performance 11 May 2015 of the computing infrastructure. However, VM live migration may open the door to security threats.
    [Show full text]
  • A Decidable Logic for Tree Data-Structures with Measurements
    A Decidable Logic for Tree Data-Structures with Measurements Xiaokang Qiu and Yanjun Wang Purdue University fxkqiu,[email protected] Abstract. We present Dryaddec, a decidable logic that allows reason- ing about tree data-structures with measurements. This logic supports user-defined recursive measure functions based on Max or Sum, and re- cursive predicates based on these measure functions, such as AVL trees or red-black trees. We prove that the logic's satisfiability is decidable. The crux of the decidability proof is a small model property which al- lows us to reduce the satisfiability of Dryaddec to quantifier-free lin- ear arithmetic theory which can be solved efficiently using SMT solvers. We also show that Dryaddec can encode a variety of verification and synthesis problems, including natural proof verification conditions for functional correctness of recursive tree-manipulating programs, legal- ity conditions for fusing tree traversals, synthesis conditions for con- ditional linear-integer arithmetic functions. We developed the decision procedure and successfully solved 220+ Dryaddec formulae raised from these application scenarios, including verifying functional correctness of programs manipulating AVL trees, red-black trees and treaps, check- ing the fusibility of height-based mutually recursive tree traversals, and counterexample-guided synthesis from linear integer arithmetic specifi- cations. To our knowledge, Dryaddec is the first decidable logic that can solve such a wide variety of problems requiring flexible combination of measure-related, data-related and shape-related properties for trees. 1 Introduction Logical reasoning about tree data-structures has been needed in various applica- tion scenarios such as program verification [32,24,26,42,49,4,14], compiler opti- mization [17,18,9,44] and webpage layout engines [30,31].
    [Show full text]
  • Data Intensive Computing with Linq to Hpc Technical
    DATA INTENSIVE COMPUTING WITH LINQ TO HPC TECHNICAL OVERVIEW Ade Miller, Principal Program Manager, LINQ to HPC Agenda Introduction to LINQ to HPC Using LINQ to HPC Systems Management Integrating with Other Data Technologies The Data Spectrum One extreme is analyzing raw, unstructured data. The analyst does not know exactly what the data contains, nor LINQ to HPC what cube would be justified. The analyst needs to do ad-hoc analyses that may never be run again. Another extreme is analytics targeting a traditional data warehouse. The analyst Parallel Data knows the cube he or she wants to build, Warehouse and the analyst knows the data sources. What kind of Data? Large Data Volume . 100s of TBs to 10s of PBs New Questions & Non-Traditional data Types New Technologies New Insights . Unstructured & Semi structured . How popular is my product? . Weak relational schema . Distributed Parallel Processing . What is the best ad to serve? . Text, Images, Videos, Logs Frameworks . Is this a fraudulent transaction? . Easy to Scale on commodity hardware . MapReduce-style programming models New Data Sources . Sensors & Devices . Traditional applications . Web Servers . Public data Overview MOVING THE DATA TO THE COMPUTE So how does it work? FIRST, STORE THE DATA Data Intensive Computing with HPC Server Windows HPC Server 2008 R2 Service Pack 2 So how does it work? SECOND, TAKE THE PROCESSING TO THE DATA Data Intensive Computing with HPC Server var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access
    [Show full text]
  • A DSL for Dynamic Semantics Specification}, Year = {2015}}
    Delft University of Technology Software Engineering Research Group Technical Report Series DynSem: A DSL for Dynamic Semantics Specification Vlad Vergu, Pierre Neron, Eelco Visser Report TUD-SERG-2015-003 SERG TUD-SERG-2015-003 Published, produced and distributed by: Software Engineering Research Group Department of Software Technology Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Mekelweg 4 2628 CD Delft The Netherlands ISSN 1872-5392 Software Engineering Research Group Technical Reports: http://www.se.ewi.tudelft.nl/techreports/ For more information about the Software Engineering Research Group: http://www.se.ewi.tudelft.nl/ This technical report is an extended version of the paper: Vlad Vergu, Pierre Neron, Eelco Visser. DynSem: A DSL for Dynamic Semantics Specification. In Mari- bel Fernandez, editor, 26th International Conference on Rewriting Techniques and Applications, RTA 2015, Proceedings Leibniz International Proceedings in Informatics Schloss Dagstuhl Leibniz-Zentrum fur Infor- matik, Dagstuhl Publishing, Germany. @inproceedings{VerguNV-RTA-2015, Author = {Vlad Vergu and Pierre Neron and Eelco Visser}, Booktitle = {26th International Conference on Rewriting Techniques and Applications, RTA 2015, Proceedings}, Month = {June}, Note = {To appear}, Publisher = {Schloss Dagstuhl -- Leibniz-Zentrum fur Informatik, Dagstuhl Publishing, Germany}, Series = {Leibniz International Proceedings in Informatics}, Title = {DynSem: A DSL for Dynamic Semantics Specification}, Year = {2015}}
    [Show full text]
  • Cloud Computing Paradigms for Pleasingly Parallel Biomedical
    Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu, Jong Youl Choi, Seung-Hee Bae, Judy Qiu School of Informatics and Computing / Pervasive Technology Institute Indiana University, Bloomington. {tgunarat, taklwu, jychoi, sebae, xqiu}@indiana.edu Abstract Cloud computing offers exciting new approaches for scientific computing that leverages the hardware and software investments on large scale data centers by major commercial players. Loosely coupled problems are very important in many scientific fields and are on the rise with the ongoing move towards data intensive computing. There exist several approaches to leverage clouds & cloud oriented data processing frameworks to perform pleasingly parallel computations. In this paper we present three pleasingly parallel biomedical applications, 1) assembly of genome fragments 2) sequence alignment and similarity search 3) dimension reduction in the analysis of chemical structures, implemented utilizing cloud infrastructure service based utility computing models of Amazon Web Services and Microsoft Windows Azure as well as utilizing MapReduce based data processing frameworks, Apache Hadoop and Microsoft DryadLINQ. We review and compare each of the frameworks and perform a comparative study among them based on performance, efficiency, cost and the usability. Cloud service based utility computing model and the managed parallelism (MapReduce) exhibited comparable performance and efficiencies for the applications we considered. We analyze the variations in cost between the different platform choices (eg: EC2 instance types), highlighting the need to select the appropriate platform based on the nature of the computation. 1. Introduction Scientists are overwhelmed with the increasing amount of data processing needs arising from the storm of data that is flowing through virtually every field of science.
    [Show full text]
  • Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
    Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters Qinghao Hu1,2 Peng Sun3 Shengen Yan3 Yonggang Wen1 Tianwei Zhang1∗ 1School of Computer Science and Engineering, Nanyang Technological University 2S-Lab, Nanyang Technological University 3SenseTime {qinghao.hu, ygwen, tianwei.zhang}@ntu.edu.sg {sunpeng1, yanshengen}@sensetime.com ABSTRACT 1 INTRODUCTION Modern GPU datacenters are critical for delivering Deep Learn- Over the years, we have witnessed the remarkable impact of Deep ing (DL) models and services in both the research community and Learning (DL) technology and applications on every aspect of our industry. When operating a datacenter, optimization of resource daily life, e.g., face recognition [65], language translation [64], adver- scheduling and management can bring significant financial bene- tisement recommendation [21], etc. The outstanding performance fits. Achieving this goal requires a deep understanding of thejob of DL models comes from the complex neural network structures features and user behaviors. We present a comprehensive study and may contain trillions of parameters [25]. Training a production about the characteristics of DL jobs and resource management. model may require large amounts of GPU resources to support First, we perform a large-scale analysis of real-world job traces thousands of petaflops operations [15]. Hence, it is a common prac- from SenseTime. We uncover some interesting conclusions from tice for research institutes, AI companies and cloud providers to the perspectives of clusters, jobs and users, which can facilitate the build large-scale GPU clusters to facilitate DL model development. cluster system designs. Second, we introduce a general-purpose These clusters are managed in a multi-tenancy fashion, offering framework, which manages resources based on historical data.
    [Show full text]
  • Ray: a Distributed Framework for Emerging AI Applications
    Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley https://www.usenix.org/conference/osdi18/presentation/nishihara This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3 Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz,∗ Robert Nishihara,∗ Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica University of California, Berkeley Abstract and their use in prediction. These frameworks often lever- age specialized hardware (e.g., GPUs and TPUs), with the The next generation of AI applications will continuously goal of reducing training time in a batch setting. Examples interact with the environment and learn from these inter- include TensorFlow [7], MXNet [18], and PyTorch [46]. actions. These applications impose new and demanding The promise of AI is, however, far broader than classi- systems requirements, both in terms of performance and cal supervised learning. Emerging AI applications must flexibility. In this paper, we consider these requirements increasingly operate in dynamic environments, react to and present Ray—a distributed system to address them. changes in the environment, and take sequences of ac- Ray implements a unified interface that can express both tions to accomplish long-term goals [8, 43].
    [Show full text]