Bugdoc: Algorithms to Debug Computational Processes

Total Page:16

File Type:pdf, Size:1020Kb

Bugdoc: Algorithms to Debug Computational Processes BugDoc: Algorithms to Debug Computational Processes Raoni Lourenço Juliana Freire Dennis Shasha New York University New York University New York University [email protected] [email protected] [email protected] Abstract Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quanti- tative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and Figure 1: Machine learning pipeline and its prove- provenance to automatically infer the root causes and de- nance. A data scientist can explore different input rive succinct explanations of failures. Through a detailed datasets and classifier estimators to identify a suitable experimental evaluation, we assess the cost, precision, and solution for a classification problem. recall of our approach compared to the state of the art. Our Example: Exploring Supernovas. In an astronomy experiment, experimental data and processing software is available for some visualizations of supernovas presented unusual arti- use, reproducibility, and enhancement. facts that could have indicated a discovery. The experimental CCS Concepts analysis consisted of multiple pipelines run at different sites, • Information systems → Data provenance. including data collection at the telescope site, data processing at a high-performance computing facility, and data analysis 1 Introduction run on the physicist’s desktop. After spending substantial Computational pipelines are widely used in many do- time trying to verify the results, the physicists found that mains, from astrophysics and biology to enterprise analytics. a bug introduced in the new version of the data processing They are characterized by interdependent modules, associ- software had caused the artifacts. ated parameters, and data inputs. Results derived from these To debug such problems, users currently expend consider- pipelines lead to conclusions and, potentially, actions. If one able effort reasoning about the effects of the many possible or more modules in a pipeline produce erroneous or unex- different settings. This requires them to tune and execute pected outputs, these conclusions may be incorrect. Thus, it new pipeline instances to test hypotheses manually, which is critical to identify the causes of such failures. is tedious, time-consuming, and error-prone. Discovering the root cause of failures in a pipeline is chal- We propose new methods and a system that automatically lenging because problems can come from many different and iteratively identifies one or more minimal causes of sources, including bugs in the code, input data, software failures in general computational pipelines (or workflows). arXiv:2004.06530v1 [cs.DB] 12 Apr 2020 updates, and improper parameter settings. Connecting the The Need for Systematic Iteration. Consider the example erroneous result to its root cause is especially difficult for in Figure 1, which shows a generic template for a machine long pipelines or when multiple pipelines are composed. learning pipeline and a log of different instances that were Consider the following real but sanitized examples. run with their associated results. Example: Enterprise Analytics. In an application deployed by The pipeline reads a dataset, splits it into training and test a major software company, plots for sales forecasts showed subsets, creates and executes an estimator, and computes the a sharp decrease compared to historical values. After much F-measure score using 10-fold cross-validation. A data scien- investigation, the problem was tracked down to a data feed tist uses this template to understand how different estimators (coming from an external data provider), whose temporal perform for different types of input data, and ultimately, to resolution had changed from monthly to weekly. The change derive a pipeline instance that leads to high scores. in resolution affected the predictions of a machine learning Analyzing the provenance of the runs, we can see that pipeline, leading to incorrect forecasts. gradient boosting leads to low scores for two of the datasets Raoni Lourenço, Juliana Freire, and Dennis Shasha (Iris and Digits), but it has a high score for Images. By contrast, defines the problem we address. In Section 4, we present al- decision trees work well for both the Iris and Digits datasets, gorithms to search for simple and complex causes of failures. and logistic regression leads to a high score for Iris. We compare BugDoc with the state of the art in Section 5 This may suggest that there is a problem with the gradient and conclude in Section 6, where we outline directions for boosting module for some parameters, that decision trees future work. provide a suitable compromise for different data, and that 2 Related Work is good for the data. Because each run logistic regression Iris Debugging Data and Pipelines. Recently, the problem of used different parameters for each method depending on explaining query results and interesting features in data has the dataset, a definitive conclusion has to await additional received substantial attention in the literature [4, 14, 18, 39, testing of these hyperparameters. Doing so manually is time- 46]. Some have focused on explaining where and how er- consuming and error-prone, while automates this BugDoc rors occur in the data generation process [46] and which process. data items are most likely to be causes of relational query Identifying Root Causes of Failures: Challenges. As the outputs [39, 47]. Others have attempted to use data to ex- above examples illustrate, there are many potential causes plain salient features in data (e.g., outliers) by discovering for a given problem. Prior work used provenance to explain relationships among attribute values [4, 14, 18]. In contrast, errors in computational processes that derive data [18, 46]. BugDoc aims to diagnose abnormal behavior in computa- However, to test these hypotheses and obtain complete (and tional pipelines that may be due to errors in data, programs, accurate) explanations, new pipeline instances must be exe- or sequencing of operations. cuted that vary the different components of the pipeline. Previous work on pipeline debugging has focused on ana- Trying all possible combinations of parameter-values leads lyzing execution histories to identify problematic parameter to a combinatorial explosion of instances to execute, and settings or inputs, but such work does not iteratively infer therefore can be prohibitively expensive. Thus, a critical and test new workflow instances. Bala and Chana [5] applied challenge lies in the design of a strategy that is provably several machine learning algorithms to predict whether a efficient (often requiring only a linear number of pipeline particular pipeline instance will fail to execute in a cloud executions in the number of parameters) for finding root environment. The goal is to reduce the consumption of ex- causes. Causes of errors can include multiple parameters, pensive resources by recommending against executing the each of which may have large domains. So, it is important to instance if it has a high probability of failure. The system have clear and concise explanations in terms of the parameter does not attempt to find the root causes of such failures. values already tried. Chen et al. [12] developed a system that identifies problems Contributions. In this paper, we introduce BugDoc, a new by finding the differences between provenance (encoded as approach that makes use of iteration and provenance to trees) of good and bad runs. However, in general, these dif- infer the root causes automatically and derive succinct ex- ferences do not necessarily identify root causes, though they planations of failures in pipelines. Our contributions can be often contain them. summarized as follows: Some systems have been developed to debug specific ap- plications. Viska [24] helps users identify the underlying (1) BugDoc finds root causes autonomously and iteratively, causes for performance differences for a set of configura- intelligently selecting so-far untested combinations. tions. Users infer hypotheses by exploring performance data (2) We propose debugging algorithms that find root causes and then test these hypotheses by asking questions about using fewer pipeline instances than state-of-the-art the causal relationships between a set of selected features methods, avoiding unnecessary costly computations. and the resulting performance. Thus, Viska can be used to In fact, BugDoc often finds root causes using only a validate hypotheses but not identify root causes. Molly [1] number of pipeline instances linear in the number of combines the analysis of lineage with SAT solvers to find parameters. bugs in fault-tolerance protocols for distributed systems. It (3) The BugDoc system further reduces time by exploiting simulates failures, such as permanent crash failures, mes- parallelism, and sage loss, and temporary network partitions, in order to test (4) Finally, BugDoc derives concise explanations, to facili- fault-tolerance protocols over a specified period. tate the tasks of human debuggers. Although not designed for computational pipelines, Data X-Ray [46]
Recommended publications
  • Data Sanity Check for Deep Learning Systems Via Learnt Assertions
    Data Sanity Check for Deep Learning Systems via Learnt Assertions Haochuan Lu∗y, Huanlin Xu∗y, Nana Liu∗y, Yangfan Zhou∗y, Xin Wang∗y ∗School of Computer Science, Fudan University, Shanghai, China yShanghai Key Laboratory of Intelligent Information Processing, Shanghai, China Abstract—Reliability is a critical consideration to DL- based deviations in a data flow perspective. Invalid input cases are systems. But the statistical nature of DL makes it quite vulnerable thus identified effectively. to invalid inputs, i.e., those cases that are not considered in We summarize the contributions of this paper as follows. the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to • We approach reliability enhancement of DL systems via enhance the reliability of DL-based systems. We design and data sanity check. We proposed a tool, namely SaneDL, implement a tool to detect behavior deviation of a DL model to perform data sanity check for DL-based systems. when processing an input case. This tool extracts the data flow SaneDL provides assertion mechanism to detect behavior footprints and conducts an assertion-based validation mechanism. The assertions are built automatically, which are specifically- deviation of DL model. To our knowledge, SaneDL is the tailored for DL model data flow analysis. Our experiments first assertion-based tool that can automatically detects conducted with real-world scenarios demonstrate that such an invalid input cases for DL systems. Our work can shed assertion-based data sanity check mechanism is effective in light to other practices in improving DL reliability.
    [Show full text]
  • Caradoc of the North Wind Free
    FREE CARADOC OF THE NORTH WIND PDF Allan Frewin Jones | 368 pages | 05 Apr 2012 | Hachette Children's Group | 9780340999417 | English | London, United Kingdom CARADOC OF THE NORTH WIND PDF As the war. Disaster strikes, and a valued friend suffers Caradoc of the North Wind devastating injury. Branwen sets off on a heroic journey towards destiny in an epic adventure of lovewar and revenge. Join Charlotte and Mia in this brilliant adventure full of princess sparkle and Christmas excitement! Chi ama i libri sceglie Kobo e inMondadori. The description is beautiful, but at times a bit too much, and sometimes at its worst the writing is hard to comprehend completely clearly. I find myself hoping vehemently for another book. It definitely allows the I read this in Caradoc of the North Wind sitting and could not put it down. Fair Wind to Widdershins. This author has published under several versions of his name, including Allan Jones, Frewin Jones, and A. Write a product review. Now he has stolen the feathers Caradoc of the North Wind Doodle, the weather-vane cockerel in charge of the weather. Jun 29, Katie rated it really liked it. Is the other warrior child, Arthur?? More than I thought I would, even. I really cafadoc want to know more, and off author is one that can really take you places. Join us by creating an account and start getting the best experience from our website! Jules Ember was raised hearing legends of wjnd ancient magic of the wicked Alchemist and the good Sorceress. Delivery and Returns see our delivery rates and policies thinking of returning an item? Mar 24, Valentina rated it really liked it.
    [Show full text]
  • Faithful Saliency Maps: Explaining Neural Networks by Augmenting "Competition for Pixels"
    Faithful Saliency Maps: Explaining Neural Networks by Augmenting "Competition for Pixels" The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Görns, Jorma Peer. 2020. Faithful Saliency Maps: Explaining Neural Networks by Augmenting "Competition for Pixels". Bachelor's thesis, Harvard College. Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364724 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Harvard University Senior Thesis Faithful Saliency Maps: Explaining Neural Networks by Augmenting "Competition for Pixels" Jorma Peer G¨orns Applied Mathematics supervised by Professor Himabindu Lakkaraju Cambridge, MA April 2020 Abstract For certain machine-learning models such as image classifiers, saliency methods promise to answer a crucial question: At the pixel level, where does the model look to classify a given image? If existing methods truthfully answer this question, they can bring some level of interpretability to an area of machine learning where it has been inexcusably absent: namely, to image-classifying neural networks, usually considered some of the most "black- box" classifiers. A multitude of different saliency methods has been developed over the last few years|recently, however, Adebayo et al. [1] revealed that many of them fail so- called "sanity checks": That is, these methods act as mere edge detectors of the input image, outputting the same convincing-looking saliency map completely independently of the model under investigation! Not only do they not illuminate the inner workings of the model at hand, but they may actually deceive the model investigator into believing that the model is working as it should.
    [Show full text]
  • Sys/Sys/Bio.H 1 1 /* 65 Void *Bio Caller2; /* Private Use by the Caller
    04/21/04 16:54:59 sys/sys/bio.h 1 1 /* 65 void *bio_caller2; /* Private use by the caller. */ 2 * Copyright (c) 1982, 1986, 1989, 1993 66 TAILQ_ENTRY(bio) bio_queue; /* Disksort queue. */ 3 * The Regents of the University of California. All rights reserved. 67 const char *bio_attribute; /* Attribute for BIO_[GS]ETATTR */ 4 * (c) UNIX System Laboratories, Inc. 68 struct g_consumer *bio_from; /* GEOM linkage */ 5 * All or some portions of this file are derived from material licensed 69 struct g_provider *bio_to; /* GEOM linkage */ 6 * to the University of California by American Telephone and Telegraph 70 off_t bio_length; /* Like bio_bcount */ 7 * Co. or Unix System Laboratories, Inc. and are reproduced herein with 71 off_t bio_completed; /* Inverse of bio_resid */ 8 * the permission of UNIX System Laboratories, Inc. 72 u_int bio_children; /* Number of spawned bios */ 9 * 73 u_int bio_inbed; /* Children safely home by now */ 10 * Redistribution and use in source and binary forms, with or without 74 struct bio *bio_parent; /* Pointer to parent */ 11 * modification, are permitted provided that the following conditions 75 struct bintime bio_t0; /* Time request started */ 12 * are met: 76 13 * 1. Redistributions of source code must retain the above copyright 77 bio_task_t *bio_task; /* Task_queue handler */ 14 * notice, this list of conditions and the following disclaimer. 78 void *bio_task_arg; /* Argument to above */ 15 * 2. Redistributions in binary form must reproduce the above copyright 79 16 * notice, this list of conditions and the following disclaimer in the 80 /* XXX: these go away when bio chaining is introduced */ 17 * documentation and/or other materials provided with the distribution. 81 daddr_t bio_pblkno; /* physical block number */ 18 * 3.
    [Show full text]
  • Hello, World! Free
    FREE HELLO, WORLD! PDF Disney Book Group | 14 pages | 16 Aug 2011 | Disney Press | 9781423141402 | English | New York, NY, United States "Hello, World!" program - Wikipedia Learn Data Science by completing interactive coding challenges and watching videos by expert instructors. Start Now! Python Hello a very simple language, and has a very straightforward syntax. It encourages programmers to program without boilerplate prepared code. The simplest directive in Python is the "print" directive - it simply prints out a line and also includes a newline, unlike in C. There Hello two major Python versions, Python 2 and Python 3. Python 2 and 3 are quite different. This tutorial uses Python 3, because it more semantically correct and supports newer features. For example, World! difference between Python 2 and 3 is the print statement. In Hello 2, World! "print" statement is not a function, and therefore it is invoked without parentheses. However, in Python World!, it World! a function, and must be invoked with parentheses. Python uses indentation for World!, instead of curly braces. Both tabs and spaces are supported, but the standard indentation requires standard Python code to use four spaces. For example:. This site is generously supported by DataCamp. Join over a million other learners and get started learning Python for data science today! Hello, World! To print a string in Python 3, just write: print "This line will be printed. Hello Tutorial. Read our Terms of Use and Privacy Policy. Hello, World! - Learn Python - Free Interactive Python Tutorial A "Hello, World! Such a Hello is very simple in most programming World!and World! often used to illustrate the basic syntax of a programming language.
    [Show full text]
  • Opengl Performer™ Programmer's Guide
    OpenGL Performer™ Programmer’s Guide 007-1680-060 CONTRIBUTORS Written by George Eckel and Ken Jones Edited by Rick Thompson and Susan Wilkening Illustrated by Chrystie Danzer and Chris Wengelski Production by Adrian Daley and Karen Jacobson Engineering contributions by Angus Dorbie, Tom Flynn, Yair Kurzion, Radomir Mech, Alexandre Naaman, Marcin Romaszewicz, Allan Schaffer, and Jenny Zhao COPYRIGHT © 1997, 2000 Silicon Graphics, Inc. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission is granted to copy, distribute, or create derivative works from the contents of this electronic documentation in any manner, in whole or in part, without the prior written permission of Silicon Graphics, Inc. LIMITED RIGHTS LEGEND The electronic (software) version of this document was developed at private expense; if acquired under an agreement with the USA government or any contractor thereto, it is acquired as "commercial computer software" subject to the provisions of its applicable license agreement, as specified in (a) 48 CFR 12.212 of the FAR; or, if acquired for Department of Defense units, (b) 48 CFR 227-7202 of the DoD FAR Supplement; or sections succeeding thereto. Contractor/manufacturer is Silicon Graphics, Inc., 1600 Amphitheatre Pkwy 2E, Mountain View, CA 94043-1351. TRADEMARKS AND ATTRIBUTIONS Silicon Graphics,IRIS, IRIS Indigo, IRIX, ImageVision Library, Indigo, Indy, InfiniteReality, Onyx, OpenGL, are registered trademarks, SGI, and CASEVision, Crimson, Elan Graphics, IRIS Geometry Pipeline, IRIS GL, IRIS Graphics Library, IRIS InSight, IRIS Inventor, Indigo Elan, Indigo2, InfiniteReality2, OpenGL Performer, Personal IRIS, POWER Series, Performance Co-Pilot, RealityEngine, RealityEngine2, SGI logo, and Showcase are trademarks of Silicon Graphics, Inc.
    [Show full text]
  • CS107, Lecture 16 Wrap-Up / What’S Next?
    CS107, Lecture 16 Wrap-Up / What’s Next? This document is copyright (C) Stanford Computer Science and Nick Troccoli, licensed under Creative Commons Attribution 2.5 License. All rights reserved. Based on slides created by Marty Stepp, Cynthia Lee, Chris Gregg, and others. 1 Plan For Today • Recap: Where We’ve Been • Larger Applications • What’s Next? • Q&A 2 Plan For Today • Recap: Where We’ve Been • Larger Applications • What’s Next? • Q&A 3 We’ve covered a lot in just 10 weeks! Let’s take a look back. 4 Our CS107 Journey Arrays Bits and and Heap Bytes Pointers Generics Allocators C Strings Stack and Assembly Heap 5 Course Overview 1. Bits and Bytes - How can a computer represent integer numbers? 2. Chars and C-Strings - How can a computer represent and manipulate more complex data like text? 3. Pointers, Stack and Heap – How can we effectively manage all types of memory in our programs? 4. Generics - How can we use our knowledge of memory and data representation to write code that works with any data type? 5. Assembly - How does a computer interpret and execute C programs? 6. Heap Allocators - How do core memory-allocation operations like malloc and free work? 6 First Day /* * hello.c * This program prints a welcome message * to the user. */ #include <stdio.h> // for printf int main(int argc, char *argv[]) { printf("Hello, world!\n"); return 0; } 7 First Day • The command-line is a text-based interface to navigate a computer, instead of a Graphical User Interface (GUI). Graphical User Interface Text-based interface 8 Bits And Bytes Key Question: How can a computer represent integer numbers? 9 Bits And Bytes Why does this matter? • Limitations of representation and arithmetic impact programs! • We can also efficiently manipulate data using bits.
    [Show full text]
  • Extracting Detailed Tobacco Exposure from the Electronic Health Record
    Extracting Detailed Tobacco Exposure From The Electronic Health Record By Travis John Osterman Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Biomedical Informatics August 11, 2017 Nashville, Tennessee Approved Josh Denny, M.D., M.S. Mia Levy, M.D., Ph.D. Pierre Massion, M. D. i To Laura, Owen and Gavin. Thank you for your patience, encouragement, and support. ii ACKNOWLEDGEMENTS I would like to thank the National Library of Medicine and the Conquer Cancer Foundation for supporting my biomedical informatics training and research (LM007450, R01- LM010685). My appreciation and thanks also go to the Department of Biomedical Informatics faculty and students for their support. Specifically, I thank Cindy Gadd and Rischelle Jenkins for creating and maintaining a rigorous and enlightening curriculum and to my master’s committee Josh Denny, Mia Levy, and Pierre Massion, for supporting my education and broader career development. In addition, I would like to thank Wei-Qi Wei, Dara Mize, Julie Wu, and Lisa Bastarche who directly contributed to this work. Finally, I would like to thank my parents, wife, and children for supporting me through this program. iii TABLE OF CONTENTS Page DEDICATION ................................................................................................................................ ii ACKNOWLEDGEMENTS ..........................................................................................................
    [Show full text]
  • IRIS Performer™ Programmer's Guide
    IRIS Performer™ Programmer’s Guide Document Number 007-1680-040 CONTRIBUTORS Written by George Eckel Edited by Steven Hiatt Illustrated by Dany Galgani Production by Derrald Vogt and Linda Rae Sande Engineering contributions by Sharon Clay, Brad Grantham, Don Hatch, Jim Helman, Michael Jones, Martin McDonald, John Rohlf, Allan Schaffer, Chris Tanner, Jenny Zhao, Yair Kurzion, and Tom McReynolds © Copyright 1995 -1997 Silicon Graphics, Inc.— All Rights Reserved This document contains proprietary and confidential information of Silicon Graphics, Inc. The contents of this document may not be disclosed to third parties, copied, or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the technical data contained in this document by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and/or in similar or successor clauses in the FAR, or in the DOD or NASA FAR Supplement. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., Mountain View, CA 94039-7311. Indigo, Indy, IRIS, IRIS Indigo, ImageVision Library, Onyx, OpenGL, Silicon Graphics, and the Silicon Graphics logo are registered trademarks and Crimson, Elan Graphics, Geometry Pipeline, Indigo Elan, Indigo2, IRIS GL, IRIS Graphics Library, IRIS InSight, IRIS Inventor, IRIS Performer, IRIX, Personal IRIS, POWER Series, Performance Co-Pilot, RealityEngine, RealityEngine2, and Showcase are trademarks of Silicon Graphics, Inc. AutoCAD is a registered trademark of Autodesk, Inc.
    [Show full text]
  • Programming in Standard ML
    Programming in Standard ML (DRAFT:VERSION 1.2 OF 11.02.11.) Robert Harper Carnegie Mellon University Spring Semester, 2011 Copyright c 2011 by Robert Harper. All Rights Reserved. This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/, or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Preface This book is an introduction to programming with the Standard ML pro- gramming language. It began life as a set of lecture notes for Computer Science 15–212: Principles of Programming, the second semester of the in- troductory sequence in the undergraduate computer science curriculum at Carnegie Mellon University. It has subsequently been used in many other courses at Carnegie Mellon, and at a number of universities around the world. It is intended to supersede my Introduction to Standard ML, which has been widely circulated over the last ten years. Standard ML is a formally defined programming language. The Defi- nition of Standard ML (Revised) is the formal definition of the language. It is supplemented by the Standard ML Basis Library, which defines a com- mon basis of types that are shared by all implementations of the language. Commentary on Standard ML discusses some of the decisions that went into the design of the first version of the language. There are several implementations of Standard ML available for a wide variety of hardware and software platforms. The best-known compilers are Standard ML of New Jersey, MLton, Moscow ML, MLKit, and PolyML.
    [Show full text]
  • Classification of Computer Programs in the Scratch Online Community Lena Abdalla
    Classification of Computer Programs in the Scratch Online Community by Lena Abdalla Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2020 © Massachusetts Institute of Technology 2020. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science January 27, 2020 Certified by. Andrew Sliwinski Research Affiliate Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Classification of Computer Programs in the Scratch Online Community by Lena Abdalla Submitted to the Department of Electrical Engineering and Computer Science on January 27, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Scratch is a graphical programming platform that empowers children to create com- puter programs and realize their ideas. Although the Scratch online community is filled with a variety of diverse projects, many of these projects also share similarities. For example, they tend to fall into certain categories, including games, animations, stories, and more. Throughout this thesis, I describe the application of Natural Language Processing (NLP) techniques to vectorize and classify Scratch projects by type. This effort included constructing a labeled dataset of 873 Scratch projects and their corresponding types, to be used for training a supervised classifier model. This dataset was constructed through a collective process of consensus-based annotation by experts. To realize the goal of classifying Scratch projects by type, I first train an unsupervised model of meaningful vector representations for Scratch blocks based on the composition of 500,000 projects.
    [Show full text]
  • Appendix a – Checklist: Sanity Check on the Design
    Appendix A – Checklist: Sanity Check on the Design A.1 Conclusion Conclusion Y / N The specifications (test base) describe the system clearly and precisely. They are therefore sufficient as the basis for a structured test project. If the conclusion is negative: High-level errors that were recorded during the sanity check are dis- played below. For each error, indicate the risks for the test design and the measures needed to rectify them. Error Risk Measure 358 Appendix A A.2 Result Description Y/N Not applicable Solution The functionality, the processes and their coherence is described sufficiently The functionality, the processes and their coherence concur with the anticipated goal All of the relevant quality requirements have been sufficiently taken care of The result of the risk analysis can be traced in the test base A.3 Control Description Y/N Not applicable Solution Contains history log (including version administration) Contains approval/distribution list Contains document description (including author) The document has a status Contains reference list A.4 Structure Description Y/N Not applicable Solution Contains executive (or management) summary Contains table of contents (current) Contains introduction Contains chapter(s) with an overview or description of the functionality (and related screens/reports and messages) Contains chapters with data models and data descriptions Contains chapters with functional specifications Contains chapters with non-functional specifications Contains chapters with interface specifications Appendix
    [Show full text]