Towards Understanding the Challenges Faced by Machine Learning Software Developers and Enabling Automated Solutions

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2020 Towards understanding the challenges faced by machine learning software developers and enabling automated solutions Md Johirul Islam Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Recommended Citation Islam, Md Johirul, "Towards understanding the challenges faced by machine learning software developers and enabling automated solutions" (2020). Graduate Theses and Dissertations. 18149. https://lib.dr.iastate.edu/etd/18149 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Towards understanding the challenges faced by machine learning software developers and enabling automated solutions by Md Johirul Islam A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Science Program of Study Committee: Hridesh Rajan, Major Professor Gianfranco Ciardo Gurpur Prabhu Jin Tian Anuj Sharma The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2020 Copyright c Md Johirul Islam, 2020. All rights reserved. ii DEDICATION To my teachers, family and friends, who made me realize the real purpose of education. iii TABLE OF CONTENTS Page LIST OF TABLES . vii LIST OF FIGURES . viii ACKNOWLEDGMENTS . .x ABSTRACT . xi CHAPTER 1. INTRODUCTION . .1 1.1 Contributions . .2 1.2 Outline . .3 CHAPTER 2. RELATED WORK . .4 2.1 Empirical Study Using Stack Overflow .............................4 2.2 Study of Challenges in ML . .5 2.3 Empirical Study of Bugs in Non-ML . .6 2.4 Empirical Study of Bugs in DNN . .6 2.5 Study of Fixing DNN Bugs . .7 2.6 Study of API-misuse Classification . .8 2.7 API Misuse Detection . .8 2.8 Model Testing . .9 CHAPTER 3. WHAT DO DEVELOPERS ASK ABOUT ML LIBRARIES? A LARGE-SCALE STUDY USING STACK OVERFLOW . 10 3.1 Introduction . 10 3.2 Methodology . 12 3.2.1 Classification of Questions . 13 3.2.2 Manual Labeling . 18 3.2.3 Threats to Validity . 19 3.3 Analysis and Results . 21 3.4 RQ1: Difficult stages . 22 3.4.1 Most difficult stage . 22 3.4.2 Data preparation . 24 3.5 RQ2: Nature of problems . 25 3.5.1 Type mismatch . 25 3.5.2 Shape mismatch . 26 3.5.3 Data Cleaning . 27 3.5.4 Model creation . 28 iv 3.5.5 Error/Exception . 29 3.5.6 Parameter selection . 30 3.5.7 Loss function selection . 31 3.5.8 Training accuracy . 32 3.5.9 Tuning parameter selection . 32 3.5.10 Correlation between libraries . 34 3.5.11 API Misuses in All ML Stages . 35 3.6 RQ3: Nature of libraries . 38 3.7 RQ4: Time consistency of difficulty . 40 3.8 Implications and Discussion . 42 3.8.1 Implications of RQ1 . 42 3.8.2 Implications of RQ2 . 43 3.8.3 Implications of RQ3 . 44 3.8.4 Implications of RQ4 . 44 3.9 Conclusion . 44 CHAPTER 4. A COMPREHENSIVE STUDY ON DEEP LEARNING BUG CHARACTERISTICS 46 4.1 Introduction . 46 4.2 Methodology . 47 4.2.1 Data Collection . 47 4.2.2 Classification . 49 4.2.3 Labeling the Bugs . 50 4.2.4 Types of Bugs in Deep Learning Software . 51 4.2.5 Classification of Root Causes of Bugs . 54 4.2.6 Classification of Effects of Bugs . 56 4.3 Frequent bug types . 57 4.3.1 Data Bugs . 57 4.3.2 Structural Logic Bugs . 59 4.3.3 API Bugs . 59 4.3.4 Bugs in Github Projects . 60 4.4 Root Cause . 61 4.4.1 Incorrect Model Parameter (IPS) . 61 4.4.2 Structural Inefficiency (SI) . 61 4.4.3 Unaligned Tensor (UT) . 63 4.4.4 Absence of Type Checking . 63 4.4.5 API Change . 64 4.4.6 Root Causes in Github Data . 64 4.4.7 Relation of Root Cause with Bug Type . 64 4.5 Impacts from Bugs . 65 4.5.1 Crash . 66 4.5.2 Bad Performance . 67 4.5.3 Incorrect Functionality . 67 4.5.4 Effects of Bugs in Github ............................... 68 4.6 Difficult Deep Learning stages . 68 4.6.1 Data Preparation . 68 4.6.2 Training Stage . 69 v 4.6.3 Choice of Model . 70 4.7 Commonality of Bug . 70 4.8 Evolution of Bugs . 72 4.8.1 Structural Logic Bugs Are Increasing . 73 4.8.2 Data Bugs Are Decreasing . 73 4.9 Threats to Validity . 74 4.10 Discussion . 75 4.11 Conclusion . 76 CHAPTER 5. REPAIRING DEEP NEURAL NETWORKS: FIX PATTERNS AND CHALLENGES 78 5.1 Introduction ..

Load more