Towards Automated Software Evolution of Data-Intensive Applications
Total Page:16
File Type:pdf, Size:1020Kb
City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects CUNY Graduate Center 6-2021 Towards Automated Software Evolution of Data-Intensive Applications Yiming Tang The Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/4406 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] Towards Automated Software Evolution of Data-intensive Applications by Yiming Tang A thesis submitted to the Graduate Faculty in Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy, The City University of New York. 2021 ii © 2021 Yiming Tang All Rights Reserved iii Towards Automated Software Evolution of Data-intensive Applications by Yiming Tang This manuscript has been read and accepted for the Graduate Faculty in Computer Science in satisfaction of the dissertation requirement for the degree of Doctor of Philosophy. May 14, 2021 Raffi Khatchadourian Date Chair of Examining Committee May 14, 2021 Ping Ji Date Executive Officer Supervisory Committee: Raffi Khatchadourian, Advisor Subash Shankar Anita Raja Mehdi Bagherzadeh THE CITY UNIVERSITY OF NEW YORK Abstract Towards Automated Software Evolution of Data-intensive Applications by Yiming Tang Advisor: Raffi Khatchadourian Recent years have witnessed an explosion of work on Big Data. Data-intensive appli- cations analyze and produce large volumes of data typically terabyte and petabyte in size. Many techniques for facilitating data processing are integrated into data- intensive applications. API is a software interface that allows two applications to communicate with each other. Streaming APIs are widely used in today's Object- Oriented programming development that can support parallel processing. In this dissertation, an approach that automatically suggests stream code run in parallel or sequentially is proposed. However, using streams efficiently and properly needs many subtle considerations. The use and misuse patterns for stream codes are proposed in this dissertation. Modern software, especially for highly transactional software sys- iv ABSTRACT v tems, generates vast logging information every day. The huge amount of information prevents developers from receiving useful information effectively. Log-level could be used to filter run-time information. This dissertation proposes an automated evolu- tion approach for alleviating logging information overload by rejuvenating log levels according to developers' interests. Machine Learning (ML) systems are pervasive in today's software society. They are always complex and can process large volumes of data. Due to the complexity of ML systems, they are prone to classic technical debt issues, but how ML systems evolve is still a puzzling problem. This dissertation introduces ML-specific refactoring and technical debt for solving this problem. To My Dear Husband Xiaodong vi Acknowledgements There are two persons I would like to thank for pursuing my Ph.D. degree. First, I would like to thank my advisor Prof. Raffi Khatchadourian for his generous support and guidance through my study period at CUNY. He offered me a chance to pursue my dream of research. He taught me how to become a professional researcher. He has helped and supported me a lot when I have any difficulties that I could not handle. I am very grateful for everything he has done for me. Without him, I would not have these academic contributions listed in the dissertation. I am also grateful to my husband Xiaodong Cao. He is the most important person in my life. Thank you! vii Contents Abstract iv Acknowledgements vii 1 Introduction 1 2 Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams 7 2.1 Introduction . .7 2.2 Motivation, Background, and Insight . 14 2.2.1 Concurrent Reductions . 20 2.3 Optimization Approach . 23 2.3.1 Intelligent Parallelization Refactorings . 23 2.3.2 Identifying Stream Creation . 28 2.3.3 Tracking Streams and Their Attributes . 28 2.3.4 Tracking Stream Pipelines . 32 2.3.5 Inferring Behavioral Parameter Side-effects . 41 viii 2.3.6 Determining Whether Reduction Ordering Matters . 43 2.3.7 Transformation . 48 2.4 Evaluation . 51 2.4.1 Implementation . 51 2.4.2 Experimental Evaluation . 53 2.4.3 Threats to Validity . 62 2.5 Related Work . 63 2.6 Conclusion . 66 2.7 Future Work . 66 2.7.1 Other programming language support . 67 3 An Empirical Study on the Use and Misuse of Java 8 Streams 68 3.1 Introduction . 68 3.2 Motivating Example and Conceptual Background . 72 3.3 Study Subjects . 74 3.4 Stream Characteristics . 75 3.4.1 Methodology . 75 3.4.2 Results . 78 3.4.3 Discussion . 79 3.5 Stream Usage . 81 3.5.1 Methodology . 81 3.5.2 Results . 82 3.5.3 Discussion . 82 3.6 Stream Misuses . 87 ix 3.6.1 Methodology . 87 3.6.2 Results . 90 3.7 Threats to Validity . 99 3.8 Related Work . 100 3.9 Conclusion . 103 3.10 Future Work . 104 3.10.1 Automated error checker . 104 4 An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems 105 4.1 Introduction . 105 4.2 Methodology . 109 4.2.1 Overview . 109 4.2.2 Commit Mining . 110 4.2.3 Refactoring Identification . 111 4.2.4 Refactoring Classification . 112 4.3 Results . 114 4.3.1 Quantitative Analysis . 114 4.3.2 Qualitative Analysis . 129 4.4 Discussion . 138 4.4.1 Code Duplication in Configuration & Model Code . 138 4.4.2 Combating Code Duplication Debt in ML Systems . 139 4.4.3 Generic vs. ML-specific Refactorings . 140 4.4.4 ML-related Code Performance . 141 x 4.5 Threats to Validity . 141 4.6 Related Work . 143 4.7 Conclusion . 146 4.8 Future Work . 146 4.8.1 ML-based approaches . 146 5 Automated Evolution of Feature Logging Statement Levels Using Git Histories and Degree of Interest 148 5.1 Introduction . 148 5.2 Motivating Example . 155 5.3 Approach . 158 5.3.1 Assumptions . 158 5.3.2 Overview . 159 5.3.3 Feature Logging Statement Level Extraction . 160 5.3.4 Mylyn DOI Model Manipulation . 160 5.3.5 DOI-Feature Logging Level Association . 163 5.3.6 Feature Logging Statement Classification & Heuristics . 164 5.4 Evaluation . 168 5.4.1 Implementation . 168 5.4.2 Experimental Evaluation . 169 5.4.3 Threats to Validity . 185 5.5 Related Work . 186 5.6 Conclusion . 189 5.7 Future Work . 190 xi 5.7.1 Existing Mylyn Context . 190 5.7.2 Machine Learning Techniques . 191 6 Conclusion & Future Work 192 6.1 Future Work . 193 Bibliography 195 VITA 221 Publications 222 Fields of Study 224 xii List of Figures 2.1 A proper subset of the relation E! in E = (ES;EΛ;E!). 29 2.2 A proper subset of the relation O! in O = (OS;OΛ;O!). 30 2.3 Scenarios for whether reduce ordering matters (ROM). 45 2.4 Predecessor tree for listing 4a. 50 3.1 Stream method calls (right lateral of the entire figure) . 83 3.2 Stream method calls (left lateral of the entire figure) . 84 3.3 Studied stream bugs and patches (hierarchical). 91 4.1 Discovered refactorings (hierarchical). 117 4.2 Discovered generic refactorings (nonhierarchical). 123 4.3 ML-specific technical debt category descriptions. 126 5.1 Logging level revitalization approach overview. 159 xiii List of Tables 2.1 Convert Sequential Stream to Parallel preconditions. 24 2.2 Optimize Parallel Stream preconditions. 26 2.3 Optimize Complex Mutable Reduction preconditions. 27 2.4 \Reduction ordering matters" (ROM) lookup table. 44 2.5 Experimental results. 54 2.6 Refactoring failures. 56 2.7 Average run times of JMH benchmarks. 58 3.1 Stream characteristics. 76 3.2 Studied subjects. 88 3.3 Stream bug/patch category legend. 90 3.4 Studied stream bugs and patches (nonhierarchical). 94 4.1 Studied subjects. 111 4.2 Discovered refactorings (nonhierarchical). 116 4.3 Discovered ML-specific technical debt vs. refactoring categories. 125 5.1 Quantitative analysis results. 172 xiv 5.2 Comparison with manual logging statement level changes. 177 5.3 \Ideal" level transformation directions for feature logs. 181 5.4 Feature logging statements in change contexts. 182 xv List of Listings 1 A hypothetical widget class. 14 2 Snippet of Widget collection processing using Java 8 streams based on java.util.stream (Java SE 9 & JDK 9) (Oracle, 2017b). 15 3 Complex reduction operation refactoring example based on java.util.stream (Java SE 9 & JDK 9) (Oracle, 2017b). 21 4 Sequencing stream instance derivations. 36.