Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models

Thesis for The Degree of Doctor of Philosophy Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models Truong Ho-Quang Division of Software Engineering Department of Computer Science & Engineering Chalmers University of Technology and University of Gothenburg Gothenburg, Sweden, 2019 Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models Truong Ho-Quang Copyright ©2019 Truong Ho-Quang except where otherwise stated. All rights reserved. ISBN 978-91-7833-608-1 (PRINT) ISBN 978-91-7833-609-8 (PDF) ISSN 0346-718X The thesis is available in full text online http://hdl.handle.net/2077/61704 Technical Report No 173D Department of Computer Science & Engineering Division of Software Engineering Chalmers University of Technology and University of Gothenburg Gothenburg, Sweden Cover illustration: Steps of printing a Đông Hồ painting (called “Gà D¤ Xướng”) Photos taken by Phùng Hồng Kên. This thesis has been prepared using LATEX. Printed by Chalmers Reproservice, Gothenburg, Sweden 2019. ii “Extraordinary Claims Require Extraordinary Evidence.” - Carl, Sagan. iv Abstract Context: In modern software development, software modeling is considered to be an essential part of the software architecture and design activities. The Unified Modeling Language (UML) has become the de facto standard for software modeling in industry. Surprisingly, there are only a few empirical studies on the practices and impacts of UML modeling in software development. This is mainly due to the lack of empirical data on real-life software systems that use UML modeling. Objective: This PhD thesis contributes to this matter by describing a method to build and curate a big corpus of open-source-software (OSS) projects that contain UML models. Subsequently, this thesis offers observations on the practices and impacts of using UML modeling in these OSS projects. Method: We combine techniques from repository mining and image classification in order to successfully identify more than 24.000 open source projects on GitHub that together contain more than 93.000 UML models. Machine learning techniques are also used to enrich the corpus with annotations. Finally, various empirical studies, including a case study, a user study, a large-scale survey and an experiment, have been carried out across this set of projects. Result: The results show that UML is generally perceived to be helpful to new contributors. The most important motivation for using UML seems to be to facilitate collaboration. In particular, teams use UML during communication and planning of joint implementation efforts. Our study also shows that the use of UML modeling has a positive impact on software quality, i.e. it correlates with lower defect proneness. Further, we find out that visualisation of design concepts, such as class role-stereotypes, helps developers to perform better in software comprehension tasks. Keywords Software Modeling, Software Design, Empirical Research, UML, Modeling Prac- tice, Impacts of Modeling, Open Source System, Mining Software Repository, Data Mining, Data Curation, GitHub. Acknowledgment To accomplish this 5-year Ph.D project, I have received lots of encouragement from colleagues, friends and my family. I would take this opportunity to thank: My main supervisor Prof. Michel R. V. Chaudron, for the continuous support to my Ph.D study, for your patience, motivation, and immense knowledge. Thanks for being not only a great academic adviser but also an intimate friend. My co-supervisor Regina Hebig, for voluntarily supporting me and providing me with tips and comments whenever needed. My former co-supervisor Patrizio Pelliccione, for offering interesting discussions and constructive comments in the first two years of my Ph.D. My examiner Prof. Ivica Crnkovic, for your encouragement, for hard ques- tions which incent me to widen my research from various perspectives. Directors of Ph.D study Jan Jonsson and Agneta Nilsson, for always giving constructive recommendations and reminding me to add buffers to my often- ambitious-research plans. To all of my colleagues at the Software Engineering Division, including of course administrative staff: Thank you all for creating such a friendly and productive work environment. I particularly thank Rodi Jolak for sharing the office with me, for interesting discussions and lots of push-up exercises at work. Hugo, Federico, Grischa, Salome: for being good and helpful neighbours. My research would have not been possible without support from my collab- orators. I would express my deeply thanks to all 24 people who co-authored papers with me. In particular, I thank Gregorio Robles and Miguel Ángel Fern¡ndez for being part of the best team that I've ever had. Many thanks to Arif Nurwidyantoro, Alexandre Bergel, Adithya Raghuraman, Alexander Serebrenik, Bogdan Vasilescu for our regular discussions despite the major time differences. To Dave, Hafeez and Bilal: I am thankful to be part of our team. Thanks to your help, I came to the Ph.D life less nervous. To all friends, near and far away, especially the Vietnamese “gangs” in Gothenburg: I sincerely thank you for sharing lots of joys during the years and recharging my battery when it was getting low. And to the most important people in my life: My lovely wife and daughter - thank you for stepping into my life and making it full of joy and happiness everyday. I also owe much of my success to my parents and grand-fathers, who supported me spiritually throughout writing this thesis and my life in general. Last but not least, I would give many thanks to my brothers and their families for the great suggestions and motivation during my Ph.D life. Truong Ho-Quang Gothenburg, 2019. vii List of Publications Included publications This thesis is based on the following eight (8) papers: [A] T. Ho-Quang, M.R.V. Chaudron, I. Samúelsson, J. Hjaltason, B. Karas- neh, H. Osman “Automatic Classification of UML Class Diagrams from Images”. Published in the Proceedings of the 21st Asia-Pacific Software Engineering Conference (APSEC 2014), Jeju, Korea, December 1 - December 4, 2014. [B] R. Hebig, T. Ho-Quang, M.R.V. Chaudron, G. Robles, F. Miguel Angel “The Quest for Open Source Projects that Use UML: Mining GitHub”. Published in the Proceedings of the ACM/IEEE 19th International Con- ference on Model Driven Engineering Languages and Systems (MODELS 2016), Saint-Malo, France, October 2 - October 7, 2016. [C] T. Ho-Quang, R. Hebig, G. Robles, M.R.V. Chaudron, F. Miguel Angel “Practices and Perceptions of UML Use in Open Source Projects”. Published in the Proceedings of the 39th International Conference on Software Engineering - Software Engineering in Practice Track (ICSE SEIP 2017), Buenos Aires, Argentina, May 20 - May 28, 2017. [D] M.H. Osman, T. Ho-Quang, M.R.V. Chaudron, “An Automated Ap- proach for Classifying Reverse-engineered and Forward-engineered UML Class Diagrams”. Published in the Proceedings of the 44th Euromicro Conference on Soft- ware Engineering and Advanced Applications (SEAA) (pp. 396-399), Prague, Czech Republic, August 29-31, 2018. [E] T. Ho-Quang, M.R.V. Chaudron, R. Hebig, G. Robles “Challenges and Directions for a Community Infrastructure for Big Data-driven Research in Software Architecture” Accepted as a chapter in the book “Model Management and Analytics for Large Scale Systems”, To be published by Elsevier (Expected release date: November 1, 2019). [F] A. Raghuraman, T. Ho-Quang, M.R.V. Chaudron, A. Serebrenik, B. Vasilescu “Does UML Modeling Associate with Higher Software Qual- ity in Open-Source Software?” Accepted at the 16th International Conference on Mining Software Repos- itories (MSR2019), Montr²al, Canada, May 26 - May 27, 2019. ix x [G] T. Ho-Quang, A. Nurwidyantoro, M.R.V. Chaudron “Using Machine Learning for Automated Classification of Class Responsibility Stereotypes in Software Design”. Under Submission. [H] T. Ho-Quang, A. Bergel, A. Nurwidyantoro, M.R.V. Chaudron “In- teractive Role Stereotype-Based Visualization To Comprehend Software Architecture”. Under submission. xi Other publications The following publications were published during my PhD studies, or are currently in submission. However, they are not appended to this thesis, due to contents overlapping that of appended publications or contents not related to the thesis. [a] H. Osman, M.R.V. Chaudron, P. van der Putten, T. Ho-Quang “Con- densing Reverse Engineered Class Diagrams Through Class Name Based Abstraction” 4th World Congress on Information and Communication Technologies (WICT'14), Malacca, Malaysia, December 8 - December 10, 2014. [b] D.R. Stikkolorum, T. Ho-Quang, M.R.V. Chaudron “Revealing Stu- dents' UML Class Diagram Modelling Strategies with WebUML and LogViz” 41st Euromicro Conference on Software Engineering and Advanced Ap- plications (SEAA), Funchal, Madeira, Protugal, August 26 - August 28, 2015. [c] D.R. Stikkolorum, T. Ho-Quang, B. Karasneh, M.R.V. Chaudron “Un- covering Students' Common Difficulties and Strategies During a Class Diagram Design Process: an Online Experiment” Educators Symposium at ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (EduSymp@MoDELS 2015), Ottawa, Canada, September 29, 2015. [d] R. Hebig, T. Ho-Quang, M.R.V. Chaudron, G. Robles, F. Miguel Angel “The Quest for UML in Open Source Projects Initial findings from GitHub” Published in the Proceedings of the Doctoral Consortium at the 12th International Conference on Open Source Systems (OSS 2016), G¨oteborg, Sweden, May 30, 2016 [e] R. Jolak, E. Umuhoza, T. Ho-Quang, M.R.V. Chaudron, M.Brambilla “Dissecting design effort and drawing effort in UML modeling” Published in the Proceedings of the 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Vienna, Austria, 2017. [f] G. Robles, T. Ho-Quang, R. Hebig, M.R.V. Chaudron, F. Miguel Angel “An extensive collection of UML files in GitHub” Published in the Proceedings of the 14th International Conference on Mining Software Repositories (MSR 2017).

Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models

Customizing UML with Stereotypes

OMG Systems Modeling Language (OMG Sysml™) Tutorial 25 June 2007

A Model-Driven Engineering Approach to Support the Veriﬁcation of Compliance to Safety Standards

UML Notation Guide 3

Plantuml Language Reference Guide (Version 1.2021.2)

UML 2001: a Standardization Odyssey

Plantuml Language Reference Guide

Software Evolution of Legacy Systems a Case Study of Soft-Migration

On UML's Composite Structure Diagram

Quantitative Safety Analysis of UML Models

During the Early Days of Computing, Computer Systems Were Designed to Fit a Particular Platform and Were Designed to Address T

UML Diagrams