Code Smells Quantification: a Case Study on Large Open Source Research Codebase Swapnil Singh Chauhan University of Texas at El Paso, [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
University of Texas at El Paso DigitalCommons@UTEP Open Access Theses & Dissertations 2019-01-01 Code Smells Quantification: A Case Study On Large Open Source Research Codebase Swapnil Singh Chauhan University of Texas at El Paso, [email protected] Follow this and additional works at: https://digitalcommons.utep.edu/open_etd Part of the Computer Sciences Commons Recommended Citation Chauhan, Swapnil Singh, "Code Smells Quantification: A Case Study On Large Open Source Research Codebase" (2019). Open Access Theses & Dissertations. 50. https://digitalcommons.utep.edu/open_etd/50 This is brought to you for free and open access by DigitalCommons@UTEP. It has been accepted for inclusion in Open Access Theses & Dissertations by an authorized administrator of DigitalCommons@UTEP. For more information, please contact [email protected]. CODE SMELLS QUANTIFICATION: A CASE STUDY ON LARGE OPEN SOURCE RESEARCH CODEBASE SWAPNIL SINGH CHAUHAN Master’s Program in Computer Science APPROVED: Omar Badreddin, Ph.D., Chair Eric Smith, Ph.D. Monika Akbar, Ph.D. Yoonsik Cheon, Ph.D. Charles H. Ambler, Ph.D. Dean of the Graduate School Copyright © by Swapnil Singh Chauhan 2019 Dedicated to my inspiring parents and family CODE SMELLS QUANTIFICATION: A CASE STUDY ON LARGE OPEN SOURCE RESEARCH CODEBASE by SWAPNIL SINGH CHAUHAN, B.E in Computer Science THESIS Presented to the Faculty of the Graduate School of The University of Texas at El Paso in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Department of Computer Science THE UNIVERSITY OF TEXAS AT EL PASO May 2019 ACKNOWLEDGEMENTS I would like to thank my supervisor, Dr. Omar Badreddin, for the patient guidance, encouragement and advice he has provided throughout my time as his student. I have been extremely lucky to have a supervisor who cared so much about my work, and who responded to my questions and queries so promptly. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Eric Smith, Dr. Monika Akbar, and Dr. Yoonsik Cheon, for their encouragement, and insightful comments. I would like to express my sincere gratitude to my advisor Dr. Hossain for his immense guidance and support. I would also like to thank Dr. Jamie Acosta and Dr. Vladik Kreinovich for their help and encouragement to think more independently about our experiments, assignments and results. Apart from this, I will also like to give my word of thanks to my lab mate Rahad for his valuable suggestion which helped me during my thesis work. Last but not the least, I would like to thank my family for all their support. v ABSTRACT Research software has opened up new pathways of discovery in many and diverse disciplines. The research software is developed under unique budgetary and schedule constraints. The developers are often untrained transient workforce of graduate students and postdocs. As a result, the software quality hinders its sustainability beyond the immediate research goals. More importantly, the prevalent reward structures favor contributions in terms of research articles and systematically undervalues research code contributions. As a result, researchers and funding agencies do not allocate appropriate efforts or resources to the development, sustenance, and dissemination of research codebases. At the same time, there are no uniform methodology to quantify codebase sustainability. Current methods adopt metrics with fixed thresholds that often do not appropriately characterize codebase quality and sustainability. In this thesis, we conduct a case study to investigate this phenomenon. We analyze a large- scale research codebase over a five-year period. For reference, we compare the research codebase quality characteristics to a reference codebase developed by google engineers over the same period of time. The case study suggests that both research and professional codebases quality tends to degrade over time, but the decline is much more prominent in research codebases. Similarly, the study found that the number of code quality violations as a percentage of the codebase is significantly higher for the research codebase. The study also reveals that there are quality characteristics that are unique to professional codebases. For example, professionals tend to design software with smaller code units possibly in anticipation that such units will grow in size overtime. On the other hand, research codebases’ units are significantly larger and become increasingly difficult to maintain over time. The results of this thesis are published in 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science). vi This thesis is organized as follows. Chapter 1 provides a background on code quality and presents a number of code quality metrics. This chapter also presents the two codebases used in the study. Chapter 2 presents related works pertaining to software engineers’ perceptions of software quality, the evolution and impact of code quality metrics, methods to quantify code quality, and a review of key methodologies and tools to identify and quantify code quality. Chapter 3 presents the case study design and introduces the quantification metrics developed for the study. Chapter 4 presents the results, analysis, and key findings. We conclude the thesis and outline future work in Chapter 5. vii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS .............................................................................................................v ABSTRACT ................................................................................................................................... vi TABLE OF CONTENTS ............................................................................................................. viii LIST OF TABLES ...........................................................................................................................x LIST OF FIGURES ....................................................................................................................... xi CHAPTER 1: BACKGROUND ......................................................................................................1 1.1 Introduction to Code Smell ............................................................................................1 1.2 Code Smell Category .....................................................................................................2 1.2 About QGIS and Tensor-Flow .......................................................................................6 1.4 Tool Selected for Input extraction .................................................................................9 1.4.1 Background on Static Analysis Tool – UNDERSTAND ...........................................9 1.4.2 Background on Clone Detection Tool – PMD-CPD.................................................12 CHAPTER 2: RELATED WORK .............................................................................................14 2.1 Perception of Developer on Code Smell ......................................................................14 2.2 Evolution and Impact of the Code Smell .....................................................................17 2.3 Quantification and Assessment of the Code Smell ......................................................19 2.4 Tools Techniques and Approaches to Identify the Code Smell ...................................21 CHAPTER 3: CASE STUDY DESIGN AND CODE SMELL QUANTIFICATION .................23 3.1 Research Approach ......................................................................................................23 3.2 Large Class...................................................................................................................24 3.3 Large Method ...............................................................................................................26 3.4 Lazy Class ....................................................................................................................27 3.5 Lazy Method ................................................................................................................28 3.6 Large Code to Comment Ratio ....................................................................................29 3.7 Small Code to Comment Ratio ....................................................................................31 3.8 Codes Clones ...............................................................................................................32 viii CHAPTER 4: RESULTS AND KEY FINDINGS ........................................................................35 4.1 Result ..........................................................................................................................35 4.1.1 Code Smell Density and Value .................................................................................35 4.2 Key Findings ...............................................................................................................53 4.3 Validating our Quantification Measures .....................................................................59 CHAPTER 5: CONCLUSION AND FUTURE WORK ...............................................................66 5.1 Future Work .................................................................................................................66 5.2 Summary ......................................................................................................................67 5.3 Key Contributions of this Research Work ...................................................................68