Part-Of-Speech Tagging of Source Code Identifiers

PART-OF-SPEECH TAGGING OF SOURCE CODE IDENTIFIERS USING PROGRAMMING LANGUAGE CONTEXT VERSUS NATURAL LANGUAGE CONTEXT A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Masters of Science by Reem S. AlSuhaibani December, 2015 i Thesis written by Reem S. AlSuhaibani M.S., Kent State University, USA, 2015 B.S., Prince Sultan University, USA, 2010 Approved by Dr. Jonathan I. Maletic Academic Advisor Dr. Gwenn L. Volkert Members, Master Thesis Committee Dr. Kambiz Ghazinour Members, Master Thesis Committee Accepted by Dr. Javed I. Khan Chair, Department of Computer Science Dr. James L. Blank Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES ......................................................................................................... V LIST OF TABLES .......................................................................................................... VI DEDICATION .............................................................................................................. VII ACKNOWLEDGEMENTS ........................................................................................ VIII INTRODUCTION ..................................................................................... 1 1.1 Research Hypothesis and Questions ......................................................................... 3 1.2 Research Contributions ............................................................................................. 3 1.3 Organization of the Thesis ........................................................................................ 4 BACKGROUND ........................................................................................ 5 2.1 Part of Speech Tagging ............................................................................................. 5 2.1.1 Rule-Based Approach ..................................................................................... 9 2.1.2 The Stochastic (Probabilistic) Approach ....................................................... 10 2.1.3 Architecture of Part of Speech Taggers ........................................................ 12 2.1.4 Tagsets ........................................................................................................... 13 2.2 Natural Language in Source Code .......................................................................... 16 2.2.1 Program Identifiers ........................................................................................ 17 2.2.2 Comments ..................................................................................................... 23 RELATED WORK ................................................................................. 25 OVERVIEW OF APPROACH .............................................................. 28 4.1 Part of Speech Tagging in Programming Languages .............................................. 28 4.2 Part of Speech Tagging Approach for Source Code ............................................... 29 iii 4.2.1 Heuristic Rules on Program Identifiers ......................................................... 31 4.2.2 Part of Speech and Method Stereotypes ........................................................ 35 4.3 Part of Speech Tagging on Source Code Comments .............................................. 36 4.4 Implementation in srcML ........................................................................................ 39 EVALUATION ........................................................................................ 42 CONCLUSIONS AND FUTURE RESEARCH ................................... 53 6.1 Main Findings ......................................................................................................... 53 6.2 Future Research Directions ..................................................................................... 54 APPENDIX A ................................................................................................................. 56 APPENDIX B ................................................................................................................. 57 APPENDIX C ................................................................................................................. 58 APPENDIX D ................................................................................................................. 60 REFERENCES ................................................................................................................ 64 iv LIST OF FIGURES Figure 2.1 The different approaches of automatic part of speech tagging. ......................... 7 Figure 2.2 The common process of part of speech taggers ............................................... 12 Figure 4.1 An example of applying heuristics .................................................................. 34 Figure 4.2 The same previous example with the NLTK POS tagger (NLP tagger) ......... 34 Figure 4.3 The result of NLTK tagger on HippoDraw source code comments ................ 38 Figure 4.4 The workflow of the heuristics approach ........................................................ 40 Figure 4.5 Example of how srcNLP tag source code identifiers ...................................... 40 Figure 5.1 Box plot of program identifiers common between all the 10 systems ............ 51 Figure 5.2 Box plot of program identifiers common without Chipmunk2D .................... 51 Figure 5.3 A general visualization for nlpCMP output structure on the usage of the identifier ‘depth’ between 4 systems ......................................................................... 52 Figure 6.1 Shows some of the common identifiers between Monkey studio and Code Blocks ........................................................................................................................ 56 v LIST OF TABLES Table 2.1 Examples of how the word “above” is used in different forms .......................... 6 Table 2.2 Differences between supervised and unsupervised part of speech tagging ........ 8 Table 2.3 The NLTK universal language tagsets .............................................................. 16 Table 4.1 Taxonomy of method stereotypes and their corresponding part of speech ...... 35 Table 5.1 Number of verified identifiers for each system according to part of speech with percentages ................................................................................................................ 43 Table 5.2 The 10 open source systems used in the evaluation ......................................... 45 Table 5.3 The consistency of part of speech within each system ..................................... 49 Table 5.4 The total number of identifiers common between systems .............................. 50 vi DEDICATION To my father Saleh AlSuhaibani, who passed away during my research studies and before finishing this thesis; a goal that we both share. vii ACKNOWLEDGEMENTS This thesis would not be completed without the guidance and blessings of God; I am grateful and thankful for everything God gave me. This thesis also would not be completed without the support of many people around me who really deserve to be acknowledged. I would like first to thank my parents my father Saleh AlSuhaibani and my mother Huda Alrajhi for their love, care, continuous support and advices that they have given me throughout my study life to be the person who I am now. Second, my deepest gratitude and sincere thanks goes to my husband Ahmad for being a father, a brother and a friend during my master’s studies in the United States. Without his continuous support, I would not have achieved many things. Thanks go out to my advisor, Professor Jonathan I. Maletic for his enormous efforts and continual follow-up and attention to achieve the goal of this thesis. Without him, I would not have loved and enjoyed the work that I have accomplished. He has been such an inspirational advisor who has created the spirit of creativity amongst us as students. I am glad and proud that he is my advisor, and I appreciate each advice he has given me to expand my knowledge in software engineering. A special thank you goes to each of the SDML lab members for being supportive and helpful with their advices and opinions. I would like also to extend my thanks to my viii brothers Abdulrahman, Ahmad and Hussam and my sister Maram, and to all my friends at Kent State University who have been such great supporters with their love and prayers. Reem S. AlSuhaibani November 2015, Kent, Ohio ix Introduction With 60-90% of software life cycle resources spent on program maintenance [Boehm 1981; Erlikh 2000], there is a critical need for advanced tools that help in exploring and comprehending today’s large and complex software. To reduce the cost of this software maintenance, it has been demonstrated that natural-language clues in program identifiers can be used to improve software tools [Shepherd, Pollock, Vijay-Shanker 2007]. There have been a number of attempts to apply Natural Language Processing (NLP) techniques to source code to support various program comprehension tasks. In the work presented here, we are particularly interested in determining the part-of-speech of identifiers names of functions, types, variables, etc. in source code. We view this as a separate problem from determining the part-of-speech of comments. Comments are typically written in a natural language (English) and often have sentence structure that follows grammatical rules [Etzkorn, Davis 1994; Etzkorn, Davis,

Load more