Part-Of-Speech Tagging of Source Code Identifiers

Part-Of-Speech Tagging of Source Code Identifiers

PART-OF-SPEECH TAGGING OF SOURCE CODE IDENTIFIERS USING PROGRAMMING LANGUAGE CONTEXT VERSUS NATURAL LANGUAGE CONTEXT A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Masters of Science by Reem S. AlSuhaibani December, 2015 i Thesis written by Reem S. AlSuhaibani M.S., Kent State University, USA, 2015 B.S., Prince Sultan University, USA, 2010 Approved by Dr. Jonathan I. Maletic Academic Advisor Dr. Gwenn L. Volkert Members, Master Thesis Committee Dr. Kambiz Ghazinour Members, Master Thesis Committee Accepted by Dr. Javed I. Khan Chair, Department of Computer Science Dr. James L. Blank Dean, College of Arts and Sciences iiesearch Hypothesis and Questions ......................................................................... 3 1.2 Research Contributions ............................................................................................. 3 1.3 Organization of the Thesis ........................................................................................ 4 BACKGROUND ........................................................................................ 5 2.1 Part of Speech Tagging ............................................................................................. 5 2.1.1 Rule-Based Approach ..................................................................................... 9 2.1.2 The Stochastic (Probabilistic) Approach ....................................................... 10 2.1.3 Architecture of Part of Speech Taggers ........................................................ 12 2.1.4 Tagsets ........................................................................................................... 13 2.2 Natural Language in Source Code .......................................................................... 16 2.2.1 Program Identifiers ........................................................................................ 17 2.2.2 Comments ..................................................................................................... 23 RELATED WORK ................................................................................. 25 OVERVIEW OF APPROACH .............................................................. 28 4.1 Part of Speech Tagging in Programming Languages .............................................. 28 4.2 Part of Speech Tagging Approach for Source Code ............................................... 29 iii 4.2.1 Heuristic Rules on Program Identifiers ......................................................... 31 4.2.2 Part of Speech and Method Stereotypes ........................................................ 35 4.3 Part of Speech Tagging on Source Code Comments .............................................. 36 4.4 Implementation in srcML ........................................................................................ 39 EVALUATION ........................................................................................ 42 CONCLUSIONS AND FUTURE RESEARCH ................................... 53 6.1 Main Findings ......................................................................................................... 53 6.2 Future Research Directionsiv LIST OF FIGURES Figure 2.1 The different approaches of automatic part of speech tagging. ......................... 7 Figure 2.2 The common process of part of speech taggers ............................................... 12 Figure 4.1 An example of applying heuristics .................................................................. 34 Figure 4.2 The same previous example with the NLTK POS tagger (NLP tagger) ......... 34 Figure 4.3 The result of NLTK tagger on HippoDraw source code comments ................ 38 Figure 4.4 The workflow of the heuristics approach ........................................................ 40 Figure 4.5 Example of how srcNLP tag source code identifiers ...................................... 40 Figure 5.1 Box plot of program identifiers common between all the 10 systems ............ 51 Figure 5.2 Box plot of program identifiers common without Chipmunk2D .................... 51 Figure 5.3 A general visualization for nlpCMP output structure on the usage of the identifier ‘depth’ between 4 systems ......................................................................... 52 Figure 6.1 Shows some of the common identifiers between Monkey studio and Code Blocks ........................................................................................................................ 56 v LIST OF TABLES Table 2.1 Examples of how the word “above” is used in different forms .......................... 6 Table 2.2 Differences between supervised and unsupervised part of speech tagging ........ 8 Table 2.3 The NLTK universal language tagsets .............................................................. 16 Table 4.1 Taxonomy of method stereotypes and their corresponding part of speech ...... 35 Table 5.1 Number of verified identifiers for each system according to part of speech with percentages ................................................................................................................ 43 Table 5.2 The 10 open source systems used in the evaluation ......................................... 45 Table 5.3 The consistency of part of speech within each system ..................................... 49 Table 5.4 The total number of identifiers common between systems .............................. 50 vi DEDICATION To my father Saleh AlSuhaibani, who passed away during my research studies and before finishing this thesis; a goal that we both share. vii ACKNOWLEDGEMENTS This thesis would not be completed without the guidance and blessings of God; I am grateful and thankful for everything God gave me. This thesis also would not be completed without the support of many people around me who really deserve to be acknowledged. I would like first to thank my parents my father Saleh AlSuhaibani and my mother Huda Alrajhi for their love, care, continuous support and advices that they have given me throughout my study life to be the person who I am now. Second, my deepest gratitude and sincere thanks goes to my husband Ahmad for being a father, a brother and a friend during my master’s studies in the United States. Without his continuous support, I would not have achieved many things. Thanks go out to my advisor, Professor Jonathan I. Maletic for his enormous efforts and continual follow-up and attention to achieve the goal of this thesis. Without him, I would not have loved and enjoyed the work that I have accomplished. He has been such an inspirational advisor who has created the spirit of creativity amongst us as students. I am glad and proud that he is my advisor, and I appreciate each advice he has given me to expand my knowledge in software engineering. A special thank you goes to each of the SDML lab members for being supportive and helpful with their advices and opinions. I would like also to extend my thanks to my viii brothers Abdulrahman, Ahmad and Hussam and my sister Maram, and to all my friends at Kent State University who have been such great supporters with their love and prayers. Reem S. AlSuhaibani November 2015, Kent, Ohio ix Introduction With 60-90% of software life cycle resources spent on program maintenance [Boehm 1981; Erlikh 2000], there is a critical need for advanced tools that help in exploring and comprehending today’s large and complex software. To reduce the cost of this software maintenance, it has been demonstrated that natural-language clues in program identifiers can be used to improve software tools [Shepherd, Pollock, Vijay-Shanker 2007]. There have been a number of attempts to apply Natural Language Processing (NLP) techniques to source code to support various program comprehension tasks. In the work presented here, we are particularly interested in determining the part-of-speech of identifiers names of functions, types, variables, etc. in source code. We view this as a separate problem from determining the part-of-speech of comments. Comments are typically written in a natural language (English) and often have sentence structure that follows grammatical rules [Etzkorn, Davis 1994; Etzkorn, Davis,

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    82 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us