<<

PART-OF- TAGGING OF SOURCE CODE IDENTIFIERS

USING PROGRAMMING LANGUAGE CONTEXT

VERSUS NATURAL LANGUAGE CONTEXT

A thesis submitted

to Kent State University in partial

fulfillment of the requirements for the

degree of Masters of Science

by

Reem S. AlSuhaibani

December, 2015

i

Thesis written by

Reem S. AlSuhaibani

M.S., Kent State University, USA, 2015

B.S., Prince Sultan University, USA, 2010

Approved by

Dr. Jonathan I. Maletic Academic Advisor

Dr. Gwenn L. Volkert Members, Master Thesis Committee

Dr. Kambiz Ghazinour Members, Master Thesis Committee

Accepted by

Dr. Javed I. Khan Chair, Department of Computer Science

Dr. James L. Blank Dean, College of Arts and Sciences

ii

TABLE OF CONTENTS

LIST OF FIGURES ...... V

LIST OF TABLES ...... VI

DEDICATION ...... VII

ACKNOWLEDGEMENTS ...... VIII

INTRODUCTION ...... 1

1.1 Research Hypothesis and Questions ...... 3

1.2 Research Contributions ...... 3

1.3 Organization of the Thesis ...... 4

BACKGROUND ...... 5

2.1 Tagging ...... 5

2.1.1 Rule-Based Approach ...... 9

2.1.2 The Stochastic (Probabilistic) Approach ...... 10

2.1.3 Architecture of Part of Speech Taggers ...... 12

2.1.4 Tagsets ...... 13

2.2 Natural Language in Source Code ...... 16

2.2.1 Program Identifiers ...... 17

2.2.2 Comments ...... 23

RELATED WORK ...... 25

OVERVIEW OF APPROACH ...... 28

4.1 Part of Speech Tagging in Programming Languages ...... 28

4.2 Part of Speech Tagging Approach for Source Code ...... 29

iii

4.2.1 Heuristic Rules on Program Identifiers ...... 31

4.2.2 Part of Speech and Method Stereotypes ...... 35

4.3 Part of Speech Tagging on Source Code Comments ...... 36

4.4 Implementation in srcML ...... 39

EVALUATION ...... 42

CONCLUSIONS AND FUTURE RESEARCH ...... 53

6.1 Main Findings ...... 53

6.2 Future Research Directions ...... 54

APPENDIX A ...... 56

APPENDIX B ...... 57

APPENDIX C ...... 58

APPENDIX D ...... 60

REFERENCES ...... 64

iv

LIST OF FIGURES

Figure 2.1 The different approaches of automatic part of speech tagging...... 7

Figure 2.2 The common process of part of speech taggers ...... 12

Figure 4.1 An example of applying heuristics ...... 34

Figure 4.2 The same previous example with the NLTK POS tagger (NLP tagger) ...... 34

Figure 4.3 The result of NLTK tagger on HippoDraw source code comments ...... 38

Figure 4.4 The workflow of the heuristics approach ...... 40

Figure 4.5 Example of how srcNLP tag source code identifiers ...... 40

Figure 5.1 Box plot of program identifiers common between all the 10 systems ...... 51

Figure 5.2 Box plot of program identifiers common without Chipmunk2D ...... 51

Figure 5.3 A general visualization for nlpCMP output structure on the usage of the

identifier ‘depth’ between 4 systems ...... 52

Figure 6.1 Shows some of the common identifiers between Monkey studio and Code

Blocks ...... 56

v

LIST OF TABLES

Table 2.1 Examples of how the “above” is used in different forms ...... 6

Table 2.2 Differences between supervised and unsupervised part of speech tagging ...... 8

Table 2.3 The NLTK universal language tagsets ...... 16

Table 4.1 Taxonomy of method stereotypes and their corresponding part of speech ...... 35

Table 5.1 Number of verified identifiers for each system according to part of speech with

percentages ...... 43

Table 5.2 The 10 open source systems used in the evaluation ...... 45

Table 5.3 The consistency of part of speech within each system ...... 49

Table 5.4 The total number of identifiers common between systems ...... 50

vi

DEDICATION

To my father Saleh AlSuhaibani, who passed away during my research studies and before finishing this thesis; a goal that we both share.

vii

ACKNOWLEDGEMENTS

This thesis would not be completed without the guidance and blessings of God; I am grateful and thankful for everything God gave me. This thesis also would not be completed without the support of many people around me who really deserve to be acknowledged.

I would like first to thank my parents my father Saleh AlSuhaibani and my mother

Huda Alrajhi for their love, care, continuous support and advices that they have given me throughout my study life to be the person who I am now.

Second, my deepest gratitude and sincere thanks goes to my husband Ahmad for being a father, a brother and a friend during my master’s studies in the United States.

Without his continuous support, I would not have achieved many things.

Thanks go out to my advisor, Professor Jonathan I. Maletic for his enormous efforts and continual follow-up and attention to achieve the goal of this thesis. Without him, I would not have loved and enjoyed the work that I have accomplished. He has been such an inspirational advisor who has created the spirit of creativity amongst us as students. I am glad and proud that he is my advisor, and I appreciate each advice he has given me to expand my knowledge in software engineering.

A special thank you goes to each of the SDML lab members for being supportive and helpful with their advices and opinions. I would like also to extend my thanks to my

viii

brothers Abdulrahman, Ahmad and Hussam and my sister Maram, and to all my friends at

Kent State University who have been such great supporters with their love and prayers.

Reem S. AlSuhaibani

November 2015, Kent, Ohio

ix

Introduction

With 60-90% of software life cycle resources spent on program maintenance

[Boehm 1981; Erlikh 2000], there is a critical need for advanced tools that help in exploring and comprehending today’s large and complex software. To reduce the cost of this software maintenance, it has been demonstrated that natural-language clues in program identifiers can be used to improve software tools [Shepherd, Pollock, Vijay-Shanker 2007].

There have been a number of attempts to apply Natural Language Processing (NLP) techniques to source code to support various program comprehension tasks. In the work presented here, we are particularly interested in determining the part-of-speech of identifiers of functions, types, variables, etc. in source code. We view this as a separate problem from determining the part-of-speech of comments. Comments are typically written in a natural language (English) and often have sentence structure that follows grammatical rules [Etzkorn, Davis 1994; Etzkorn, Davis, Bowen 2001; Vinz,

Etzkorn 2008]

Part of speech (PoS) taggers, for natural language, leverage large amounts of knowledge about English and their usage in sentences. Thus, they work fine for typical English prose, but lacking sentence structure these methods break down.

Additionally, the manner in which programmers use an identifier in a program is very different from how a writer uses a word in a sentence. The is different and the of the identifiers are typically specialized for the domain of software. While we

1

do not deny that there is some correspondence between an identifier and its English counterpart, drawing a direct comparison is clearly flawed.

As such, we feel a more appropriate approach is to define the part-of-speech in terms of how an identifier is used in the code rather than how it would be used in prose.

Similar techniques have been used by others [Gupta, Malik, Pollock, Vijay-Shanker 2013;

Pollock et al. 2007; Shepherd et al. 2007]. For example, we could simply mark all function names as verbs and variables/objects as nouns. Names of functions typically describe an action (on an or parameter). Likewise, variables are typically nouns that describe an object in the domain.

Of course, this heuristic is overly simplistic and does not take into consideration a wealth of relevant information that can be derived (statically) from the context of an identifier within the source code. Much like how natural language part of speech taggers use the context of the word in a sentence, we must use the context of the identifier in the program.

Given this viewpoint, we propose a set of heuristics that define the part of speech of identifiers in source code. That is, we have taken terms from part of speech (e.g., noun, verb, etc.) and defined them, using heuristics, in the context of source code. Thus instead of assigning part of speech to identifiers based on the word’s usage in English prose, we assign it based on its use in source code. The goal of this work is to produce a specialized part of speech tagger for source code. This would be used in conjunction with a Natural

Language Processing (NLP) part of speech tagger for the internal comments.

2

1.1 Research Hypothesis and Questions

The main hypothesis of this work is that defining part of speech in the context of

source code for programming understanding will produce more useful results than using a

natural language based part of speech definition.

This thesis investigates the following research questions:

1. Is it possible to tag program identifiers using programming based part of speech

rules according to their placement in source code?

2. What are the definitions for each part of speech in the context of source code?

2. Are program identifiers consistently used, according to the part of speech,

within systems?

4. Are program identifiers consistently used across systems?

1.2 Research Contributions

The results of this research add to the research literature in several ways.

Programming-based part of speech tagging allows program identifiers to be tagged without

using Natural Language Processing (NLP) part of speech taggers. It is a new way for

handling part of speech tagging in source code. Specific contributions of this work include:

• Definition of proper nouns, nouns, adjectives, pronouns, verbs, in terms of source

code and rules that distinguish them.

• An evaluation of these definitions on 10 open source software systems.

• An empirical study of 10 open source systems in the context of common identifier

usage.

3

1.3 Organization of the Thesis

The thesis is organized as follows: A background is given in CHAPTER 2, followed

by related work in CHAPTER 3. The details of the proposed approach are given in

CHAPTER 4. An evaluation for the approach is illustrated in CHAPTER 5. The main

results and the future work is summarized in CHAPTER 6.

4

Background

This chapter presents a background of the most important aspects of the part of

speech tagging process. First, part of speech tagging is defined, and then followed by a

brief description of some related approaches that can be used for this procedure. An

overview of part of speech tagging architecture is also provided to give an overall vision

of how the various Natural Language Processing (NLP) taggers work. Next, the term tagset

is broadly explored and supported with available examples and the most commonly used

ones.

In the following section, more information about natural language usage in source

code is explored and provided with the most recent research studies. This section, in fact,

can be considered as the foundation of this thesis.

2.1 Part of Speech Tagging

Part of speech tagging in natural language processing is defined as labeling a word

in a sentence or phrase to its appropriate part of speech type, based on both its own

definition as well as in the context of the sentence. Different applications can be adopted

using this process. These include: , chunking, lexical acquisition, and concept

extraction. Part of speech tagging is not a trivial task; in English for example,

one word may have various speech tags. For example, the word “about” in English can be

used as an adverb or preposition depending on how it is used in a sentence and the word

5

“above” can be used as an adverb, a preposition, an adjective or a noun (see Table 2.1).

Such words may cause confusion for the taggers as a result of their ambiguity. However, lately, there has been a rising interest in data-driven machine learning disambiguation approaches, which can be employed in different tasks such as tagging.

Table 2.1 Examples of how the word “above” is used in different forms Sentence How the word “above” is used The stars are above. Adverb The airplane is above the clouds. Preposition Read the instruction given above. Adjective Our financial incentives come from above. Noun

Numbers of proposed approaches have been produced for generating part of speech taggers, including the Hidden Markov Models. These are statistical methods that work by choosing the tag sequence that allows maximizing the product of lexical probability and the contextual probability. This method is successfully applied to different languages such as German [Brants 2000], English [Brants 2000; Mihalcea 2003], Slovene [Dzeroski,

Erjavec, Zavrel 2000] and Spanish [Carrasco, Gelbukh 2003].

Another proposed approach is rule-based, which uses given rules and a to resolve the ambiguity of a tag. These rules can be either created by developers or learned

[Allen 1987], which is discussed further in section 2.1.1. Other machine learning models used for tagging include maximum entropy and other log-linear models, decision trees, memory-based learning, and transformation based learning [Zavrel, Daelemans 1999],

[Ratnaparkhi 1996].

6

[Guilder 1995] attempted to depict the different approaches of automatic part of speech tagging (See Figure 2.1)

Figure 2.1 The different approaches of automatic part of speech tagging. Many parts of speech tagging use aspects of some or all of these approaches, hence, diagrams such as this can become very complicated.

In supervised part of speech tagging as defined by [Guilder 1995], a pre-tagged corpora is created and it is relied on by supervised taggers in order to function as a foundation for making tools that can be employed in the overall tagging process. It can

7

actually be illustrated as the tagger , since it has the number of a word usage

(frequencies) and a tag, as well as the probabilities of a tag sequence and/or a set of rules.

Conversely, unsupervised models do not need a pre-tagged corpus; these models, instead, employ computational strategies that are sophisticated and generate tag sets automatically. These automatic tag sets are used to either quantify the probabilistic information required by stochastic methods or to develop context-specific rules required by the rule-based systems. The differences between these two approaches as described by

[Guilder 1995] can be seen in Table 2.2.

Table 2.2 Differences between supervised and unsupervised part of speech tagging

Supervised Unsupervised

Induction of tagset using untagged training Selection of tagset/tagged corpus. data.

Creation of dictionaries using tagged Induction of dictionary using training data. corpus.

Calculation of disambiguation tools which Induction of disambiguation tools which may include: may include: - Word frequencies. - Word frequencies. - frequencies. - Affix frequencies. - Tag sequence probabilities. - Tag sequence probabilities. - "Formulaic" expressions. Tagging of test data using dictionary Tagging of test data using induced information. dictionaries.

Disambiguation using statistical, hybrid or Disambiguation using statistical, hybrid or rule based approaches. rule based approaches Calculation of tagger accuracy Calculation of tagger accuracy

8

The two main approaches of tagging; rule-based and stochastic are further discussed in the following sections.

2.1.1 Rule-Based Approach

According to [Guilder 1995] the contextual information largely used in typical rule based approaches for assigning tags to unknown or ambiguous words is known as ‘context frame rules’. To demonstrate what a context frame rule might indicate [Guilder 1995] uses the example of an unknown/ambiguous word X following a determiner and preceding a noun, with a tag is in the category of an adjective.

det - X - n = X/adj

Although the contextual information is used for the tagging process, many available taggers also use morphological information to assist in the process of disambiguation. A rule on this aspect could be if an ambiguous/unknown word ends with –tion and is preceded by a determiner, it can be marked up as a noun.

Several taggers go beyond using contextual and morphological information by defining rules related to factors, such as the use of punctuation. In programming languages, using or an underscore influences how a source code can indicate how well- written a program is, and based on this idea, how this can be exploited in different research areas.

Over all, rule-based taggers largely require supervised training, which are somewhat hard to build and not very robust. However, a great deal of interest in automatic induction of rules has emerged. This can be illustrated as running untagged text through a 9

tagger and watching how it performs. In this first phase, human scrutiny and correction is carried out to fix any errors in the tagging process. Then the right-tagged text is submitted to the tagger, which then learns correction rules by comparing the two sets of data. In some cases, multiple iterations of this process are required.

2.1.2 The Stochastic (Probabilistic) Approach

‘Stochastic tagger’ is a specific term that can denote any number of various methods employed to address the issue of part of speech tagging. To properly label a model as stochastic, it should be capable of incorporating probability such as statistics and frequency.

Based on the probability that a word appears with a specific tag, the simplest stochastic taggers disambiguate words. However, the tag assigned to an ambiguous word is one that a word most frequently encounters in the training. The issue with this method is that, although it may offer a proper tag for a word at hand, it can also create sequences of tags that are inadmissible.

Calculating the probability of a given sequence of tags is an alternative method to the word frequency approach. This approach is usually known as the n-gram method, denoting the fact that the ideal tag for a word at hand is decided by the probability of its occurring with tags that have previously occurred. The Viterbi Algorithm is the most common algorithm and it implements the n-gram method. The Viterbi Algorithm is a search program that skips the polynomial extension of a breadth first query by ‘trimming’ the search tree at each level, employing the ideal N Maximum Likelihood Estimates (N here denotes the total number of tags in a related word). 10

Further complexity can be introduced to a stochastic tagger by combining the two approaches discussed above and blending word frequency calculations and tag sequence probabilities. This approach is referred to as the Hidden Markov Model. This model assumes the following;

1. Unassociated with other words and all their tags;

2. Probabilistic depending on the N previous tags only [Dermatas, Kokkinakis 1995].

Employing the Viterbi Algorithm, both hidden and visible models (i.e. Hidden

Markov Model and visible Markov Model) can be applied; these are considered the most proficient of the tagging techniques. Nonetheless, the Hidden Markov Model cannot be employed in a tagging schema that is automated because of its crucial reliance on the calculation of statistics on the tag states or output sequences. The Baum-Welch Algorithm is an appropriate solution to train Hidden Markov Model to automatically process the tagging schema. The Baum-Welch Algorithm is also referred to as the Forward-Backward

Algorithm, which employs word instead of tag information to robustly map sequences which enhances training data-related probability [Guilder 1995].

The Stochastic technique is appealing over traditional rule-governed methods that come with built-in procedures to assist in statistical automation, as it is the most automatic, with little manual knowledge needed to run the systems.1

1 One of the oldest taggers is a simple rule based tagger that automatically learns rules.

This tagger shows that the Stochastic Method is not the only viable speech tagging

method as these rule-based taggers presented significant advantages over Stochastic

11

2.1.3 Architecture of Part of Speech Taggers

Part of Speech (PoS) taggers for natural language leverage large amounts of knowledge on English words and their use in sentences. However, each tagger follows certain steps to reach the desired results (see Figure 2.2).

Tokenization Ambiguity look-up Disambiguation

Figure 2.2 The common process of part of speech taggers

In the first step, a given text is divided into tokens; each token can be a word, an article or a punctuation mark. These divided tokens are then used as inputs for a lexical analyzer. In the ambiguity look-up step a lexicon provides a list of word forms and their parts of speech and a guesser analyzes unknown tokens, which are used here to define ambiguous words. The disambiguation step is based on information related to words, such as the probability of a word besides contextual information or word/tag sequences. For

ones [Brill 1992] Brill, E., (1992), "A simple rule-based part of speech tagger", in

Proceedings of the third conference on Applied natural language processing. Trento,

Italy: Association for Computational , pp. 152-155.

12

example, a word might refer to a noun over a verb if the preceding word is a preposition or an article. This step is the most difficult in the tagging process.

2.1.4 Tagsets

In any language in the world, a word can take the form of a noun, verb, adjective, adverb, pronoun, preposition, or conjunction. The part of speech tagger’s role is to tag or label a word with its appropriate tag as defined in section 2.1. Tagsets are a set of tags in which the tagger should choose the appropriate option for a certain word; this is a requirement for providing lexical resources, specifically dictionary entries and grammar rules. Indeed, any available state-of-the-art tagger is given a standard tagset to complete the tagging process. There are a number of English part of speech tagsets available from the Natural Language Processing (NLP) community.

Tagsets in Natural Language Processing are used in corpus. This is a naturally occurring text that is a finite, machine-readable body that is chosen according to certain criteria. Criteria may include: the language (i.e. English/Arabic/German and its type standard/dialects), genre (such as novels), domain (such as newspaper or software manuals), and size (such as 2000 words, 40K words or 2M words). Some recent part of speech annotated reference corpora include , The

(BNC), and The Penn corpus. A brief information for each of these corpora is given in the following paragraphs.

Brown Corpus, namely the Brown University Standard Corpus of Present-Day

American English, was compiled in the 1960s in the field of . 500 samples of English-language text are contained, comprising one million words from a wide variety 13

of sources. A selection of approximately 85 parts of speech are used in the tagged Brown

Corpus2, as well as special indicators for forms, contractions, foreign words and a few other phenomena. This formed the basis for many later corpora such as the Lancaster-

Oslo-Bergen Corpus.

The British National Corpus (BNC) contains 100 million words of balanced

British English from a wide variety of genres. This project involved the collaboration of three publishers: Oxford University Press, Longman and W. & R. Chambers; two universities: The University of Oxford and Lancaster University; and the British Library.

In fact, the BNC Basic (C5) in fact uses 61 tagsets3. Two revisions were released during the process of developing this corpus, BNC World in 2001 and the BNC XML Edition in

2007. The BNC was the focus of computer linguists, who aimed to create a collection of texts of modern and naturally occurring language in the form of speech and text or writing that a computer could analyze. 90% of this corpus is samples of written language and 10% is samples of spoken language use; it is academically, commercially and educationally open.

The Penn Treebank, is a corpus that contains 2 million words of newswire

American English, a set of English texts from the Wall Street Journal (WSJ). The corpus is distributed in 25 dictionaries, which contains 100 files with several sentences. It is the

2 See “Brown Corpus” on Wikipedia for a full list of the tagset used.

3 Look for “The BNC Basic (C5) Tagset” at University of Oxford for the complete list.

14

most common corpus in NLP today, and it uses a set of 45 tags4. The entire corpus is automatically labeled with a part of speech tag and a syntactic labeling, which in turn allow the Penn Treebank to develop corpora with semantic and other linguistic information. It has been created manually, with linguists annotating each sentence with syntactic structure, as well as semi-automatically, whereby a parser assigned some syntactic structures, then linguists checked it in order to make any corrections. An automatically parsed corpus that is not corrected by human linguists can still be useful5. Several syntactic have been developed for a wide variety of languages, for instance, the Penn Chinese Treebank,

Greek Dependency Treebank, and Penn Arabic Treebank.

The above-mentioned corpuses use tagsets that have more than 40 tags; on the contrary, a large number of applications require a simple basic tagset only. Developers of the Natural Language Tool Kit (NLTK6) have provided a simplified universal tagset of only 12 tags (see Table 2.3) that can be used with all the previously stated corpuses besides various other corpuses such as Web , Reuters Corpus and Gutenberg Corpus.

Furthermore, it gives the user the ability to load his or her own corpus.

4 A full list of all the 45 tags is available at the Penn Treebank Project website.

5 A general improvement for parsers could be achieved here by applying them to large

amounts of text and by doing gathering for rule frequencies.

6 NLTK is a suite of libraries for NLP, which will be discussed further in later

chapters.

15

Each corpus needs a tagset that should be used for part of speech tagging these

corpuses. This in turn reveals the fact that linguists developing these sets are looking to

capture different grammatical distinctions. These might include: morphological

subcategories such as number for nouns and tense and person for verbs, or syntactic

subcategories such as making a distinction between adjectives in attributive and predicative

positions.

Table 2.3 The NLTK universal language tagsets Tag Meaning English example

ADJ Adjective new, good, high, special, big

ADP Adposition on, of, at, with, by, into, under

ADV Adverb really, already, still, early, now

CONJ Conjunction and, or, but, if, while although

DET Determiner, Article the, a , some, most, every, which

NOUN noun year, home, cost, time, Africa

NUM Twenty-four, fourth, 1991, 14:24

PRT Particle At, on, out, per, per, that, up, with

PRON Pronoun he, their, her, its, my, I, us

VERB Verb is, say, told, given, playing

. Punctuation marks ..;!

X Other espirt, dunoo, gr8, university

2.2 Natural Language in Source Code

Natural language in source code appears in two main components: program

identifier names and comments. These sources of information are very important to

16

programmers and developers, as well as for program understanding and program maintenance tasks.

2.2.1 Program Identifiers

Program identifiers are names given to program entities such as variables, functions, and methods. Formally, it is the program element names from which programmers acquire knowledge of concepts [Deissenbock, Pizka 2005]. These identifiers’ names convey relevant information about the role and properties of the objects or entities they are intended to label. These entity names contain valuable information that can be used through NLP tools for several program-understanding activities.

Identifier Naming: there have been several publications on identifier naming on source code that show how important this process is. Conventional wisdom says that choosing meaningful identifier names improves the ability of the next engineer to comprehend the code. [Caprile, Tonella 2000; Knuth 2003] agreed with this idea, stating that: “identifier names are one of the most important sources of information about program entities.” In their work, [Lawrie, Morrell, Feild, Binkley 2006] investigated three levels of identifier naming, which are: full words, , and single letters. The study resulted in better program comprehension when full word identifiers are used rather than single letter identifiers, as measured by description rating and confidence in understanding.

Identifier naming is in fact an arbitrary process, however, programmers do not select names arbitrarily. Rather, programmers choose and use names in regular, systematic ways that reflect deep cognitive and linguistic influences. This in turn allows names to carry semantic cues that aid in program understanding and support the larger software 17

development process. High quality identifier names represent a key practice that each developer should give attention to in software engineering. The ability to recover software architecture from the names of source files by [Anquetil, Lethbridge 1999] or by mining concept keywords from identifiers [Ohba, Gondow 2005] are two examples of how important identifier naming is. The importance of this aspect appears in the drive toward code readability and comprehension that were intensely studied by [Biggerstaff,

Mitbander, Webster 1993], [Rajlich, Wilde 2002], [Buse, Weimer 2010], [Liblit, Begel,

Sweetser 2006], and [Takang, Grubb, Macredie 1996].

According to [Deissenbock, Pizka 2005], identifiers represent 70% of source code tokens. [Eshkevari et al. 2011] explored how identifiers change in code, while [Lawrie,

Feild, Binkley 2006] studied the consistency of identifier naming. [Abebe, Haiduc,

Tonella, Marcus 2011] discuss the importance of naming to concept location, while

[Caprile, Tonella 2000; Rajlich, Gosavi 2004] propose a framework for restructuring and renaming identifiers based on custom rules and dictionaries. Furthermore, [Rilling,

Klemola 2003] observe, “In computer programs, identifiers represent defined concepts”, and, further, identifiers that fail to be concise or consistent increase comprehension complexity and its associated costs [Deissenbock, Pizka 2005].

Naming Conventions: refer to how the programmer can eliminate a “useless freedom” (i.e. free to append more than one to the same thing), by using a set of rational rules as a useful method. The method developers use to name program components determines the readability of these program’s source code. To make this more formal, each programmer or developer should follow predefined naming conventions in software

18

engineering. According to [Allamanis, Barr, Bird, Sutton 2014] a coding convention is a syntactic restriction not imposed by a programming language grammar. Following such rules has a great impact on several software engineering tasks, including program comprehension and software maintenance. Coding conventions are standard practice and study results emphasize the importance of selecting accurate and applicable rules

[Boogerd, Moonen 2008]. Moreover, naming conventions are important because “studies of how people name things (in general not just in code) have shown that the probability of having two people apply the same name to an object is between 7% and 18%, depending on the object” [Butler, Grogono, Shinghal, Tjandra 1995]. To investigate the extent to which developers follow conventions, [Butler, Wermelinger, Yu 2015] presented a naming convention checking library for Java called that allows the declarative specification of conventions regarding typography and the use of abbreviations and phrases. In their evaluation of 3.5 million reference name declarations, they found that the median is over 85% of declarations adhering to conventions. That result suggests that, to some extents, there is an awareness between java developers in some extend about the importance of naming conventions.

Importance of Identifier Naming: several research projects have considered the issue of identifier naming conventions. [Anquetil, Lethbridge 1998] argued that naming conventions are reliable if there is equivalence between the name of the software artifacts and the concepts they implement. Indeed, it would be very effective to be capable of relying on the titles of software tags artifacts to identify various types of implementation of the very concept. [Deissenbock, Pizka 2005] created a formal model for concepts and names

19

that are used to determine reliable names. They defined two characteristics of well-formed identifiers: conciseness and consistency. It is essential to have a mapping from the identifier’s domain to that of the conceptual domain in order to confirm that a variable is not only consistent but also concise. Consistency is what identifiers are termed as and they rely on the ideas by which they are mapped. On the other hand, [Caprile, Tonella 1999] use grammar to define the naming convention and use it to find semantic meanings in the words. In addition to naming conventions, [Takang, Grubb, Macredie 1996] have examined how informative identifiers are.

Identifier Splitting: a vast number of programmers use two or more words for naming their identifier names (Multiword). These names can be written using different cases, such as:

• Camel case. (E.g. doubleLinkedList).

• Pascal case. (E.g. DoubleLinkedList).

• Underscores. (E.g. Double_Linked_List).

• Snake case. (E.g. double_linked_list).

In contrast to human languages or natural languages, in which words delineation is achieved by punctuation and space, here there are no spaces in identifiers. Thus, in order to enhance the efficiency of a number of software maintenance tools, the splitting of multiword identifiers into their integral or constituent words is required in order to get the expected benefits out of these natural language clues. The automatic splitting of multi-word identifiers into their integral words is convenient if coding conventions have been followed.

20

Although several techniques for identifier splitting have been developed, such as

Linsen which aims to find a mapping between each source code identifier and the corresponding set of dictionary words by exploiting high-level and domain dependent information gathered from different dictionaries [Corazza, Di Martino, Maggio 2012], and

Tidier which has been proposed by [Guerrouj, Di Penta, Antoniol, Guéhéneuc 2013], some forms of identifiers (such as PORGRAMstatus, DAYofBIRTH, FIELDLENGTH) are challenging even for state-of-the-art techniques. Previous studies have shown that algorithms can perform well for identifiers with certain characteristics while being challenged by those with other characteristics mentioned in [Butler, Wermelinger, Yu,

Sharp 2011] and [Enslen, Hill, Pollock, Vijay-Shanker 2009] work.

Abbreviated Identifiers: a large number of programmers tend to abbreviate their program identifiers while they are coding their programs, especially for identifiers that must be typed frequently and for domain-specific words used in comments. However, this behavior does not support natural language processing tasks and may result in several program-understanding difficulties. For this reason, a number of researchers are trying to find different ways to expand abbreviated identifiers found in source codes.

Expanding abbreviations not only help with program comprehension but also support several software engineering tasks such as: code summarization and concept location, as well as enhancing software maintenance tools that utilize natural language information. The automatic expanding of abbreviated identifiers into full words will give an access to words and their associated meanings that were previously meaningless sequences of characters. There are a number of tools and methodologies for expanding

21

identifiers that are available. In their research, [Lawrie, Feild, Binkley 2007] presented a methodology for expanding identifiers and evaluated the process on a code based on slightly over 35 million lines of code. [Hill et al. 2008] presented an automated approach to mining expansions from source code.

In regards to part of speech tagging tasks, expanding abbreviations processes can improve the effectiveness of analyzing source code identifiers alongside with comments, which could result in an accurate tagging of words buried behind abbreviations and .

Verb-Do pairs: [Booch 1982] stated that, in a programming languages, verbs correspond to actions (or operations) and nouns correspond to objects. Consequently,

[Shepherd et al. 2007] defined verb-DO pairs as two co-located identifiers in which the first identifier is an action or verb, and the second identifier is being used as a direct object for the first identifier’s action. [Fry et al. 2008] proposed strategies for automatically extracting precise natural language clues from Java source code in the form of verb–direct object (DO) pairs. Their evaluation study indicated that their techniques could obtain 57% precision and 64% recall. The Apache OpenNLP library project, on the other hand, implemented Maxent models, which uses Verb-Do concept in determining the of direct-object phrases [Baldridge, Morton, Bierner 2005]. In terms of part of speech tagging, when using such approaches, tagging Verb-Do pairs or action words in source codes would be much easier, thereby enhancing source code part of speech tagging results.

22

2.2.2 Comments

Another valuable source of information for program understanding is internal source code comments. Comments have an incredible power in aiding programmers to understand source codes with or without reading them. They allow one to understand the source code faster and deeper without needing to see the implementation details. If a source code is well documented with reliable comments, it improves its readability [Tenny 1988].

More specifically, they are crucial to sustain software maintainability and aid in reverse engineering, for example when applying the Read All the Code in One Hour reengineering pattern. In fact, there are a number of researchers who have studied the quality of source code comments and the use of comments by developers during understanding and maintenance activities. [Elshoff, Marcotty 1982] and [Fluri, Wursch, Gall 2007] stated that comments, as well as the structure of the source code, aid in program understanding and therefore reduce maintenance costs. This finding was confirmed by the studies of [Tenny

1988] as well. Another similar study, described in [Aman, Okazaki 2008], conducted an empirical experiment for analyzing the relation between the comment density of a project and its stability. The results of this study showed that projects that have a higher level of comment density are inclined to be more stable during the upgrading processes.

During the development phase, software developers use natural language to describe lines of code that is written for assisting themselves or to assist other programmers, which allows the code to be easily comprehended during the maintenance phase. Such description is considered as internal source code comments. In terms of types of comments, according to [Freitas 2011], these can be divided into two types: line or inline

23

comments; and block comments. The number of lines these comments can contain and the to define them is the only difference between these two types7. Furthermore, there is no standard for the amount of comments that can be included in source code.[Arafat,

Riehle 2009] have found that in open-source projects that tends to include a high percentage of comments, and a high percentage of comment quality, almost 19% of the source code was for commenting. Most importantly, inconsistent comments can mislead the reader [Tan, Yuan, Krishna, Zhou 2007].

7 The inline comments can only contain one single line, whereas block comments can

contain one or more lines.

24

Related Work

This section discusses the most related work to this thesis. More specifically, it describes how other research uses part of speech tagging in program identifiers and comments. There are a number of investigations on applying Natural Language Processing

(NLP) techniques to source code to support various program comprehension tasks. In reality, all of these attempts are for the purpose of trying to leverage natural language processing techniques to take full advantage of the wealth of information program identifiers provide.

[Gupta, Malik, Pollock, Vijay-Shanker 2013] presents a part of speech tagger and syntactic chucker for source code names that takes into account programmers’ naming conventions to understand the regular systematic ways a program element is named. In their work, they use WordNet and morphological rules to assign all possible part of speech tags to each word. Their evaluation results show a significant improvement in accuracy

(11%-20%) of part of speech tagging of identifiers, over the current approaches.

[Abebe, Tonella 2010] present an approach that extracts concepts and relations from the source code. More specifically, they apply natural language parsing to sentences constructed from source code identifiers in order to support program maintenance.

[Binkley, Hearn, Lawrie 2011] investigate how an English prose tagger performs on source

25

code identifiers. They measure the accuracy of the Stanford Log-Liner part of speech tagger on a large corpus of over 145,000 structure-field names. They found that this tagger, on source code identifiers, with minimal guidance resulted in an accuracy of 81.7%, proving the fact that natural language found is software differs from that found in English standard prose.

[Falleri et al. 2010] propose an automated approach to extract identifiers from source code and organize them in a WordNet-like structure. They evaluate their work against a corpus of 24 open source Java programs. Their results give useful insights on overall software architecture and how it can be used to improve the results of many software engineering tasks.

[Fry et al. 2008] propose the term ‘Natural Language Program Analysis’ (NLPA) for any natural language analysis for source code. They present an automatic approach for discovering useful, precise verb–DO pairs from source code, in order to extract useful natural language clues. They also conducted an empirical evaluation of the effectiveness of the automatic technique for precisely extracting verb-direct object pairs from Java source code.

Several researchers have studied the effect of part of speech on Java program identifiers. Some of them focused on the relationship of noun and verb usage in code, such as [Liblit, Begel, Sweetser 2006] and [Shepherd, Pollock, Vijay-Shanker 2007], however others, such as [Williams, Hill, Pollock, Shanker 2007] focused on preposition word usage and the potential effect on software tools.

26

The distinctive ways in which programmers select and use names in cognitively motivated ways are discussed by [Liblit, Begel, Sweetser 2006]. They found that methods are actions and actively change the state of the program if they are verb phrases in the imperative mood such as add, addAll, addElement, clear, copyInto, ensureCapacity, setSize and trimToSize. On the contrary, methods are mathematical functions and passively compute a result but do not alter the state of the program if they return some useful piece of data of interest to the caller such as true/false.

Meaning that, true/false methods which have verb phrase names in the indicative mood such as contains, containsAll, equals, isEmpty are factual assertions.

However, methods returning values other than true/false and have singular or plural names such as capacity, clone, elementAt, elements, firstElement, hashCode, indexOf, lastElement, lastIndexOf, size, subList are things.

A number of studies investigate source code comments. [Vinz, Etzkorn 2008], for example, focused on understanding the sublanguage characterizing the comments of C++ software packages and more specifically comment grammar. [Freitas, da Cruz, Henriques

2012], on the other hand, tried to locate problem domain concepts on comments, and then identify the relevant code chunks associated with them. Several older studies like

[Woodfield, Dunsmore, Shen 1981] and [Tenny 1988] conducted experiments to show the importance and role of comments on the comprehension of programs.

27

Overview of Approach

The previous chapter described research on applying natural language part of speech

tagging to source code and comments. This assumes that programmers and programming

languages use and define part-of-speech in source code in the same way that is done in

natural language prose and grammar. In this study, we take a very different viewpoint.

Instead of relying on English part-of-speech definitions of words we attempt to extract new

part-of-speech definitions of words in the context of the programming language and their

usage in software systems. This chapter presents the core of the thesis and how we

determine the part-of-speech of program identifiers (i.e., names of functions, types,

variables, etc.) using heuristics.

4.1 Part of Speech Tagging in Programming Languages

Part of speech taggers, for natural language— here we limit the discussion to

English— leverage large amounts of knowledge about English words and their usage in

sentences. Thus, they work fine for typical English prose, but lacking sentence structure

(as in the case of source code), these methods break down. Additionally, the manner in

which programmers use an identifier in a program is very different to how a writer uses a

word in a sentence. There have been a few attempts by different researchers to explore how

part of speech tagging could work on source code, as section CHAPTER 3 showed.

28

The usage of natural language terms in source code is different and the semantics

of the identifiers are typically specialized for the domain of software. While it is true that

there is some correspondence between an identifier and its English counterpart, drawing a

direct comparison does make sense in many cases. For example, using the letter ‘i’ as a

program identifier most likely does not imply a personal pronoun referring to oneself.

Rather it is most likely a loop index. For this reason, an approach is proposed to define the

part-of-speech in terms of how an identifier is used in the code rather than how it would be

used in natural language prose.

4.2 Part of Speech Tagging Approach for Source Code

At the beginning, it is very important to consider that names of functions typically

describe an action (on an object or parameter). Likewise, variables are typically nouns that

describe an object in the domain. Of course, this heuristic is overly simplistic and does not

take into consideration a wealth of relevant information that can be derived (statically)

from the context of an identifier within the source code. Much like how natural language

part of speech taggers use the context of a word in a sentence, we must use the context of

the identifier in the program. Given this viewpoint, a proposed set of heuristics that define

the part of speech of identifiers in source code is generated. More clearly, terms from part

of speech (e.g., noun, verb, etc.) have been taken and defined using heuristics in the context

of source code. Consequently, instead of assigning part of speech to identifiers based on

the word’s usage in English prose, it is here assigned based on how it is used in source

code.

29

The focus here is to determine how an identifier expresses the intent of the entity it represents. Thus, if an identifier is being characterized as a verb, that identifier must represent some sort of action, whereas, if an identifier is being tagged as a noun, it must represent some sort of object in the system. To accomplish this task, some information about how an identifier is defined, how it is used, and its context is required. To get such information, srcML8 is used in order to allow examining statically computable information about source code entities. Alongside srcML, we use the knowledge of stereotypes9

[Alhindawi, Dragan, Collard, Maletic 2013] (Examples of stereotypes are giving in Table

4.1). Primarily, our heuristics use stereotypes to determine how a function behaves with respect to its arguments, local variables, and the calling object (if applicable). In fact, the suggested approach is strongly related to stereotypes. While stereotypes categorize at the method and class level, our approach categorizes at the identifier level.

A more formal description for the heuristics here is to define them by presenting a part-of-speech term and then state its definition with respect to source code as opposed to its use in English. As they are heuristics, there is room for debate on whether they are completely correct or not. However, the intention here with these base heuristics is that more data may be collected and tighter definitions obtained using techniques already employed in the NLP community. The next section will in detail show the proposed definitions of parts of speech in terms of source code.

8 srcML is markup language that wraps source code with AST information.

9 Stereotypes give the user information about how a function is being used.

30

4.2.1 Heuristic Rules on Program Identifiers

The first step here is summarizing the motivation behind each definition and then giving a list of rules that an identifier must satisfy to be classified under the given term. It is important to say that variables are assigned a part of speech only when they are declared, since the type and name at that time is known. One exception is function identifiers

(names), where the part of speech is assigned when they are defined, because the assignment here is based on the stereotype of the definition.

In an object-oriented system, the primary way to collect, relate, and move data is to use objects. As we discussed in section 2.2.1, a unique name is given for each program object or entity representing an identifier name. This is, in fact, is analogous to proper nouns in English, which represent the names of unique objects: a person’s name, a location’s name. Whereas an identifier in source code can be a proper noun only if it satisfies the following rules:

• It names a first class user-defined object.

• The identifier does not appear as a member of another class (i.e., not part of a

composition relationship).

A noun in English, on the other hand, specifies a set of objects that do not have their own unique identity. These are words like building, food, and device. On the contrary, in objected-oriented code, these are akin to identifiers that represent objects that make up part of the composition for a larger unique object. Consequently, an identifier in source code is a noun if it satisfies the following rules:

• It names a first class user-defined object.

31

• The identifier appears as a member of another class (i.e., it is used in a

composition relationship; this is where they differs from proper nouns).

Adjectives in English describe nouns; a person’s hair color, the age of a city. In source code, these are identifiers whose primary purpose is to convey a characteristic of something; its size/length, whether it is true or false; a radius; a file handle, etc. Primitive types are often used for this purpose, particularly primitives that make up part of a class.

For this reason, all identifiers whose types are primitive are considered adjectives.

Accordingly, an identifier is an adjective if it satisfies the following rules:

• The type that the identifier’s value represents is primitive (e.g., int, float,

bool). Note that the type of a function’s identifier is its return type.

• If the identifier represents a function, then it further satisfies the constraint

that it does not apply any modifications to any of:

o Its aliased arguments (i.e., does not modify any references or

pointers).

o The calling object (i.e., this).

• The identifier does not represent a pointer, reference, const reference or an

array.

Pronouns in English are used as references to nouns or proper nouns; what noun/proper noun they refer to is based on context e.g., the word “she” can refer to any female person.

In source code, these are akin to reference variables and pointers. Hence, an identifier is a pronoun if it satisfies the following rules:

• It names a pointer or reference to a user-defined object or primitive value.

32

• It is not const that is, it can be pointed at something else. This implies that

parameters passed by const reference in C++ are not pronouns (as this is for

optimization to avoid a copy constructor call).

Verbs represent actions in English; you run, you kick a ball, you play a game, etc. In source code instead, verbs instead modify the state of the system. Therefore, an identifier is a verb if it satisfies the following:

• It is the name of a function that applies some modification to at least one of

three things:

o One or more of its arguments.

o The calling object (this)

o Local variable whose value is then returned.

• It is not const in both its return type and the calling object (i.e., it has to

perform some useful modification).

An example of how these heuristics can be applied to source code is given in Figure 4.1.

Notice that only identifiers are tagged; types are not tagged and class names are not tagged.

Addressing this is left to future work. A preliminary compassion between how the heuristics approach results would differ from other taggers results (NLTK tagger10 here) is illustrated in Figure 4.2. We can notice that the variable ‘age’ is tagged as adjective in terms of source code, whereas it been tagged as noun in terms of English prose, which does not match the desired goal.

10 Explained further on page 36.

33

class Person{ public: float returnAge() //adjective {return curyear – yearborn;} void SetName(std::string n) //verb {name = n;} private: int age; //adjective std::string name; //noun date yearborn; //noun }; vector split //verb (string& str){ //pronoun vector result; //proper noun ... return result;

}

Figure 4.1 An example of applying heuristics

class Person{ public: float returnAge() //noun {return curyear – yearborn;}

void SetName(std::string n) //noun {name = n;} private: int age; //noun std::string name; //noun date yearborn; //noun }; vectorsplit //noun (const string& str){ //noun vector result; //noun ... return result; }

Figure 4.2 The same previous example with the NLTK POS tagger (NLP tagger)

34

4.2.2 Part of Speech and Method Stereotypes

This section presents a deeper look at how method stereotypes are used to further refine the part of speech for identifiers. Formally, method stereotypes categorize methods based on their role in a given system. They are not based on the name of the method, but on static analysis of the code in the method. We refer the reader to [Dragan, Collard,

Maletic 2006; 2010] for a complete definition. The list of stereotypes is provided in Table

4.1 and a brief discussion about each category is given in the next paragraph.

Table 4.1 Taxonomy of method stereotypes and their corresponding part of speech Stereotype Stereotype Description Part-of-Speech Category Adjective/noun get Returns a data member. depending on return type Returns Boolean value that is Structural predicate Adjective not a data member. Accessor Returns info about data property Adjective members. Returns information via a void-accessor Adjective parameter. set Sets a data member. Verb Structural command Performs a complex change to Verb Mutator non-void-command the object’s state. Verb constructor, copy-const, Creates and/or destroys Creational destructor, Verb objects. factory Works with objects collaborator (parameter, local or return Verb Collaborational value). Changes only an external controller Verb object’s state (not this). Does not read/change the incidental N/A Degenerate object’s state. Empty Has no statements. N/A

Structural methods provide and support the structure of the class. For example, accessors read an object’s state while mutators change it. The identifier for a structural

35

method corresponds primarily with adjectives because these types of methods are asking

about some characteristic of the object they are part of (isEmpty, getName, etc). In the case

of a mutator, however, they can also be verbs. Creational method’s task, on the other hand,

is to create or destroy objects of the class. These correspond primarily to verbs; they

completely construct an object thereby changing the program’s state. Collaborational

method’s job, from another side, is to characterize the communication between objects and

how objects are controlled in the system. These are primarily marked as verbs, but future

work will investigate if there are more complex patterns, particularly how verbs are applied

to their intended target (the , which could be an argument, the calling object, etc).

Lastly, Degenerate methods, are methods that give us little information; in this case,

applying our heuristics purely will fail.

It is important to state that since a method may have more than one stereotype, a

small finite state machine implements the rules for how to assign a tag based on stereotypes.

The approach is naïve and requires further research to be refined, but the implementation

uses stereotypes to differentiate between Structural Accessors and others. Literally

anything combined with a Structural Accessor ends up as an adjective or a noun (since

state is not modified by the method but may be modified outside of the method in the case

where a member is returned and not const). Currently, all other combinations are verbs.

4.3 Part of Speech Tagging on Source Code Comments

The heuristics rules created deal with source code identifiers only and do not directly

apply to source code comments. srcML marks comments with a tag but does not do any

other parsing of the comments (e.g., into sentences). Therefore, a use of an available 36

English tagger is vital. Natural Language Tool Kit (NLTK) tagger was chosen for several reasons, which include: it is a Maximum Entropy based tagger, meaning it provides a principled way of incorporating complex features into probability methods; it is trained on the Treebank corpus (most common in NLP today); and after doing a simple comparison study between NLTK tagger and Stanford tagger, it showed that NLTK tagger is a better choice for our work11.

To show how effective part of speech tagging for source code comments is, a simple empirical study on random HippoDraw12 source code comments is carried out using NLTK tagger (Figure 4.3).

11 Besides that, NLTK provides an easy-to-use interface to over 50 corpora and lexical resources, such as WordNet, along with a suite of libraries for classification, tokenization, , tagging, parsing, and semantic reasoning, which gave us a strong belief about this tagger’s utility for our work.

12 Object oriented statistical data analysis package written in C++.

37

Comment 1: “A class that does XML serialization and de-serialization of derived classes of AxisModelBase with XML”. Tagging Result: A/DET class/NOUN that/DET does/VERB XML/NOUN serialization/NOUN and/CONJ deserialization/ NOUN of/ADP derived/VERB classes/NOUN of/ADP AxisModelBase/NOUN with/ADP XML/NOUN ./.

Comment 2: "The attribute name for axis model that is logarithmic.” Tagging Result: The/DET attribute/NOUN name/NOUN for/ADP axis/NOUN model/VERB that/ADP is/VERB logarithmic/ADJ ./.

Comment 3: "The point size attribute name.” Tagging Result: The/DET point/NOUN size/NOUN attribute/NOUN name/NOUN ./.

Comment 4: "The attribute name for tick values.” Tagging Result: The/DET attribute/NOUN name/NOUN for/ADP tick/NOUN values/NOUN ./.

Comment 5: "A constructor that takes the tag name and XML controller object as arguments. Since a default constructor does not exist, derived class must use this constructor therefore guaranteeing that the tag name is always set by derived classes.” Tagging Result: A/DET constructor/NOUN that/DET takes/VERB the/DET tag/NOUN name/NOUN and/CONJ XML/NOUN controller/NOUN object/NOUN as/ADP arguments/NOUN ./. Since/ADP a/DET default/NOUN constructor/NOUN does/VERB not/ADV exist/VERB ,/. derived/VERB class/NOUN must/VERB use/VERB this/DET constructor/NOUN therefore/NOUN guaranteeing/VERB that/ADP the/DET tag/NOUN name/NOUN is/VERB always/ADV set/VERB by/ADP derived/VERB classes/NOUN ./. Figure 4.3 The result of NLTK tagger on HippoDraw source code comments

Thus, in the work presented, when a comment is encountered in the code, we characterize it using NLTK tagger and each word in the comment is given a part of speech tag as Figure 4.3 shows. There is some mis-tagging in the tagging results, but this does not violate the main proposed idea here. As a plus feature, using the heuristics approach, identifiers from the code and that are used in comments are tagged with the produced code

Part of Speech, which is possible since we are able to take multiple passes over the code, and make identifier part of speech consistent in comments with their usage in the code.

38

4.4 Implementation in srcML

The heuristics have been implemented in a tool called srcNLP. srcNLP uses srcML,

libxml213, and SAX14 parser in order to compute data about identifiers. Since srcML wraps

identifiers with abstract syntax, a determination of the type of any identifier is applicable,

as long as the type is statically computable. Furthermore, a determination about whether

identifiers or functions are const, or whether identifiers are aliases, and so on, is possible

as well. Essentially, being able to keep track of a large amount of metadata for any given

identifier, whether a variable, object, or function name in any given system is conceivable.

As a matter of fact, srcNLP uses this information and data gathered from the tool

stereocode, which implements the method stereotype assignment, to determine the

constraints of the proposed heuristics. Once determined, the part of speech is inserted

directly into the srcML of the source code in the form of srcML tags with an nlp namespace

(See Figure 4.4). This means, if an identifier is a noun, then it is marked with an

tag in srcML and if it is a verb, then it is marked with an tag. The current

implementation of srcNLP is very fast; it is able to completely mark up a 2.5 million lines

in less than four minutes. This does not include the time required for marking up comments,

since that part has not yet been integrated into the tool. An example of the markup is shown

in Figure 4.5.

13 Libxml2 is the XML C parser and toolkit.

14 Simple API for XML.

39

srcML stereocode srcNLP

Apply srcML to the Use stereocode Add markup for code base and get Tool to stereotypes part of speech XML output of all the methods using srcNLP

Figure 4.4 The workflow of the heuristics approach

The approach starts with applying srcML to the code base, then using stereocode to determine the stereotypes of all the methods, which is the input to srcNLP (i.e., a srcML document with stereotype information). It ends with adding XML attributes for part of speech using srcNLP by using libxml2 and SAX parser.

unsignedchar

p

Figure 4.5 Example of how srcNLP tag source code identifiers

An evaluation tool called nlpCMP is implemented for the purpose of comparing the occurrence of program identifiers between any systems, as well as comparing their heuristics assigned part of speech in XML format. nlpCMP provides useful insights about which part of speech each identifier has across systems. Besides including the name of the

40

file that contains that identifier, the line of code that uses it and the total number of occurrences. Figure 5.3 in the next chapter will show an example.

Currently, an elementary data preparation on the names of identifiers is in the process; an upper case into lower case and a remove for any non-alphabet symbols from identifier names is made. A true believe that splitting identifiers (section 2.2.1) technique [Binkley et al. 2013; Enslen, Hill, Pollock, Vijay-Shanker 2009; Hill et al. 2014], abbreviation expansion for abbreviated identifiers discussed section 2.2.1, and some handling of identifier naming conventions discussed in section 2.2.1 would greatly increase the accuracy of the approach since a look at what words make up the identifier instead of looking at what is likely an agglomeration of multiple terms is possible. Synonyms

[Haiduc, Marcus 2008] are also an issue that will likely need to be addressed.

41

Evaluation

Given the heuristics, it is clear that the proposed approach cannot be validated in the same way natural language taggers are; there is no golden set as we are not basing our definitions on English. Therefore, to validate, a determination for whether the approach identifies a consistent tag for a given identifier within a system and across systems is needed. As a pre-thought notion, if a word that is marked consistently within a given system as well as across systems is found, then the heuristics are more likely revealing something latent about how language is used in software and justify their validity.

Further, once validated, the data obtained using this approach can be used to apply techniques from the Natural Language Processing domain in order to fine-tune what part of speech is assigned to an identifier. The advantage, given the proposed heuristics, is that these models are based on how identifiers are used in code. Actually, our belief is that more latent knowledge about how identifiers are used in code will be unveiled. A strong feeling that tagging will even be consistent for a given identifier between different systems appeared, since it is being used in a consistent manner that the proposed heuristics uncover.

There are a number of methods that have been intended for the purpose of evaluating the results. A preliminary examination for the data is used to determine if the heuristics are consistent within a system or not. For that goal, a counting up of how many words fell under each separate category (adjective, pronoun, etc.) is performed, followed by pruning out any word that only appears once. This way the data reflects words that were assigned to the same part of speech at least twice and gives at least a modicum of confidence about 42

consistency of usage. If an identifier was given more than one part of speech found within the system, it should be called as a mismatch. A mismatch here means that, despite giving a word the same part of speech at least twice, it is still can be assigned to a different part of speech somewhere else in the same system. Moreover, if an identifier is not given more than one part of speech, then it should be left in the bucket that corresponds to the part of speech it was characterized. At the end, a count of how many times each identifier was seen and its part of speech is calculated. The resulting data is shown in Table 5.1.

Table 5.1 Number of verified identifiers for each system according to part of speech with percentages Blender Brclad Cali Inkscape Ogre Average 1292 1411 9768 1320 1109 Verb 29% (12%) (12%) (45%) (27%) (40%) 4117 4278 3707 1104 448 Adjective 26% (39%) (36%) (17%) (22%) (16%) 2647 2608 2159 998 480 Pronoun 17% (25%) (22%) (10%) (20%) (17%) 1594 2340 3129 786 508 Prop. Noun 16% (15%) (20%) (14%) (16%) (18%) 71 102 1088 186 79 Noun 3% (1%) (1%) (5%) (4%) (3%) 838 1040 1927 516 147 Mismatch 9% (8%) (9%) (9%) (11%) (5%)

The first column of Table 5.1 gives the part of speech labels. The numbers that follow are counts for each part of speech according to the systems used. The mismatch row at the bottom represents the number of words in each system that were given more than one part of speech tag, hence lowering the accuracy for the proposed tool.

The data presented in Table 5.1 shows that developer use of identifiers in each system are fairly consistent given our threshold of at least two. Mismatches only happened 43

between 8-11% of the time, with an average of 9%. Noun is the only part of speech that is low. This makes some sense seeing as how our heuristic states that nouns occur only in class declarations since they are part of object compositions. Likewise, a large number of verbs and adjectives are reasonable; verbs represent functions with some potential side effect. These are everywhere in object-oriented projects. Likewise, adjectives represent functions that return data about objects or variables that hold data about objects (or arrays), which are also very common. In essence, the data reflects that naming is mostly consistent as well as a fairly typical breakdown of types of identifiers within a system.

Given this preliminary evaluation was carried and its results proved the possibility of getting more promising results, another evaluation was done on 10 open source systems using two methods. The first was the same way as the previous evaluation, that is, evaluating the consistency of part of speech of words within each system. The second method was evaluating part of speech across systems. More information about each system involved in this evaluation is illustrated in Table 5.2. According to that table, 9 of the systems are more than 100K SLOC, and 1 system is only 36k for the purpose of investigating how the results are impacted by including or excluding a small system during the evaluation across systems. The systems’ sizes range between 36k and 700k, and the total unique identifiers for each system ranges between 1526 and 84k. Some of the systems are from the same domain and the rest are from different domains. It is also important to mention that all the systems are written in C++, except Vim, which is mostly written in C.

44

Table 5.2 The 10 open source systems used in the evaluation Total Systems SLOC Domain Identifiers Monkey Studio 182,337 IDE 6185 Code::Blocks 727,875 IDE 23912 Kdeveloper 145,095 IDE 4246 vim 321,461 Text editor 5259 Programmer's 222,494 Text editor 10032 Notepad Cocos2d 725,966 2D game engine 20407 OpenCV 732,553 Real time computer vision 14365 Inkscape 483,462 Drawing 10607 ParaView 600,080 Visualization 84373 Chipmunk2D 36,081 2D game physics library 1526

Table 5.3, displays the result of the first part of the evaluation, which was carried out to evaluate the consistency of part of speech of words within each system. The purpose is to investigate the number of verified identifiers for each system according to part of speech, including percentages. The data shows that programmers’ use of identifiers in each system is also fairly consistent given a threshold of at least two. Mismatches happen between 9-23% of the time with an average of 17%. The lowest part of speech is nouns, which is the same as the preliminary evaluation result, with 2% in average. The highest part of speech was adjectives with almost 50%. This large number is reasonable since adjectives represent functions that return data about objects or variables that hold data about objects (arrays) which is very common, as already mentioned in the preliminary evaluation part. In essence, the data results out of 10 systems proved the fact that naming is mostly consistent as well as a fairly typical breakdown of types of identifiers within a system.

45

Evaluating part of speech across the systems is the second part of the evaluation process. However, a pre-step of checking whether there are common identifiers between systems is taken. Table 5.4 on page 50 shows the results of this step. The data presented in that table shows number of implications. There is a high number of common identifiers between systems from the same domain such as Monkey Studio, Code Blocks and

Programmer’s Notepad. An example of a high percentage of identifiers commons can be illustrated between Monkey Studio and Programmer’s Notepad, with a percentage of 27%,

Code Blocks and Programmer’s Notepad with 14% total commonalities, and Monkey

Studio with Code Blocks with 12% of commonalities. These high percentages lead us to investigate the reasons for such results. See APPENDIX A for some of the commonly occurring identifiers between Monkey studio and Code Blocks system. These two systems are open source systems and they are both under one domain, thus it is expected that one system could be the base for the other. Monkey Studio was created in 2005, whereas

Programmer’s Notepad was created in 2002, which gives a clue that Monkey Studio may have used some of Programmer’s Notepad original code. Additionally, both systems use

Qt for the GUI. In short, these three systems use the same core code base. As such, they have a large number of common identifiers.

Another observation was carried out on two systems under the same domain but written in different programming languages, to see whether there are common identifiers or not. This can be illustrated by examining Vim and Programmer’s Notepad, since Vim is mostly written in C while Programmer’s Notepad is mostly written in C+. Both of these two programming languages are somewhat similar in concepts, but they are still considered

46

different. We can see from Table 5.4 that there is a total identifiers common of 494 between those two systems. This gave us an evidence that there can be commonalities between systems from the same domain even if they have slightly different in programming language.

Among all 10 systems there is only 66 common identifiers out of tens of thousand unique identifiers. See APPENDIX B for a full list of these identifiers. These included single letter identifiers, common abbreviations, and small full words that most of the programmers usually use. The average of this commonality was 3.7% and the median was

2.8%. The standard deviation was 4.3 and the variance standard deviation was 18.7. On the contrary, when the small size system, Chipmunk2D, is excluded from this evaluation, the results shows a total of 117 identifiers common, with an average of 4.2% and a median of

3.2%. The standard deviation was 4.7 and the variance standard deviation was 22. See

APPENDIX B for a full list of these identifiers. Two box plot diagrams for the results are shown in Figure 5.1 and Figure 5.2 on page 51.

We can conclude that there is a direct correlation relationship between systems sizes and consistency. Large systems tend to have a large number of identifiers common, whereas a small systems have small number of identifiers common. Code Blocks, for instance, has a very large number of commonalities with ParaView system since both of these systems are more than 200K SLOC, although they are from different domains.

After confirming that there is a commonality between all the systems, the next step is to evaluate Part of speech of these common identifiers and investigate whether they have the same part of speech across systems or not for the purpose of investigating the

47

consistency of part of speech across systems. This part was carried out using nlpCMP, an evaluation tool that uses srcNLP output for the purpose of comparing Parts of Speech for each identifier across systems. See Figure 5.3 on page 52. The output of this tool is used to investigate the consistency of the common identifiers across 10 systems, in addition to across 9 systems excluding Chipmunk2D, the small size SLOC system. See APPENDIX

C for the list of the 66 common identifiers that were found between all 10 systems with the common part of speech for each identifier and how consistent it is, and see APPENDIX D for the list of the 117 common identifiers that were found between all 9 systems with the common part of speech for each identifier and how consistent it is. After doing a simple calculation to find how single letters were mostly tagged, 0% of verbs and nouns were found and a 74% of those letters were tagged as proper nouns, 47% were tagged as adjectives, and 16% were tagged as pronouns. Overall, this information suggested that those single letters are names for a first class user-defined object, which could not be deduced without this evaluation.

48

SD 3.16 8.34% 2.11% 2.57% 4.77% 481.24 2691.4 2164.6 11.43% 3738.01 2159.12 13415.88

g. 2% 6% 17% 10% 48% 17% Av 353.7 2612.6 1822.4 9668.9 1334.1 2299.5

3 53 0% 3% 501 216 406 347 33% 14% 27% 23% Chipmunk2D

8% 1% 6% 5% 199 977 70% 10% 1567 1285 2082 14297 Cocos2d

7% 8% 3% 500 15% 50% 17% 2159 1029 1108 7171 2398 OpenCV

2% 9% 9% 16% 11% 54% 1410 9121 7253 7912 13078 45599 ParaView

within within each system

3% 336 17% 10% 40% 11% 19% 1849 1058 4210 1161 1993 Inkscape

1% 5% 129 489 15% 14% 48% 18% 1470 1398 4784 1762 part of speech of part Notepad Programmer's

9 0% 5% 6% 271 316 829 Vim 31% 42% 16% 1606 2228

he consistency of he of consistency 53 1% 9% 7% 717 380 286 890 T 17% 45% 21% 1920

3 . Kdeveloper 5

Table Table 1% 8% 340 11% 12% 53% 14% 2560 2970 2023 3452 12567 Code::Blocks

29 0% 7% 5% 619 417 283 10% 57% 22% 3507 1330 Studio Monkey

49

Verb Noun Pronoun Adjective Mismatch Prop. Noun Prop.

- 158 254 113 130 174 249 389 252 289 2.09% 1.01% 2.00% 1.95% 1.53% 2.10% 0.45% 1.61% 1.34% Chipmunk2D

- 634 455 413 857 958 289 1440 2406 1092 2.44% 3.36% 1.88% 1.64% 2.90% 3.19% 2.35% 3.24% 1.34% Cocos2d

- 602 421 454 785 252 1382 1008 3000 1092 3.02% 3.75% 2.31% 2.37% 3.32% 4.21% 3.13% 3.24% 1.61% OpenCV

- 117 892 931 389 1436 3815 1850 2128 3000 2406 1.61% 3.65% 1.02% 1.05% 2.00% 2.29% 3.13% 2.35% 0.45%

ParaView

systems systems

66 - 676 439 487 848 958 249 1455 2128 1008 4.19% 4.40% 3.05% 3.17% 4.28% 2.29% 4.21% 3.19% 2.10% Inkscape between

- 507 494 848 785 857 174 common 3481 4179 1850 3.68% 3.34% 4.28% 2.00% 3.32% 2.90% 1.53% 27.33% 14.04% between each system with percentage with system each between Programmer's NotepadProgrammer's

im - 401 771 259 494 487 931 454 413 130 V 3.63% 2.71% 2.80% 3.34% 3.17% 1.05% 2.37% 1.64% 1.95%

commonalities

- t Chipmunk2D Chipmunk2D t 470 750 259 507 439 892 421 455 113 4.72% 2.74% 2.80% 3.68% 3.05% 1.02% 2.31% 1.88% 2.00% Kdeveloper

withou Number of Number

he total number of identifiers total identifiers number of he

T

4 - . 750 771 254 3225 4179 1455 3815 1382 1440 5 2.74% 2.71% 4.40% 3.65% 3.75% 3.36% 1.01% 12.00% 14.04% Code::Blocks

Table Table

(The small system) small (The

- 470 401 676 602 634 158 3225 3481 1436 4.72% 3.63% 4.19% 1.61% 3.02% 2.44% 2.09% 12.00% 27.33% Monkey Studio

vim Studio Monkey Notepad Cocos2d Inkscape OpenCV ParaView Number of identifiers common for all systems all for common of identifiers Number Kdeveloper Number of identifiers common for all systems all for common of identifiers Number Code::Blocks Chipmunk2D Programmer's 50

Chipmunk2D

between theall systems between 10 without without

common common

s s

Box plot of program identifier Box plot program of Box plot of program identifier Box plot program of

1 2 . . 5 5 Figure Figure

51

Figure 5.3 A general visualization for nlpCMP output structure on the usage of the identifier ‘depth’ between 4 systems

52

Conclusions and Future Research

The contributions of this thesis involve using a novel approach to mark each

program identifier with an appropriate part of speech based on its functionality across the

source code. The evaluation showed that programmers tend to use different identifiers

within a system in a consistent way and this has been evaluated with its part of speech

consistency and resulted in reasonable percentages.

6.1 Main Findings

The evaluation provided evidence that program identifiers are used fairly

consistently, according to our part of speech definition, within a system. That is,

developers assign identifiers’ semantics and use them in a consistent manner throughout

the project. We observed this trend across 10 different open source software projects. This

supports the usefulness of our programming based parts of speech approach as a practical

alternative to natural language part of speech tagging.

Another important finding is that, within the systems we studied, there is very little

overlap in identifier names. This result has not been previously discussed in the literature

and could have a significant impact on developing programming understanding tools that

rely on dictionaries to infer the meaning of identifiers. The study here did not involve

identifier splitting or do any word stemming. This may have a large impact on the

commonality of identifiers between systems. As such, further study of this topic is

warranted.

53

6.2 Future Research Directions

The plan for the future work is to involve a new path of heuristics evaluation. The

next step that needs to be taken is an application of various filtering techniques (splitting,

expansion of abbreviations and stemming) in order to remove threats to validity. Since only

identifiers are tagged in this study, that is, types are not tagged and class names are not

tagged, future work will involve working on tagging these aspects as nouns and then

evaluating the results. Finding about 100 large systems categorized by domain to

investigate the validity of the presented work will also give more insights about the

usefulness of the approach for various software engineering tasks.

Moreover, providing the evaluation in CHAPTER 5 shines a positive light on the

proposed heuristics, two new objectives will be developed. The first is to create a database

of words and their typical usages to be used by the research community, similar to

WordNet’s use in the NLP community. This is similar to work done in [Falleri et al. 2010].

Using this database, there will be an ability to record a large number of words from varying

systems and their typical usages. Furthermore, this will allow the creation of models based

on typical word usage with respect to source code and would be made available to the

research community as a whole. One way this could be allowed is by using the

aforementioned database to find typical word usages and create a mathematical model.

Another way would be to more fully investigate the relationship between stereotypes and

the proposed part of speech heuristics. Other than that, investigating higher-order patterns

like verb-direct object pairs seen in section 2.2.1 is one of the goals. In the end, the hope is

that this mark-up could help to support research in identifier naming [Binkley, Hearn,

54

Lawrie 2011], code summarization [Haiduc, Aponte, Marcus 2010; Haiduc, Aponte,

Moreno, Marcus 2010], and bug fixing [, Lo 2015].

55

APPENDIX A

addwatcher islineendchar allocation issavepoint appendaction item bracematch keywordisambiguous buflength kwlast ccpos lastdeferred ccstart lengthfind chbase lengthstyle chcolor linefromhandle checkforchangeoutsidepaint linerangebreak chunkend linerangestart chunkendoffset lineremove chunkoffset maycoalesce classifywordrb mergemarkers colourisecamldoc mouseclick colourisetadsmsgparam nexttabpos colourisetadsstring nextwordend copydesiredcolours nextwordstart debugprintf notifymodified ends opcode ensurestyledto paintcontents expand pascalkeywords extendwordselect performeddeletion fnt pfniscommentleader foldcamldoc prompt foldeiffeldockeywords redrawselmargin grab removenumber hsstart sendscintilla inhexnumber setfiles initstate setforeback isadirectivechar setstyles isanhtmlchar settabsize isanidentifier startmod iscaml starty isdbcsleadbyte tabsz iseiffelcomment uparrow isfillup userwords xend

Figure 6.1 Shows some of the common identifiers between Monkey studio and Code Blocks

56

APPENDIX B

The 66 common identifiers between all The 117 common identifiers without

the systems Chipmunk2D system

Single litters a, b, c, d, e, f, g, i, j, k, l, m, n, p, r, s, t, v, w. a, b, c, d, e, f, g, i, j, k, l, m, n, p, r, s, t, v, w.

Full words begin, check, child, context, copy, count, action, append, base, begin, check, child,

data, delta, depth, done, end, file, filter, first, code, context, copy, count, create, current,

group, hash, id, key, len, line, lines, mask, data, delta, depth, done, empty, end, error,

max, message, name, next, offset, options, event, file, files, filter, first, flag, flags,

other, out, parent, result, retval, root, size, found, from, function, group, hash, id,

start, state, target, temp, type, value. insert, item, key, last, line, lines, list,

location, mask, max, menu, message, mode,

name, next, number, offset, ok, options,

other, out, parent, path, , range, result,

root, row, size, start, state, stream, target,

text, to, type, value, widget.

Abbreviations idx, init, pos, ptr, str, val. arg, args, buf, cc, ch, col, dest, dir, fd, fp,

idx, init, len, msg, num, op, params, pos,

ptr, res, ret, retval, sc, si, str, temp, tmp, val.

57

APPENDIX C

Consistency percentage of the Identifiers Common POS across systems Total occurrence out of 10 common POS a pnoun and adjective 10 100% b adjective 10 100% c pnoun 10 100% d pnoun, pronoun and adjective 8 80% e pnoun and pronoun 9 90% f pnoun 9 90% g pnoun 8 80% i adjective 10 100% j adjective 10 100% k adjective 9 90% l pnoun 9 90% m pnoun 10 100% n pnoun and adjective 10 100% p pnoun 10 100% r adjective 10 100% s pnoun 10 100% t pnoun 10 100% v pnoun 10 100% w pnoun, pronoun and adjective 9 90% begin adjective 8 80% check adjective 8 80% child pronoun 9 90% context pronoun 10 100% copy adjective 9 90% count adjective 10 100% data pnoun and adjective 10 100% delta pnoun 9 90% depth adjective 10 100% done adjective 9 90% end pnoun and adjective 10 100% file adjective 9 90% filter pnoun 8 80% first adjective 10 100% group pronoun 9 90% hash pnoun and adjective 7 70%

58

Consistency percentage of the Identifiers Common POS across systems Total occurrence out of 10 common POS id pnoun 10 100% idx pnoun and adjective 10 100% init adjective 9 90% key pronoun and adjective 9 90% len pnoun and adjective 9 90% line pronoun and adjective 10 100% lines adjective 10 100% mask pnoun 10 100% max adjective 9 90% message adjective 8 80% name adjective 10 100% next pronoun 9 90% offset adjective 9 90% options pnoun and pronoun 6 60% other pronoun 9 90% out pnoun, pronoun and adjective 8 80% parent pronoun 10 100% pos pnoun 10 100% ptr adjective 9 90% result pronoun 10 100% retval adjective 9 90% root pronoun 9 90% size adjective 10 100% start pnoun and adjective 10 100% state adjective 10 100% str pronoun 9 90% target pnoun and pronoun 8 80% temp pnoun and pronoun 9 90% type pnoun 10 100% val pnoun 10 100% value pnoun and pronoun 10 100% Average 93%

59

APPENDIX D

Identifiers Common POS Total occurrence out of 9 Consistency of the common POS

a pnoun and adjective 9 100% b adjective 9 100% c pnoun and adjective 9 100% d pronoun and adjective 8 89% e pronoun 9 100% f pronoun, adjective and pnoun 9 100% g pnoun 7 78% i pronoun, adjective and pnoun 9 100% j adjective 9 100% k adjective 9 100% l pronoun and pnoun 9 100% m pnoun and pronoun 9 100% n pronoun, adjective and pnoun 9 100% p pnoun and pronoun 9 100% r adjective and pronoun 9 100% s pronoun and pnoun 9 100% t pronoun and pnoun 9 100% v adjective and pnoun 9 100% w adjective 9 100% action pronoun 8 89% append adjective 7 78% arg adjective 9 100% args pnoun 9 100% base adjective 9 100% begin adjective 8 89% buf adjective 9 100% cc adjective 7 78% ch adjective 8 89% check adjective 8 89% child pronoun 6 67% code adjective 7 78%

60

Identifiers Common POS Total occurrence out of 9 Consistency of the common POS col adjective 9 100% context pronoun 9 100% copy adjective 9 100% count adjective 9 100% create verb and adjective 6 67% current pronoun 8 89% data pronoun, adjective and pnoun 9 100% delta pnoun and adjective 8 89% depth adjective 9 100% dest pronoun 8 89% dir pronoun 9 100% done adjective 9 100% empty adjective 9 100% end pnoun and adjective 9 100% error adjective 8 89% event pronoun 9 100% fd adjective and pnoun 7 78% file pnoun and adjective 8 89% files pnoun 7 78% filter pnoun 7 78% first adjective 9 100% flag adjective 9 100% flags adjective and pnoun 8 89% found adjective 9 100% fp pronoun 8 89% from adjective 9 100% function pronoun 6 67% group pronoun 8 89% hash adjective 7 78% id pnoun 9 100% idx pnoun and adjective 9 100% init adjective 9 100% insert adjective 9 100%

61

Identifiers Common POS Total occurrence out of 9 Consistency of the common POS item pronoun 9 100% key pronoun 9 100% last adjective 8 89% len adjective 9 100% line pronoun, adjective and pnoun 9 100% lines adjective and pnoun 9 100% list pronoun and pnoun 9 100% location pnoun 8 89% mask pnoun 9 100% max adjective 9 100% menu pnoun and pronoun 7 78% message pronoun, adjective and pnoun 7 78% mode adjective and pnoun 8 89% msg pnoun 8 89% name pronoun, adjective and pnoun 9 100% next pronoun 8 89% num adjective 9 100% number adjective 9 100% offset adjective 9 100% ok adjective 9 100% op pnoun 9 100% options pnoun 6 67% other pronoun 8 89% out pnoun and adjective 8 89% params pnoun and pronoun 8 89% parent pronoun 9 100% path pronoun 9 100% pos pnoun and adjective 9 100% prefix pronoun 7 78% ptr adjective 8 89% range adjective 8 89% res adjective and pnoun 9 100% result pronoun, adjective and pnoun 9 100%

62

Identifiers Common POS Total occurrence out of 9 Consistency of the common POS ret adjective and pronoun 9 100% retval adjective 9 100% root pronoun 8 89% row adjective 8 89% sc pronoun and pnoun 8 89% si adjective and pronoun 6 67% size adjective 9 100% start pnoun and adjective 9 100% state adjective 9 100% str pronoun 9 100% stream pronoun 8 89% target pnoun and adjective 8 89% temp pronoun 9 100% text pronoun 9 100% tmp pronoun 8 89% to adjective 9 100% type pronoun and pnoun 9 100% val adjective and pnoun 9 100% value pronoun and pnoun 9 100% widget pronoun 9 100% Average of consistency for all the common POS 94%

63

REFERENCES

[Abebe, Haiduc, Tonella, Marcus 2011] Abebe, S. L., Haiduc, S., Tonella, P., and Marcus, A.,

(2011), "The effect of lexicon bad smells on concept location in source code", in

Proceedings of Source Code Analysis and Manipulation (SCAM), 2011 11th IEEE

International Working Conference on, pp. 125-134.

[Abebe, Tonella 2010] Abebe, S. L. and Tonella, P., (2010), "Natural Language Parsing of

Program Element Names for Concept Extraction", in Proceedings of Program

Comprehension (ICPC), 2010 IEEE 18th International Conference on, June 30 2010-July

2 2010, pp. 156-159.

[Alhindawi, Dragan, Collard, Maletic 2013] Alhindawi, N., Dragan, N., Collard, M. L., and

Maletic, J., (2013), "Improving feature location by enhancing source code with

stereotypes", in Proceedings of Software Maintenance (ICSM), 2013 29th IEEE

International Conference on, pp. 300-309.

[Allamanis, Barr, Bird, Sutton 2014] Allamanis, M., Barr, E. T., Bird, C., and Sutton, C., (2014),

"Learning natural coding conventions", in Proceedings of the 22nd ACM SIGSOFT

International Symposium on Foundations of Software Engineering. Hong Kong, China:

ACM, pp. 281-293.

[Allen 1987] Allen, J., (1987), "Natural language understanding".

[Aman, Okazaki 2008] Aman, H. and Okazaki, H., (2008), "Impact of Comment Statement on

Code Stability in Open Source Development", in Proceedings of JCKBSE, pp. 415-419.

[Anquetil, Lethbridge 1998] Anquetil, N. and Lethbridge, T., (1998), "Assessing the relevance of

identifier names in a legacy software system", in Proceedings of the 1998 conference of

the Centre for Advanced Studies on Collaborative research. Toronto, Ontario, Canada:

IBM Press, pp. 4.

64

[Anquetil, Lethbridge 1999] Anquetil, N. and Lethbridge, T. C., (1999), "Recovering software

architecture from the names of source files", Journal of Software Maintenance, vol. 11,

no. 3, pp. 201-221.

[Arafat, Riehle 2009] Arafat, O. and Riehle, D., (2009), "The commenting practice of open

source", in Proceedings of Proceedings of the 24th ACM SIGPLAN conference

companion on Object oriented programming systems languages and applications, pp.

857-864.

[Baldridge, Morton, Bierner 2005] Baldridge, J., Morton, T., and Bierner, G., (2005), "OpenNLP

maxent package in Java".

[Biggerstaff, Mitbander, Webster 1993] Biggerstaff, T. J., Mitbander, B. G., and Webster, D.,

(1993), "The concept assignment problem in program understanding", in Proceedings of

the 15th international conference on Software Engineering. Baltimore, Maryland, USA:

IEEE Computer Society Press, pp. 482-498.

[Binkley, Hearn, Lawrie 2011] Binkley, D., Hearn, M., and Lawrie, D., (2011), "Improving

identifier informativeness using part of speech information", in Proceedings of the 8th

Working Conference on Mining Software Repositories. Waikiki, Honolulu, HI, USA:

ACM, pp. 203-206.

[Binkley et al. 2013] Binkley, D., Lawrie, D., Pollock, L., Hill, E., and Vijay-Shanker, K.,

(2013), "A dataset for evaluating identifier splitters", in Proceedings of the 10th Working

Conference on Mining Software Repositories. San Francisco, CA, USA: IEEE Press, pp.

401-404.

[Boehm 1981] Boehm, B. W.,(1981),Software engineering economics, Prentice-hall Englewood

Cliffs (NJ).

[Booch 1982] Booch, G., (1982), "Object-oriented design", ACM SIGAda Ada Letters, vol. 1, no.

3, pp. 64-76.

65

[Boogerd, Moonen 2008] Boogerd, C. and Moonen, L., (2008), "Assessing the value of coding

standards: An empirical study", in Proceedings of Software Maintenance, 2008. ICSM

2008. IEEE International Conference on, Sept. 28 2008-Oct. 4 2008, pp. 277-286.

[Brants 2000] Brants, T., (2000), "TnT: a statistical part-of-speech tagger", in Proceedings of the

sixth conference on Applied natural language processing. Seattle, Washington:

Association for Computational Linguistics, pp. 224-231.

[Brill 1992] Brill, E., (1992), "A simple rule-based part of speech tagger", in Proceedings of the

third conference on Applied natural language processing. Trento, Italy: Association for

Computational Linguistics, pp. 152-155.

[Buse, Weimer 2010] Buse, R. P. L. and Weimer, W. R., (2010), "Learning a Metric for Code

Readability", IEEE Trans. Softw. Eng., vol. 36, no. 4, pp. 546-558.

[Butler, Grogono, Shinghal, Tjandra 1995] Butler, G., Grogono, P., Shinghal, R., and Tjandra, I.,

(1995), "Retrieving information from data flow diagrams", in Proceedings of Reverse

Engineering, 1995., Proceedings of 2nd Working Conference on, 14-16 Jul 1995, pp. 22-

29.

[Butler, Wermelinger, Yu 2015] Butler, S., Wermelinger, M., and Yu, Y., (2015), "Investigating

naming convention adherence in Java references".

[Butler, Wermelinger, Yu, Sharp 2011] Butler, S., Wermelinger, M., Yu, Y., and Sharp, H.,

(2011), "Improving the tokenisation of identifier names", in ECOOP 2011–Object-

Oriented Programming, Springer, pp. 130-154.

[Caprile, Tonella 1999] Caprile, B. and Tonella, P., (1999), "Nomen est omen: analyzing the

language of function identifiers", in Proceedings of Reverse Engineering, 1999.

Proceedings. Sixth Working Conference on, 6-8 Oct 1999, pp. 112-122.

66

[Caprile, Tonella 2000] Caprile, B. and Tonella, P., (2000), "Restructuring program identifier

names", in Proceedings of Software Maintenance, 2000. Proceedings. International

Conference on, 2000, pp. 97-107.

[Carrasco, Gelbukh 2003] Carrasco, R. M. and Gelbukh, A., (2003), "Evaluation of TnT Tagger

for Spanish", in Proceedings of Computer Science, 2003. ENC 2003. Proceedings of the

Fourth Mexican International Conference on, pp. 18-25.

[Corazza, Di Martino, Maggio 2012] Corazza, A., Di Martino, S., and Maggio, V., (2012),

"LINSEN: An efficient approach to split identifiers and expand abbreviations", in

Proceedings of Software Maintenance (ICSM), 2012 28th IEEE International Conference

on, pp. 233-242.

[Deissenbock, Pizka 2005] Deissenbock, F. and Pizka, M., (2005), "Concise and consistent

naming [software system identifier naming]", in Proceedings of Program Comprehension,

2005. IWPC 2005. Proceedings. 13th International Workshop on, 15-16 May 2005, pp.

97-106.

[Dermatas, Kokkinakis 1995] Dermatas, E. and Kokkinakis, G., (1995), "Automatic stochastic

tagging of natural language texts", Comput. Linguist., vol. 21, no. 2, pp. 137-163.

[Dragan, Collard, Maletic 2006] Dragan, N., Collard, M. L., and Maletic, J., (2006), "Reverse

engineering method stereotypes", in Proceedings of Software Maintenance, 2006.

ICSM'06. 22nd IEEE International Conference on, pp. 24-34.

[Dragan, Collard, Maletic 2010] Dragan, N., Collard, M. L., and Maletic, J., (2010), "Automatic

identification of class stereotypes", in Proceedings of Software Maintenance (ICSM),

2010 IEEE International Conference on, pp. 1-10.

[Dzeroski, Erjavec, Zavrel 2000] Dzeroski, S., Erjavec, T., and Zavrel, J., (2000),

"Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets", in Proceedings

of LREC.

67

[Elshoff, Marcotty 1982] Elshoff, J. L. and Marcotty, M., (1982), "Improving computer program

readability to aid modification", Commun. ACM, vol. 25, no. 8, pp. 512-521.

[Enslen, Hill, Pollock, Vijay-Shanker 2009] Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker,

K., (2009), "Mining source code to automatically split identifiers for software analysis",

in Proceedings of the 2009 6th IEEE International Working Conference on Mining

Software Repositories: IEEE Computer Society, pp. 71-80.

[Erlikh 2000] Erlikh, L., (2000), "Leveraging legacy system dollars for e-business", IT

professional, vol. 2, no. 3, pp. 17-23.

[Eshkevari et al. 2011] Eshkevari, L. M., Arnaoudova, V., Penta, M. D., Oliveto, R., Yann-Ga,

#235, Gu, l., #233, #233, neuc, and Antoniol, G., (2011), "An exploratory study of

identifier renamings", in Proceedings of the 8th Working Conference on Mining Software

Repositories. Waikiki, Honolulu, HI, USA: ACM, pp. 33-42.

[Etzkorn, Davis 1994] Etzkorn, L. H. and Davis, C. G., (1994), "A documentation-related

approach to object-oriented program understanding", in Proceedings of Program

Comprehension, 1994. Proceedings., IEEE Third Workshop on, 14-15 Nov 1994, pp. 39-

45.

[Etzkorn, Davis, Bowen 2001] Etzkorn, L. H., Davis, C. G., and Bowen, L. L., (2001), "The

language of comments in computer software: A sublanguage of English", Journal of

Pragmatics, vol. 33, no. 11, 11//, pp. 1731-1756.

[Falleri et al. 2010] Falleri, J. R., Huchard, M., Lafourcade, M., Nebut, C., Prince, V., and Dao,

M., (2010), "Automatic Extraction of a WordNet-Like Identifier Network from

Software", in Proceedings of Program Comprehension (ICPC), 2010 IEEE 18th

International Conference on, June 30 2010-July 2 2010, pp. 4-13.

[Fluri, Wursch, Gall 2007] Fluri, B., Wursch, M., and Gall, H. C., (2007), "Do Code and

Comments Co-Evolve? On the Relation between Source Code and Comment Changes",

68

in Proceedings of Reverse Engineering, 2007. WCRE 2007. 14th Working Conference

on, 28-31 Oct. 2007, pp. 70-79.

[Freitas, da Cruz, Henriques 2012] Freitas, J. L., da Cruz, D., and Henriques, P. R., (2012), "A

Comment Analysis Approach for Program Comprehension", in Proceedings of Software

Engineering Workshop (SEW), 2012 35th Annual IEEE, pp. 11-20.

[Freitas 2011] Freitas, J. L. F. d., (2011), "Comment analysis for program comprehension".

[Fry et al. 2008] Fry, Z. P., Shepherd, D., Hill, E., Pollock, L., and Vijay-Shanker, K., (2008),

"Analysing source code: looking for useful verb-direct object pairs in all the right

places", Software, IET, vol. 2, no. 1, pp. 27-36.

[Guerrouj, Di Penta, Antoniol, Guéhéneuc 2013] Guerrouj, L., Di Penta, M., Antoniol, G., and

Guéhéneuc, Y. G., (2013), "Tidier: an identifier splitting approach using speech

recognition techniques", Journal of Software: Evolution and Process, vol. 25, no. 6, pp.

575-599.

[Guilder 1995] Guilder, L. V., (1995), "Automated Part of Speech Tagging: A Brief Overview",

Date Accessed: 9/12/2015.

[Gupta, Malik, Pollock, Vijay-Shanker 2013] Gupta, S., Malik, S., Pollock, L., and Vijay-

Shanker, K., (2013), "Part-of-speech tagging of program identifiers for improved text-

based software engineering tools", in Proceedings of Program Comprehension (ICPC),

2013 IEEE 21st International Conference on, 20-21 May 2013, pp. 3-12.

[Haiduc, Aponte, Marcus 2010] Haiduc, S., Aponte, J., and Marcus, A., (2010), "Supporting

program comprehension with source code summarization", in Proceedings of

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-

Volume 2, pp. 223-226.

[Haiduc, Aponte, Moreno, Marcus 2010] Haiduc, S., Aponte, J., Moreno, L., and Marcus, A.,

(2010), "On the use of automated text summarization techniques for summarizing source

69

code", in Proceedings of Reverse Engineering (WCRE), 2010 17th Working Conference

on, pp. 35-44.

[Haiduc, Marcus 2008] Haiduc, S. and Marcus, A., (2008), "On the Use of Domain Terms in

Source Code", in Proceedings of Program Comprehension, 2008. ICPC 2008. The 16th

IEEE International Conference on, 10-13 June 2008, pp. 113-122.

[Hill et al. 2014] Hill, E., Binkley, D., Lawrie, D., Pollock, L., and Vijay-Shanker, K., (2014),

"An empirical study of identifier splitting techniques", Empirical Softw. Engg., vol. 19,

no. 6, pp. 1754-1780.

[Hill et al. 2008] Hill, E., Fry, Z. P., Boyd, H., Sridhara, G., Novikova, Y., Pollock, L., and

Vijay-Shanker, K., (2008), "AMAP: automatically mining abbreviation expansions in

programs to enhance software maintenance tools", in Proceedings of the 2008

international working conference on Mining software repositories. Leipzig, Germany:

ACM, pp. 79-88.

[Knuth 2003] Knuth, D. (2003). Selected papers on computer languages. Stanford, Calif.: CSLI

Publications, Center for the Study of Language and Information.

[Lawrie, Feild, Binkley 2006] Lawrie, D., Feild, H., and Binkley, D., (2006), "Syntactic

identifier conciseness and consistency", in Proceedings of Source Code Analysis and

Manipulation, 2006. SCAM'06. Sixth IEEE International Workshop on, pp. 139-148.

[Lawrie, Feild, Binkley 2007] Lawrie, D., Feild, H., and Binkley, D., (2007), "Extracting

Meaning from Abbreviated Identifiers", in Proceedings of Source Code Analysis and

Manipulation, 2007. SCAM 2007. Seventh IEEE International Working Conference on,

Sept. 30 2007-Oct. 1 2007, pp. 213-222.

[Lawrie, Morrell, Feild, Binkley 2006] Lawrie, D., Morrell, C., Feild, H., and Binkley, D.,

(2006), "What's in a Name? A Study of Identifiers", in Proceedings of Program

Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pp. 3-12.

70

[Liblit, Begel, Sweetser 2006] Liblit, B., Begel, A., and Sweetser, E., (2006), "Cognitive

perspectives on the role of naming in computer programs", in Proceedings of Proceedings

of the 18th annual psychology of programming workshop.

[Mihalcea 2003] Mihalcea, R., (2003), "Performance analysis of a part of speech tagging task",

in Computational Linguistics and Intelligent Text Processing, Springer, pp. 158-167.

[Ohba, Gondow 2005] Ohba, M. and Gondow, K., (2005), "Toward mining "concept keywords"

from identifiers in large software projects", in Proceedings of the 2005 international

workshop on Mining software repositories. St. Louis, Missouri: ACM, pp. 1-5.

[Pollock et al. 2007] Pollock, L., Vijay-Shanker, K., Shepherd, D., Hill, E., Fry, Z. P., and

Maloor, K., (2007), "Introducing natural language program analysis", in Proceedings of

the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and

engineering. San Diego, California, USA: ACM, pp. 15-16.

[Rajlich, Gosavi 2004] Rajlich, V. and Gosavi, P., (2004), "Incremental change in object-

oriented programming", Software, IEEE, vol. 21, no. 4, pp. 62-69.

[Rajlich, Wilde 2002] Rajlich, V. and Wilde, N., (2002), "The role of concepts in program

comprehension", in Proceedings of Program Comprehension, 2002. Proceedings. 10th

International Workshop on, pp. 271-278.

[Ratnaparkhi 1996] Ratnaparkhi, A., (1996), "A maximum entropy model for part-of-speech

tagging", in Proceedings of Proceedings of the conference on empirical methods in

natural language processing, pp. 133-142.

[Rilling, Klemola 2003] Rilling, J. and Klemola, T., (2003), "Identifying comprehension

bottlenecks using program slicing and cognitive complexity metrics", in Proceedings of

Program Comprehension, 2003. 11th IEEE International Workshop on, 10-11 May 2003,

pp. 115-124.

71

[Shepherd et al. 2007] Shepherd, D., Fry, Z. P., Hill, E., Pollock, L., and Vijay-Shanker, K.,

(2007), "Using natural language program analysis to locate and understand action-

oriented concerns", in Proceedings of the 6th international conference on Aspect-oriented

software development. Vancouver, British Columbia, Canada: ACM, pp. 212-224.

[Shepherd, Pollock, Vijay-Shanker 2007] Shepherd, D., Pollock, L., and Vijay-Shanker, K.,

(2007), "Case study: supplementing program analysis with natural language analysis to

improve a reverse engineering task", in Proceedings of Proceedings of the 7th ACM

SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering,

pp. 49-54.

[Takang, Grubb, Macredie 1996] Takang, A. A., Grubb, P. A., and Macredie, R. D., (1996), "The

effects of comments and identifier names on program comprehensibility: an experimental

investigation", J. Prog. Lang., vol. 4, no. 3, pp. 143-167.

[Tan, Yuan, Krishna, Zhou 2007] Tan, L., Yuan, D., Krishna, G., and Zhou, Y., (2007), "/*

iComment: Bugs or bad comments?*", in Proceedings of ACM SIGOPS Operating

Systems Review, pp. 145-158.

[Tenny 1988] Tenny, T., (1988), "Program readability: Procedures versus comments", Software

Engineering, IEEE Transactions on, vol. 14, no. 9, pp. 1271-1279.

[Tian, Lo 2015] Tian, Y. and Lo, D., (2015), "A comparative study on the effectiveness of part-

of-speech tagging techniques on bug reports", in Proceedings of Software Analysis,

Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on,

pp. 570-574.

[Vinz, Etzkorn 2008] Vinz, B. L. and Etzkorn, L. H., (2008), "Comments as a Sublanguage: A

Study of Comment Grammar and Purpose", in Proceedings of Software Engineering

Research and Practice, pp. 17-23.

72

[Williams, Hill, Pollock, Shanker 2007] Williams, M., Hill, E., Pollock, L., & Shanker, V.

(2007). The Role of PartsOfSpeech in Java Program Identifiers (Final report). CRA-W

Distributed Mentor Program. Retrieved from

http://archive2.cra.org/Activities/craw_archive/dmp/awards/2007/Williams/finalReport.pdf

[Woodfield, Dunsmore, Shen 1981] Woodfield, S. N., Dunsmore, H. E., and Shen, V. Y., (1981),

"The effect of modularization and comments on program comprehension", in Proceedings

of Proceedings of the 5th international conference on Software engineering, pp. 215-223.

[Zavrel, Daelemans 1999] Zavrel, J. and Daelemans, W., (1999), "Recent advances in memory-

based part-of-speech tagging", in Proceedings of VI Simposio Internacional de

Comunicacion Social, pp. 590-597.

73