Extracting Meaning from Abbreviated Identifiers

Extracting Meaning from Abbreviated Identifiers Dawn Lawrie Henry Feild David Binkley Loyola College Baltimore MD 21210-2699, USA {lawrie, hfeild, binkley}@cs.loyola.edu Abstract is to translated terse identifiers into natural language. Then existing (and future) IR based techniques can be employed Informative identifiers are made up of full (natural lan- by software engineering tools to make use of this informa- guage) words and (meaningful) abbreviations. Readers of tion in problems such as concept location [22], quality as- programs typically have little trouble understanding the pur- sessment [16], and software reuse [8]. pose of identifiers composed of full words. In addition, The automated translation of existing identifiers to natu- those familiar with the code can (most often) determine ral language is complicated by many facets. For example, the meaning of abbreviations used in identifiers. However, identifiers often include prefixes, suffixes, or use ‘common’ when faced with unfamiliar code, abbreviations often carry abbreviations (believed to be so well known that they go little useful information. Furthermore, tools that focus on undocumented). To address these, the translation process the natural language used in the code have a hard time in applies three IR based techniques. First, probability dis- the presence of abbreviations. One approach to provid- tributions such as relative entropy [19] are used to identify ing meaning to programmers and tools is to translate (ex- usual versus unusual aspects of identifiers. pand) abbreviations into full words. This paper presents Second, for specific parts of each identifier (e.g., non- a methodology for expanding identifiers and evaluates the dictionary words), sets of possible expansions are generated processon a codebased of just over35 million lines of code. based on wildcard expansions (expansions that begin with For example, using phrase extraction, fs exists is expanded the same letter as the abbreviation and subsequent letters to file status exists illustrating how the expansion process occur in the same order). A learning algorithm is applied to can facilitate comprehension. On average, 16 percent of the determine the likelihood of a particular expansion based on identifiers in a program are expanded. Finally, as an exam- characteristics such as the proportion of adjacent letters in ple application, the approach is used to improve the syn- the expansion. For example, lib can expand to library and tactic identification of violations to Deißenböck and Pizka’s liberty. In a function that includes comments such as ”find rules for concise and consistent identifier construction. opponent liberty and choose move to attack”, the expansion would favor liberty over library. Keywords: Software Quality, Program Identifiers, Program Finally, for multi-word identifiers, machine translation Comprehension techniques are employed to ensure that the expansion has produced a meaningful cohesive unit. For example, the 1 Introduction identifier thenewestone includes three dictionary words fused together. This identifier is a challenge for a naive The first Working Session on Information Retrieval splitting algorithm as it can be split into three dictionary Based Approaches in Software Evolution was held at words two different ways: the-newest-one or then-ewe- ICSM’06. The number of attendees at this session un- stone. Borrowing from machine translation techniques for derscores the growing interest in applying Information Re- resolving translation ambiguity, the likelihood of finding the trieval (IR) techniques to software engineering problems. words the, newest, and one versus that of finding then, In order to make effective use of IR techniques, consistent ewe, and stone in close proximity can be compared. meaningful vocabulary is important. Identifiers present a The usefulness of expanded identifiers is far reaching. challenge when domain information is buried behind abbre- For example, they could be used to provide GUI ‘tool tips’, viations and acronyms. One way to expose this information as an aide to an engineer who must map the program’s 1 identifiers to the concepts that they represent. Further- or known abbreviations. When a hard word is in neither cat- more, by increasing the amount of natural language found egory, the identifier may contain non-well-separated words. in the code, expansion improves existing source code anal- For example, the identifier easycase includes the two non- ysis tools that attempt to map program identifiers to do- well separated words easy and case. The identification of main level concepts (e.g., those appearing in the documen- these “soft words” is the goal of identifier splitting. One tation). Examples of this include the techniques of Antoniol approach does so using a greedy algorithm that recursively et al. and Marcus et al. that seek to (re)establish links be- finds the longest prefix and suffix that are in the dictionary tween source code and its documentation[3, 21]. In another or a known abbreviation list [9]. example expansion allows for better automatic recognition For example, the identifier zeroindeg includes a single of quality identifiers: while humans can often comprehend hard word because there are no word markers; thus, divi- code whose identifiers are composed of meaningful abbre- sion into hard words is insufficient to identify the concepts viations [17], tools have a much more difficult time doing within the identifier. The splitting algorithm divides this so. Finally, expansion facilitates correctly identifying vi- hard word into the three soft words zero-in-deg. olations of conciseness and consistency rules for identifiers such as those laid out by Deißenböckand Pizka [7, 15]. This 2.2 Expanding Abbreviations final example is used herein to demonstrate an application of the expansion. The expansion algorithm works independently on soft This paper presents a first step in this process: the expan- words in the context of the source code for a particular sion of abbreviations and acronyms into meaningful words function. It uses four lists of potential expansions: a list and phrases based on local function-level language infor- of natural-language words extracted from the code, a list of mation. For example, a phrase dictionary, built for each phrases extracted from the code, a list of programming lan- function using phrases found in the comments, providesone guage specific words referred to as a stoplist, and finally a source of meaningful names. This step allows the value of natural-languagedictionary. There are two sources of words local informationto be better understoodand also highlights for the first list: first, words contained in the comments that places where using global information is useful. appear before and within the function, and second dictio- The remainder of this paper consists of a description of nary hard words found in the identifiers of the function. the process used to expand identifiers in Section 2. An anal- The phrase list is obtained by running the comments and ysis of the expanded identifiers is presented in Section 3. multiword-identifiers through a phrase finder [10]. Here, Section 4 describes the example application of the tech- the first letter of each word in the phrase is used to build an nique. The final sections discuss related and future work acronym. If a sequence of letters matches an acronym ex- and conclude. actly, the phrase is considered a potential expansion. For in- stance, the phrase finder extracts the phrase file status from 2 Technique the comments of the program which. This phrase is later returned as the expansion for the soft word fs contained in In the current implementation, identifier expansion is a the identifier fs exists. two step process. The first step, described in Section 2.1, Once the list of potential words and phrases has been ex- splits an identifier into its constituent words. Then, as extracted, expansion begins. This process has two stages that plained in Section 2.2, a collection of possible expansions look first in the word list extracted from the code and com- are considered. ments and in the phrase dictionary, and then in a natural- language dictionary. Thus, language extracted from the 2.1 Identifier Splitting source is favored over words that appear in the dictionary, and words on the stoplist are treated the same as natural lan- Following past studies, identifiers are first divided into guage words. A word from one of these list matches an ab- their constituent parts for analysis [3, 7, 6, 9, 25]. Herein, breviation if it begins with the same letter and the individual these parts are called “words” – sequences of characters letters of the abbreviation occur, in order, in the word. For with which some meaning may be associated. Two kinds example, the search for “abs” in the mozilla source cor- of words are considered: hard words and soft words. Hard rectly discovers the expansion absolute, while horiz and words are separated by the use of word markers (e.g., the triag are correctly expanded to horizontal and triangle, reuse of camelCasing or under scores) [3]. For example, the spectively. hard word spongeBob and sponge bob both contain the In the present implementation, if there is a single match well separated hard words sponge and bob. in either stage of the search, it is returned as the expansion For many identifiers, the division into hard words is suffi- of the abbreviation. Future work will explore techniques for cient. This occurs when all hard words are dictionary words selecting from among multiple potential expansions by fa- 2 voring words where the sequences of letters are adjacent, code was written in C. Significant C++ and Java code were minimizing the number of different expansions for a se- also studied. quence within the program and across programs, and, for Table 1 shows 10 representative programs. It reports multi-word identifiers, maximizing the likelihood that the code sizes for the C, C++, and Java code (and their sum) as words occur together based on co-occurrence information counted by the Unix utility wc (excluding header files).

Load more