Part-Of-Speech Tagging of Source Code Identifiers

Total Page:16

File Type:pdf, Size:1020Kb

Part-Of-Speech Tagging of Source Code Identifiers PART-OF-SPEECH TAGGING OF SOURCE CODE IDENTIFIERS USING PROGRAMMING LANGUAGE CONTEXT VERSUS NATURAL LANGUAGE CONTEXT A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Masters of Science by Reem S. AlSuhaibani December, 2015 i Thesis written by Reem S. AlSuhaibani M.S., Kent State University, USA, 2015 B.S., Prince Sultan University, USA, 2010 Approved by Dr. Jonathan I. Maletic Academic Advisor Dr. Gwenn L. Volkert Members, Master Thesis Committee Dr. Kambiz Ghazinour Members, Master Thesis Committee Accepted by Dr. Javed I. Khan Chair, Department of Computer Science Dr. James L. Blank Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES ......................................................................................................... V LIST OF TABLES .......................................................................................................... VI DEDICATION .............................................................................................................. VII ACKNOWLEDGEMENTS ........................................................................................ VIII INTRODUCTION ..................................................................................... 1 1.1 Research Hypothesis and Questions ......................................................................... 3 1.2 Research Contributions ............................................................................................. 3 1.3 Organization of the Thesis ........................................................................................ 4 BACKGROUND ........................................................................................ 5 2.1 Part of Speech Tagging ............................................................................................. 5 2.1.1 Rule-Based Approach ..................................................................................... 9 2.1.2 The Stochastic (Probabilistic) Approach ....................................................... 10 2.1.3 Architecture of Part of Speech Taggers ........................................................ 12 2.1.4 Tagsets ........................................................................................................... 13 2.2 Natural Language in Source Code .......................................................................... 16 2.2.1 Program Identifiers ........................................................................................ 17 2.2.2 Comments ..................................................................................................... 23 RELATED WORK ................................................................................. 25 OVERVIEW OF APPROACH .............................................................. 28 4.1 Part of Speech Tagging in Programming Languages .............................................. 28 4.2 Part of Speech Tagging Approach for Source Code ............................................... 29 iii 4.2.1 Heuristic Rules on Program Identifiers ......................................................... 31 4.2.2 Part of Speech and Method Stereotypes ........................................................ 35 4.3 Part of Speech Tagging on Source Code Comments .............................................. 36 4.4 Implementation in srcML ........................................................................................ 39 EVALUATION ........................................................................................ 42 CONCLUSIONS AND FUTURE RESEARCH ................................... 53 6.1 Main Findings ......................................................................................................... 53 6.2 Future Research Directions ..................................................................................... 54 APPENDIX A ................................................................................................................. 56 APPENDIX B ................................................................................................................. 57 APPENDIX C ................................................................................................................. 58 APPENDIX D ................................................................................................................. 60 REFERENCES ................................................................................................................ 64 iv LIST OF FIGURES Figure 2.1 The different approaches of automatic part of speech tagging. ......................... 7 Figure 2.2 The common process of part of speech taggers ............................................... 12 Figure 4.1 An example of applying heuristics .................................................................. 34 Figure 4.2 The same previous example with the NLTK POS tagger (NLP tagger) ......... 34 Figure 4.3 The result of NLTK tagger on HippoDraw source code comments ................ 38 Figure 4.4 The workflow of the heuristics approach ........................................................ 40 Figure 4.5 Example of how srcNLP tag source code identifiers ...................................... 40 Figure 5.1 Box plot of program identifiers common between all the 10 systems ............ 51 Figure 5.2 Box plot of program identifiers common without Chipmunk2D .................... 51 Figure 5.3 A general visualization for nlpCMP output structure on the usage of the identifier ‘depth’ between 4 systems ......................................................................... 52 Figure 6.1 Shows some of the common identifiers between Monkey studio and Code Blocks ........................................................................................................................ 56 v LIST OF TABLES Table 2.1 Examples of how the word “above” is used in different forms .......................... 6 Table 2.2 Differences between supervised and unsupervised part of speech tagging ........ 8 Table 2.3 The NLTK universal language tagsets .............................................................. 16 Table 4.1 Taxonomy of method stereotypes and their corresponding part of speech ...... 35 Table 5.1 Number of verified identifiers for each system according to part of speech with percentages ................................................................................................................ 43 Table 5.2 The 10 open source systems used in the evaluation ......................................... 45 Table 5.3 The consistency of part of speech within each system ..................................... 49 Table 5.4 The total number of identifiers common between systems .............................. 50 vi DEDICATION To my father Saleh AlSuhaibani, who passed away during my research studies and before finishing this thesis; a goal that we both share. vii ACKNOWLEDGEMENTS This thesis would not be completed without the guidance and blessings of God; I am grateful and thankful for everything God gave me. This thesis also would not be completed without the support of many people around me who really deserve to be acknowledged. I would like first to thank my parents my father Saleh AlSuhaibani and my mother Huda Alrajhi for their love, care, continuous support and advices that they have given me throughout my study life to be the person who I am now. Second, my deepest gratitude and sincere thanks goes to my husband Ahmad for being a father, a brother and a friend during my master’s studies in the United States. Without his continuous support, I would not have achieved many things. Thanks go out to my advisor, Professor Jonathan I. Maletic for his enormous efforts and continual follow-up and attention to achieve the goal of this thesis. Without him, I would not have loved and enjoyed the work that I have accomplished. He has been such an inspirational advisor who has created the spirit of creativity amongst us as students. I am glad and proud that he is my advisor, and I appreciate each advice he has given me to expand my knowledge in software engineering. A special thank you goes to each of the SDML lab members for being supportive and helpful with their advices and opinions. I would like also to extend my thanks to my viii brothers Abdulrahman, Ahmad and Hussam and my sister Maram, and to all my friends at Kent State University who have been such great supporters with their love and prayers. Reem S. AlSuhaibani November 2015, Kent, Ohio ix Introduction With 60-90% of software life cycle resources spent on program maintenance [Boehm 1981; Erlikh 2000], there is a critical need for advanced tools that help in exploring and comprehending today’s large and complex software. To reduce the cost of this software maintenance, it has been demonstrated that natural-language clues in program identifiers can be used to improve software tools [Shepherd, Pollock, Vijay-Shanker 2007]. There have been a number of attempts to apply Natural Language Processing (NLP) techniques to source code to support various program comprehension tasks. In the work presented here, we are particularly interested in determining the part-of-speech of identifiers names of functions, types, variables, etc. in source code. We view this as a separate problem from determining the part-of-speech of comments. Comments are typically written in a natural language (English) and often have sentence structure that follows grammatical rules [Etzkorn, Davis 1994; Etzkorn, Davis,
Recommended publications
  • Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources.
    [Show full text]
  • Compound Word Formation.Pdf
    Snyder, William (in press) Compound word formation. In Jeffrey Lidz, William Snyder, and Joseph Pater (eds.) The Oxford Handbook of Developmental Linguistics . Oxford: Oxford University Press. CHAPTER 6 Compound Word Formation William Snyder Languages differ in the mechanisms they provide for combining existing words into new, “compound” words. This chapter will focus on two major types of compound: synthetic -ER compounds, like English dishwasher (for either a human or a machine that washes dishes), where “-ER” stands for the crosslinguistic counterparts to agentive and instrumental -er in English; and endocentric bare-stem compounds, like English flower book , which could refer to a book about flowers, a book used to store pressed flowers, or many other types of book, as long there is a salient connection to flowers. With both types of compounding we find systematic cross- linguistic variation, and a literature that addresses some of the resulting questions for child language acquisition. In addition to these two varieties of compounding, a few others will be mentioned that look like promising areas for coordinated research on cross-linguistic variation and language acquisition. 6.1 Compounding—A Selective Review 6.1.1 Terminology The first step will be defining some key terms. An unfortunate aspect of the linguistic literature on morphology is a remarkable lack of consistency in what the “basic” terms are taken to mean. Strictly speaking one should begin with the very term “word,” but as Spencer (1991: 453) puts it, “One of the key unresolved questions in morphology is, ‘What is a word?’.” Setting this grander question to one side, a word will be called a “compound” if it is composed of two or more other words, and has approximately the same privileges of occurrence within a sentence as do other word-level members of its syntactic category (N, V, A, or P).
    [Show full text]
  • Linguistic Annotation of the Digital Papyrological Corpus: Sematia
    Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances.
    [Show full text]
  • Hunspell – the Free Spelling Checker
    Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of
    [Show full text]
  • Building a Treebank for French
    Building a treebank for French £ £¥ Anne Abeillé£ , Lionel Clément , Alexandra Kinyon ¥ £ TALaNa, Université Paris 7 University of Pennsylvania 75251 Paris cedex 05 Philadelphia FRANCE USA abeille, clement, [email protected] Abstract Very few gold standard annotated corpora are currently available for French. We present an ongoing project to build a reference treebank for French starting with a tagged newspaper corpus of 1 Million words (Abeillé et al., 1998), (Abeillé and Clément, 1999). Similarly to the Penn TreeBank (Marcus et al., 1993), we distinguish an automatic parsing phase followed by a second phase of systematic manual validation and correction. Similarly to the Prague treebank (Hajicova et al., 1998), we rely on several types of morphosyntactic and syntactic annotations for which we define extensive guidelines. Our goal is to provide a theory neutral, surface oriented, error free treebank for French. Similarly to the Negra project (Brants et al., 1999), we annotate both constituents and functional relations. 1. The tagged corpus pronoun (= him ) or a weak clitic pronoun (= to him or to As reported in (Abeillé and Clément, 1999), we present her), plus can either be a negative adverb (= not any more) the general methodology, the automatic tagging phase, the or a simple adverb (= more). Inflectional morphology also human validation phase and the final state of the tagged has to be annotated since morphological endings are impor- corpus. tant for gathering constituants (based on agreement marks) and also because lots of forms in French are ambiguous 1.1. Methodology with respect to mode, person, number or gender. For exam- 1.1.1.
    [Show full text]
  • Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin
    Chapter 1 Creating and using multilingual corpora in translation studies Claudio Fantinuoli and Federico Zanettin 1 Introduction Corpus linguistics has become a major paradigm and research methodology in translation theory and practice, with practical applications ranging from pro- fessional human translation to machine (assisted) translation and terminology. Corpus-based theoretical and descriptive research has investigated written and interpreted language, and topics such as translation universals and norms, ideol- ogy and individual translator style (Laviosa 2002; Olohan 2004; Zanettin 2012), while corpus-based tools and methods have entered the curricula at translation training institutions (Zanettin, Bernardini & Stewart 2003; Beeby, Rodríguez Inés & Sánchez-Gijón 2009). At the same time, taking advantage of advancements in terms of computational power and increasing availability of electronic texts, enormous progress has been made in the last 20 years or so as regards the de- velopment of applications for professional translators and machine translation system users (Coehn 2009; Brunette 2013). The contributions to this volume, which are centred around seven European languages (Basque, Dutch, German, Greek, Italian, Spanish and English), add to the range of studies of corpus-based descriptive studies, and provide exam- ples of some less explored applications of corpus analysis methods to transla- tion research. The chapters, which are based on papers first presented atthe7th congress of the European Society of Translation Studies held in Germersheim in Claudio Fantinuoli & Federico Zanettin. 2015. Creating and using multilin- gual corpora in translation studies. In Claudio Fantinuoli & Federico Za- nettin (eds.), New directions in corpus-based translation studies, 1–11. Berlin: Language Science Press Claudio Fantinuoli and Federico Zanettin July/August 20131, encompass a variety of research aims and methodologies, and vary as concerns corpus design and compilation, and the techniques used to ana- lyze the data.
    [Show full text]
  • Corpus Linguistics As a Tool in Legal Interpretation Lawrence M
    BYU Law Review Volume 2017 | Issue 6 Article 5 August 2017 Corpus Linguistics as a Tool in Legal Interpretation Lawrence M. Solan Tammy Gales Follow this and additional works at: https://digitalcommons.law.byu.edu/lawreview Part of the Applied Linguistics Commons, Constitutional Law Commons, and the Legal Profession Commons Recommended Citation Lawrence M. Solan and Tammy Gales, Corpus Linguistics as a Tool in Legal Interpretation, 2017 BYU L. Rev. 1311 (2018). Available at: https://digitalcommons.law.byu.edu/lawreview/vol2017/iss6/5 This Article is brought to you for free and open access by the Brigham Young University Law Review at BYU Law Digital Commons. It has been accepted for inclusion in BYU Law Review by an authorized editor of BYU Law Digital Commons. For more information, please contact [email protected]. 2.GALESSOLAN_FIN.NO HEADERS.DOCX (DO NOT DELETE) 4/26/2018 3:54 PM Corpus Linguistics as a Tool in Legal Interpretation Lawrence M. Solan* & Tammy Gales** In this paper, we set out to explore conditions in which the use of large linguistic corpora can be optimally employed by judges and others tasked with construing authoritative legal documents. Linguistic corpora, sometimes containing billions of words, are a source of information about the distribution of language usage. Thus, corpora and the tools for using them are most likely to assist in addressing legal issues when the law considers the distribution of language usage to be legally relevant. As Thomas R. Lee and Stephen C. Mouritsen have so ably demonstrated in earlier work, corpus analysis is especially helpful when the legal standard for construction is the ordinary meaning of the document’s terms.
    [Show full text]
  • Against Corpus Linguistics
    Against Corpus Linguistics JOHN S. EHRETT* Corpus linguistics—the use of large, computerized word databases as tools for discovering linguistic meaning—has increasingly become a topic of interest among scholars of constitutional and statutory interpretation. Some judges and academics have recently argued, across the pages of multiple law journals, that members of the judiciary ought to employ these new technologies when seeking to ascertain the original public meaning of a given text. Corpus linguistics, in the minds of its proponents, is a powerful instrument for rendering constitutional originalism and statutory textualism “scientific” and warding off accusations of interpretive subjectivity. This Article takes the opposite view: on balance, judges should refrain from the use of corpora. Although corpus linguistics analysis may appear highly promising, it carries with it several under-examined dangers—including the collapse of essential distinctions between resource quality, the entrenchment of covert linguistic biases, and a loss of reviewability by higher courts. TABLE OF CONTENTS INTRODUCTION……………………………………………………..…….51 I. THE RISE OF CORPUS LINGUISTICS……………..……………..……54 A. WHAT IS CORPUS LINGUISTICS? …………………………………54 1. Frequency……………………………………………...…54 2. Collocation……………………………………………….55 3. Keywords in Context (KWIC) …………………………...55 B. CORPUS LINGUISTICS IN THE COURTS……………………………56 1. United States v. Costello…………………………………..56 2. State v. Canton……………………………………………58 3. State v. Rasabout………………………………………….59 II. AGAINST “JUDICIALIZING”C ORPUS LINGUISTICS………………….61 A. SUBVERSION OF SOURCE AUTHORITY HIERARCHIES……………..61 * Yale Law School, J.D. 2017. © 2019, John S. Ehrett. 51 THE GEORGETOWN LAW JOURNAL ONLINE [VOL. 108 B. IMPROPER PARAMETRIC OUTSOURCING…………………………65 C. METHODOLOGICAL INACCESSIBILITY…………………………...68 III. THE FUTURE OF JUDGING AND CORPUS LINGUISTICS………………70 INTRODUCTION “Corpus linguistics” may sound like a forensic investigative procedure on CSI or NCIS, but the reality is far less dramatic—though no less important.
    [Show full text]
  • TEI and the Documentation of Mixtepec-Mixtec Jack Bowers
    Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Jack Bowers To cite this version: Jack Bowers. Language Documentation and Standards in Digital Humanities: TEI and the documen- tation of Mixtepec-Mixtec. Computation and Language [cs.CL]. École Pratique des Hauts Études, 2020. English. tel-03131936 HAL Id: tel-03131936 https://tel.archives-ouvertes.fr/tel-03131936 Submitted on 4 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Pratique des Hautes Études Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Soutenue par Composition du jury : Jack BOWERS Guillaume, JACQUES le 8 octobre 2020 Directeur de Recherche, CNRS Président Alexis, MICHAUD Chargé de Recherche, CNRS Rapporteur École doctorale n° 472 Tomaž, ERJAVEC Senior Researcher, Jožef Stefan Institute Rapporteur École doctorale de l’École Pratique des Hautes Études Enrique, PALANCAR Directeur de Recherche, CNRS Examinateur Karlheinz, MOERTH Senior Researcher, Austrian Center for Digital Humanities Spécialité and Cultural Heritage Examinateur Linguistique Emmanuel, SCHANG Maître de Conférence, Université D’Orléans Examinateur Benoit, SAGOT Chargé de Recherche, Inria Examinateur Laurent, ROMARY Directeur de recherche, Inria Directeur de thèse 1.
    [Show full text]
  • Corpus Linguistics: a Practical Introduction
    Corpus Linguistics: A Practical Introduction Nadja Nesselhauf, October 2005 (last updated September 2011) 1) Corpus Linguistics and Corpora - What is corpus linguistics (I)? - What data do linguists use to investigate linguistic phenomena? - What is a corpus? - What is corpus linguistics (II)? - What corpora are there? - What corpora are available to students of English at the University of Heidelberg? (For a list of corpora available at the Department of English click here) 2) Corpus Software - What software is there to perform linguistic analyses on the basis of corpora? - What can the software do? - A brief introduction to an online search facility (BNC) - A step-to-step introduction to WordSmith Tools 3) Exercises (I and II) - I Using the WordList function of WordSmith - II Using the Concord function of WordSmith 4) How to conduct linguistic analyses on the basis of corpora: two examples - Example 1: Australian English vocabulary - Example 2: Present perfect and simple past in British and American English - What you have to take into account when performing a corpuslingustic analysis 5) Exercises (III) - Exercise III.1 - Exercise III.2 6) Where to find further information on corpus linguistics 1) Corpus Linguistics and Corpora What is corpus linguistics (I)? Corpus linguistics is a method of carrying out linguistic analyses. As it can be used for the investigation of many kinds of linguistic questions and as it has been shown to have the potential to yield highly interesting, fundamental, and often surprising new insights about language, it has become one of the most wide-spread methods of linguistic investigation in recent years.
    [Show full text]
  • A Scaleable Automated Quality Assurance Technique for Semantic Representations and Proposition Banks K
    A scaleable automated quality assurance technique for semantic representations and proposition banks K. Bretonnel Cohen Lawrence E. Hunter Computational Bioscience Program Computational Bioscience Program U. of Colorado School of Medicine U. of Colorado School of Medicine Department of Linguistics [email protected] University of Colorado at Boulder [email protected] Martha Palmer Department of Linguistics University of Colorado at Boulder [email protected] Abstract (Leech, 1993; Ide and Brew, 2000; Wynne, 2005; Cohen et al., 2005a; Cohen et al., 2005b) that ad- This paper presents an evaluation of an auto- dress architectural, sampling, and procedural issues, mated quality assurance technique for a type as well as publications such as (Hripcsak and Roth- of semantic representation known as a pred- schild, 2005; Artstein and Poesio, 2008) that address icate argument structure. These representa- issues in inter-annotator agreement. However, there tions are crucial to the development of an im- portant class of corpus known as a proposi- is not yet a significant body of work on the subject tion bank. Previous work (Cohen and Hunter, of quality assurance for corpora, or for that matter, 2006) proposed and tested an analytical tech- for many other types of linguistic resources. (Mey- nique based on a simple discovery proce- ers et al., 2004) describe three error-checking mea- dure inspired by classic structural linguistic sures used in the construction of NomBank, and the methodology. Cohen and Hunter applied the use of inter-annotator agreement as a quality control technique manually to a small set of repre- measure for corpus construction is discussed at some sentations.
    [Show full text]
  • Compounds and Word Trees
    Compounds and word trees LING 481/581 Winter 2011 Organization • Compounds – heads – types of compounds • Word trees Compounding • [root] [root] – machine gun – land line – snail mail – top-heavy – fieldwork • Notice variability in punctuation (to ignore) Compound stress • Green Lake • bluebird • fast lane • Bigfoot • bad boy • old school • hotline • Myspace Compositionality • (Extent to which) meanings of words, phrases determined by – morpheme meaning – structure • Predictability of meaning in compounds – ’roid rage (< 1987; ‘road rage’ < 1988) – football (< 1486; ‘American football’ 1879) – soap opera (< 1939) Head • ‘A word in a syntactic construction or a morpheme in a morphological one that determines the grammatical function or meaning of the construction as a whole. For example, house is the head of the noun phrase the red house...’ Aronoff and Fudeman 2011 Righthand head rule • In English (and most languages), morphological heads are the rightmost non- inflectional morpheme in the word – Lexical category of entire compound = lexical category of rightmost member – Not Spanish (HS example p. 139) • English be- – devil, bedevil Compounding and lexical category noun verb adjective noun machine friend request skin-deep gun foot bail rage quit verb thinktank stir-fry(?) ? adjective high school dry-clean(?) red-hot low-ball ‘the V+N pattern is unproductive and limited to a few lexically listed items, and the N+V pattern is not really productive either.’ HS: 138 Heads of words • Morphosyntactic head: determines lexical category – syntactic distribution • That thinktank is overrated. – morphological characteristics; e.g. inflection • plurality: highschool highschools • tense on V: dry-clean dry-cleaned • case on N – Sahaptin wáyxti- “run, race” wayxtitpamá “pertaining to racing” wayxtitpamá k'úsi ‘race horse’ Máytski=ish á-shapaxwnawiinknik-xa-na wayxtitpamá k'úsi-nan..
    [Show full text]