Polish Inflection Fit for Man and Machine
Total Page:16
File Type:pdf, Size:1020Kb
Polish Inflection Fit for Man and Machine Marjorie J. McShane Memoranda in Computer and Cognitive Science Computing Research Laboratory New Mexico State University January 22, 2001 Table of Contents Preface and Acknowledgments i Introduction 1 Part I: Nouns 25 Masculine 26 Feminine 59 Neuter 79 Part II: Adjectives 89 Part III: Adverbs 101 Part IV: Verbs 109 -ać 110 -awać 129 -ąć 130 -c 135 -eć 143 -ić 153 -ieć 167 -iwać 176 -nąć 178 -ować 195 -ść 197 -uć 209 -yć 210 -ywać 215 -źć 217 References 221 Preface and Acknowledgments This book may be of interest to the following groups of people: teachers of Polish who seek a more detailed treatment of Polish inflection than is found in traditional textbooks, as well as extensive practice examples; students of Polish with those same goals; scholars studying Polish morphology, since a fresh perspective can often lead to new insights; and computational linguists, because the whole approach was driven by computational considerations. However, since the materials contained herein were not expressly catered to any single one of these groups, some degree of selective focus might be in order. This work is not a traditional treatment of Polish inflection, and users should be aware that some traditional assumptions must be left behind. This issue is discussed at length in the Introduction. Special thanks to Adam Przepiórkowski. Not being a native speaker, I particularly appreciate his help with the Polish and his many excellent suggestions. Naturally, any remaining mistakes are my own and, if brought to my attention, will be posted as a list of errata on the web page housing the book. Sincerest gratitute also goes to Ron Zacharski for his commentary on content and style, and for his steady encouragment during this longer-than-expected undertaking. 1 INTRODUCTION This document presents a paradigmatic treatment of Polish inflection that is both machine tractable and insightful for people. The approach bears little resemblance to traditional descriptive approaches because it is driven by computational demands like full coverage and formal precision, rather than by academic demands like terseness and theoretically supported abstraction. However, over the course of the work it became clear that the same sort of tightly organized, precise, broad-coverage input that is required by computational applications could be of benefit to people attempting to understand and master a complex system like Polish inflection. Thus, this work seeks to fill a void in the literature, simultaneously serving two populations: computational linguists, and students and teachers of Polish. 1 The Boas Environment The necessity of taking a new approach to Polish inflection derived from the following task: to test the morphological learning algorithm for the Boas knowledge-elicitation module of the Expedition System (http://crl.nmsu.edu/expedition).1 The goal of Expedition is to enable the quick ramp-up of moderate-quality machine translation (MT) systems from any low-density language (a language for which no significant computer resources have been developed) to English; the goal is not to compete with systems designed expressly for a certain language pair.2 The morphological learning program—hereafter, the learner—takes as input fully specified inflectional paradigms from any natural language and creates rules of inflection that are iteratively refined in a test-debug loop. It is irrelevant for the learner whether the language informant splits words into many narrowly specified paradigms or bunches them into more broadly defined paradigms, as long as all inflectional forms of all words in each paradigm are unambiguously predictable based solely on the spelling of the citation form (i.e., the dictionary form), with no reference to umbrella rules like spelling conventions, consonant and vowel alternations, and semantic classes.3 The learner generates rules of inflection using an iterative learning loop that works as follows. First, the language informant provides all the inflectional forms for the so-called primary example for Paradigm 1. The learner generates rules to cover these word forms, incorporating certain types of generalizations—e.g., a given vowel might be generalized to any vowel, and a given consonant might be generalized to any consonant. Then the informant lists the citation forms of more words belonging to this paradigm and the learner generates what it believes to be all of their inflectional forms, based on its initial set of rules. The language informant corrects mistakes, if there are any, and the learner relearns its rules, incorporating the new data. This process continues until the informant believes the rules are robust enough to cover all the slight inflectional variations (if any) in Paradigm 1. Then he proceeds to Paradigm 2, and continues to establish new paradigms until he has covered all the regular patterns of inflection for the given part of speech.4 When competing rules could account for the data, the learner chooses the most general one that produces no conflicts. Therefore, letter-specific behavior can be collapsed into a single paradigm as long as each letter (or combination of letters, if initially specified as a single token) always behaves the same way. 1 The Expedition Project is supported by Department of Defense Contract MDA904-92-C-5189. 2 For details see McShane and Zacharski, forthcoming; McShane, Helmreich, Nirenburg and Raskin, 2000; Nirenburg 1998; Nirenburg and Raskin 1998; Oflazer 1996; Oflazer and Nirenburg 1999, Oflazer, Nirenburg and McShane, forthcoming. 3 Other machine learning architectures might incorporate rules covering phonological changes, spelling conventions, etc. 4 Irregular lexical items will be gathered during open-class lexical acquisition in order not to corrupt the learner’s rules. 1 Depending upon the morphological nature of the language, one could take either extreme in delineating paradigms, as illustrated by the continuum shown below. One could bunch all the words in a given part of speech into a single paradigm as long as there are perfectly regular, graphotactically predictable rules of inflection (point 1), or one could establish a new paradigm for every slight variation in inflection, producing hundreds or even thousands of paradigms for a given part of speech (point 7). Bunching Splitting | | | | | | | 1 2 3 4567 However, each of these options is undesirable for the goals at hand: to create an approach to inflection that is both machine tractable and insightful for people. As concerns Polish, a maximally extreme bunching approach (point 1) would fail since inflectional patterns are affected by semantic rules and idiosyncrasy that are not detectable from the spelling of the citation form. A relatively extreme bunching approach (point 2) would work computationally but would be of little help to people— e.g., combining certain Masc., Fem. and Neuter nouns in a single paradigm would be utterly confusing. Any point toward the extreme end of the splitting continuum (points 4-6) would also work computationally but would be unwieldy for human use, turning the system into an amorphous mass of data. The current approach (point 3) seeks to create a reasonably small number of carefully crafted, robust paradigms that provide good coverage and can be manipulated by both people and machines. The task of organizing the world of Polish inflection for the Boas environment posed a significant challenge because traditional accounts of Polish inflection define paradigm loosely, collapsing into a single paradigm lexical items with significant inflectional variations that are not predictable based on the spelling of the citation form. There is good reason for this. Many rules—like spelling rules and certain consonant and vowel alternations—apply language wide. Therefore, it is more economical to describe them as umbrella phenomena than to use them as distinguishing criteria for paradigm delineation. However, this almost universally adopted approach to Polish inflection blurs formal distinctions that are crucial for machine processing and, I will suggest, for full human understanding as well. There is no computational requirement that the learner be provided with all the inflectional forms of a word at one go—e.g., all tenses, moods, and non-finite forms for verbs. Instead, the learner could learn all present-tense patterns of inflection first, then past-tense patterns, then future tense, then various non-indicative moods, and so on. In the end, the morphological analyzer will represent the union of all rules anyway. Moreover, descriptive and pedagogical treatments of inflection for many languages take a modular approach, so people acquainted with such materials might find such an approach most convenient. However, a practical consideration within Boas led to the decision to elicit paradigms as whole entities. Since the informant will manually assign each inflecting open-class lexical item to a paradigm during the process of lexical acquisition, creating sub-paradigms would necessitate making multiple assignments for each word: e.g., piec ‘to bake’ might belong to present-tense Paradigm 9, past- tense Paradigm 2, past-participle Paradigm 6, etc. Making multiple paradigm assignments of this sort would slow the process of lexical acquisition to an unacceptable rate.5 5 The current description of Polish inflection was developed based on these requirements, which held for the first iterations of the morphological component of Boas. However, it was recently decided that it would be too time-consuming to have the informant assign each open-class word to a particular paradigm during open-class acquisition. Instead, the informant will provide diagnostics that will help a paradigm-guessing program to automatically assign words to paradigms. For example, if all neuter 2 2 A Brief Overview of Polish CHARACTER SET. Polish is a West Slavic language written using an extended Latin character set. The inventory of vowels and consonants is as follows: VOWELS: a, ą, e, ę, i, o, ó, u, y CONSONANTS: b, c, ć, d, f, g, h, j, k, l, ł, m, n, ń, p, q, r, s, ś, t, u, v, w, x, z, ź, ż The letters q, v, x are not native Polish and are found only in foreign words.