3 Compounding

Total Page:16

File Type:pdf, Size:1020Kb

3 Compounding 66 Nigel Fabb 3 Compounding NIGEL FABB 1 Overview A compound is a word which consists of two or more words. For example, the Malay compound mata-hari ‘sun’ is a word which consists of two words: mata ‘eye’ and hari ‘day’. Compounds are subject to phonological and morpholo- gical processes, which may be specific to compounds or may be shared with other structures, whether derived words or phrases; we explore some of these, and their implications, in this chapter. The words in a compound retain a meaning similar to their meaning as isolated words, but with certain restric- tions; for example, a noun in a compound will have a generic rather than a referential function: as Downing (1977) puts it, not every man who takes out the garbage is a garbage man. 1.1 Structure and interpretation The meaning of a compound is usually to some extent compositional, though it is often not predictable. For example, popcorn is a kind of corn which pops; once you know the meaning, it is possible to see how the parts contribute to the whole – but if you do not know the meaning of the whole, you are not certain to guess it by looking at the meaning of the parts. This lack of predict- ability arises mainly from two characteristics of compounds: (a) compounds are subject to processes of semantic drift, which can include metonymy, so that a redhead is a person who has red hair; (b) there are many possible semantic relations between the parts in a compound, as between the parts in a sentence, but unlike a sentence, in a compound, case, prepositions and structural posi- tion are not available to clarify the semantic relation. 1.1.1 Endocentric and exocentric compounds Compounds which have a head are called ‘endocentric compounds’. A head of a compound has similar Compounding 67 characteristics to the head of a phrase: it represents the core meaning of the constituent, and it is of the same word class. For example, in sneak-thief, thief is the head (a sneak-thief is a kind of thief; thief and sneak-thief are both nouns). Compounds without a head are called ‘exocentric compounds’ or ‘bahuvrihi compounds’ (the Sanskrit name). The distinction between endocentric and exocentric compounds is sometimes a matter of interpretation, and is often of little relevance; for example, whether you think greenhouse is an endocentric or exocentric compound depends on whether you think it is a kind of house. The major interest in the head of a compound relates to the fact that where there is a clear head, its position seems to be constrained; endocentric com- pounds tend to have heads in a language systematically on either the right (e.g. English) or left (e.g. Vietnamese, French). 1.1.2 Co-ordinate compounds There is a third kind of compound, where there is some reason to think of both words as equally sharing head-like char- acteristics, as in student-prince (both a student and a prince); these are called ‘appositional’ or ‘co-ordinate’ or ‘dvandva’ (the Sanskrit name) compounds. Co-ordinate compounds can be a combination of synonyms (example from Haitian): toro-bèf(bull-cow) ‘male cow’ a combination of antonyms (example from French): aigre-doux (sour-sweet) or a combination of parallel things (example from Malayalam): acchanammamaaIf ( father-mother-pl.) ‘parents’ Co-ordinate compounds can have special characteristics in a language. For example, in Mandarin, co-ordinate compounds behave differently in terms of theta-assignment from (endocentric) resultative verb compounds. And in Malayalam, co-ordinate compounds are not affected by gemination processes which other compounds undergo (see section 5.2). 1.1.3 The semantic relations between the parts The semantic relations between the parts of a compound can often be understood in terms of modifica- tion; this is true even for some exocentric compounds like redhead. Modifier– modifiee relations are often found in compounds which resemble equivalent phrases; this is true, for example, of English AN%N1 compounds and many Mandarin compounds (see Anderson 1985b). It is not always the case, though; the French compound est-allemand (East German) corresponds to a phrase allemand de l’est (German from the East). In addition, many compounds mani- fest relations which can be interpreted as predicator–argument relations, as in 68 Nigel Fabb sunrise or pull-chain. Note that in pull-chain, chain can be interpreted as an argument of pull, and at the same time pull can be interpreted as a modifier of chain. 1.1.4 Transparency: interpretive and formal The transparency and pre- dictability of a compound are sometimes correlated with its structural trans- parency. For example, in languages with two distinct types of compound where one is more interpretively transparent than the other, the less interpretively transparent type will often be subject to greater phonological or morphological modification. A diachronic loss of transparency (both formal and interpretive) can be seen in the process whereby a part of a compound becomes an affix, as in the development of English -like as the second part of a compound to become the derivational suffix -ly. 1.2 Types of compound Accounts of compounds have divided them into classes. Some of these – such as the exocentric, endocentric and appositional types, or the various interpretive types (modifier–modifiee, complement–predicator, etc.) – are widespread across languages. Then there are compound types which are language- or language- family-specific, such as the Japanese postsyntactic compounds (4.3), Hebrew construct state nominals (4.2), or Mandarin resultative verb compounds (4.1). Other types of compounds are found intermittently; these include synthetic compounds (section 1.2.1), incorporation compounds (section 1.2.2) and redu- plication compounds (section 1.2.3). 1.2.1 Synthetic (verbal) compounds The synthetic compound (also called ‘verbal compound’) is characterized by a co-occurrence of particular formal characteristics with particular restrictions on interpretation. Not all languages have synthetic compounds (e.g. English does, but French does not). The formal characteristic is that a synthetic compound has as its head a derived word con- sisting of a verb plus one of a set of affixes (many writers on English restrict this to agentive -er, nominal and adjectival -ing, and the passive adjectival -en). Thus the following are formally characterized as synthetic compounds: expert-test-ed checker-play-ing (as an adjective: a checker-playing king) window-clean-ing (as a noun) meat-eat-er (There is some disagreement about whether other affixes should also be included, so that slum-clear-ance for example would be a synthetic compound.) Compounds with this structure are subject to various restrictions (summarized by Roeper and Siegel 1978), most prominent of which is that the left-hand member must be interpreted as equivalent to a syntactic ‘first sister’ of the Compounding 69 right-hand member. We discuss this in section 3.2. There have been many accounts of synthetic compounds, including Roeper and Siegel 1978, Selkirk 1982, Lieber 1983, Fabb 1984, Botha 1981 on Afrikaans, and Brousseau 1988 on Fon; see also section 2.1.2. Many of the relevant arguments are summarized in Spencer 1991. 1.2.2 Incorporation compounds In some languages, incorporation words resemble compounds: for example, both a verb and an incorporated noun may exist as independent words. Bybee (1985) comments on this, but suggests that even where the two parts may be independently attested words, an incorp- oration word may differ from a compound in certain ways. This includes phonological or morphological differences between incorporated and free forms of a word. (In fact, as we have seen, this is true also of some compound pro- cesses; e.g. segment loss, which is found in Tiwi incorporation words, is found also in ASL compounds.) Another difference which Bybee suggests may dis- tinguish incorporation from compounding processes is that the incorporation of a word may depend on its semantic class. For example, in Pawnee it is mainly body part words which are incorporated, while various kinds of name are not (such as personal names, kinship terms, names of particular species of tree, etc.). It is possible that compounding is not restricted in this way; as we will see, semantic restrictions on compounding tend to be in terms of the relation between the parts rather than in terms of the individual meanings of the parts. 1.2.3 Repetition compounds Whole-word reduplication is sometimes described as a compounding process, because each part of the resulting word corresponds to an independently attested word. Steever (1988), for example, describes as compounds Tamil words which are generated through reduplica- tion: vantu ‘coming’ is reduplicated as vantu-vantu ‘coming time and again’. In another type of Tamil reduplicated compound, the second word is slightly modified: viyAparam ‘business’ becomes viyAparam-kiyAparam ‘business and such’. English examples of this type of compound include words like higgledy-piggledy, hotchpotch, and so on. 1.3 Compounds which contain ‘bound words’ In a prototypical compound, both parts are independently attested as words. However, it is possible to find words which can be parsed into an independ- ently attested word plus another morpheme which is not an independently attested word but also does not appear to be an affix (see Aronoff 1976). Here are some examples from English (unattested part italicized): church-goer ironmonger television cranberry The part which is not attested as an isolated word is sometimes found in other words as well; in some cases it may become an attested word: for example, telly 70 Nigel Fabb (= television). These parts fail to resemble affixes morphologically (they are relat- ively unproductive compared to most affixes), and there is no good evidence on phonological grounds for considering them to be affixes.
Recommended publications
  • This Chapter Provides a Historical Framework For
    Chapter One Introductions 1.0 Background: This chapter provides a historical framework for the study specifically it will provide description of the context of the study and gives short account of history of English language in Sudan. It is also provide a description of the study problem and formulates the question and hypotheses of the study. The significance of the study was shown. The scope and limitation of the study. Finally, the study unfolds the methodology to be adopted for conducting the empirical part of study. 1.1 Context of the Study problem: During the British colonial period of Sudan (1898- 1956). English language was the official language of the state. It was the medium of instruction in the educational institution during that period. It is worth mentioning that most of teachers at that time were native speakers of English and that allowed a wide exposure of English language for Sudanese students. Late after Sudan gained its independence in 1956 Arabic language began gradually to replace English as medium of instruction. Consequently, English language comes to be taught as a foreign language. 1 The current status of this language in the context of Sudan shows that it is declining and losing its significance in the education environment in this country because the purposes of learning this language have been changed. Upon considering its characteristics, English is a language which is rich in what are calls phrasal verbs are the most frequently used types of figurative language in discourse. For Sudanese secondary schools students these guises of language (i.e. phrasal verbs) are difficult to deal with because they are not relevant of the culture of the target language.
    [Show full text]
  • Compound Word Formation.Pdf
    Snyder, William (in press) Compound word formation. In Jeffrey Lidz, William Snyder, and Joseph Pater (eds.) The Oxford Handbook of Developmental Linguistics . Oxford: Oxford University Press. CHAPTER 6 Compound Word Formation William Snyder Languages differ in the mechanisms they provide for combining existing words into new, “compound” words. This chapter will focus on two major types of compound: synthetic -ER compounds, like English dishwasher (for either a human or a machine that washes dishes), where “-ER” stands for the crosslinguistic counterparts to agentive and instrumental -er in English; and endocentric bare-stem compounds, like English flower book , which could refer to a book about flowers, a book used to store pressed flowers, or many other types of book, as long there is a salient connection to flowers. With both types of compounding we find systematic cross- linguistic variation, and a literature that addresses some of the resulting questions for child language acquisition. In addition to these two varieties of compounding, a few others will be mentioned that look like promising areas for coordinated research on cross-linguistic variation and language acquisition. 6.1 Compounding—A Selective Review 6.1.1 Terminology The first step will be defining some key terms. An unfortunate aspect of the linguistic literature on morphology is a remarkable lack of consistency in what the “basic” terms are taken to mean. Strictly speaking one should begin with the very term “word,” but as Spencer (1991: 453) puts it, “One of the key unresolved questions in morphology is, ‘What is a word?’.” Setting this grander question to one side, a word will be called a “compound” if it is composed of two or more other words, and has approximately the same privileges of occurrence within a sentence as do other word-level members of its syntactic category (N, V, A, or P).
    [Show full text]
  • Hunspell – the Free Spelling Checker
    Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of
    [Show full text]
  • Building a Treebank for French
    Building a treebank for French £ £¥ Anne Abeillé£ , Lionel Clément , Alexandra Kinyon ¥ £ TALaNa, Université Paris 7 University of Pennsylvania 75251 Paris cedex 05 Philadelphia FRANCE USA abeille, clement, [email protected] Abstract Very few gold standard annotated corpora are currently available for French. We present an ongoing project to build a reference treebank for French starting with a tagged newspaper corpus of 1 Million words (Abeillé et al., 1998), (Abeillé and Clément, 1999). Similarly to the Penn TreeBank (Marcus et al., 1993), we distinguish an automatic parsing phase followed by a second phase of systematic manual validation and correction. Similarly to the Prague treebank (Hajicova et al., 1998), we rely on several types of morphosyntactic and syntactic annotations for which we define extensive guidelines. Our goal is to provide a theory neutral, surface oriented, error free treebank for French. Similarly to the Negra project (Brants et al., 1999), we annotate both constituents and functional relations. 1. The tagged corpus pronoun (= him ) or a weak clitic pronoun (= to him or to As reported in (Abeillé and Clément, 1999), we present her), plus can either be a negative adverb (= not any more) the general methodology, the automatic tagging phase, the or a simple adverb (= more). Inflectional morphology also human validation phase and the final state of the tagged has to be annotated since morphological endings are impor- corpus. tant for gathering constituants (based on agreement marks) and also because lots of forms in French are ambiguous 1.1. Methodology with respect to mode, person, number or gender. For exam- 1.1.1.
    [Show full text]
  • Syntactic-Words.Pdf
    Yearbook ofmorphology 3 (1990),201-216 Syntactic words and n*"1orphological words, simple and composite Arnold M. Zwicky 1. SYNTAGMATIC UNITS AND PARADIGMATIC UNITS What is the relationship between the simple or elementary objects of grammar, word- like things, and its composite or complex objects, phrase-like things? I focus here on a small piece of this gargantuan topic, having to do specifically with the elementary objects of morphosyntax, rather than with the grammar as a whole: ‘syntactic words’, the syntagmatic units I will call Ws; and ‘morphological words’, the paradigmatic units I will call moremes. The distinction at issue is a familiar one - it is made clearly, though not with this terminology, in careful discussions of the notion of word, such as those in Lyons (1968: sec. 5.4) and Matthews (1974) - but for some reason generative grammarians have for the most part failed to take the distinction seriously, preferring instead to use references to ‘X°’ units as if the small objects of syntax and the large objects of morphology have the same status, in fact, as if they coincided with one another. But they are objects ofquite different character - the fonner are expression tokens, the latter are expression types -and the question of whether they are in some sense coincident with one another is an empirical question, to be decided by considering a wide range of problematic data, not via a terminological or notational stipulation. Indeed, familiar data suggest quite strongly that coincidence is merely the default relationship between
    [Show full text]
  • Compositionality in English Deverbal Compounds
    Chapter 3 Compositionality in English deverbal compounds: The role of the head Gianina Iordăchioaia University of Stuttgart Lonneke van der Plas University of Malta Glorianna Jagfeld Lancaster University This paper is concerned with the compositionality of deverbal compounds such as budget assessment in English. We present an interdisciplinary study on how the morphosyntactic properties of the deverbal noun head (e.g., assessment) can pre- dict the interpretation of the compound, as mediated by the syntactic-semantic relationship between the non-head (e.g., budget) and the head. We start with Grim- shaw’s (1990) observation that deverbal nouns are ambiguous between composi- tionally interpreted argument structure nominals, which inherit verbal structure and realize arguments (e.g., the assessment of the budget by the government), and more lexicalized result nominals, which preserve no verbal properties or arguments (e.g., The assessment is on the table.). Our hypothesis is that deverbal compounds with argument structure nominal heads are fully compositional and, in our system, more easily predictable than those headed by result nominals, since their composi- tional make-up triggers an (unambiguous) object interpretation of the non-heads. Linguistic evidence gathered from corpora and human annotations, and evaluated with machine learning techniques supports this hypothesis. At the same time, it raises interesting discussion points on how different properties of the head con- tribute to the interpretation of the deverbal compound. Gianina Iordăchioaia, Lonneke van der Plas & Glorianna Jagfeld. 2020. Compositionality in English deverbal compounds: The role of the head. In Sabine Schulte im Walde & Eva Smolka (eds.), The role of constituents in multiword expressions: An interdisciplinary, cross-lingual perspec- tive, 61–106.
    [Show full text]
  • Using the Web As an Implicit Training Set: Application to Noun Compound
    Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics by Preslav Ivanov Nakov Magistar (Sofia University, Bulgaria) 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY arXiv:1912.01113v1 [cs.CL] 23 Nov 2019 Committee in charge: Professor Marti Hearst, Co-chair Professor Dan Klein, Co-chair Professor Jerome Feldman Professor Lynn Nichols Fall 2007 The dissertation of Preslav Ivanov Nakov is approved: Professor Marti Hearst, Co-chair Date Professor Dan Klein, Co-chair Date Professor Jerome Feldman Date Professor Lynn Nichols Date University of California, Berkeley Fall 2007 Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics Copyright 2007 by Preslav Ivanov Nakov 1 Abstract Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics by Preslav Ivanov Nakov Doctor of Philosophy in Computer Science University of California, Berkeley Professor Marti Hearst, Co-chair Professor Dan Klein, Co-chair An important characteristic of English written text is the abundance of noun compounds – sequences of nouns acting as a single noun, e.g., colon cancer tumor suppressor protein. While eventually mastered by domain experts, their interpretation poses a major challenge for automated analysis. Understanding noun compounds’ syntax and semantics is important for many natural language applications, including question answering, machine translation, information retrieval, and information extraction. For example, a question answering system might need to know whether ‘protein acting as a tumor suppressor’ is an acceptable paraphrase of the noun compound tumor suppressor protein, and an information extraction system might need to decide if the terms neck vein thrombosis and neck thrombosis can possibly co-refer when used in the same document.
    [Show full text]
  • Compounds and Word Trees
    Compounds and word trees LING 481/581 Winter 2011 Organization • Compounds – heads – types of compounds • Word trees Compounding • [root] [root] – machine gun – land line – snail mail – top-heavy – fieldwork • Notice variability in punctuation (to ignore) Compound stress • Green Lake • bluebird • fast lane • Bigfoot • bad boy • old school • hotline • Myspace Compositionality • (Extent to which) meanings of words, phrases determined by – morpheme meaning – structure • Predictability of meaning in compounds – ’roid rage (< 1987; ‘road rage’ < 1988) – football (< 1486; ‘American football’ 1879) – soap opera (< 1939) Head • ‘A word in a syntactic construction or a morpheme in a morphological one that determines the grammatical function or meaning of the construction as a whole. For example, house is the head of the noun phrase the red house...’ Aronoff and Fudeman 2011 Righthand head rule • In English (and most languages), morphological heads are the rightmost non- inflectional morpheme in the word – Lexical category of entire compound = lexical category of rightmost member – Not Spanish (HS example p. 139) • English be- – devil, bedevil Compounding and lexical category noun verb adjective noun machine friend request skin-deep gun foot bail rage quit verb thinktank stir-fry(?) ? adjective high school dry-clean(?) red-hot low-ball ‘the V+N pattern is unproductive and limited to a few lexically listed items, and the N+V pattern is not really productive either.’ HS: 138 Heads of words • Morphosyntactic head: determines lexical category – syntactic distribution • That thinktank is overrated. – morphological characteristics; e.g. inflection • plurality: highschool highschools • tense on V: dry-clean dry-cleaned • case on N – Sahaptin wáyxti- “run, race” wayxtitpamá “pertaining to racing” wayxtitpamá k'úsi ‘race horse’ Máytski=ish á-shapaxwnawiinknik-xa-na wayxtitpamá k'úsi-nan..
    [Show full text]
  • The Challenges of Translating English Compounds Into Arabic - for Better Or for Worse
    الجنان Al Jinan Volume 11 Article 19 2020 The challenges of Translating English Compounds into Arabic - For Better or for Worse Walid M. Amer Islamic University of Gaza, [email protected] Karim Menacere Liverpool Business School, [email protected] Follow this and additional works at: https://digitalcommons.aaru.edu.jo/aljinan Part of the Language Interpretation and Translation Commons Recommended Citation Amer, Walid M. and Menacere, Karim (2020) "The challenges of Translating English Compounds into .Vol. 11 , Article 19 :الجنان Arabic - For Better or for Worse," Al Jinan Available at: https://digitalcommons.aaru.edu.jo/aljinan/vol11/iss1/19 This Article is brought to you for free and open access by Arab Journals Platform. It has been accepted for .by an authorized editor. The journal is hosted on Digital Commons, an Elsevier platform الجنان inclusion in Al Jinan For more information, please contact [email protected], [email protected], [email protected]. -19- Author Corresponding author Dr. Walid M. Amer Dr. Karim Menacere Associate Professor of Linguistics Senior Lecturer Islamic University of Gaza. Liverpool Business School The challenges of Translating English Compounds into Arabic - For Better or for Worse. DOI: 10.33986/0522-000-011-022 Abstract English: This paper examines the main challenges of translating English compounds into Arabic. Compounding is linguistically a common process across many languages where compounds are frequently formed. In English, compounding is highly creative and innovative, and often used as a means of introducing new phrases or coining new words into the lexicon. In contrast, Arabic is less resourceful.
    [Show full text]
  • A Poor Man's Approach to CLEF
    A poor man's approach to CLEF Arjen P. de Vries1;2 1 CWI, Amsterdam, The Netherlands 2 University of Twente, Enschede, The Netherlands [email protected] October 16, 2000 1 Goals The Mirror DBMS [dV99] aims specifically at supporting both data management and content man- agement in a single system. Its design separates the retrieval model from the specific techniques used for implementation, thus allowing more flexibility to experiment with a variety of retrieval models. Its design based on database techniques intends to support this flexibility without caus- ing a major penalty on the efficiency and scalability of the system. The support for information retrieval in our system is presented in detail in [dVH99], [dV98], and [dVW99]. The primary goal of our participation in CLEF is to acquire experience with supporting Dutch users. Also, we want to investigate whether we can obtain a reasonable performance without requiring expensive (but high quality) resources. We do not expect to obtain impressive results with our system, but hope to obtain a baseline from which we can develop our system further. We decided to submit runs for all four target languages, but our main interest is in the bilingual Dutch to English runs. 2 Pre-processing We have used only ‘off-the-shelf' tools for stopping, stemming, compound-splitting (only for Dutch) and translation. All our tools are available for free, without usage restrictions for research purposes. Stopping and stemming Moderately sized stoplists, of comparable coverage, were made available by University of Twente (see also Table 1). We used the stemmers provided by Muscat1, an open source search engine.
    [Show full text]
  • Verbal Compounding in English: a Challenge for Usage-Based Models of Word-Formation?
    DOI 10.1515/anglia-2013-0066 Anglia 2013, 131 (4): 591–626 Angela Lamberty and Hans-Jörg Schmid Verbal Compounding in English: A Challenge for Usage-Based Models of Word-Formation? Abstract: Usage-based models of grammar claim that, in a nutshell, speakers glean their tacit knowledge about language from their linguistic experience in communicative exchanges. Verbal compounding can be considered a challenge to usage-based models: on the one hand, genuine verbal compounding is gen- erally regarded as not being a productive pattern for the formation of new Eng- lish words; on the other hand, complex lexemes which could very well be the result of such a process, but were created by means of back-formation or con- version, e.g. to dry-clean, to babysit or to house-train, exist in non-negligible quantities. From a usage-based perspective, then, the question arises as to why speakers of English apparently do not have a productive schema for the crea- tion of genuine verbal compounds at their disposal, even though they are con- fronted with linguistic input that seems to suggest, at least on the surface, that verbal compounding is indeed a productive process. Given that speakers with no training in linguistics are unlikely to be aware of the formation history of existing verbal compounds, what is the nature of the tacit knowledge they do seem to have that generally prevents them from creating new genuine verbal compounds and from judging them as acceptable when they come across them? This is the question addressed in the present paper. We offer findings from a systematic dictionary-cum-corpus analysis and from an acceptability and comprehension task which strongly suggest that hearers actually do not process novel genuine verbal compounds as compounds, but rely on different processing strategies instead, trying to take recourse to possible base nouns or adjectives and interpreting meanings on the basis of analogies to similar lexical items in the network.
    [Show full text]
  • Robust Prediction of Punctuation and Truecasing for Medical ASR
    Robust Prediction of Punctuation and Truecasing for Medical ASR Monica Sunkara Srikanth Ronanki Kalpit Dixit Sravan Bodapati Katrin Kirchhoff Amazon AWS AI, USA fsunkaral, [email protected] Abstract the accuracy of subsequent natural language under- standing algorithms (Peitz et al., 2011a; Makhoul Automatic speech recognition (ASR) systems et al., 2005) to identify information such as pa- in the medical domain that focus on transcrib- tient diagnosis, treatments, dosages, symptoms and ing clinical dictations and doctor-patient con- signs. Typically, clinicians explicitly dictate the versations often pose many challenges due to the complexity of the domain. ASR output punctuation commands like “period”, “add comma” typically undergoes automatic punctuation to etc., and a postprocessing component takes care enable users to speak naturally, without hav- of punctuation restoration. This process is usu- ing to vocalise awkward and explicit punc- ally error-prone as the clinicians may struggle with tuation commands, such as “period”, “add appropriate punctuation insertion during dictation. comma” or “exclamation point”, while true- Moreover, doctor-patient conversations lack ex- casing enhances user readability and improves plicit vocalization of punctuation marks motivating the performance of downstream NLP tasks. This paper proposes a conditional joint mod- the need for automatic prediction of punctuation eling framework for prediction of punctuation and truecasing. In this work, we aim to solve the and truecasing using pretrained masked lan- problem of automatic punctuation and truecasing guage models such as BERT, BioBERT and restoration to medical ASR system text outputs. RoBERTa. We also present techniques for Most recent approaches to punctuation and true- domain and task specific adaptation by fine- casing restoration problem rely on deep learning tuning masked language models with medical (Nguyen et al., 2019a; Salloum et al., 2017).
    [Show full text]