Dependency-Based Empty Category Detection Via Phrase Structure Trees

Total Page:16

File Type:pdf, Size:1020Kb

Dependency-Based Empty Category Detection Via Phrase Structure Trees Dependency-based empty category detection via phrase structure trees Nianwen Xue Yaqin Yang Brandeis University Brandeis University Waltham, MA, USA Waltham, MA, USA [email protected] [email protected] Abstract et al., 2005) use traces to represent the extraction site of a dislocated element, dropped pronouns (rep- We describe a novel approach to detecting resented as *pro*s) are much more widespread in empty categories (EC) as represented in de- the CTB. This is because Chinese is a pro-drop lan- pendency trees as well as a new metric for guage (Huang, 1984) that allows the subject to be measuring EC detection accuracy. The new metric takes into account not only the position dropped in more contexts than English does. While and type of an EC, but also the head it is a detecting and resolving traces is important to the in- dependent of in a dependency tree. We also terpretation of the syntactic structure of a sentence in introduce a variety of new features that are both English and Chinese, the prevalence of dropped more suited for this approach. Tested on a sub- nouns in Chinese text gives EC detection added sig- set of the Chinese Treebank, our system im- nificance and urgency. They are not only an impor- proved significantly over the best previously tant component of the syntactic parse of a sentence, reported results even when evaluated with this but are also essential to a wide range of NLP appli- more stringent metric. cations. For example, any meaningful tracking of entities and events in natural language text would 1 Introduction have to include those represented by dropped pro- nouns. If Chinese is translated into a different lan- In modern theoretical linguistics, empty categories guage, it is also necessary to render these dropped (ECs) are an important piece of machinery in repre- pronouns explicit if the target language does not al- senting the syntactic structure of a sentence and they low pro-drop. In fact, Chung and Gildea (2010) re- are used to represent phonologically null elements ported preliminary work that has shown a positive such as dropped pronouns and traces of dislocated impact of automatic EC detection on statistical ma- elements. They have also found their way into large- chine translation. scale treebanks which have played an important role in advancing the state of the art in syntactic parsing. Some ECs can be resolved to an overt element in In phrase-structure treebanks, ECs have been used to the same text while others only have a generic ref- indicate long-distance dependencies, discontinuous erence that cannot be linked to any specific entity. constituents, and certain dropped elements (Marcus Still others have a plausible antecedent in the text, et al., 1993; Xue et al., 2005). Together with la- but are not annotated due to annotation limitations. beled brackets and function tags, they make up the A common practice is to resolve ECs in two separate full syntactic representation of a sentence. stages (Johnson, 2002; Dienes and Dubey, 2003b; The use of ECs captures some cross-linguistic Dienes and Dubey, 2003a; Campbell, 2004; Gab- commonalities and differences. For example, while bard et al., 2006; Schmid, 2006; Cai et al., 2011). both the Penn English TreeBank (PTB) (Marcus et The first stage is EC detection, where empty cate- al., 1993) and the Chinese TreeBank (CTB) (Xue gories are first located and typed. The second stage 1051 Proceedings of NAACL-HLT 2013, pages 1051–1060, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics is EC resolution, where empty categories are linked determining the type of EC. Framing EC detection to an overt element if possible. this way also requires a new evaluation metric. An In this paper we describe a novel approach to de- EC is considered to be correctly detected if its linear tecting empty categories in Chinese, using the CTB position, its head, and its type are all correctly de- as training and test data. More concretely, EC de- termined. We report experimental results that show tection involves (i) identifying the position of the even using this more stringent measure, our EC de- EC, relative to some overt word tokens in the same tection system achieved performance that improved sentence, and (ii) determining the type of EC, e.g., significantly over the state-of-the-art results. whether it is a dropped pronoun or a trace. We fo- The rest of the paper is organized as follows. In cus on EC detection here because most of the ECs Section 2, we will describe how to represent ECs in the Chinese Treebank are either not resolved to in a dependency structure in detail and present our an overt element or linked to another EC. For ex- approach to EC detection. In Section 3, we describe ample, dropped pronouns (*pro*) are not resolved, how linguistic information is encoded as features. and traces (*T*) in relative clauses are linked to an In Section 4, we discuss our experimental setup and empty relative pronoun (*OP*). present our results. In Section 5, we describe related In previous work, ECs are either represented lin- work. Section 6 concludes the paper. early, where ECs are indexed to the following word (Yang and Xue, 2010) or attached to nodes in a 2 Approach phrase structure tree (Johnson, 2002; Dienes and In order to detect ECs anchored in a dependency Dubey, 2003b; Gabbard et al., 2006). In a linear tree, we first convert the phrase structure trees in the representation where ECs are indexed to the follow- CTB into dependency trees. After the conversion, ing word, it is difficult to represent consecutive ECs each word token in a dependency tree, including the because that will mean more than one EC will be ECs, will have one and only one head (or parent). indexed to the same word (making the classification We then train a classifier to predict the position and task more complicated). While in English consecu- type of ECs in the dependency tree. Let W be a se- tive ECs are relatively rare, in Chinese this is very quence of word tokens in a sentence, and T is syn- common. For example, it is often the case that an tactic parse tree for W , our task is to predict whether empty relative pronoun (*OP*) is followed imme- there is a tuple (h, t, e), such that h and t are word to- diately by a trace (*T*). Another issue with the lin- kens in W , e is an EC, h is the head of e, and t imme- ear representation of ECs is that it leaves unspecified diately follows e. When EC detection is formulated where the EC should be attached, and crucial depen- as a classification task, each classification instance dencies between ECs and other elements in the syn- is thus a tuple (h, t). The input to our classifier is tactic structure are not represented, thus limiting the T , which can either be a phrase structure tree or a utility of this task. dependency tree. We choose to use a phrase struc- In a phrase structure representation, ECs are at- ture tree because phrase structure parsers trained on tached to a hierarchical structure and the problem the Chinese Treebank are readily available, and we of multiple ECs indexed to the same word token can also hypothesize that phrase structure trees have a be avoided because linearly consecutive ECs may be richer hierarchical structure that can be exploited as attached to different non-terminal nodes in a phrase features for EC detection. structure tree. In a phrase structure framework, ECs are evaluated based on their linear position as well 2.1 Empty categories in the Chinese Treebank as on their contribution to the overall accuracy of According to the CTB bracketing guidelines (Xue the syntactic parse (Cai et al., 2011). and Xia, 2000), there are seven different types of In the present work, we propose to look at EC ECs in the CTB. Below is a brief description of the detection in a dependency structure representation, empty categories: where we define EC detection as (i) determining its linear position relative to the following word token, 1. *pro*: small pro, used to represent dropped (ii) determining its head it is a dependent of, and (iii) pronouns. 1052 2. *PRO*: big PRO, used to represent shared el- In previous work EC detection has been formu- ements in control structures or elements that lated as a classification problem with the target of have generic references. the classification being word tokens (Yang and Xue, 3. *OP*: null operator, used to represent empty 2010; Chung and Gildea, 2010), or constituents in relative pronouns. a parse tree (Gabbard et al., 2006). When word to- 4. *T*: trace left by movement such as topical- kens are used as the target of classification, the task ization and relativization. is to determine whether there is an EC before each 5. *RNR*: right node raising. word token, and what type EC it is. One shortcom- 6. *: trace left by passivization and raising. ing with that representation is that more than one EC 7. *?*: missing elements of unknown category. can precede the same word token, as is the case in the example in Figure 1, where both *OP* and *T* An example parse tree with ECs is shown in precede 涉Ê (“involve”). In fact, (Yang and Xue, Figure 1. In the example, there are two ECs, an 2010) takes the last EC when there is a sequence of empty relative pronoun (*OP*) and a trace (*T*), a ECs and as a result, some ECs will never get the common syntactic pattern for relative clauses in the chance to be detected.
Recommended publications
  • LINGUISTICS 221 Lecture #3 DISTINCTIVE FEATURES Part 1. an Utterance Is Composed of a Sequence of Discrete Segments. Is the Segm
    LINGUISTICS 221 Lecture #3 DISTINCTIVE FEATURES Part 1. An utterance is composed of a sequence of discrete segments. Is the segment indivisible? Is the segment the smallest unit of phonological analysis? If it is, segments ought to differ randomly from one another. Yet this is not true: pt k prs What is the relationship between members of the two groups? p t k - the members of this set have an internal relationship: they are all voiceles stops. p r s - no such relationship exists p b d s bilabial bilabial alveolar alveolar stop stop stop fricative voiceless voiced voiced voiceless SIMILARITIES AND DIFFERENCES! Segments may be viewed as composed of sets of properties rather than indivisible entities. We can show the relationship by listing the properties of each segment. DISTINCTIVE FEATURES • enable us to describe the segments in the world’s languages: all segments in any language can be characterized in some unique combination of features • identifies groups of segments → natural segment classes: they play a role in phonological processes and constraints • distinctive features must be referred to in terms of phonetic -- articulatory or acoustic -- characteristics. 1 Requirements on distinctive feature systems (p. 66): • they must be capable of characterizing natural segment classes • they must be capable of describing all segmental contrasts in all languages • they should be definable in phonetic terms The features fulfill three functions: a. They are capable of describing the segment: a phonetic function b. They serve to differentiate lexical items: a phonological function c. They define natural segment classes: i.e. those segments which as a group undergo similar phonological processes.
    [Show full text]
  • Arxiv:2106.08037V1 [Cs.CL] 15 Jun 2021 Alternative Ways the World Could Be
    The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing Valentina Pyatkin∗ Shoval Sadde∗ Aynat Rubinstein Bar Ilan University Bar Ilan University Hebrew University of Jerusalem [email protected] [email protected] [email protected] Paul Portner Reut Tsarfaty Georgetown University Bar Ilan University [email protected] [email protected] Abstract (1) a. We presented a paper at ACL’19. Modality is the linguistic ability to describe b. We did not present a paper at ACL’20. events with added information such as how de- sirable, plausible, or feasible they are. Modal- The propositional content p =“present a paper at ity is important for many NLP downstream ACL’X” can be easily verified for sentences (1a)- tasks such as the detection of hedging, uncer- (1b) by looking up the proceedings of the confer- tainty, speculation, and more. Previous studies ence to (dis)prove the existence of the relevant pub- that address modality detection in NLP often p restrict modal expressions to a closed syntac- lication. The same proposition is still referred to tic class, and the modal sense labels are vastly in sentences (2a)–(2d), but now in each one, p is different across different studies, lacking an ac- described from a different perspective: cepted standard. Furthermore, these senses are often analyzed independently of the events that (2) a. We aim to present a paper at ACL’21. they modify. This work builds on the theoreti- b. We want to present a paper at ACL’21. cal foundations of the Georgetown Gradable Modal Expressions (GME) work by Rubin- c.
    [Show full text]
  • Serial Verb Constructions Revisited: a Case Study from Koro
    Serial Verb Constructions Revisited: A Case Study from Koro By Jessica Cleary-Kemp A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Linguistics in the Graduate Division of the University of California, Berkeley Committee in charge: Associate Professor Lev D. Michael, Chair Assistant Professor Peter S. Jenks Professor William F. Hanks Summer 2015 © Copyright by Jessica Cleary-Kemp All Rights Reserved Abstract Serial Verb Constructions Revisited: A Case Study from Koro by Jessica Cleary-Kemp Doctor of Philosophy in Linguistics University of California, Berkeley Associate Professor Lev D. Michael, Chair In this dissertation a methodology for identifying and analyzing serial verb constructions (SVCs) is developed, and its application is exemplified through an analysis of SVCs in Koro, an Oceanic language of Papua New Guinea. SVCs involve two main verbs that form a single predicate and share at least one of their arguments. In addition, they have shared values for tense, aspect, and mood, and they denote a single event. The unique syntactic and semantic properties of SVCs present a number of theoretical challenges, and thus they have invited great interest from syntacticians and typologists alike. But characterizing the nature of SVCs and making generalizations about the typology of serializing languages has proven difficult. There is still debate about both the surface properties of SVCs and their underlying syntactic structure. The current work addresses some of these issues by approaching serialization from two angles: the typological and the language-specific. On the typological front, it refines the definition of ‘SVC’ and develops a principled set of cross-linguistically applicable diagnostics.
    [Show full text]
  • 1 English Subjectless Tagged Sentences Paul Kay Department Of
    1 English subjectless tagged sentences Paul Kay Department of Linguistics University of California Berkeley, CA 94720 [email protected] 2 Abstract A colloquial English sentence like Fooled us, didn't they? contains a finite main verb but no expressed subject. The identity of the missing subject of fooled is recovered from the tag subject they: compare Fooled us, didn't she?, Fooled us, didn't you? This paper argues (1) that such subjectless tagged sentences (STSs) pose a problem for grammatical approaches based on movement and empty categories and (2) that STSs receive a revealing analysis as part of a finely articulated family of tagged sentence constructions when viewed within a non-derivational, constructional, multiple-inheritance-based approach.* *I would like to thank Peter Culicover, Liliane Haegeman, Charles Fillmore Andreas Kathol and Richard Oehrle for comments on previous versions of this paper, as well as an anonymous reviewer for Language. They have doubtless offered more good advice than I have accepted. 3 0. Introduction. It has been argued from several points of view that whatever can be done with empty categories (ecs) can be done without them (Ades and Steedman 1982, Gazdar et al. 1984, Kaplan and Zaenen 1989, Pollard and Sag 1994 chapter 9, Sag and Fodor 1994, Kay and Fillmore 1999, Sag 1999). It has also been argued that, because there is no hard evidence for their existence, linguistic theory would be better off dispensing with these unobservable entities (Pickering and Barry 1991, Sag and Fodor 1994, Sag 1999).1 The present paper purports to take the argument one step further by showing that there are things that can be done without empty categories that cannot be done with them, at least not with any of the ecs currently available.
    [Show full text]
  • The Syntax of Answers to Negative Yes/No-Questions in English Anders Holmberg Newcastle University
    The syntax of answers to negative yes/no-questions in English Anders Holmberg Newcastle University 1. Introduction This paper will argue that answers to polar questions or yes/no-questions (YNQs) in English are elliptical expressions with basically the structure (1), where IP is identical to the LF of the IP of the question, containing a polarity variable with two possible values, affirmative or negative, which is assigned a value by the focused polarity expression. (1) yes/no Foc [IP ...x... ] The crucial data come from answers to negative questions. English turns out to have a fairly complicated system, with variation depending on which negation is used. The meaning of the answer yes in (2) is straightforward, affirming that John is coming. (2) Q(uestion): Isn’t John coming, too? A(nswer): Yes. (‘John is coming.’) In (3) (for speakers who accept this question as well formed), 1 the meaning of yes alone is indeterminate, and it is therefore not a felicitous answer in this context. The longer version is fine, affirming that John is coming. (3) Q: Isn’t John coming, either? A: a. #Yes. b. Yes, he is. In (4), there is variation regarding the interpretation of yes. Depending on the context it can be a confirmation of the negation in the question, meaning ‘John is not coming’. In other contexts it will be an infelicitous answer, as in (3). (4) Q: Is John not coming? A: a. Yes. (‘John is not coming.’) b. #Yes. In all three cases the (bare) answer no is unambiguous, meaning that John is not coming.
    [Show full text]
  • Sequence-Of-Tense and the Features of Finite Tenses Karen Zagona University of Washington*
    Sequence-of-tense and the Features of Finite Tenses Karen Zagona University of Washington* Abstract Sequence-of-tense (SOT) is often described as a (past) tense verb form that does not correspond to a semantically interpretable tense. Since SOT clauses behave in other respects like finite clauses, the question arises as to whether the syntactic category Tense has to be distinguished from the functional category tense. I claim that SOT clauses do in fact contain interpretable PRESENT tense. The “past” form is analyzed as a manifestation of agreement with the (matrix past) controller of the SOT clause evaluation time. One implication of this analysis is that finite verb forms should be analyzed as representing features that correspond to functional categories higher in clause structure, including those of the clausal left periphery. SOT morphology then sheds light on the existence of a series of finer- grained functional heads that contribute to tense construal, and to verbal paradigms. These include Tense, Modality and Force. 1. Introduction The phenomenon of sequence-of-tense (SOT) poses several challenges for the standard assumption that a “past tense” verb form signals the presence of a functional category in clause structure with an interpretable ‘past’ value. SOT is illustrated by the ‘simultaneous’ reading of sentence (1): (1) Terry believed that Sue was pregnant. a. The time of Sue’s pregnancy precedes time of Terry’s belief (precedence) b. The time of Sue’ pregnancy overlaps time of Terry’s belief (simultaneity) For the ‘precedence’ reading in (1a), the embedded clause tense is semantically transparent in the sense that the past form was corresponds to a past ordering relation (relative to the time of Terry’s belief).
    [Show full text]
  • Topic Structures in Chinese Author(S): Xu Liejiong and D
    Linguistic Society of America Topic Structures in Chinese Author(s): Xu Liejiong and D. Terence Langendoen Source: Language, Vol. 61, No. 1 (Mar., 1985), pp. 1-27 Published by: Linguistic Society of America Stable URL: http://www.jstor.org/stable/413419 Accessed: 11/05/2009 22:37 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=lsa. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact [email protected]. Linguistic Society of America is collaborating with JSTOR to digitize, preserve and extend access to Language. http://www.jstor.org TOPIC STRUCTURES IN CHINESE Xu LIEJIONG D.
    [Show full text]
  • Empty Categories in LFG
    Empty categories in LFG Judith Berman University of Stuttgart Pro ceedings of the LFG Conference University of California San Diego Miriam Butt and Tracy Holloway King Editors CSLI Publications httpwwwcslistanfordedupublications I am grateful to Christian Fortmann Gert Web elhuth Hub ert Haider Joan Bresnan Miriam Butt Steve Berman and Werner Frey for helpful comments and discussion This researchwas supp orted by the DFG Graduiertenkolleg Linguistische Grundlagen fur die Sprachverarb eitung at the University of Stuttgart LFG JBerman Empty Categories in LFG This pap er is concerned with the question whether there is any necessity and evidence for empty categories sp ecically traces in German Following the analysis of weak crossover in Bresnan b and Choi it is shown that the German weak crossover data can b e captured correctly if it is assumed that a topicalized constituent has to b e linked with an empty category in its lo cal domain its minimal clause only in the case of long distance dep endencies The empty category is indep endently motivated by a lo cality requirement on function sp ecication which is empirically supp orted by the fact that free word order in German is restricted to the lo cal clause It is further shown that the empty category cannot o ccupy the canonical p osition of the antecedent Instead it is claimed based on work byFrey that the sp ecier p osition of the functional category is the only p osition in which the empty category is licensed The resulting analysis not only accounts for the weak crossover data
    [Show full text]
  • Staged Approach for Grammatical Gender Identification of Nouns Using Association Rule
    Staged Approach for Grammatical Gender Identification of Nouns using Association Rule Mining and Classification Shilpa Desai1, Jyoti Pawar1, and Pushpak Bhattacharyya2 1 Department of Computer Science and Technology Goa University, Goa - India [email protected], [email protected] 2 Department of Computer Science and Engineering IIT-Bombay, Mumbai - India [email protected] Abstract. In some languages, gender is a grammatical property of the noun. Grammatical gender identification enhances machine translation of such languages. This paper reports a three staged approach for gram- matical gender identification that makes use of word and morphological features only. A Morphological Analyzer is used to extract the morpho- logical features. In stage one, association rule mining is used to obtain grammatical gender identification rules. Classification is used at the sec- ond stage to identify grammatical gender for nouns that are not covered by grammatical gender identification rules obtained in stage one. The third stage combines the results of the two stages to identify the gender. The staged approach has a better precision, recall and F-score compared to machine learning classifiers used on complete data set. The approach was tested on Konkani nouns extracted from the Konkani WordNet and an F-Score 0.84 was obtained. 1 Introduction Gender is a grammatical property of nouns in many languages3 including Indian language such as Sanskrit, Hindi, Gujarati, Marathi and Konkani. In such lan- guages adjectives and verbs in a sentence agree with the gender of the noun. For example translation of \He is a good boy" and \She is a good girl" in Hindi is \vaha eka achchhaa laDakaa haai 4" and "vaha eka achchhii laDakii haai", respectively.
    [Show full text]
  • Serial Verb Constructions: Argument Structural Uniformity and Event Structural Diversity
    SERIAL VERB CONSTRUCTIONS: ARGUMENT STRUCTURAL UNIFORMITY AND EVENT STRUCTURAL DIVERSITY A DISSERTATION SUBMITTED TO THE DEPARTMENT OF LINGUISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Melanie Owens November 2011 © 2011 by Melanie Rachel Owens. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/db406jt2949 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Beth Levin, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Joan Bresnan I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Vera Gribanov Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Serial Verb Constructions (SVCs) are constructions which contain two or more verbs yet behave in every grammatical respect as if they contain only one.
    [Show full text]
  • Inflection, P. 1
    36607. MORPHOLOGY Prof. Yehuda N. Falk Inflection, p. 1 In flectional morphology expresses morphosyntactic properties of lexemes. For each lexical category (part of speech) there is a certain set of available properties; the forms specify the properties. The most straightforward way to model this is in terms of features and their vvvaluesvaluesalues. For nirkod expresses a particular set of properties associated נרקוד example, the Hebrew form with the lexeme RAKAD : the feature TENSE with the value FUTURE , the feature PERSON with the value 1, and the feature NUMBER with the value PLURAL . The textbook uses di Terent terminology: in flectional dimension instead of feature , and in flectional category instead of value . The term feature is also sometimes used for a feature-value combination, such as the phrase “the past tense feature”. The in flectional properties of a word can be represented as a fffeature-valuefeature-value (or attribute-valueattribute-value) representation: RAKAD TENSE FUT nirkod PERS 1 NUM PL This can also be written horizontally, and even abbreviated: 〈RAKAD , [ TENSE FUT , PERS 1, NUM PL ]〉 nirkod 1Pl.Fut A paradigm is a chart showing all the features and feature values for a particular lexeme. The expression of in flectional properties is called exponenceexponence. In the simplest case, there is a . זכרונות one-to-one relationship between properties and exponents. Note the word ZIKARON zixron+ot []NUM PL Zixron is the stem, and the su Ux -ot is the exponent of the [ NUM PL ] feature. טובות Not everything is that simple, though. Consider the adjective 36607. MORPHOLOGY Prof. Yehuda N. Falk Inflection, p.
    [Show full text]
  • Proceedings of the IWCS 2013 Workshop on Annotation of Modal
    Challenges in modality annotation in a Brazilian Portuguese Spontaneous Speech Corpus Luciana Beatriz Avila SecondHeliana Author Mello Second Author PosLin-UFMG/UFV/Capes AffiliationUFMG/FGV/CNPq / Address line 1 Affiliation / Address line 1 Av Antonio Carlos 6627 AffiliationAv Antonio / Address Carlos 6627line 2 Affiliation / Address line 2 Belo Horizonte, MG 31270-901 Brazil Belo Horizonte,Affiliation MG/ Address 31270 line-901 3 Brazil Affiliation / Address line 3 [email protected] heliana.melloemail@[email protected] email@domain category stands for, as well as identifying linguistic Abstract elements that carry it, is of utmost relevance. Our goal in annotating modality in a This short paper introduces the first notes about a modality annotation system that is under spontaneous speech Brazilian Portuguese Corpus is development for a spontaneous speech to provide a reliable starting point for researchers Brazilian Portuguese corpus (C-ORAL- that might be interested in developing BRASIL). We indicate our methodological methodologies associated to NLP that ensue the decisions, the points which seem to be well extraction of oral discourse reliability, certainty resolved and two issues for further discussion and factuality markers, or carrying sentiment and investigation. analysis, modeling modality and similar objectives. 3 Defining modality 1 Credits In this paper we study modality in a spontaneous The authors are thankful to CNPq, FAPEMIG and speech corpus, the C-ORAL-BRASIL, which will CAPES (Proc. nº BEX 9537/12-0) for research be presented in 4 below. As for spontaneous funding support. speech, we follow Cresti and Scarano (1998:5) in characterizing it as “the fulfillment of linguistic 2 Introduction acts, not programmed and not programmable, Modality annotation is inexistent for both written because they emerge during the unfolding of an and spoken Brazilian Portuguese corpora, thus the interaction, always new and unpredictable, novelty of this project.
    [Show full text]