Automatic Semantic Role Annotation for Spanish

Automatic Semantic Role Annotation for Spanish Eckhard Bick M. Pilar Valverde Ibáñez Institute of Language and Communication Departamento de Lengua Española University of Southern Denmark Universidade de Santiago de Compostela [email protected] [email protected] Abstract jects, as well as inter-annotator agreement and annotation consistency is affected by this tend- This paper describes and evaluates the auto- ency. For Spanish, the ADESSE database (Gar- matic annotation of clause-level complements cía-Miguel and Albertuz 2005) uses a set of 143 with semantic roles in a Spanish Web corpus, roles, the AnCora corpus (Taulé et al. 2008) 20 using a rule- and dependency-based approach. roles and the Sensem corpus (Alonso et al. 2007) In all, 52 different role tags, like agent (§AG), 24 roles. Only the AnCora corpus assigns a se- experiencer (§EXP), location (§LOC) etc. are mantic role to all the complements of the clause, distinguished. The annotator uses a role grammar of 568 hand-written Constraint Grammar while the rest only treat valency-bound comple- rules that take as input the syntactic analysis of ments. the HISPAL parser. A rough evaluation of In our corpus, we use a set of 52 semantic 5000 running words was performed, where the roles, adopting the set of roles used by Bick role annotation achieved an F1 of 81,6% on (2007) for the annotation of Portuguese texts. raw text and 90,0% on syntactically revised in- These cover the major categories of the tecto- put. A Spanish Internet corpus of 11.2 million grammatical annotation layer of the Prague De- words has been compiled and automatically pendency Treebank (Hajicova et al. 2000), as annotated with our semantic role grammar, al- well as those of the Spanish AnCora project. lowing us to provide some linguistic and statistical interpretations about the relationship The rules of the grammar use syntactic-semantic between semantic roles on the one hand and information available in the input (lemma, se- syntactic functions, part of speech and semant- mantic prototype of the head, type of preposition, ic prototypes on the other. etc.) as well as information extracted from corpus-based resources, such as the ADESSE data- 1 Semantic roles base (García-Miguel and Albertuz 2005) and the Spanish CorpusEye corpora A semantic role is the underlying relationship that a syntactic constituent has with a predicate. (http://corp.hum.sdu.dk). Therefore, assigning semantic roles to the argu- 2 The grammar ments of a verb is a way of adding deep semantic information to the analysis of a sentence. With We have developed a role grammar of 568 hand- this type of information, we can answer ques- written Constraint Grammar rules that exploit tions like who, when, where or what happened, syntactic and semantic information to assign role which is useful in systems that require the com- tags to the clause-level complements. Input to the prehension of sentences, like dialogue systems, semantic role grammar is provided by the HIS- information retrieval, information extraction or PAL parser (Bick, 2006a). Linguistically, the automatic translation. three main difficulties to overcome in the assign- The idea of semantic roles has a long linguist- ment of syntactic roles were (a) the relative lack ic tradition, originated in the concept of case of lexical-semantic information, (b) the fact that roles (Fillmore 1968), later termed thematic or there is no clear correspondence between syn- theta roles in Government & Binding theory tactic functions and semantic roles, and (c) the (Jackendoff 1982). behaviour of the multi-ambiguous particle se. A higher level of abstraction often implies less consensus on category definitions in the lin- 2.1. Semantic information guistic community, and in semantic role annota- We used the ADESSE database, that contains tion the level of agreement among different pro- syntactic-semantic information about the clauses Kristiina Jokinen and Eckhard Bick (Eds.) NODALIDA 2009 Conference Proceedings, pp. 215–218 Eckhard Bick and M. Pilar Valverde Ibáñez and verbs of a Spanish corpus of 1.5 million assigned to the accusative object in the active words, to study the relationship between the syn- voice, are assigned to §ARG1&. tactic functions and the role of valency-governed Three annotation principles were followed: clause-level complements. All in all, 96 sets of a) All clause-level complements (valency gov- verb lexemes that typically allow a given role erned or not), are systematically assigned a se- with a given syntactic function have been mantic role, including relative pronouns and ad- defined1, moving part of the lexical information verbial subclauses. into the grammar. For example, the following LIST of verbs (V-SP-SUBJ) contains verbs b) The role tags are assigned to semantic depend- whose subject is usually a speaker. ency heads at the token level, CG-style, i.e. alongside syntactic and other tags. However, the (a) LIST V-SP-SUBJ = "contar" "decir" "hablar" ...; (to tell, to say, to speak) semantic head is not necessarily equivalent to the syntactic head. Thus, pp's were role-tagged not With the list in (a) and the following rule, the on the preposition, but on its dependent. In grammar assigns the role “speaker” (§ SP) to any (sub)clauses, the syntactic head is the first verb subject (or agent complement in the passive of a verb chain, while the semantic head is the voice) (§ARG1&) whose dependency-parent (p) last one. is one of the verbs of the list. c) Only one role is allowed for each token, with (b) MAP (§SP) TARGET §ARG1& (p V-SP-SUBJ); the exception of clause-heading verbs which be- In addition, the semantic features of the head sides §PRED (predicator) also carries the “ex- were also used, exploiting the semantic prototype ternal” function for its clause as a whole. information from the HISPAL lexicon. For example, the following rule (c) assigns the role 2.3. The particle se “destination” (§DES) to a dependent of preposi- So-called “se-constructions”, covering not only tion (@P<) if its semantic prototype is in the set true reflexive use, but also others (pronominal, N-LOC (that contains the semantic prototypes re- unaccusative, passive and impersonal), constitute lated with a locative meaning) and its parent is in one of the main sources of ambiguity in the auto- the set of prepositions PRP-DES (that contains matic syntactic analysis of Spanish, and thus in prepositions that typically introduce this role, further levels of analysis like the semantic one. like hasta (till), en dirección a (towards), etc.). These sentences are syntactically similar, but their argument structure is different. (c) MAP (§DES) TARGET @P< (0 N-LOC LINK p PRP-DES); 3 Constraint Grammar (CG) 2.2. Diathesis alternation Our Constraint Grammar uses the new CG3 The tags §ARG0& and §ARG1& are used to sys- compiler, that was developed by the Danish com- tematize diathesis alternation, and assigned to pany GrammarSoft in an open source frame- two types of arguments, respectively: the argu- work, in cooperation with the VISL project at the ment semantically closest to the predicate (0) University of Southern Denmark (for document- (that corresponds to the subject in active voice) ation, see http://beta.visl.sdu.dk/constraint_gram- and the second closest one (1) (that corresponds mar.html). In fact, the semantic role annotation to the accusative object of transitive verbs in project served as a kind of test bed for a number active voice). Specifically, §ARG1& is assigned of CG3 features, allowing the authors to influ- to the subject of passive clauses or unaccusative ence compiler development according to their verbs and to the accusative object of the rest of needs. verbs. §ARG0& is assigned to the rest of sub- Like previous incarnations of the Constraint jects and to the passive agent of the passive Grammar paradigm (Karlsson 1995), CG3 is ba- voice. The grammar takes the active voice as a sically a disambiguation and information map- reference, and the roles that would be assigned to ping methodology operating on token-based the subject in the active voice are instead as- grammatical tags that can be added, removed or signed to §ARG0&, and the roles that would be changed in an incremental and context-sensitive fashion. Unlike previous editions of the formal- 1 The problem of semantic verb ambiguity was limited, ism, however, CG3 explicitly moves beyond since listing the same verb in 2 different lists was only shallow syntax, by allowing the direct creation necessary where a semantic difference corresponded to and use of dependency and other binary rela- a difference in syntactic subcategorization frames. 216 Automatic Semantic Role Annotation for Spanish tions. CG3 also provides for hybridization with 12 §DES (destination), both of which conflict other major parsing paradigms, integrating cor- with §REC (recipient). Those three functions pus-derived statistical information and feature-at- constitute an important source of error not only tribute unification. Finally, CG3 allows the use in the automatic annotation but also in the manu- of regular expressions, increasing rule and tag set al annotation of Spanish corpus with semantic economy and permitting the on-the-fly reference roles (cf. Vaamonde 2008)5. to lexical information by reference to e.g. gram- A relatively high pay-off can thus be expec- matical morphemes and affixes. ted from tackling the most problematic categor- In CG3's direct use of dependency links, as ies. Using the grammar for corpus creation, we we have seen in rule (c), topological methods intend to create a bootstrapping cycle facilitating (here searching leftward from a noun, for the such work, followed by more precise evaluation nearest preposition with nothing in between but allowing a comparison with other role labelling determiners) are replaced with p (parent), c systems for Spanish that are based on machine (child) or s (sibling) relations.

Load more