Eliciting Specialized Frames from Corpora Using Argument-Structure Extraction Techniques Beatriz Sanchez Cardenas, Carlos Ramisch
Total Page:16
File Type:pdf, Size:1020Kb
Eliciting specialized frames from corpora using argument-structure extraction techniques Beatriz Sanchez Cardenas, Carlos Ramisch To cite this version: Beatriz Sanchez Cardenas, Carlos Ramisch. Eliciting specialized frames from corpora using argument- structure extraction techniques. Terminology. International Journal of Theoretical and Ap- plied Issues in Specialized Communication , John Benjamins Publishing, 2019, 25 (1), pp.1-31. 10.1075/term.00026.san. hal-02318280 HAL Id: hal-02318280 https://hal.archives-ouvertes.fr/hal-02318280 Submitted on 16 Oct 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. Reference: Sánchez-Cárdenas, Beatriz & Carlos Ramish (2019). Eliciting specialized frames from corpora using argument-structure extraction techniques. Terminology: An International Journal of Theoretical and Applied Issues in Specialized Communication, 25(1). DOI: https://doi.org/10.1075/term.00026.san Authors: Beatriz Sánchez Cárdenas and Carlos Ramisch Length: 8702 words (excluding re!erences" Beatriz Sánchez#Cárdenas Research group LexiCon Department of &ranslation and 'nterpreting University of Granada Calle Buensuceso, -- 18002 Granada (Spain) (./0" 958244104 bsc4ugr5es http:66lexicon.ugr5es6sanchezcardenas Carlos Ramisch Aix 7arseille Uni), Universit8 de &oulon, CNRS, L'S :arc Scienti;que et &echnologique de Luminy 163 Avenue de Luminy # Case 901 13288 7arseille Cedex 9 (>rance" (.//" 0 86 09 06 72 Carlos5Ramisch@lis#la35!r http:66pageperso.lis#la35!r6?carlos5ramisch Abstract Frame Semantics provides a powerful cross-lingual model to describe the conceptual structure underlying specialized language. However, building specialized frames is challenging because of the complex nature of predicate-argument structures, and because of the domain-specific uses of general-language predicates. This article presents a semi-automatic method to elicit semantic frames from specialized corpora. Its goal is to discover lexical patterns that reveal the structure of specialized frames and to populate them with corpus-based data. Firstly, we automatically extracted verb-noun This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. triples from corpora using bootstrapping to identify noun-verb-noun phraseological patterns. Secondly, we annotated each noun-verb-noun triple with the lexical domain of the verbs and the semantic class and role of the noun filling each argument slot. We then used these annotations and patterns to classify similar triples. This allowed us to make generalizations and infer the structure as well as the types of lexical units that belong to these specialized frames. We evaluated our methodology using specialized corpora of environmental science texts in English and in Spanish. Keywords Frame semantics, frame-based terminology, corpora, corpus-based extraction, argument structure This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. Eliciting specialized frames from corpora using argument-structure extraction techniques Abstract Frame Semantics provides a powerful cross-lingual model to describe the conceptual structure underlying specialized language. However, building specialized frames is challenging because of the complex nature of predicate-argument structures, and because of the domain-specific uses of general-language predicates. This article presents a semi-automatic method to elicit semantic frames from specialized corpora. Its goal is to discover lexical patterns that reveal the structure of specialized frames and to populate them with corpus-based data. Firstly, we automatically extracted verb-noun triples from corpora using bootstrapping to identify noun-verb-noun phraseological patterns. Secondly, we annotated each noun-verb-noun triple with the lexical domain of the verbs and the semantic class and role of the noun filling each argument slot. We then used these annotations and patterns to classify similar triples. This allowed us to make generalizations and infer the structure as well as the types of lexical units that belong to these specialized frames. We evaluated our methodology using specialized corpora of environmental science texts in English and in Spanish. Keywords Frame semantics, frame-based terminology, corpora, corpus-based extraction, argument structure This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. 1. Introduction The study of phraseology in scientific texts tends to focus either on general scientific formulaic templates or on the study of terms for their inclusion in specialized dictionaries. However, the description of the language used in a given scientific or technical domain should go far beyond merely collecting an inventory of terms that are used to instantiate general-language constructs (L’Homme 2004, Hanks 2004, Williams 2005, Granger and Meunier 2008, Faber 2012). In fact, a significant part of specialized language is composed of structured lexico-grammatical constructs used to express complex concepts that are typical of a given domain. There is thus the need to develop specialized lexicons that provide this type of information. This is particularly evident in translation. Translators dealing with specialized texts often have problems transposing the meaning of a sentence across languages because a superficial knowledge of the terms in a text is not sufficient. In addition to translating terms, it is necessary to translate actions and processes along with the entities that participate in them. For instance, a description of earthquake should include the entities that generally cause this event as well as its effect on other entities. This would afford translators a more in-depth knowledge of the concept and allow them to express it more idiomatically in the target language. In our opinion, such a description should stem from the analysis of specialized corpora in the source and target languages. In this endeavor, domain-specific corpora are a rich This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. source of information. Given that verbs carry most of the semantic load of the sentence, they are essential to define the underlying conceptual structure of specialized texts (Fellbaum 1990; L'Homme 2012, 1998). Thus, the identification of noun-verb combinations in corpora is crucial to build structured descriptions. The corpus-based construction of specialized lexical resources requires both linguistic and domain expertise, as well as suitable tools for performing corpus inquiries. Computational tools can support, enhance and facilitate corpus analysis to confirm and generalize linguistic introspection. Therefore, one often needs to run complex queries to model morphosyntactic and syntactic co-occurrence patterns, which in turn are proxies for predicate-argument structure. Our research combined the principles of Frame-based Terminology (Faber 2012, 2015; Faber and León Araúz 2014) with computational tools for corpus searches, semantic annotation, and frame specification. For automatic corpus searches, we used the MWEtoolkit, a software application that extracts co-occurrence patterns from corpora using multi-level queries that support regular-expression operators (Ramisch 2015). This approach lies in the roots of a considerable amount of literature over the last 20 years on the identification of knowledge patterns in specialized texts (Faber et al. 2009, Feliu 2004, Condamines 2002, Condamines and Rebeyrolle, Meyer et al. 2001, Meyer et al. 1999, inter alia). This preprint version has been produced by the authors upon acceptance and reflects changes requested by reviewers. The official ‘version of record’ https://doi.org/10.1075/term.00026.san is under copyright and the publisher should be contacted for permission to re-use or reprint the material in any form. The output of the initial