Linguistic Preprocessing for Distributional Analysis : Evidence from French Emmanuel Cartier
Total Page:16
File Type:pdf, Size:1020Kb
Linguistic Preprocessing for Distributional Analysis : Evidence from French Emmanuel Cartier To cite this version: Emmanuel Cartier. Linguistic Preprocessing for Distributional Analysis : Evidence from French. Corpus Linguistics 2015, Lancaster University, Jul 2015, Lancaster, United Kingdom. hal-01443193 HAL Id: hal-01443193 https://hal.archives-ouvertes.fr/hal-01443193 Submitted on 22 Jan 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Linguistic Preprocessing for experiments/tasks with varying values for each parameter, so as to assess the most efficient ones. Distributional Analysis Efficiency : They conclude that : dependency-based does not Evidence from French trigger any improvement over raw-text n-gram calculus; as for feature granularity, that stemming yields the better results; as for stopwords or high- Emmanuel Cartier frequency words removal, it does yield better results, Université Paris 13 Sorbonne Paris Cité but only if no raw frequency weighting is applied to the results; this is in line with the conclusion of LIPN – RCLN – UMR 7030 (Bulinaria and Levy, 2012). [email protected] Nevertheless, these conclusions should be refined and completed : 1/ As feature granularity is concerned, the authors do 1 Introduction not take into account a combination of features from different levels; (Béchet et al., 2012), for example, have shown that combining features from three Distributional Hypothesis and Cognitive levels (form, lemma, pos-tag) can result in better Foundations pattern recognition for specific linguistic tasks; such For about fifteen years, the statistical paradigm, from a combination is also in line with the Cognitive the distributionalism hypothesis (Harris, 1954) and Semantics and the Construction Grammar corpus linguistics (Firth, 1957), has prevailed in the hypothesis, that linguistic signs emerge as NLP field, with a lot of convincing results : constructions combining schemes, lemmas and multiword expression, part-of-speech, semantic specific forms; relation identification, and even probabilistic models 2/ The experiments on dependency need additional of language. These studies have identified interesting experiments, as several works (for example Pado linguistic phenomena, such as collocations, and Lapata, 2007) made a contradictory conclusion. "collostructions" (Stefanowitsch, 2003), « word 3/ Stopwords or high-frequency words removal sketches » (Kilgariff et al., 2004). results in better results if no frequency weighting is Cognitive Semantics (Langacker, 1987, 1991; applied; but the authors apply – as quasi all work in Geeraerts et al. 1994 ; Schmid, 2007, 2013), have the field -, a brute-force removal either based on also introduced novel concepts, most notably that of “gold standard” stopword lists, or on a arbitrary « entrenchment », which enables to ground the count to cut off results; this technique should be social lexicalization of linguistic signs and to refined to remove only the noisy words or n-grams correlate it with repetition in corpus. and should be linguistically motivated. Finally, Construction Grammars (Fillmore et al., 1988 ; Goldberg, 1995, 2003 ; Croft, 2001, 2004, Linguistic Motivation for Linguistic 2007) have proposed linguistic models which reject Preprocessing the distinction lexicon (list of “words”) - grammar The hypothesis supported in this paper is that, if (expliciting the combination of words) : all linguistic repetition of sequences is the best way to access signs are constructions, from morphemes to usage and to induce linguistic properties, language syntaxical schemes, leading to the notion of users do not only rely on the sequentiality of “constructicon”, as a goal for linguistic description. language, but also on non-sequential knowledge thus untractable from the actual distribution of words. Computational Models of the Distributional This knowledge is linked to the three classical Hypothesis linguistical units : lexical units, phrases and As Computational Linguistics is concerned, the predicate structures, each being a combination of the Vector Space Model (VSM) has prevailed to preceding with language-specific rules for their implement the distributional hypothesis, giving rise construction. Probabilistic models of language have to continuous sophistication and several state-of-the- mainly focused until now on the lexical units level, arts (Turney and Pantel, 2010; Lenci and al., 2010; but to leverage language, probabilistic research must Kiela and Clark, 2013; Clark, 2015). (Kiela and also model and preprocess phrases and predicate Clarke, 2014) state that the following parameters are structures. implied in any VSM implementation: vector size, The present paper will try to ground this window size, window-based or dependency-based hypothesis through an experiment, aimed at context, feature granularity, similarity metric, retrieving lexico-semantic relations in French, where weighting scheme, stopwords and high frequency we preprocess the corpus in three ways : cut-off. Three of them are directly linked to • morphosyntactic analysis linguistic preprocessing : window-based or • peripheral lexical units removal dependency-based context, the second requiring a dependency analysis of the corpus; feature • phrases identification. granularity, ie, the fact of taking into account either the raw corpus, or a lemmatized or pos-tagged one As we will see, these steps enables to access more for n-gram calculus; stopwords and high frequency easily the predicate structures that the experiment cut-off , ie removal of high-frequency words or “tool aims at revealing, while using a VSM model on the words”. (Kiela and Clarke, 2014) conducted six resulting preprocessed corpus. 2 Evidence from French : Semantic 2. Unify nominal phrases, as they are the target Relations and Definitions for hypernym relations and their sparsity complicate retrieval of patterns. Definition model Here we assume that a definitory statement is a Take the following source definition : statement asserting the essential properties of a en/P cuisine/NC ,/PONCT un/DET DEFINIENDUM être/V lexical unit. It is composed of the definiendum un/DET pièce/NC de/P pâte/NC aplatir/VPP ,/PONCT (DFM), i.e. the lexical unit to be defined; the g é n é r a l e m e n t / A D V a u / P + D r o u l e a u / N C à / P pâtisserie/NC ./PONCT ((cooking) an undercrust is a piece definiens (DFS), i.e. the phrasal expression denoting of dough that has been flattened, usually with a rolling pin.) the essential properties of the DFM lexical unit; the definition relator (DEFREL), i.e. the linguistic items It will be reduced to : denoting the semantic relation between the two un/DET DEFINIENDUM être/V un/DET pièce/NC de/P previous elements. pâte/NC aplatir/VPP ,/PONCT au/P+D rouleau/NC à/P The traditional model of definition decomposes pâtisserie/NC ou/CC un/DET laminoir/NC ./PONCT the DFS into two main parts : HYPERNYM + And we extract the domain restriction : en/P PROPERTIES. cuisine/NC. Definitory statement can also comprise other information : enunciation components (according to Step 1 and 2 : Adverbials and subordinate clauses Sisley, a G-component is a ... ); domain restrictions The first linguistical sequences removed from the (in Astrophysics, a XXX is a YYY). source sentence are adverbials and specific clauses. But we would like to remove only clauses dependent Corpus on the main predicate, not those dependent on one of We use three corpora and retain only the nominal its core components. For example, we remove the entries in each : incidential clause in : DEFINIENDUM (/PONCT parfois/ADV Apaiang/NPP Trésor de la Langue Française (TLF) : 61 234 ,/PONCT même/ADJ prononciation/NC )/PONCT être/V nominal lexical units, and 90 348 definitions; un/DET atoll/NC de/P le/DET république/NC du/P+D French Wiktionary (FRWIK) : 140 784 nouns, Kiribati/NPP ./PONCT for a total of 187 041 definitions. But relative clauses dependent on one of the Wikipedia (WIKP) : 610 013 glosses (ie first definiens component should be first extracted: sentence of each article) from the French Wikipedia, DEFINIENDUM être/V du/P+ enzymes/NC qui/PROREL contrôler/V le/DET structure/NC topologique/ADJ de/P using a methodology next to (Navigli and al, 2008) l’ADN/NC ... The first two are dictionaries (TLF, FRWIK), the To achieve this goal, we use distributional analysis last one is an encyclopedia (WIKP). In the first case, on these clauses with (SDMC, Béchet et al., 2012) definition obeys to lexicographic standards, whereas and human tuning to determine the most frequent definitions are more “natural” in WIKP. patterns and trigger words of incidential clauses at specific locations in the sentence : beginning of the System Architecture definition, between the definiendum and a definition The system is composed of four steps : relator. • Morpho-syntactic analysis of the corpus Some incidential clauses convey a semantic relation, • Semantic Relation Trigger words Markup for example the synonymy relation : • Sentence Simplification : this step aims at