When Being Unseen from Mbert Is Just the Beginning: Handling New Languages with Multilingual Language Models

Total Page:16

File Type:pdf, Size:1020Kb

When Being Unseen from Mbert Is Just the Beginning: Handling New Languages with Multilingual Language Models When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models Benjamin Mullery* Antonios Anastasopoulosz Benoît Sagoty Djamé Seddahy yInria, Paris, France *Sorbonne Université, Paris, France zDepartment of Computer Science, George Mason University, USA [email protected] [email protected] Abstract those models. Even languages with millions of na- tive speakers like Sorani Kurdish (about 7 million Transfer learning based on pretraining lan- speakers in the Middle East) or Bambara (spoken guage models on a large amount of raw data by around 5 million people in Mali and neighbor- has become a new norm to reach state-of-the- art performance in NLP. Still, it remains unclear ing countries) are not covered by any available how this approach should be applied for unseen language models at the time of writing. languages that are not covered by any available Even if training multilingual models that cover large-scale multilingual language model and more languages and language varieties is tempting, for which only a small amount of raw data is the curse of multilinguality (Conneau et al., 2020) generally available. In this work, by compar- makes it an impractical solution, as it would require ing multilingual and monolingual models, we show that such models behave in multiple ways to train ever larger models. Furthermore, as shown on unseen languages. Some languages greatly by Wu and Dredze(2020), large-scale multilingual benefit from transfer learning and behave simi- language models are sub-optimal for languages that larly to closely related high resource languages are under-sampled during pretraining. whereas others apparently do not. Focusing In this paper, we analyze task and language adap- on the latter, we show that this failure to trans- tation experiments to get usable language model- fer is largely related to the impact of the script based representations for under-studied low re- used to write such languages. We show that transliterating those languages significantly im- source languages. We run experiments on 15 ty- proves the potential of large-scale multilingual pologically diverse languages on three NLP tasks: language models on downstream tasks. This part-of-speech (POS) tagging, dependency parsing result provides a promising direction towards (DEP) and named-entity recognition (NER). making these massively multilingual models Our results bring forth a diverse set of behaviors 1 useful for a new set of unseen languages. that we classify in three categories reflecting the 1 Introduction abilities of pretrained multilingual language models to be used for low-resource languages. We dub Language models are now a new standard to those categories Easy, Intermediate and Hard. build state-of-the-art Natural Language Process- Hard languages include both stable and endan- ing (NLP) systems. In the past year, monolingual gered languages, but they predominantly are lan- language models have been released for more than guages of communities that are majorly under- 20 languages including Arabic, French, German, served by modern NLP. Hence, we direct our atten- and Italian (Antoun et al., 2020; Martin et al., 2020; tion to these Hard languages. For those languages, de Vries et al., 2019; Cañete et al., 2020; Kuratov we show that the script they are written in can be and Arkhipov, 2019; Schweter, 2020, inter alia). a critical element in the transfer abilities of pre- Additionally, large-scale multilingual models cov- trained multilingual language models. Translit- ering more than 100 languages are now available erating them leads to large gains in performance (XLM-R by Conneau et al.(2020) and mBERT outperforming non-contextual strong baselines. To by Devlin et al.(2019)). Still, most of the 6500+ sum up, our contributions are the following: spoken languages in the world (Hammarström, • We propose a new categorization of the low- 2016) are not covered—remaining unseen—by resource languages that are unseen by avail- 1Code available at https://github.com/benjami able language models: the Hard, the Interme- n-mlr/mbert-unseen-languages.git diate and the Easy languages. 448 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462 June 6–11, 2021. ©2021 Association for Computational Linguistics • We show that Hard languages can be better of pretraining data seems to correlate with down- addressed by transliterating them into a better- stream task performance (e.g. compare BERT and handled script (typically Latin), providing a RoBERTa (Liu et al., 2020)), several attempts have promising direction towards making multilin- shown that training a model from scratch can be ef- gual language models useful for a new set of ficient even if the amount of data in that language is unseen languages. limited. Indeed, Ortiz Suárez et al.(2020) showed that pretraining ELMo models (Peters et al., 2018) 2 Background and Motivation on less than 1GB of text leads to state-of-the-art As Joshi et al.(2020) vividly illustrate, there is a performance while Martin et al.(2020) showed that large divergence in the coverage of languages by pretraining a BERT model on as few as 4GB of di- NLP technologies. The majority of the 6500+ of verse enough data results in state-of-the-art perfor- the world’s languages are not studied by the NLP mance. Micheli et al.(2020) further demonstrated community, since most have few or no annotated that decent performance was achievable with only datasets, making systems’ development challeng- 100MB of raw text data. ing. Adapting large-scale models for low-resource The development of such models is a matter of languages Multilingual language models can be high importance for the inclusion of communities, used directly on unseen languages, or they can the preservation of endangered languages and more also be adapted using unsupervised methods. For generally to support the rise of tailored NLP ecosys- example, Han and Eisenstein(2019) successfully tems for such languages (Schmidt and Wiegand, used unsupervised model adaptation of the English 2017; Stecklow, 2018; Seddah et al., 2020). In BERT model to Early Modern English for sequence that regard, the advent of the Universal Dependen- labeling. Instead of fine-tuning the whole model, cies project (Nivre et al., 2016) and the WikiAnn Pfeiffer et al.(2020) recently showed that adapter dataset (Pan et al., 2017) have greatly increased layers (Houlsby et al., 2019) can be injected into the number of covered languages by providing an- multilingual language models to provide parameter notated datasets for more than 90 languages for efficient task and language transfer. dependency parsing and 282 languages for NER. Regarding modeling approaches, the emergence Still, as of today, the availability of monolin- of multilingual representation models, first with gual or multilingual language models is limited to static word embeddings (Ammar et al., 2016) and approximately 120 languages, leaving many lan- then with language model-based contextual rep- guages without access to valuable NLP technology, resentations (Devlin et al., 2019; Conneau et al., although some are spoken by millions of people, 2020) enabled transfer from high to low-resource including Bambara and Sorani Kurdish, or are an languages, leading to significant improvements in official language of the European Union, like Mal- downstream task performance (Rahimi et al., 2019; tese. Kondratyuk and Straka, 2019). Furthermore, in What can be done for unseen languages? Un- their most recent forms, these multilingual mod- seen languages strongly vary in the amount of els process tokens at the sub-word level (Kudo available data, in their script (many languages use and Richardson, 2018). As such, they work in an non-Latin scripts such as Sorani Kurdish and Min- open vocabulary setting, only constrained by the grelian), and in their morphological or syntactical pretraining character set. This flexibility enables properties (most largely differ from high-resource such models to process any language, even those Indo-European languages). This makes the design that are not part of their pretraining data. of a methodology to build contextualized models When it comes to low-resource languages, one for such languages challenging at best. In this work, direction is to simply train contextualized embed- by experimenting with 15 typologically diverse un- ding models on whatever data is available. An- seen languages, (i) we show that there is a diversity other option is to adapt/fine-tune a multilingual of behavior depending on the script, the amount of pretrained model to the language of interest. We available data, and the relation to the pretraining briefly discuss these two options. languages; (ii) Focusing on the unseen languages Pretraining language models on a small that lag in performance compared to their easier-to- amount of raw data Even though the amount handle counterparts, we show that the script plays a 449 critical role in the transfer abilities of multilingual 3.3 Language Models language models. Transliterating such languages In all our study, we train our language models using to a script which is used by a related language seen the Transformers library (Wolf et al., 2020). during pretraining can lead to significant improve- ment in downstream performance. MLM from scratch The first approach we eval- uate is to train a dedicated language model from 3 Experimental Setting scratch on the available
Recommended publications
  • A Survey of Orthographic Information in Machine Translation 3
    Machine Translation Journal manuscript No. (will be inserted by the editor) A Survey of Orthographic Information in Machine Translation Bharathi Raja Chakravarthi1 ⋅ Priya Rani 1 ⋅ Mihael Arcan2 ⋅ John P. McCrae 1 the date of receipt and acceptance should be inserted later Abstract Machine translation is one of the applications of natural language process- ing which has been explored in different languages. Recently researchers started pay- ing attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine transla- tion systems is the variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable, but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthog- raphy’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic in- formation can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under- resourced languages. We discuss different types of machine translation and demon- strate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts of cog- nates information at different levels of machine translation and the lessons that can Bharathi Raja Chakravarthi [email protected] Priya Rani [email protected] Mihael Arcan [email protected] John P.
    [Show full text]
  • Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani
    1 Word representation or word embedding in Persian texts Siamak Sarmady Erfan Rahmani (Abstract) Text processing is one of the sub-branches of natural language processing. Recently, the use of machine learning and neural networks methods has been given greater consideration. For this reason, the representation of words has become very important. This article is about word representation or converting words into vectors in Persian text. In this research GloVe, CBOW and skip-gram methods are updated to produce embedded vectors for Persian words. In order to train a neural networks, Bijankhan corpus, Hamshahri corpus and UPEC corpus have been compound and used. Finally, we have 342,362 words that obtained vectors in all three models for this words. These vectors have many usage for Persian natural language processing. and a bit set in the word index. One-hot vector method 1. INTRODUCTION misses the similarity between strings that are specified by their characters [3]. The Persian language is a Hindi-European language The second method is distributed vectors. The words where the ordering of the words in the sentence is in the same context (neighboring words) may have subject, object, verb (SOV). According to various similar meanings (For example, lift and elevator both structures in Persian language, there are challenges that seem to be accompanied by locks such as up, down, cannot be found in languages such as English [1]. buildings, floors, and stairs). This idea influences the Morphologically the Persian is one of the hardest representation of words using their context. Suppose that languages for the purposes of processing and each word is represented by a v dimensional vector.
    [Show full text]
  • Language Resource Management System for Asian Wordnet Collaboration and Its Web Service Application
    Language Resource Management System for Asian WordNet Collaboration and Its Web Service Application Virach Sornlertlamvanich1,2, Thatsanee Charoenporn1,2, Hitoshi Isahara3 1Thai Computational Linguistics Laboratory, NICT Asia Research Center, Thailand 2National Electronics and Computer Technology Center, Pathumthani, Thailand 3National Institute of Information and Communications Technology, Japan E-mail: [email protected], [email protected], [email protected] Abstract This paper presents the language resource management system for the development and dissemination of Asian WordNet (AWN) and its web service application. We develop the platform to establish a network for the cross language WordNet development. Each node of the network is designed for maintaining the WordNet for a language. Via the table that maps between each language WordNet and the Princeton WordNet (PWN), the Asian WordNet is realized to visualize the cross language WordNet between the Asian languages. We propose a language resource management system, called WordNet Management System (WNMS), as a distributed management system that allows the server to perform the cross language WordNet retrieval, including the fundamental web service applications for editing, visualizing and language processing. The WNMS is implemented on a web service protocol therefore each node can be independently maintained, and the service of each language WordNet can be called directly through the web service API. In case of cross language implementation, the synset ID (or synset offset) defined by PWN is used to determined the linkage between the languages. English PWN. Therefore, the original structural 1. Introduction information is inherited to the target WordNet through its sense translation and sense ID. The AWN finally Language resource becomes crucial in the recent needs for cross cultural information exchange.
    [Show full text]
  • Pulley / Descender / Belay Device
    Multi-Purpose Device 11 mm EN FR NO SE Model No. 333010-CE Pulley / Descender / Belay Device EN 12278:2007 EN 341:2011/2A 1019 EN 12841:2006/C Patented WARNING Activities involving the use of this device are potentially dangerous. You are responsible for your own actions and decisions. Before using this device, you must: • Read and understand these user instructions and • Familiarize yourself with its capabilities warnings; and limitations; • Get specific training in its proper use; • Understand and accept the risks involved. FAILURE TO HEED ANY OF THESE WARNINGS MAY RESULT IN SEVERE INJURY OR DEATH. 0 Traceability and Markings D E B F A G C 1019 A. Body controlling production of this PPE E. Individual........................... number No. 1019 00 000 M 0000 VVUÚ, a.s. Unit serial number Pikartská 1337/7 716 07 Ostrava - Radvanice Control Czech Republic Day of manufacture Year of manufacture B. Standards F. Anchor/load end of rope C. Carefully read the instructions for use G. Free end of rope D. Model identification 2 1 Field of Application 4 Inspection, Points to Verify See Text See Text 2 Breaking Strength 5 Compatibility 44 kN O 11 mm (EN) Rope (core + sheath) static, semi-static (EN 1891) type A 22 kN 22 kN 3 Nomenclature of Parts See Text 2 7 8 1 6 10 9 3 5 4 3 6 Installing the Rope LOAD SIDE 2 4 1 3 7 Function Test 8 Securing Function Test STOP! 4 9a EN 341:2011/2A Rescue Descender Lowering From Anchor— Maximum Descent Height: 200 m One Person Minimum/Maximum Working Load: 30–240 kg Maximum Descent Rate: 2 m/s Descent—Two People
    [Show full text]
  • Da Heng Building
    fabric&glass Sefar AG Hinterbissaustrasse 12 9410 Heiden Switzerland Da Heng Building Phone +41 71 898 51 04 Taichung (TW) [email protected] www.sefararchitecture.com [FactBox] Within a spacious green park is the Da Additionally, independent studies have Heng building, which was finished in 2016. also shown that the risk of bird strikes Project/Location The company chairman specifically chose on glass facades with integrated SEFAR® Da Heng, Taichung (TW) this fabric for its progressive appearance Architecture VISION Fabric is significantly www.yojicon.com.tw/list.php and unique attributes. SEFAR® Architecture reduced. The special finish using Sentry- Architect VISION PR 260/50 met. grey fabric was Glas® Foil for facade windows meets the C.Y.Lee & Partners Architects/Planners, Taipei (TW) laminated in the lower curtain wall to re- very highest safety standards. www.cylee.com/team1_tw.php duce glare and provide priavacy at the street The glass lites are fixed on four sides in Designer level for the buidlings occupants. an attached profile. The lites are 1.60 m x + Ray Chen International, Taipei (TW) SEFAR® Architecture VISION is a range of 3 and 4 m high for a total of over 1000 sqm, www.raychen-intl.com.tw high-precision fabrics made from synthet- using 10 mm low iron glass. Glass Manufacturer ic black fibers. In a complex process, fab- Stanley Glass Co. Ltd, Keelung (TW) ric is metal-coated on one side. By means www.stanleyglass.com of digital printing, the coated fabric sides Foil can be further individualized. The reflec- SentryGlas® supplied by Kuraray/DuPont tive properties reduces heat input consid- www.sentryglas.com erably and makes a correspondingly valu- Fabric able contribution to the environment and SEFAR® Architecture VISION PR 260/50 met.
    [Show full text]
  • Reply to Heng: Inborn Aneuploidy and Chromosomal Instability
    LETTER LETTER Reply to Heng: Inborn aneuploidy and chromosomal instability It is with great interest that we read the letter chromosome gains (3). Furthermore, our re- Heng for bringing up the interesting topic of by Heng (1), who kindly brought up the port included cells from patients with con- NCCAs in cancer, but we still feel that the interesting topic of nonclonal-chromosome stitutional trisomy 8, a type of aneuploidy title of our report is justified by its results. aberrations (NCCAs) in the context of our commoninhematologicalaswellassolid recent report on the relationship between neoplasms. We also included cells with Anders Valind1 and David Gisselsson inborn aneuploidy and chromosome insta- multiple trisomies. Neither trisomy 8 nor Department of Clinical Genetics, Lund bility (CIN). We fully agree with Heng that such double trisomies are compatible with University, Lund 22184, Sweden NCCAs are better proxies for CIN than live birth. clonal-chromosome aberrations (CCAs). In In summary, we showed that none of fact, we used the frequency of NCCA as a fairly large spectrum of whole chromosome 1 Heng HH (2014) Distinguishing constitutional and acquired proxy to CIN in our study (2). We also agree gains could automatically destabilize the nonclonal aneuploidy. Proc Natl Acad Sci USA 111:E972. 2 Valind A, Jin Y, Baldetorp B, Gisselsson D (2013) Whole that it might well be that aneuploidy causes chromosome number when present in be- chromosome gain does not in itself confer cancer-like chromosomal other forms of genomic instability, but our nign human cells. Our findings thereby infer instability. Proc Natl Acad Sci USA 110(52):21119–21123.
    [Show full text]
  • Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese
    Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese Jessica´ Rodrigues da Silva1, Helena de Medeiros Caseli1 1Federal University of Sao˜ Carlos, Department of Computer Science [email protected], [email protected] Abstract. Word embeddings are numerical vectors which can represent words or concepts in a low-dimensional continuous space. These vectors are able to capture useful syntactic and semantic information. The traditional approaches like Word2Vec, GloVe and FastText have a strict drawback: they produce a sin- gle vector representation per word ignoring the fact that ambiguous words can assume different meanings. In this paper we use techniques to generate sense embeddings and present the first experiments carried out for Portuguese. Our experiments show that sense vectors outperform traditional word vectors in syn- tactic and semantic analogy tasks, proving that the language resource generated here can improve the performance of NLP tasks in Portuguese. 1. Introduction Any natural language (Portuguese, English, German, etc.) has ambiguities. Due to ambi- guity, the same word surface form can have two or more different meanings. For exam- ple, the Portuguese word banco can be used to express the financial institution but also the place where we can rest our legs (a seat). Lexical ambiguities, which occur when a word has more than one possible meaning, directly impact tasks at the semantic level and solving them automatically is still a challenge in natural language processing (NLP) applications. One way to do this is through word embeddings. Word embeddings are numerical vectors which can represent words or concepts in a low-dimensional continuous space, reducing the inherent sparsity of traditional vector- space representations [Salton et al.
    [Show full text]
  • Unicode Alphabets for L ATEX
    Unicode Alphabets for LATEX Specimen Mikkel Eide Eriksen March 11, 2020 2 Contents MUFI 5 SIL 21 TITUS 29 UNZ 117 3 4 CONTENTS MUFI Using the font PalemonasMUFI(0) from http://mufi.info/. Code MUFI Point Glyph Entity Name Unicode Name E262 � OEligogon LATIN CAPITAL LIGATURE OE WITH OGONEK E268 � Pdblac LATIN CAPITAL LETTER P WITH DOUBLE ACUTE E34E � Vvertline LATIN CAPITAL LETTER V WITH VERTICAL LINE ABOVE E662 � oeligogon LATIN SMALL LIGATURE OE WITH OGONEK E668 � pdblac LATIN SMALL LETTER P WITH DOUBLE ACUTE E74F � vvertline LATIN SMALL LETTER V WITH VERTICAL LINE ABOVE E8A1 � idblstrok LATIN SMALL LETTER I WITH TWO STROKES E8A2 � jdblstrok LATIN SMALL LETTER J WITH TWO STROKES E8A3 � autem LATIN ABBREVIATION SIGN AUTEM E8BB � vslashura LATIN SMALL LETTER V WITH SHORT SLASH ABOVE RIGHT E8BC � vslashuradbl LATIN SMALL LETTER V WITH TWO SHORT SLASHES ABOVE RIGHT E8C1 � thornrarmlig LATIN SMALL LETTER THORN LIGATED WITH ARM OF LATIN SMALL LETTER R E8C2 � Hrarmlig LATIN CAPITAL LETTER H LIGATED WITH ARM OF LATIN SMALL LETTER R E8C3 � hrarmlig LATIN SMALL LETTER H LIGATED WITH ARM OF LATIN SMALL LETTER R E8C5 � krarmlig LATIN SMALL LETTER K LIGATED WITH ARM OF LATIN SMALL LETTER R E8C6 UU UUlig LATIN CAPITAL LIGATURE UU E8C7 uu uulig LATIN SMALL LIGATURE UU E8C8 UE UElig LATIN CAPITAL LIGATURE UE E8C9 ue uelig LATIN SMALL LIGATURE UE E8CE � xslashlradbl LATIN SMALL LETTER X WITH TWO SHORT SLASHES BELOW RIGHT E8D1 æ̊ aeligring LATIN SMALL LETTER AE WITH RING ABOVE E8D3 ǽ̨ aeligogonacute LATIN SMALL LETTER AE WITH OGONEK AND ACUTE 5 6 CONTENTS
    [Show full text]
  • Heng Xian and the Problem of Studying Looted Artifacts
    Dao (2013) 12:153–160 DOI 10.1007/s11712-013-9323-4 Heng Xian and the Problem of Studying Looted Artifacts Paul R. Goldin Published online: 10 April 2013 # Springer Science+Business Media Dordrecht 2013 Abstract Heng Xian is a previously unknown text reconstructed by Chinese scholars out of a group of more than 1,200 inscribed bamboo strips purchased by the Shanghai Museum on the Hong Kong antiquities market in 1994. The strips have all been assigned an approximate date of 300 B.C.E., and Heng Xian allegedly consists of thirteen of them, but each proposed arrangement of the strips is marred by unlikely textual transitions. The most plausible hypothesis is one that Chinese scholars do not appear to take seriously: that we are missing one or more strips. The paper concludes with a discussion of the hazards of studying unprovenanced artifacts that have appeared during China’s recent looting spree. I believe the time has come for scholars to ask themselves whether their work indirectly abets this destruction of knowledge. Keywords Heng Xian . Chinese philosophy . Shanghai Museum . Looting Heng Xian 恆先 (In the Primordial State of Constancy) is a previously unknown text reconstructed by Chinese scholars out of a group of more than 1,200 inscribed bamboo strips purchased by the Shanghai Museum on the Hong Kong antiquities market in 1994 (MA Chengyuan 2001: 1). The strips have all been assigned an approximate date of 300 B.C.E., and Heng Xian consists of thirteen of them. The first published version was edited by the veteran palaeographer LI Ling 李零 (LI Ling 2003).
    [Show full text]
  • Highway Dharma Letters: Two Buddhist Pilgrims Write to Their Teacher
    Highway Dharma Letters: Two Buddhist Pilgrims Write to Their Teacher The Second Edition of News from True Cultivators by Heng Sure and Heng Chau With a Foreword by Huston Smith Foreword to the Second Edition I have written forewords to more than forty-five books, but I say with- out hesitation that I have never been as honored—and humbled—as I am to have been invited to write this one. This is the most neglected book of the twentieth century—I say this categorically and with com- plete confidence. Come to think of it, though, this stands to reason, for humility requires that authentic spirituality and public relations stand in inverse ration to each other. (“When you pray, go into your closet.”) Humility shows itself also in the fact that the “News” in the title takes the form of daily letters written to the Venerable Master Hsuan Hua, by his two disciples who made the eight-hundred-mile journey from Gold Wheel Temple in South Pasadena to the City of Ten Thousand Buddhas near Ukiah. They covered the distance in the traditional penitential way: three steps followed by a full prostration. One took a vow of complete silence for the duration of the pilgrimage and three years beyond; the other managed practical affairs, did all the talking, driving, cooking, and dealing with occasionally hostile visitors, while still finding time to bow. Having said that this is the most overlooked book of the twenti- eth century, I want only to add that I find it one of the most inspir- ing.
    [Show full text]
  • Quantum Integrable Systems from Conformal Blocks Quantum
    Quantum Integrable Systems from Conformal Blocks (arXiv: 1605.05105 with Joshua D. Qualls) Heng-Yu Chen National Taiwan University Strings 2016 @ Beijing Heng-Yu Chen National Taiwan University Quantum Integrable Systems from Conformal Blocks This talk attempts to discuss the following questions: I Are there more systematic or even more efficient ways to obtain conformal blocks in various dimensions and set up? I Can one extend the correspondence between the degenerate correlation functions and quantum integrable systems in d = 2 dim. CFTs to d > 2 dim. CFTs? I If so, how general such a correspondence is? Heng-Yu Chen National Taiwan University Quantum Integrable Systems from Conformal Blocks Let us begin with the four point function of scalar conformal primary operator φi (x) with scale dimension ∆i in d-dim. CFTs: 2 a 2 b x14 x14 F (u; v) < φ1(x1)φ2(x2)φ3(x3)φ4(x4) >= x 2 x 2 2 (∆1+∆2) 2 (∆3+∆4) 24 13 (x12) 2 (x34) 2 (1) ∆2−∆1 ∆3−∆4 where xij = xi − xj a = 2 , b = 2 and F (u; v) is a function of conformally invariant cross ratios: 2 2 2 2 x12x34 x14x23 u = 2 2 = zz¯; v = 2 2 = (1 − z)(1 − z¯): (2) x13x24 x13x24 Decomposing the four point function further into contributions from the individual exchanged primary operators O∆;l and its descendants: X < φ1(x1)φ2(x2)φ3(x3)φ4(x4) >= λ12O∆;l λ34O∆;l WO∆;l (xi ) (3) fO∆;l g WO∆;l (xi ) is \conformal partial wave" and λijO∆;l are \OPE coefficients.".
    [Show full text]
  • TEI and the Documentation of Mixtepec-Mixtec Jack Bowers
    Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Jack Bowers To cite this version: Jack Bowers. Language Documentation and Standards in Digital Humanities: TEI and the documen- tation of Mixtepec-Mixtec. Computation and Language [cs.CL]. École Pratique des Hauts Études, 2020. English. tel-03131936 HAL Id: tel-03131936 https://tel.archives-ouvertes.fr/tel-03131936 Submitted on 4 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Pratique des Hautes Études Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Soutenue par Composition du jury : Jack BOWERS Guillaume, JACQUES le 8 octobre 2020 Directeur de Recherche, CNRS Président Alexis, MICHAUD Chargé de Recherche, CNRS Rapporteur École doctorale n° 472 Tomaž, ERJAVEC Senior Researcher, Jožef Stefan Institute Rapporteur École doctorale de l’École Pratique des Hautes Études Enrique, PALANCAR Directeur de Recherche, CNRS Examinateur Karlheinz, MOERTH Senior Researcher, Austrian Center for Digital Humanities Spécialité and Cultural Heritage Examinateur Linguistique Emmanuel, SCHANG Maître de Conférence, Université D’Orléans Examinateur Benoit, SAGOT Chargé de Recherche, Inria Examinateur Laurent, ROMARY Directeur de recherche, Inria Directeur de thèse 1.
    [Show full text]