GLOBALEX 2016 Lexicographic Resources for Human Language Technology
Total Page:16
File Type:pdf, Size:1020Kb
GLOBALEX 2016 Lexicographic Resources for Human Language Technology Workshop Programme 24 May 2016 09:00 – 09:10 Introduction by Ilan Kernerman and Simon Krek 09:10 – 10:30 Session 1 Patrick Hanks, A common-sense paradigm for linguistic research Pamela Faber, Pilar León-Araúz and Arianne Reimerink, EcoLexicon: New features and challenges Sara Carvalho, Rute Costa and Christophe Roche, Ontoterminology meets lexicography: The Multimodal Online Dictionary of Endometriosis (MODE) Gregory Grefenstette and Lawrence Muchemi, Determining the characteristic vocabulary for a specialized dictionary using Word2vec and a directed crawler 10:30 – 11:00 Coffee break 11:00 – 12:40 Session 2 Malin Ahlberg, Lars Borin, Markus Forsberg, Olof Olsson, Anne Schumacher and Jonatan Uppström, Karp: Språkbanken’s open lexical infrastructure Ivelina Stoyanova, Svetla Koeva, Maria Todorova and Svetlozara Leseva, Semi-automatic compilation of a very large multiword expression dictionary for Bulgarian Raffaele Simone and Valentina Piunno, CombiNet: Italian word combinations in an online lexicographic tool Irena Srdanovic and Iztok Kosem, GDEX for Japanese: Automatic extraction of good dictionary example candidates Jana Klímová, Veronika Kolářová and Anna Vernerová, Towards a corpus-based valency lexicon of Czech nouns 12:40 – 14:20 Lunch break 14:20 – 16:00 Session 3 Jan Hajic, Eva Fucikova, Jana Sindlerova and Zdenka Uresova, Verb argument pairing in a Czech- English treebank Sonja Bosch and Laurette Pretorius, The role of computational Zulu verb morphology in multilingual lexicographic applications Martin Benjamin, Toward a global online living dictionary: A model and obstacles for big linguistic data Luis Morgado da Costa, Francis Bond and František Kratochvil, Linking and disambiguating Swadesh texts Luis Espinoza Anke, Roberto Carlini, Horacio Saggion and Francesco Ronzano, DEFEXT: A semi- supervised definition extraction tool 16:00 – 16:30 Coffee break I 16:30 – 17:00 Session 4 Julia Bosque-Gil, Jorge Gracia, Elena Montiel-Ponsoda and Guadalupe Aguado-de-Cea, Modelling multilingual lexicographic resources for the web of data: The K Dictionaries case Ivett Benyeda, Péter Koczka and Tamás Varadi, Creating seed lexicons for under-resourced languages 17:00 – 18:00 GLOBALEX Discussion Editors Ilan Kernerman K Dictionaries Iztok Kosem Trojina, Institute for Applied Slovene Studies Simon Krek “Jožef Stefan” Institute Lars Trap-Jensen Society for Danish Language and Literature Workshop Organizers/Organizing Committee Andrea Abel EURALEX Ilan Kernerman* ASIALEX Steven Kleinedler DSNA Iztok Kosem eLex Simon Krek* eLex Julia Miller AUSTRALEX Maropeng Victor Mojela AFRILEX Danie J. Prinsloo AFRILEX Rachel Edita O. Roxas ASIALEX Lars Trap-Jensen EURALEX Luanne von Schneidemesser DSNA Michael Walsh AUSTRALEX * Co-chairs of the Organising Committee Workshop Programme Committee Michael Adams Indiana University Philipp Cimiano University of Bielefeld Janet DeCesaris Universitat Pompeu Fabra Thierry Declerck German Research Center for Artificial Intelligence Anne Dykstra Fryske Akademie Edward Finegan University of Southern California Thierry Fontenelle Translation Center for the Bodies of the EU Polona Gantar University of Ljubljana Alexander Geyken Berlin-Brandenburg Academy of Sciences and Humanities Rufus Gouws Stellenbosch University Jorge Gracia Madrid Polytechnic University Orin Hargraves University of Colorado Ulrich Heid Hildesheim University II Chu-Ren Huang Hong Kong Polytechnic University Miloš Jakubíček Lexical Computing – Sketch Engine Jelena Kallas Institute of Estonian Language Ilan Kernerman K Dictionaries Annette Klosa German Language Institute Iztok Kosem Trojina, Institute for Applied Slovene Studies Simon Krek “Jožef Stefan” Institute Robert Lew Adam Mickiewicz University Marie Claude l’Homme University of Montreal Nikola Ljubešić University of Zagreb Stella Markantonatou Institute for Language and speech Processing ATHENA John McCrae National University of Ireland Galway Roberto Navigli Sapienza University of Rome Vincent Ooi National University of Singapore Michael Rundell Lexicography Masterclass Mary Salisbury Massey University Adam Smith Macquarie University Pius ten Hacken Innsbruck University Carole Tiberius Institute of Dutch Lexicology Yukio Tono Tokyo University of Foreign Studies Lars Trap-Jensen Society for Danish Language and Literature Tamás Váradi Hungarian Academy of Sciences Elena Volodina Gothenburg University Eveline Wandl-Vogt Austrian Academy of Sciences Shigeru Yamada Waseda University III Table of contents Towards a Corpus-based Valency Lexicon of Czech Nouns 1 Jana Klímová, Veronika Kolářová, Anna Vernerová Ontoterminology meets Lexicography: the Multimodal Online Dictionary of 8 Endometriosis (MODE) Sara Carvalho, Rute Costa, Christophe Roche Verb Argument Pairing in Czech-English Parallel Treebank 16 Jan Hajič, Eva Fučíková, Jana Šindlerová, Zdeňka Urešová DEFEXT: A Semi Supervised Definition Extraction Tool 24 Luis Espinosa-Anke, Roberto Carlini, Horacio Saggion, Francesco Ronzano Linking and Disambiguating Swadesh Lists: Expanding the Open Multilingual 29 Wordnet Using Open Language Resources Luís Morgado da Costa, Francis Bond, František Kratochvíl The Role of Computational Zulu Verb Morphology in Multilingual 37 Lexicographic Applications Sonja Bosch, Laurette Pretorius CombiNet. A Corpus-based Online Database of Italian Word Combinations 45 Valentina Piunno Creating seed lexicons for under-resourced languages 52 Ivett Benyeda, Péter Koczka, Tamás Váradi GDEX for Japanese: Automatic Extraction of Good Dictionary Example 57 Candidates Irena Srdanović, Iztok Kosem Modelling multilingual lexicographic resources for the Web of Data: The K 65 Dictionaries case Julia Bosque-Gil, Jorge Gracia, Elena Montiel-Ponsoda, Guadalupe Aguado-de-Cea EcoLexicon: New Features and Challenges 73 Pamela Faber, Pilar León-Araúz, Arianne Reimerink Determining the Characteristic Vocabulary for a Specialized Dictionary using 81 Word2vec and a Directed Crawler Gregory Grefenstette, Lawrence Muchemi Semi-automatic Compilation of the Dictionary of Bulgarian Multiword 86 Expressions Svetla Koeva, Ivelina Stoyanova, Maria Todorova, Svetlozara Leseva IV Author Index Aguado-de-Cea, Guadalupe ……...….…… 65 León-Araúz, Pilar ………………………… 73 Leseva, Svetlozara ……………………….. 86 Benyeda, Ivett ……………………………. 52 Bond, Francis ……………………....…….. 29 Montiel-Ponsoda, Elena ………………….. 65 Bosch, Sonja …………………….……....... 37 Morgado da Costa, Luís ………………….. 29 Bosque-Gil, Julia ………………….……… 65 Muchemi, Lawrence ……………………… 81 Carlini, Roberto ………………………..…. 24 Piunno, Valentina ……………………….... 45 Carvalho, Sara …………………………...… 8 Pretorius, Laurette …………………...…… 37 Costa, Rute ……………………………...…. 8 Reimerink, Arianne ………………….…… 73 Espinosa-Anke, Luis ……………………... 24 Roche, Christophe …………………………. 8 Ronzano, Francesco ……………………… 24 Faber, Pamela ………………………….…. 73 Fučíková, Eva ………………………….… 16 Saggion, Horacio ……………………….… 24 Srdanović, Irena ……………………..…… 57 Gracia, Jorge …………………………....... 65 Stoyanova, Ivelina ………………………... 86 Grefenstette, Gregory …………………….. 81 Šindlerová, Jana ………………………….. 16 Hajič, Jan …………………………………. 16 Todorova, Maria ………………………….. 86 Klímová, Jana …………………………..….. 1 Koczka, Péter …………………………….. 52 Urešová, Zdeňka …………………………. 16 Koeva, Svetla …………………………….. 86 Kolářová, Veronika ………………………... 1 Váradi, Tamás ………………………….… 52 Kosem, Iztok ………………………...…… 57 Vernerová, Anna …………………………… 1 Kratochvíl, František …………………..…. 29 V Introduction The field of lexicography has been shifting to digital media, with effect on all stages of research, development, design, evaluation, publication, marketing and usage. Modern lexicographic content is created with help of dictionary writing tools, corpus query systems and QA applications, and becomes more easily accessible and useful for integration with numerous LT solutions, as part of bigger knowledge systems and collaborative intelligence. At the same time, extensive interlinked language resources, primarily intended for use in Human Language Technology (HLT), are being created through projects, movements and initiatives, such as Linguistic Linked (Open) Data (LLOD), meeting requirements for optimal use in HLT, e.g. unique identification and use of web standards (RDF or JSON-LD), leading to better federation, interopera- bility and flexible representation. In this context, lexicography constitutes a natural and vital part of the LLOD scheme, currently represented by wordnets, FrameNets, and HLT-oriented lexicons, on- tologies and lexical databases. However, a new research paradigm and common standards are still lacking, and so are common standards for the interoperability of lexicography with HLT applications and systems. The aim of this workshop is to explore the development of global standards for the evaluation of lexicographic resources and their incorporation with new language technology services and other devices. The workshop is the first-ever joint initiative by all the major continental lexicography as- sociations, seeking to promote cooperation with related fields of HLT for all languages worldwide, and it is intended to bridge various existing gaps within and among such different research fields and interest groups. The target audience includes lexicographers, computational and corpus linguists and researchers working in the fields of HLT, Linked Data, the Semantic Web, Artificial Intelligence, etc. GLOBALEX 2016 is sponsored by the five existing continental lexicography associations and the international conferences on