<<

Number 25 ● July 2017 Kernerman kdictionaries.com/kdn News Adam Kilgarriff Prize, 2017

The Adam Kilgarriff Prize was set up in 2016 , including an innovative corpus and a as a memorial to our multi-talented friend, who corpus-based dictionary. Dr. Rutkowski explains contributed so much to the fields of , his project's goals on the last page of this issue. corpus linguistics, and NLP. Along with the other He has accepted an invitation from the organizers Trustees, I’m delighted that the Prize has got off of eLex 2017 in Leiden to give one of the keynote to such a terrific start. Not only did we receive talks at the conference, where he will receive the eight excellent applications, but the projects they Prize in person from Adam’s spouse, Gill Lamden. were based on reflected interesting and original We congratulate all the applicants for their work across the whole spectrum of fields in which impressive submissions, and for making the first Adam himself was engaged. There were submissions in areas iteration of Adam’s prize such a success. Finally, I take this as diverse as corpus building, software tools for language opportunity to thank my fellow Trustees for the great care and research, named entity resources, translation systems, machine thoroughness which they devoted to the task of evaluating all learning, and dictionary-creation. the submissions. We are all confident that this year’s Adam With so many high-quality applications, selecting a winner Kilgarriff Prize has a worthy winner, and we look forward to was a challenging process for the six Trustees, involving inviting applications for the 2019 Prize in due course. several lengthy Skype meetings. But our eventual decision was Michael Rundell unanimous, and we’re all thrilled that the first Adam Kilgarriff Prize is going to Dr. Paweł Rutkowski. Paweł and his team The Trustees of the Adam Kilgarriff Prize are Miloš Jakubíček, at the University of Warsaw have been working for over five Ilan Kernerman, Iztok Kosem, Michael Rundell (Chair), Pavel years on the development of resources for users of Polish Sign Rychlý, and Carole Tiberius.

1 Adam Kilgarriff Prize, 2017 | Michael Rundell 2 Introducing LDL4HELTA: Linked data lexicography for high-end language technology application | Martin Kaltenböck and Ilan Kernerman th 4 Triplifying a dictionary: Some learnings | Timea Turdean and Shrikant Joshi 6 TIAD shared task 2017 – Translation Inference Across | Noam Ordan 25 7 Towards a module for lexicography in OntoLex | Julia Bosque-Gil, Jorge Gracia and jubilee issue Elena Montiel-Ponsoda with a focus on 10 OntoLex 2017 – 1st workshop on the OntoLex model | Philipp Cimiano lexicography and 12 KD API | Morris Alper computational linguistics 13 1st Conference on Language, Data and Knowledge – Galway 2017 | John P. McCrae, Paul Buitelaar, Christian Chiarcos and Sebastian Hellmann 13 Global WordNet Conference – 2018 | Francis Bond 14 The advent of post-editing lexicography | Miloš Jakubíček 15 Sketch Engine and Lexonomy | Ondřej Matuška 16 Phonetic transcription of dotted Hebrew | Alon Itai 18 LOTKS 2017 – Workshop on Language, Ontology, and Knowledge Structures | Fahad Khan 19 The saga of Norsk Ordbok: A scholarly dictionary for the Norwegian vernacular and the Nynorsk language | Oddrun Grønvik 24 NORDIX bilingual dictionaries for Nordic and major European © 2017 All rights reserved. 25 The First Century of English Monolingual Lexicography. Kusujiro Miyoshi | Michael Adams K DICTIONARIES LTD 26 eLex 2017 – Lexicography from Scratch | Carole Tiberius 8 Nahum Hanavi Street 28 Met zoveel woorden. Gids voor trefzeker taal gebruik. Rik Schutz en Ludo Permentier Tel Aviv 6350310 Israel | Anne Dykstra +972-3-5468102 29 GLOBALEX mid-2017 | Ilan Kernerman [email protected] 30 A brief account of ASIALEX 2017 | Hai Xu http://kdictionaries.com 31 Diccionarios electrónicos: perspectivas para el siglo XXI | Beatriz Sánches Cárdenas and Amelia Sanz 32 The Corpus of Polish Sign Language and the Corpus-based Dictionary of Polish Sign Language | Paweł Rutkowski Editor | Ilan Kernerman ISSN 1565-4745 2

Introducing LDL4HELTA: Linked data lexicography for high-end language technology application Martin Kaltenböck and Ilan Kernerman

Background text processing (e.g. MS Word / Office 365), Business is becoming increasingly text mining, content analysis, localization, globalized, and enterprises as well as etc. New major players consist of a range public organisations are increasingly of world leading software and hardware acting in multiple areas, facing challenges corporations (Google, Apple, Microsoft, of cross-lingual and inter-cultural barriers. Facebook, SAP, IBM, Intel, Amazon, etc.) English has been crowned a global language, as well as newcomers offering innovative yet regional identities flourish as well, and solutions. this trend correlates with a rise in human and machine solutions to facilitate and enhance Semantics and lexicography communication across languages and To develop and provide satisfactory LT cultures. Looking at the European market, mechanisms and tools we need suitable for example, we see 24 working languages software as well as high quality data Martin Kaltenböck is in EU28, which make cross-border services and information in the form of corpora, Co-Founder, CFO and considerably complicated. This calls for dictionaries, word form lists and other powerful language technology (LT) and fine lexical resources. Lexicography is Managing Partner of Semantic intense efforts to enable and materialize a vital component in this ecosystem. It Web Company. He leads the the vision of a multilingual digital single follows mainly a qualitative approach LDL4HELTA project, and market (defined as a priority area of the that tends to be both time-consuming and leads and works in several European Commission, see http://ec.europa. cost-intensive. Interoperating lexicography national and international eu/priorities/digital-single-market/). with emerging Semantic Web methods research, industry and public As a result, we see a continuously and technologies is already underway, administration projects – growing LT market, primarily in Europe mainly as part of academic research, but mainly as regards project but also worldwide. The growth leads mainstream lexicographic applications management, requirements LT entrepreneurs to suggest solutions for still make little use of state-of-the-art engineering, and community data- and information-driven organisations linked data (LD) resources. LT, on the and communication activities. to work internationally, to efficiently other hand, follows mainly a quantitative He studied communication, store, access, integrate and disseminate approach along statistical and machine their data, and to allow for both inter- and learning technologies. Bridging the gap psychology and marketing at intra-organisation communication across between these qualitative and quantitative the University of Vienna. borders and language barriers by utilizing approaches is a huge challenge, and https://linkedin.com/in/ software tools. new solutions that combine these fields martinkaltenboeck/ Although the emerging LT industry is fairly successfully stand good chances to be useful young, it has rapidly changed the rules of the for the market. game and excluded major old-time players. In this context, the emerging Semantic It ranges as widely as machine translation Web, with new LD and semantic (e.g. Google Translate, Translation Memory technologies, offers innovative means for tools), speech technologies (e.g. Apple’s Siri information and data retrieval, knowledge digital assistant), education (e.g. e-learning), management systems, and other applications for lexical data exchange and integration. However, while existing methodologies are becoming mature, they still lack sufficient Semantic Web Company is a leading provider of graph-based metadata, refinements for data quality mechanisms, search and analytic solutions, based in Vienna. Its expert team provides provenance methods and security issues. consulting and integration services for semantic data and information There are new RandD initiatives that aim management, and supports customers mainly in North America, Europe to bridge the gap between these disciplines,

2017 and , including global 500 companies. It has recently been but only a few commercial applications. named on KMWorld’s 2017 list of the 100 Companies That Matter in Groundbreaking projects include LIDER Knowledge Management (after being listed also in 2015 and 2016). (http://lider-project.eu/) and LOD2 – https://semantic-web.com Creating Knowledge out of Interlinked Data, which included the development of PoolParty Semantic Suite is a world-class semantic technology tool that 45+ LOD software components (http:// offers sharply focused solutions for knowledge organization and content lod2.eu). Several freely available sources business. As a semantic middleware, PoolParty enriches information with (and semi-open lexicographic resources) valuable metadata and links business and content assets automatically. have also been developed, such as the http://poolparty.biz Linguistic Linked Open Data Cloud (http://

Kernerman Dictionary News, July linguistic-lod.org/llod-cloud), WordNet 3

(https://wordnet.princeton.edu/), BabelNet In order to provide the above-mentioned (http://babelnet.org/), or DBpedia (http:// solutions, an integrated multilingual dbpedia.org). Although these sources metadata and data management approach are comprehensive and useful, as well as is needed, and this is where SWC’s available in machine readable formats (often PoolParty Semantic Suite plays a providing an API) that allow relatively easy crucial role. As PoolParty follows W3C and efficient data integration, their main Semantic Web standards (http://w3.org/ drawbacks still regard the content quality standards/semanticweb/), such as SKOS and (in)completeness. (http://w3.org/2004/02/skos/), it already The need remains to combine such openly incorporates language-independent-based available sources with quality lexicographic technologies. However, as regards text resources, including monolingual, analysis and extraction, the ability to bilingual or multilingual dictionaries that process multilingual information and data offer comprehensive data such as precise is key for success – which means that such definitions, examples of usage, and other systems need to speak as many languages grammatical and semantic information, as possible. among others. The cooperation with KD in the course of LDL4HELTA will enable PoolParty The LDL4HELTA project Semantic Suite to continuously “learn to Linked Data Lexicography for High-End speak” more and more languages, and do Language Technology Application so more precisely, by making use of KD’s Ilan Kernerman is CEO of K (LDL4HELTA, https://ldl4.com/) attempts rich multi-language lexical content and its Dictionaries, leading strategic to deal with the issues described above know-how in lexicography as a base for development and cooperation. by combining lexicography with LD and improved text analysis and processing. integrating closed data sources with open The first goal of the LDL4HELTA He edits and publishes ones to develop new LT methods and project is to model and convert KD data Kernerman Dictionary News, tools. This project is part of the EUREKA into RDF format, make it enrichable by has co-edited conference papers bilateral Austria-Israel RandD framework third-party sources, by applying the Linked and other collections, and (http://eurekanetwork.org/project/id/9898), Data Design Principles proposed by Tim been involved in international endorsed and supported by the Austrian Berners Lee (https://w3.org/DesignIssues/ lexicography associations and Research Promotion Agency and the Israeli LinkedData.html), and to make use of projects, most recently Asialex Chief Scientist Office (Israel Innovation a SPARQL endpoint as an API to enable president (2015-2017) and Authority). It is led by Semantic Web complex and flexile data querying. the Globalex initiative. His Company (SWC) and K Dictionaries (KD), The second goal is to improve word sense interests include lexicography with scholarly cooperation of the Austrian disambiguation as regards entity extraction and the interoperability with Academy of Sciences and Polytechnic and semantic annotation. Several methods NLP and knowledge systems. University of Madrid. are combined to attain this purpose, The project brings together lexicographic including (i) using dictionary data (ii) http://kdictionaries.com/ resources of KD with SWC's expertise in using thesauri and knowledge models, (iii) ilank_2015.pdf semantic technologies for the development making use of corpora and freely available of new products and services, to help lexical resources, and (iv) integrating users’ the international LT market meet the first-choice mechanisms. fast-growing demands for dedicated The project started in July 2015 and language-independent, language-specific ends in September 2017. It is supported and cross-language solutions. These, in by an advisory board including Prof. turn, will enhance cross-lingual search and Christian Chiarcos (Goethe University, usage for multilingual data management Frankfurt), Mr. Orri Erling (Google, San and integration. This entails: Francisco), Prof. Asunción Gómez Pérez ● Enhancing knowledge and technology (Universidad Politécnica de Madrid), Dr. transfer between the partners in lexical Sebastian Hellmann (Leipzig University), methodologies and LD and semantic Prof. Alon Itai (Technion, Haifa), and Ms. technologies; Eveline Wandl-Vogt (Austrian Academy of ● Combining state-of-the-art Sciences).

lexicographic and LT resources with 2017 Semantic Web and LD mechanisms to bridge the gap between them and K Dictionaries creates cross-lexical resources for 50 languages and generate new cross-language lexical cooperates with industry and academia partners worldwide. Based in tools and services; Tel Aviv and incorporating cutting edge pedagogical and multilingual ● Integrating existing and new tools of lexicography methodologies, it develops manually crafted and the partners to give way to improved automatically generated linguistic data serving natural language enterprise-ready software and data processing technologies for human and machine use. solutions for a wider market; http://kdictionaries.com ● Developing new software components

for to upgrade data quality. Kernerman Dictionary News, July 4

Triplifying a dictionary: Some learnings Timea Turdean and Shrikant Joshi

1. Introduction on the monolingual part of KD’s Spanish The Linked Data Lexicography for High-End dataset mentioned above. Language Technology (LDL4HELTA) project1 was launched in cooperation 3. Triplification process between Semantic Web Company (SWC)2 The triplification of a dictionary is a process and K Dictionaries (KD)3, combining of mapping its data (which in KD’s case lexicography and computational linguistics is in propriatory XML format) to RDF with semantic and linked (open) data triples. Following the triplication process, mechanisms and technologies. One of the the resulting data was stored in a database implementation steps of the project was to to facilitate further processing. create a language graph from the dictionary In previous works, an RDF lexicographic data. The input data consists of the Spanish model was proven to work for KD’s lexicographic resource of KD, which is lexicographic resources. The present article Timea Turdean is Technical translated into multiple languages and is reports on how this model was applied on the Consultant at Semantic Web available in XML format. The data needed Global Spanish dataset (i.e. the monolingual to be triplified (that is, converted to RDF4) core and its translations in other languages) Company, Vienna. In her for several purposes, including enhancing and triplified. In the process we ensured current position she supports its enrichment with external resources. that the RDF complied with Semantic Web clients and partners integrating Section 2 of this article describes previous (SW) standards8. semantic technologies and is work carried out in this domain. Section involved in different research 3 discusses in detail the actual process 3.1. Nature of a dictionary entry projects dealing with linguistic of triplification of the dictionary XML The XML format of KD’s Global Spanish data, earth observation data and into RDF. An interesting experiment was dataset consists of a complex structure publication data. She holds a carried out by using and applying the same containing nested components. Each word MSc. from Vienna University principles for the translation of a dictionary, constitutes an entry, containing information of Technology, and her as described in Section 4. Although the such as: pronunciation; inflections; range of background is in text mining initial success has ratified the process, some application; sense indicators; compositional and sentiment analysis. work is still required to explore and enhance phrases; translations (of different it further, which is described as part of the components); alternative scripts; register; timea.turdean@semantic-web. conclusions in Section 5. geographical usage; sense qualifier; com version; ; lexical sense; examples 2. Previous work of usage; homograph information; language There are different initiatives and efforts information; specific display information; that investigate the process and usefulness identifiers; and more… of triplifying lexicographic data. Entries can have predefined values Terminesp5 is a well-known database that that can recur, but their fields can also was transformed into RDF following linked have so-called free values, which can data best practices (cf. Gracia 2015). Our vary too, including: Aspect; Tense; work builds on the findings of Klimek and Subcategorization; Subject Field; Mood; Brümmer (2015), who have investigated Grammatical Gender; Geographical Usage; the usage of the Lemon model6 on KD’s Case; and more… German lexicographic XML data, and demonstrated how it can be represented 3.2. Constructing a lexical model in RDF and noted some missing elements After studying the entry structure, it was that needed to be reconsidered. Bosque-Gil necessary to construct a model representing et al. (2016) also report about combining the entries in the SW conceptual form to

2017 linked data in lexicography, particularly go from the dictionary’s XML format to regarding usage of the Ontolex model7 its triples. The model was designed by Bosque-Gil et al. (2016), and an example 1 https://ldl4.com/ representing two Spanish words having 2 https://www.semantic-web.at/ senses that relate to each other is presented 3 http://kdictionaries.com/ in Figure 1. 4 https://www.w3.org/RDF/ 5 http://linguistic.linkeddata.es/ ontolex/wiki/Final_Model_ terminesp/ Specification 6 http://lemon-model.net/ 8 http://semanticweb.org/wiki/

Kernerman Dictionary News, July 7 https://www.w3.org/community/ Semantic_Web_standards.html 5

Usually, when modelling linked data or a left-to-right order, the process outlined in just RDF it is important to make use of Figure 2 represents: existing models and schemas to enable easier ● A DPU used to upload the XML and more efficient use and integration. A files into UnifiedViews for further well-known model is Lemon9, whose processing; core path can cover some of this dictionary’s ● A DPU which transforms XML data to needs (cf. Klimek and Brümmer, 2015), but RDF using XSLT12. The style sheet is not all of them. The Ontolex model10, which part of the configuration of the unit; is more complex and considered to be the ● The .rdf generated files are stored on evolution of Lemon, offers more capabilities the filesystem; in this regard. However, also after adapting ● Finally, the .rdf generated files are the KD data to the OntoLex model, some uploaded into a triple store, such as pieces of information were still missing Virtuoso Universal Server13. and an additional ontology was needed to be created to cover all such elements and 3.3. URIs catch the specific details that did not get Complexity increases also through the sufficiently treated (such as the free values). URIs (Uniform Resource Identifier) that We named this model extension OntolexKD. are needed for mapping the information The process used to do the mapping from in the dictionary since linked data requires KD’s XMLs to RDF consists of several every resource to have a clearly identified steps. This can be visualised as a processing and persistent identifier. The start was Shrikant Joshi holds a PhD pipeline which manipulates the XML data. to represent a single word (headword) in Linguistics from Université The tool that we used for this mapping was under a desired namespace and build on de Lausanne, with a focus on UnifiedViews11. This is an ETL (Extract, it to associate it with its part of speech, Transform and Load) tool with which you grammatical gender and number, definition the semantics of affixation, its can configure your own data processing and translation. formalisation and subsequent pipeline to generate RDF data. One of The base URIs follow the best practices computational processing, and its use cases is to triplify different data recommended in the ISA study on persistent a BE in Electronics Engineering formats and store the resulting RDF data URIs14 following the pattern: http:// and an MA in French from in a database. Our processing pipeline {domain}/{type}/{concept}/{reference}. University of Pune. He has appears in UnifiedViews as displayed in An example of such URIs for the forms been teaching courses in NLP, Figure 2. of a headword is: French and German Linguistics ● The pipeline is composed of data http://kdictionaries.com/id/lexiconES/ at the University of Pune as processing units (DPUs) which entendedor-n-m-sg-form a visiting lecturer. Currently communicate with each other iteratively. In ● http://kdictionaries.com/id/lexiconES/ he is working as Technical entendedor-n-f-sg-form 9 http://lemon-model.net/ Consultant and Researcher at 10 https://www.w3.org/community/ 12 https://www.w3.org/Style/XSL/ Semantic Web Company. ontolex/wiki/Final_Model_ 13 https://virtuoso.openlinksw.com/ shrikant.joshi@semantic-web. Specification 14 http://philarcher.org/diary/2013/ com 11 https://unifiedviews.eu/ uripersistence/ 2017

Figure 1: Language model example Kernerman Dictionary News, July 6

These two URIs represent the singular In this case, both represent the same TIAD shared task 2017 masculine and singular feminine forms of translation but have different URIs – Translation Inference the Spanish word entendedor. because they were generated from different Across Dictionaries ● http://kdictionaries.com/id/lexiconES/ dictionaries (in accordance with the entendedor-adj-form-1 translation order) that need to be mapped The first shared task on ● http://kdictionaries.com/id/lexiconES/ to each other so as to represent the same Translation Inference Across entendedor-adj-form-2 concept. Dictionaries was aimed to If the dictionary contains two different The word Bank in German can mean explore best methods and adjectival endings, as with entendedor either a bench or a bank in English. techniques for automatic which has different endings for the feminine When either of these English senses is generation of new bilingual and masculine forms (entendedora and translated back into German the result is dictionaries based on existing entendedor), and they are not explicitly the German word Bank. It is, however, resources. It relied on extracts mentioned, then we use numbers in the URI not possible to determine which sense out from 15 bilingual dictionaries to describe them. If the gender is explicitly of the two was translated unless the URI of K Dictionaries (KD) for mentioned, then the URIs would be: that contains the sense ID is included. It ● http://kdictionaries.com/id/lexiconES/ is also important to maintain the order of developing three new language entendedor-adj-form translation (source-target) but later map pairs that were validated ● http://kdictionaries.com/id/lexiconES/ both translations to the same sense and against existing KD data and by entendedora-adj-form same concept. This is difficult to establish human translators. In addition, it should be considered that automatically. TIAD 2017 was organized by the aim of triplifying the XML was for all Noam Ordan, Morris Alper these headwords with senses, forms and 5. Future work and Ilan Kererman (KD) and translations, to connect and be identified The actual overlap and automatic linking Jorge Gracia (OEG, Madrid and linked following the SW standards. of the dictionary resources remains to be Politechnic University). The One of the last steps of complexity was to tested. There are also some lexicographic results were presented in a develop a generic XSLT which can triplify elements which were not covered by the workshop co-located with all the different languages of this dictionary new OntolexKD model and need to be the Language, Data and series and store the complete data in a triple added. store. The question remains whether the There is also the necessity to verify and Knowledge conference at NUI design of such a universal XSLT is possible check for differences between KD’s XML Galway on June 18, 2017 by while taking into account the differences in dataset and the derived KD’s triplified four teams: languages or the differences in dictionaries. dataset. For this, SPARQL queries need ● Kathrin Donandt, Christian to be created that validate and verify the Chiarcos and Maxim Ionov; 4. Application and exploration resulting RDF. Goethe University, Frankfurt We tried to investigate also whether the ● Tom Knorr; Neurocollective, automated resource linking could help with References San Francisco CA the translation of one dictionary into another Bosque-Gil, J., J. Gracia, and A. ● Thomas Proisl, Philipp the language. Two bilingual dictionaries Gómez-Pérez. 2016. Linked data in Heinrich, Stefan Evert and were considered - English(en)-German(de) lexicography. Kernerman Dictionary Besim Kabashi; Erlangen and German(de)-English(en). News 24, 19-24. For the word bank the following Bosque-Gil, J., J. Gracia, E. University translations are found: Montiel-Ponsoda, and G. Aguado-de ● Uliana Sentsova; National Bank (de) – bank (en) – German to English Cea. 2016. Modelling multilingual Research University Higher bank (en) – Bank (de) – English to German lexicographic resources for the Web School of Economics, The URI of the translation from German to of Data: The K Dictionaries case. In Moscow English was designed to look like: Proceedings of GLOBALEX 2016 The papers are published ● http://kdictionaries.com/id/ Workshop at the 10th Language as part of the LDK 2017 tranSetDE-EN/Bank-n-SE00006116- Resources and Evaluation Conference Workshop Proceedings http:// sense-bank-n-Bank-n-SE00006116- (LREC 2016), Portorož, (Slovenia). ceur-ws.org. sense-TC00014378-trans Gracia, J. 2015. Multilingual dictionaries And the one for the translation from English and the Web of Data. Kernerman Noam Ordan to German would be: Dictionary News 23, 1-4. ● http://kdictionaries.com/id/ Klimek, B., and M. Brümmer. 2015.

2017 tranSetEN-DE/bank-n-SE00006110- Enhancing lexicography with semantic https://tiad2017.wordpress. sense-Bank-n-bank-n-SE00006110- language databases. Kernerman com/ sense-TC00014370-trans Dictionary News 23, 5-10.

uv-e-filesDownload uv-t-unzipper uv-t-xslt basic uv-l-filesUpload uv-l-filesToVirtuoso output->input output->fileinput filesOutput->input run->after

Kernerman Dictionary News, July Figure 2: UnifiedViews pipeline used to triplify XML 7

Towards a module for lexicography in OntoLex Julia Bosque-Gil, Jorge Gracia and Elena Montiel-Ponsoda

1. Introduction (domain of usage, region, frequent use Over the past few years, more and more tags, restrictions on number and gender efforts are being devoted towards the depending on a sense, etc.) that affect word conversion of dictionaries into Linguistic meaning and language usage and are not Linked Data (LLD), based on Lemon structural in nature. Collocations, idioms, (McCrae et al., 2012) and its more recent context indicators, semantic selection, etc. version OntoLex-Lemon1, a de facto are presented differently in dictionaries and standard to represent ontology-lexica on modeling them is not trivial. the Web. These works aim both to enrich The natural doubt that would be the so-called Linguistic Linked (Open) entertained by many experts is whether Data cloud2 with lexical information to be OntoLex is supposed to provide the means consumed by natural language processing to model all aspects of a dictionary or (NLP) tools, and to build bridges between whether this is outside its scope of ontology the lexicography and semantic web lexicalization, and therefore should be Julia Bosque Gil is a PhD communities. Recent projects such as tackled by another initiative. In this paper we student at the Ontology LIDER3, or on-going ones such as ENeL4, motivate our insights on OntoLex to enable Engineering Group (OEG) LDH4HELTA5 and LiODi6, promote the dictionary representation as LLD in all its since October 2014. She adoption of linked data technologies in the granularity, and advocate for the creation of work with lexicographic resources focusing a lexicography-specific module that would holds a B.A. in German on language technologies, e-lexicography gather elements concerning dictionary Linguistics and English from and linguistic research, respectively. structure and annotations. The module the Humboldt University Nonetheless, the conversion of a could also link to other modules that might of Berlin and an M.A. in lexicographic resources to OntoLex is not be proposed, such as an etymology-oriented Computational Linguistics always straightforward. Lemon was initially one to support etymological dictionaries. from Brandeis University, developed to enrich a given ontology with The rest of the paper is organized Waltham. Her interests include a lexical layer, and not with the idea of as follows: Section 2 goes through the lexicon-ontology and the rendering any already existent dictionary state-of-the-art of LLD and lexicography syntax-semantics interfaces, to LLD. A majority of scholars working on and some of the problems encountered relations between the lexicon this field, however, are turning to Lemon or during the representation of dictionaries and syntax, semantic annotation OntoLex in pursuit of the latter objective. as LLD. Our motivation for OntoLex The more numerous and resource-specific to be able to tackle those and the issues and the representation of the annotations in a dictionary are, the presented throughout the paper are stated (multilingual) language more complex the modeling solutions are, in that section as well. Section 3 describes resources as linked data. especially if until then the dictionary was five of a series of issues we identified in our She has been working in the targeted at human users. We are aware that work modeling and analyzing dictionary conversion of multilingual some solutions exceed the needs of lexical entries, and which we argue serve as input to RDF as information that some NLP tools require. for discussion on the need for a module part of the LIDER project However, if we are also aiming to bring for lexicography. Our initial approaches and in the modeling of linked data to lexicography, all dictionary towards such a module and a description lexicographic data as linked content must be taken into account and of how it would solve the described issues data in collaboration with K must be retrievable once converted to are outlined in Section 4, while Section 5 Dictionaries and Semantic LLD, i.e, migrating to LLD should imply offers some concluding remarks. Web Company. For her PhD no information loss. This means that structural aspects of the dictionary, as for 2. Background and motivation thesis she is investigating the instance senses and homographs order, There have been several reports in the use of linguistic linked data along with the sub-sense hierarchy some literature on the conversion of dictionaries for research in linguistics and dictionaries display, should be kept in to LLD, most of them relying on Lemon lexicography. 2017 mind when offering modeling solutions. or OntoLex. However, proprietary formats, [email protected] There is a range of dictionary annotations such as that of K Dictionaries (KD)7, often have XML tags used in their annotation 1 http://www.w3.org/2016/05/ontolex/ schemes that refer to linguistic categories or 2 http://linguistic-lod.org/llod-cloud features which are not present in available 3 http://lider-project.eu/ repositories of linguistic categories or which 4 http://elexicography.eu/ lack a compatible definition that prevents us 5 http://ldl4.com/ from reusing the ontology entity at hand. Ad 6 http://acoli.informatik.uni-frankfurt.

de/liodi/ 7 http://kdictionaries.com/ Kernerman Dictionary News, July 8

hoc vocabularies were defined to migrate and integrate multilingual linguistic data content from the German monolingual from Oxford Dictionaries and emerges as dictionary of KD’s Global Series (Klimek an ontology exclusively created to meet and Brümmer, 2015) and its Spanish dictionary representation requirements. It multilingual set (Bosque-Gil et al., 2016). accounts for a range of information found These works approached issues which in dictionaries, from inflected forms to affect, for example, the relation between semantic relations, pragmatic features and a lexical sense and the lexicalized phrases etymological data. The focus is laid on the and idioms in which it occurs, regional representation of grammatical information restrictions, lexical and semantic selection with cross-linguistic validity and the respect (in general) of lexical entries, groups of towards grammar traditions. However, some homographs, tone and register indications, modeling decisions and class definitions inflection groups, context of use, frequency differ from those suggested in OntoLex modifiers to register, etc. Multilingual (e.g. Form in OntoLex vs. a Form in OGL) dictionaries pose further problems due to and the emphasis is not set on the reuse of the modeling of examples and translations available ontology entities. of examples, as well as alternative forms In this position paper we do not focus of those translations (e.g. an example in on a particular kind of lexical information English translated to Japanese in kanji present in dictionaries (e.g., etymology or and hiragana, and that translation in turn morphology) but we aim to highlight some Jorge Gracia is a postdoctoral with a transliteration in rōmaji). The set of difficulties in the modeling of dictionary researcher at Ontology thirteen dictionaries (dialectal, bilingual, entries without information loss. Thus, monolingual, historical, etc.) converted we will not target the representation of Engineering Group, as part of the ENel Action (Declerck and resource-specific features of particular Universidad Politécnica de Wandl-Vogt, 2015) required the definition dictionaries. Taken into account the Madrid. He got his PhD in of new properties to encode different types problems reported in the literature, and after Computer Science at University of temporal information and etymological analyzing dictionary entries in e-dictionaries of Zaragoza in 2009, with a aspects. of English (Oxford Living Dictionaries thesis about heterogeneity Structures typically found in dictionaries, Online; Merriam Webster Dictionary issues on the Semantic Web. such as the sense and sub-sense hierarchy Online; American Heritage Dictionary His current research interests in an entry, are not trivial to model either. Online; COBUILD Advanced English include multilingualism polyLemon (Khan et al., 2016), developed Dictionary and Collins English Dictionary), and linked data, linguistic as part of the conversion of the Liddell-Scott German (Duden Online Wörterbuch; linked data, and cross-lingual Greek-English Lexicon to Lemon, was PONS Deutsch als Fremdsprache Online suggested in order to capture the sense and Wörterbuch), and Spanish (Diccionario de matching and information sub-sense structure in dictionaries using la Lengua Española; Clave: diccionario de access on the Semantic Web. properties such as and uso del español actual), we report on some His current research focus is senseChild senseSibling to relate senses and their of the issues we gathered which may pose on exploring how to move parent senses in the dictionary entry. problems for the modeling with OntoLex language resources (lexica, The accurate representation of and which we believe call for the definition dictionaries, corpora, etc) etymological information as LLD is of a new module to account for them. Future from their data silos into the key in the conversion of historical and steps include the analysis of dictionaries in multilingual Web of Data and etymological dictionaries. An extension to languages that are underrepresented in the make them interoperable, Lemon, Lemonet, to represent etymological LLOD cloud (e.g. Japanese) to identify in order to support a future information of lexical entries was proposed further representation challenges. generation of linked data-aware (Chiarcos and Sukhareva, 2014) and, more We ground our proposal for a lexicography NLP tools. recently, a revisited version builds upon the module on the following four points: (1) properties suggested for the modeling of the the use of OntoLex by the majority of the http://jogracia.url.ph/web/ etymological WordNet8 to undertake the community to convert linguistic resources conversion of the Tower of Babel (Starling) to LLD instead of to lexicalize ontologies, in the LiODi project (Abromeit and Fäth, (2) the nature of Lemon being descriptive 2016). Some recent work on the conversion but not prescriptive and the respect of the classical dictionary Al-Qamus towards different lexicographic views, (3)

2017 to Lemon and LMF has been undertaken the coming together of the lexicography (Khalfi et al., 2016), but no pointers or and the Semantic Web communities and traceback to the original structure are given potential benefits that LLD may bring in the work. about to lexicography, assuming it involves Alternatives to the use of OntoLex are no information loss, and (4) the reuse of available as well. The Oxford Global already available mechanisms in OntoLex. Languages Ontology (OGL) (Parvizi et al. 2016) has been developed to model 3. Issues In the following we report on some of 8 http://www1.icsi.berkeley. the issues we have come across after our

Kernerman Dictionary News, July edu/~demelo/etymwn/ experiences in converting dictionaries to 9

LLD and our analysis of dictionary entries Spanish cometa (m.) ‘comet’, (f.) ‘kite’. In in English, German and Spanish. Here these cases, the dictionary entry can be a we restrain ourselves to issues that reveal single one (e.g. refreshment in English or current limitations of the OntoLex model, cometa in Spanish) but one of its senses i.e, cases in which applying the Lemon core indicates a preferred form. In the case of implies a different view on the data than the refreshment, the plural form is used if the one provided in the original resource and, intended meaning is snacks and beverages; therefore, an information loss (type 1, hence with cometa, the feminine form is applicable T1), and missing entities, e.g. a property or when referring to a kite, the masculine a class, to account for information mostly when denoting a comet. Further examples found in dictionaries (type 2, hence T2). We are good(s), manner(s); and Spanish frente have already raised some of these issues (m.) ‘front’, (f.) ‘forehead’. as input for discussion to the OntoLex community.9 Issue 3 (T2). Usage examples and their translations Issue 1 (T1). Headwords that can take Usage examples of a word or multiword different parts-of-speech expression are often provided in the Both Lemon and OntoLex specify a lexical definition of each of a dictionary entry’s entry as a word, a multiword expression senses. Lexinfo11 includes a property or an affix with a single part-of-speech, lexinfo:senseExample to describe morphological pattern, etymology and an example of a sense (as a subproperty of Elena Montiel-Ponsoda 10 set of senses. However, a headword in Lemon:definition) and which is linked is Associate Professor at a dictionary may occur with different to the example data category in ISOCat.12 the Applied Linguistics parts-of speech depending on context and Nonetheless, due to it being a datatype its senses are nonetheless defined in the property, it does not enable including further Department at Universidad same dictionary entry, all of them derived information on the example or to establish Politécnica de Madrid (UPM) from the same etymology (no homonymy translation relations among examples, since 2012, and member of the involved). Applying the OntoLex model which is common practice in bilingual Ontology Engineering Group would imply the generation of several and multilingual dictionaries. The Lemon since 2006. She got her PhD ontolex:LexicalEntr[ies], one model included a Lemon:UsageExample on Applied Linguistics from per each part-of-speech the headword class and a property Lemon:example to UPM in 2011. Her research can take. Splitting the dictionary entry link to it, but OntoLex does not cover this interests are at the intersection into several lexical entries would cause aspect yet. Examples: Spanish preocuparse between translation (and loss of information (shared etymology, ‘worry’; no hay por qué preocuparse terminology) and knowledge pronunciation, senses implicitly related) ‘there is nothing to worry about’ (Collins representation, including and does not keep track of the dictionary English-Spanish Dictionary). among others: ontology representation. Examples: poison, bread, water (noun and verb), and Spanish lento Issue 4 (T2). Sense and homograph order localization and lexicalization, ‘slow, slowly’ (adjective and adverb) and The order of senses may be based lexico-syntactic patterns alto ‘tall, loudly, height’ (adjective, adverb on frequency of use, date of origin, for ontology development, and noun). concreteness (from the most concrete to functional models for deep most abstract sense, etc.). Homographs semantics analysis, sentiment Issue 2 (T1). Lexical sense requiring a are also given according to some ordering analysis, and linguistic linked particular form criteria that may vary from dictionary to data for content analytics. Some senses of a dictionary headword dictionary. Their order should be searchable She is currently working require a particular form, e.g. in English a and retrievable as to recover the information on the representation of plural form or in Spanish a masculine or provided in the original resource. Examples: lexical resources according feminine one. Since the meaning in these Boa: noun. (1) any of a family (Boidae) of to the linked data paradigm, cases is associated with the form and it large snakes that kill by constriction and may differ significantly from other senses that includes the boa constrictor, anaconda, specifically, on how translation that do share gender or number features, and python (2) a long fluffy scarf (Merriam relations can help in the splitting the dictionary entry into different Webster Dictionary)13; bat1: n. 1. A stout construction of the multilingual lexical entries would be an option (see wooden stick; a cudgel [. . . ]; bat2: n. Any Web of Data.

Issue 1). An alternative is the linking of various nocturnal flying mammals of the [email protected] 2017 of that sense to elements in a catalog of order Chiroptera [. . . ] (American Heritage grammatical categories which encode those Dictionary). grammatical restrictions, but we would need an exhaustive list of them for this option to 11 http://lexinfo.net/ontology/2.0/lexinfo. be applicable. Examples: refreshment(s), owl 12 http://isocat.org/ 9 http://w3.org/community/ontolex/ 13 Example of logical order of senses wiki/Lexicography inspired by Diccionario de la Lengua 10 http://w3.org/2016/05/ Española, Guía de Consulta, http://dle.

ontolex/#lexical-entries rae.es/ Kernerman Dictionary News, July 10

Issue 5 (T2). Semantic selection 4. A module for lexicography Some dictionaries indicate the semantic The previous section dealt with some of features of the lexical items that an entry (in the issues we encountered in our work one of its senses) selects or even the exact with dictionaries and the potential ones that lexical items with which it collocates. This may rise with other lexicographic works is usually indicated either with a specific that have not been migrated to LLD yet. In tag (e.g. KD’s Range Of Application), or the following we draft a potential solution in-between parentheses at the beginning which can serve as a basis for a new module OntoLex 2017 of a definition. Examples are, for instance, in OntoLex specifically developed for the 1st workshop on the the dictionary entry for the German verb representation of dictionaries after thorough OntoLex model dämmen, which in its sense ‘to insulate, revision and improvement according to the absorb, mute’ selects arguments that denote community’s feedback. The W3C OntoLex Community warmth or sound (German Wärme, Schall, In order to keep track of the dictionary Group was launched in 2011, etc.) (KD), the adjective cozy, meaning representation and prevent any loss of with Paul Buitelaar (INSIGHT. beneficial to all those involved and possibly information mentioned in Issue 1, related National University of somewhat corrupt if predicated from a to the splitting of dictionary entries in Ireland, Galway) and Philipp transaction or an arrangement (Google several lexical entries, we propose a new Cimiano (CITEC, Bielefeld Dictionary); or the collocational measure class DictionaryEntry. This new University, Germany) as words of luck: stroke, piece of (Oxford class would both enable to group together chairs, with the goal to define Collocations Dictionary). The OntoLex lexical entries as well as to associate any a model for allowing to Syntax-Semantics Module (synsem) class information shared by all of them. In represent lexical knowledge synsem:OntoMap allows to map a our view, we distinguish lexical entries syntactic frame to an ontology entity, so that and (as containers of lexical in connection to ontologies the frame and its arguments are linked to entries), from the original dictionary entry [1]. The rationale behind the the ontology elements that they lexicalize. (the new class DictionaryEntry) model is that semantics are Even though dictionaries commonly include and the original dictionary resource captured by ontologies, and the information on subcategorization (transitive/ (Dictionary), which would serve in turn role of the lexicon-interface intransitive/reflexive etc. annotations for to record the provenance of each dictionary is to link lexical entries to verbs, for instance), details on the syntactic entry. Mirroring the lime:Lexicon- ontological entities expressing frame are not always provided beyond those ontolex:LexicalEntry relation their denotational meaning, annotations. Since in dictionary conversion we suggest a Dictionary- following a principle called we often lack a given ontology and detailed DictionaryEntry one. Any lexical Semantics by Reference. syntactic information is not provided, the entry created during the conversion to LLD Based on five years of intensive mapping between syntactic arguments and but not originally provided in the resource ontology entities seems difficult to establish would then belong to a , discussions and work, the new lime:Lexicon automatically via : but not to the instance of OntoLex-Lemon model was synsem:OntoMap Dictionary how do we automatically represent that the representing that resource. launched in May 2016 [2] A adjective cozy has a meaning only applied to lime:Lexicon in English, for example, and is becoming the primary transaction or agreement or that the measure could aggregate lexical entries created on method for representing linked words that collocate with luck are stroke or the fly by the LLD expert or original ones lexical resources on the Web piece if the morphosyntactic information coming from as many English dictionaries of Data, not only for capturing provided in the dictionary is just that cozy is as desired. These dictionaries can in turn the lexicon-ontology interface an adjective and luck a noun? Furthermore, differ in their modeling and their views on but for the representation of synsem:condition (in its turn the data, their criteria of sense ordering or lexicographic resources as well. subsuming synsem:propertyRange their structure. The model facilitates bridging and synsem:propertyDomain) Regarding Issue 2, the the gap between the NLP and enables us to state constraints on the DictionaryEntry class would allow arguments of a predicate in a given to divide a single lexical entry into several data science communities by ontology.14 The possibility of reusing it to ones if desired, each with a different making available and linking state the constraints on syntactic arguments preferred form, while maintaining the large amounts of quality lexical even in cases in which we lack a given original dictionary representation. If the information to the knowledge ontology and therefore are not mapping to dictionary entry is not split, the option of represented on the semantic given ontology properties has to be further linking a sense to a grammatical restriction

2017 web, for example in graphs analyzed. In addition, the potential links on gender or number from an external such as DBpedia, applications between the modeled entries (e.g. piece catalog would solve the issue, although the and luck), i.e. the links at the lexical level, implications of this solution (its benefits and are also to be considered, for instance, by drawbacks) will need further analysis. taking into account recent proposals on the In order to represent usage examples representation of lexical functions as LLD and their translations (Issue 3) we propose (Fonseca et al. 2016). to go back to lemon:UsageExample and link it to a LexicalSense. A new class ExampleCluster would link to 14 http://w3.org/2016/05/ UsageExamples that are translations

Kernerman Dictionary News, July ontolex/#conditions from each other. The use of the vartrans 11 module to model translations among senses each of their senses. Thus, the focus was would imply the creation of lexical senses set on representing the data just as it was for each example, and therefore treating the in its original format while being compliant example as a lexical entry, which we deem with the OntoLex formal specification and is beyond the definition of lexical entries. reusing its elements as much as possible. Issue 4 was concerned with the order of We argue that the lexicography module senses in a dictionary entry and the order should aim to set the basis to exploit at of homographs in the macrostructure of the dictionary’s macro-structure level the the dictionary. There are different possible potential benefits of establishing semantic approaches to resolve this: reusing relations among lexical senses based on already available RDF mechanisms, lexical selection or among syntactic frames as part of the Global WordNet reifying the sense order in a new class and arguments and the ontology entities that Association Collaborative SenseOrder, or defining a new property they denote. To this aim, overcoming the Interlingual Index, or existing senseOrder attached to the lexical lack of detailed syntactic information in the resources such as BabelNet, sense. The first option involves the reuse of dictionary as well as the lack of a given DBnary and commercial [s] to declare with e.g. ontology to lexicalize becomes essential. rdfs:Container dictionaries. rdf:_1, rdf:_2 that a particular sense is the first or the second one. However, 5. Conclusion OntoLex 2017 was co-located cases in which a set of senses allows OntoLex is increasingly being used to with the Language, Data and for various orderings, depending on the convert linguistic resources to LLD outside Knowledge conference [3, ordering criterion, or in which some senses the scope of ontology lexicalization. In and see p.13] and presented come from different dictionaries (each with this position statement we have drawn the first opportunity for its order), should also be accounted for. attention to a series of issues raised practitioners to meet to discuss The second option suggests that the sense in the literature on LLD related to the the model, its applications and order is reified in a class SenseOrder conversion of dictionaries to LD and to future development [4]. The linked to the lexical sense. This class five of the ones we came across in the same participants have shown interest would enable us to record the position of line of work and after a later analysis of in continuing the development that sense, its provenance (presumably several additional dictionaries. We argue of OntoLex-Lemon, particularly an instance of the class Dictionary), that the OntoLex model should enable and, if desired, the ordering criterion. If the preservation of the content and the with regard to lexicographic repeated senses were identified (e.g. senses structure of the original resource, even resources. In consequence, the that share a definition in both dictionaries), if the LLD expert opts for a different group is starting to work on a SenseOrder would allow us to have one representation that is better suited to the new best practice document single lexical sense with two different data exploitation by external applications that will provide modeling positions according to the two different or is more in line with his or her view on examples and guidelines orderings and dictionaries, in a similar the lexicon-ontology interface. We have for how to use OntoLex fashion as two containers with two different outlined some of our insights on how specifically to represent sequences of senses. Alternatively, if we to address these issues in a new module lexicographic resources such assume that a lexical sense always comes for lexicography. It would be compatible as dictionaries. Then it will be from just one dictionary source, a property with the mechanisms suggested in the decided whether to stipulate the senseOrder would suffice. state-of-the-art on dictionaries represented Issue 5, dealing with semantic selection, as LLD, as of the moment of writing, and proposed vocabulary elements has been brought up for further discussion also with other potential modules for the or modeling solutions into the in this paper to see whether it could be encoding of specific lexical aspects (e.g. status of a new module or to covered by synsem module mechanisms etymology). The final module is intended keep the best practices proposal or whether it would require new entities to be dictionary-agnostic in the sense that as an informal document. in the context of the lexicography module. it should be applicable (and combined with As part of the conversion of the KD’s other modules if necessary) to different Philipp Cimiano Global Spanish Multilingual Dictionary kinds of dictionaries (e.g., general, Universität Bielefeld (Bosque-Gil et al. 2016), the semantic collocations, learner’s, etymological, selection information provided by KD’s historical, etc.). This would bring linked [1] https://w3.org/community/ tag RangeOfApplication was captured by data (LD) closer to lexicography not ontolex/ the use of . In that only with the aim of leveraging already synsem:condition [2[ https://w3.org/2016/05/ approach, synsem:condition would available dictionaries in LD for NLP tasks, 2017 link a lexical sense to a blank node15 with but also for introducing LD in the work ontolex/ [3] http://ldk2017.org/ an rdf:value recording the strings carried out in that discipline. given as arguments in the original data. This [4] http://ontolex2017. modeling allowed us to deal with the lack of Acknowledgments linguistic-lod.org/ a given ontology and detailed information This work is supported by the Spanish on the syntactic frames of lexical entries for Ministry of Economy and Competitiveness through the project 4V (TIN2013-46238- 15 synsem:condition has C4-2-R), the Excellence Network ReTeLe rdfs:Resource defined as its (TIN2015-68955-REDT), and the Juan

range. de la Cierva program and by the Spanish Kernerman Dictionary News, July 12

Ministry of Education, Culture and Sports Cross-linking Austrian dialectal through the Formación del Profesorado Dictionaries through formalized Universitario (FPU) program. Meanings. In Proceedings of Euralex 2014. This contribution was presented at the Declerck, T., and Wandl-Vogt, E. 2015. 1st Workshop on the OntoLex Model, Towards a Pan European Lexicography co-located with the Language, Data and by Means of Linked Open. Data. In Knowledge Conference (LDK 2017) in Proceedings of eLex 2015, 342-355. Galway, Ireland, on June 18. Declerck, T., and Wandl-Vogt, E. 2014. How to semantically relate Dictionaries cited dialectal Dictionaries in the Linked American Heritage Dictionary Online. Data Framework. In 8th Workshop on KD API Retrieved 25/05/2017 from http:// Language Technology for Cultural Heritage, Social Sciences, and K Dictionaries is completing ahdictionary.com/ Clave: diccionario de uso del español Humanities LaTeCH 2014, 9-12. the development of an online actual (online). Retrieved 25/05/2017 Del Gratta, R., Frontini, F., Khan, F., and API which will provide from http://clave.smdiccionarios.com/ Monachini, M. 2015. Converting the programmatic access to its rich app.php Parole Simple Clips Lexicon into RDF cross-lingual lexicographic COBUILD Advanced English Dictionary. with lemon. Semantic Web 64, 387-392. resources for 50 languages. Retrieved 25/05/2017 from http:// El Maarouf, I., Bradbury, J., and Hanks, In addition to the collinsdictionary.com/ P. 2014. PDEV-lemon: a Linked well-formatted XML Collins English Dictionary. Retrieved Data implementation of the Pattern Dictionary of English Verbs based on data, the API outputs 25/05/2017 from http://collinsdictionary. com/ the Lemon model. In Proceedings of developer-friendly JSON-LD, Duden Online Wörterbuch. Retrieved Language Resources and Evaluation which complies with the 25/05/2017 from http://duden.de/ LREC. familiar RESTful API standard. woerterbuch Fonseca, A., Sadat, F., and Lareau, F. The JSON-LD encodes Merriam Webster Dictionary Online. 2016. Lexfom: a lexical functions RDF linked data, making Retrieved 25/05/2017 from http:// ontology model. In COLING 2016, 145. it highly compatible with merriam-webster.com/ Gracia, J., Villegas, M., Gómez-Pérez, complementary open linked Oxford Living Dictionaries Online. A., and Bel, N. 2016. The Apertium linguistic data sets. Users will Retrieved 25/05/2017 from https:// Bilingual Dictionaries on the Web of be authenticated and given en.oxforddictionaries.com/ Data. Semantic Web Journal. access according to their PONS Deutsch als Fremdsprache Online Khalfi, M., Nahli, O., and Zarghili, A. account type. Wörterbuch. Retrieved 25/05/2017 from 2016. Classical dictionary Al-Qamus http://de.pons.com/ in lemon. In Information Science and There is also an editing Real Academia Española, Diccionario de Technology CiSt., 2016 4th IEEE pipeline in development in la Lengua Española (DLE). Retrieved International Colloquium, 325–330. which editors, translators and 25/05/2017 from http://dle.rae.es Khan, F., Díaz-Vera, J. E., and Monachini, crowdsourced suggestions will M. 2016. Representing Polysemy and use API calls to consolidate References Diachronic Lexico-semantic Data on suggested changes to K Abromeit, F., and Fäth, C. 2016. Linking the Semantic Web. In Proceedings of Dictionaries’ data and direct the Tower of Babel: Modelling a Massive the Second International Workshop on them through quality assurance. Set of Etymological Dictionaries as Semantic Web for Scientific Heritage RDF. In LDL 2016 5th Workshop on co-located with 13th Extended The API will be launched in Linked Data in Linguistics, 11. Semantic Web Conference ESWC 2016., September 2017. Registration is Bosque-Gil, J., Gracia, J., Heraklion, Greece, 37-46. open at [email protected]. Montiel-Ponsoda, E., and Klimek, B., and Brümmer, M. 2015. Aguado-de-Cea, G. 2016. Modelling Enhancing lexicography with semantic Morris Alper multilingual lexicographic resources language databases. Kernerman for the Web of Data: The K Dictionaries Dictionary News 23, 5-10. case. In GLOBALEX 2016 Workshop at McCrae, J., Fellbaum, C., and Cimiano, P. LREC 2016, 65. 2014. Publishing and Linking WordNet 2017 Chiarcos, C., and Sukhareva, M. 2014. using lemon and RDF. In 3rd Workshop Linking Etymological Databases. A Case on Linked Data in Linguistics. Study in Germanic. In 3rd Workshop on Parvizi, A., Kohl, M., González, M., and Linked Data in Linguistics: Multilingual Saurí, R. 2016. Towards a Linguistic Knowledge Resources and Natural Ontology with an Emphasis on Language Processing, 41. Reasoning and Knowledge Reuse. In Declerck, T., and Mörth, K. 2016. Towards Proceedings of Language Resources a Sense-based Access to Related Online and Evaluation LREC. Lexical Resources. In Proceedings of Villegas, M., and Bel, N. 2013. PAROLE/ Euralex 2016, 342-355. SIMPLE “Lemon” ontology and

Kernerman Dictionary News, July Declerck, T., and Wandl-Vogt, E. 2014. lexicons. Semantic Web Journal II. 13

1st Conference on Language, Data and GWC9 Knowledge – Galway, 2017 Singapore, 2018 The Ninth Global WordNet Conference (GWC 2018), will be held from 8 to 12 On June 19-20, the first conference on data as well as knowledge graphs as a key January, 2018, at Nanyang Language, Data and Knowledge (LDK focus for providing knowledge about the Technological University, 2017) took place in Galway, Ireland, which world in terms of ontologies, terminologies Singapore. The conference was attended by over 100 participants from and linked data. Direct applications in focuses on wordnets, but is 27 countries. This new biennial conference NLP are of particular importance in broad in scope, welcoming series brings together researchers from certain semantic technologies, question more general work on lexical across disciplines concerned with the answering, multilinguality and big data. semantics and lexicography, acquisition, curation and use of language Finally, applications of these technologies as well as word sense data in the context of data science and to domains as varied as digital humanities, disambiguation. We especially knowledge-based applications. Language enterprise data analytics and text mining of datasets, such as corpora, typological biomedical literature are of importance. invite papers addressing the resources and, of course, dictionaries, The conference featured invited talks following topics: are of increasing importance to machine from both industry and academia. Zoltan ● Linguistics and lexical learning-based approaches in Natural Szlavik from IBM Benelux gave a talk semantics Language Processing (NLP), linked data about cognitive computing and the Watson ● Architecture of lexical and Semantic Web research and applications systems. Kathleen McKeown from the databases that depend on linguistic and semantic Data Science Institute of the Columbia annotation with lexical, terminological and University talked about the intersection of ● Tools and methods for ontological resources, manual alignment data science and NLP. Antal van den Bosch wordnet development across languages or other human-assigned of Radboud University in the ● Wordnets and applications labels. The acquisition, provenance, and director of the Meertens Institute ●  representation, maintenance, usability, related data science technologies with Standardization, distribution quality as well as legal, organizational new breakthroughs in digital humanities. and availability of wordnets and infrastructure aspects of language Finally, Isaac Graham from NUI Galway and wordnet tools data are therefore rapidly becoming major talked about relevant features of the Irish Submissions are anonymous, areas of research that were at the focus and Welsh languages. and can be for long papers, of the conference. A further focus was In addition to the main conference there short papers, project reports or the combined use and exploitation of were a number of workshops and tutorials. demonstrations. The deadline language data and knowledge graphs in data The 1st OntoLex Workshop discussed the for submissions is September science-based approaches to use cases in development of the OntoLex model, a recent 6, 2017, and acceptance will industry, including biomedical applications, vocabulary from the World Wide Web as well as use cases in humanities and social Consortium, for representing dictionaries be announced to the authors by sciences. relative to ontologies as linked data on the end of September. Final papers The LDK conference has been initiated Web. The Translation Inference Across are due on October 11. by a consortium of researchers from Dictionaries (TIAD) shared task explored In conjunction with the the Insight Centre for Data Analytics the automatic generation of bilingual conference we intend to (National University of Ireland Galway), dictionaries from existing resources and hold two workshops: one on InfAI (Leipzig University, Germany) and was co-organized by K Dictionaries and distributional semantics and Goethe University Frankfurt, Germany, the Polytechnic University of Madrid. one on technology enhanced and a Scientific Committee of leading A further workshop was organized on researchers in NLP, linked data and the Challenges for Wordnets, concerning the learning. Semantic Web, language resources and construction of dictionaries in the wordnet digital humanities. LDK 2017 was endorsed format, organized by the Global WordNet Francis Bond by several international organisations, Association. Importantly, also a tutorial Nanyang Technological namely DBpedia, the ACL Special Interest on text analytics for Digital Humanities University Group for Annotation (SIGANN), Global and Social Sciences with CLARIN

WordNet Association, CLARIN, Big Data was organized jointly by CLARIN and http://compling.hss.ntu.edu.sg/ 2017 Value Association (BDVA) and DARIAH DARIAH Ireland. Finally, a meeting of the events/2018-gwc/ Ireland. DBpedia Association took place after the The conference stands at the intersection main conference. of data science and NLP and brought together The next edition of the LDK conference is researchers from across computer science planned to take place in Leipzig, Germany as well as linguistics and the humanities. in 2019. More details about the conference The conference series is concerned with can be found at http://ldk2017.org. the creation of data resources as well as the metadata, development, evaluation, John P. McCrae, Paul Buitelaar, Christian quality and legal aspects of publishing such Chiarcos and Sebastian Hellmann Kernerman Dictionary News, July 14

The advent of post-editing lexicography Miloš Jakubíček

The lexicographic landscape has been tools for (semi-)automation of specific subject to two major disruptions over the lexicographic tasks have been developed past twenty years. as well. In a review carried out in 2011, The first is related to the uptake of Rundell and Kilgarriff argue3 that while information technology and availability word sense identification and definition of text corpora. Lexicographers were on writing remain to be tackled, many other the forefront of the shift to empiricism in tasks alongside the lexicographic workflow linguistics1 and it was for good: a field that have been already solved with an accuracy never seriously acknowledged any theoretic that delivers time- (and therefore money-) framework was starting to benefit more than saving solutions. This is deemed to be the any other linguistic discipline – practical case for devising dictionary headword lists, needs for describing language as used were finding collocations and other multiword very high. units, or extracting dictionary examples Miloš Jakubíček is an NLP The second change, related to the first from corpora. researcher, software engineer one, was without doubt the breakdown of What is the next step? At the moment traditional publishing business, manifested lexicographers query corpora (by means of and CEO of Lexical Computing in the end of paper dictionaries (as well as many tools) for finding linguistic evidence (LC), a research company by the fall of many renowned dictionary in order to draft a dictionary entry which developing the Sketch Engine publishers). From the perspective of users, they continue working on and which is (SkE) corpus platform. dictionaries are tools to be used while doing subject to a number of reviews in the His research interests are something else, to paraphrase Hilary Nesi.2 lexicographic workflow. devoted mainly to effective The environment has drastically changed The next step is to spare the processing of very large text and so do need to change the tools. lexicographers from such initial corpus corpora for lexicographic and The impacts of both of these changes query and entry drafting. Instead of starting linguistic tasks and syntactic are yet to be discovered: for the latter one, with an “empty” dictionary, they will be parsing of morphologically the status quo can be well described by able to begin with a dictionary database rich languages. Since 2008 quoting another heavyweight in the field, pre-populated with entries according to the long-time editor-in-chief of Macmillan a big underlying reference corpus. These he has been involved in the dictionaries, Michael Rundell, whom I entries will contain suggested word sense development of SkE, in 2011 often heard saying: “After working in this clustering, with definitions (or explanations he became director of the field for 30 years, I thought I had a pretty in alternative forms such as image media), Czech branch of LC leading good idea about how to create and publish labels and examples extracted from the local development team of a dictionary. But things have changed so corpora. These entries will then be edited in SkE, eventually becoming CEO dramatically in the last five years, that I an environment that includes direct links to of LC in 2014 after its founder have only a limited idea of what the future underlying corpus evidence so as to allow Adam Kilgarriff. He is a fellow of lexicography will be.” manual inspection of the source texts, as of the NLP Centre at Masaryk The impact is a bit easier to be foreseen well as mechanisms for easy and simple University where his interests as regards technological innovations. corrections of the entry (e.g. lumping lie mainly in morphosyntactic Contemporary lexicography makes heavy and splitting of word senses, replacing analysis of Czech and its use of corpora and increasingly also of dictionary examples, amending definitions many natural language processing tools and labels). Having all the evidence at hand, practical applications. that automate the analysis of morphology the next step is to leave the “easy” bits milos.jakubicek@sketchengine. as well as syntax and semantics. Many to the computer and have human editors co.uk spend their time on the more intellectually 1 As can be seen from early corpus demanding parts of the job. This opens development projects like COBUILD the way to Post-Editing Lexicography, or BNC, which were both driven and in an analogy to the translation process.

2017 devised (also) for lexicographers, Translators used to use many independent who were themselves employed in tools (dictionaries, in the first place!) up empirical linguistic research (cf. to the moment when machine translation Church, Ward and Hanks, 1990. Word association norms, mutual 3 Rundell and Kilgarriff, 2011. information, and lexicography. Automating the creation of Computational Lnguistics 16.1, dictionaries: Where will it all end? 22-29). In Meunier et al. (eds.), A Taste for 2 See e.g. The Oxford Handbook of Corpora. In honour of Sylviane Lexicography, 2016. Durkin, P. (ed.). Granger. Amsterdam: Benjamins,

Kernerman Dictionary News, July Oxford: OUP, 584. 257-281. 15 and translation memories became mature need to be supplemented by automation as enough to be exploited for professional far as possible. translation, and henceforth translators At the eLex 2017 conference we are going became post-translation editors. to present a proof of concept in the form of An important lesson from the translation interconnecting Sketch Engine, a leading business concerning potential danger for the corpus query system, with Lexonomy, a future of lexicography is that the transition new lightweight dictionary writing system.5 to post-editing translation was by far not We will show how a dictionary draft can be easy, partially because it may have actually directly obtained from a reference corpus (as begun too soon (pushed by translation and a One-Click Dictionary) in Sketch Engine localization agencies pressing to cut costs), and how it can be efficiently post-edited in having machine translation do yet another Lexonomy. “over-promised and under-delivered” U-turn. The future of lexicography presents Eventually, this adoption has further big challenges. It would be naïve not to progressed as the technology became more realize that many of them pose real issues, mature, and, mainly, as the translation problems and obstacles for all players in environment for professionals has become the field. However, the more so we need more suited for the task of post-editing, to look for those of them that present real which is very different from translating opportunities – and, I believe, post-editing from scratch. lexicography is one of them. However, the episode had a very undesired consequence that we want to 5 Jakubíček M., Kovář V., Měchura M. avoid in lexicography. Translators were and Rychlý P. One-Click Dictionary. abandoning machine translation, for both In Electronic lexicography in the 21st technological reasons as well as for fear century, Proceedings of eLex 2017 of becoming jobless. These fears remain, (forthcoming). even though the former is being improved and the latter just did not prove to be the case: the translation industry is growing and Sketch Engine (SkE) started in 2004 as an academic and lexicographic some reports describe it as one of the fastest product for corpus query and management and has since attracted a wide growing businesses today.4 Moreover, as the audience including translators, writers, marketers, brand naming and SEO whole process gets streamlined, there are professionals. To meet the challenge of guiding this variety of users to the good chances that lower per-word income functionality they need in a streamlined way, an all-new user interface is of translators will eventually turn into a under development to not only bring in the latest Web technology but also higher per-hour rate for them. change the way users interact with SkE. With a soft launch due in autumn The lessons for lexicography are 2017, users will enjoy a new friendly design which adapts to small touch straightforward: the transition to post-editing screens of tablets and mobile phones. Input forms and selection screens will must be backed by solid technology (which offer basic and advanced layouts, the former targeted at casual users without we believe we have), revisited workflows a profound knowledge of corpora or NLP and the latter serving academic (which we need to work on), and with and professional users. In this process, various controls have shifted to advocacy explaining that it is not meant more intuitive spots, enabling the user to, for example, adjust the view on to steal lexicographic jobs. The shift to the result screen rather than make this decision beforehand as it is now, post-editing lexicography might well be a while preserving all the options and features that are currently available. fertilizer for the falling industry, showing Hand in hand with developing the new look and feel, SkE will become faster (and hence, more affordable) and more useful to anyone in need of or dictionaries. In addition more effective workflows. to serving as an indispensable tool for gathering data, the new link with A specific account comes with addressing Lexonomy system will enable data conversion into a lexicographic product less-resourced languages – or those that as part of an online dictionary writing tool, which doubles as a hosting are basically not resourced at all at the service that can produce a dictionary and have it published online instantly. moment. There are plenty of them, often Lexonomy also features an easy-to-use XML editor suitable for users with geographically located in areas with no prior knowledge to create lexicographic products complying with current growing numbers of speakers. Many of standards. Embedding Lexonomy in SkE will become vital for starting these speakers live in poverty, which brand new lexicographic projects. Users will access a corpus to identify the nevertheless does include the possession most frequent words and have the list pushed to Lexonomy along with part 2017 of a smartphone. Language resources will of speech tags, usage flags, example sentences, collocations, synonyms, be one of the first data needed when these definitions or translation, thus generating a dictionary draft for post-editing. societies will approach information levels Likewise, a subject-specific can be developed analogically from a of the developed world. There will be a terminology list extracted from a domain corpus. This push & pull model strong business need for them but no time will dramatically change the way dictionaries are built, besides its beneficial for twenty-years-running lexicographic time and money saving implications. projects, another reason why human efforts Ondřej Matuška 4 See, e.g., https://gala-global.org/ Lexical Computing

industry/industry-facts-and-data. Kernerman Dictionary News, July 16

Phonetic transcription of dotted Hebrew Alon Itai

Abstract All through the Middle Ages, scholars The pronunciation of Israeli Hebrew continued to write in unvocalized Hebrew mostly follows the pronunciation rules using the matres lectionis more extensively, of dotted Hebrew script, though there are and this is the standard script of Israeli several systematic deviations. As part of Hebrew. the development process of a new Hebrew With the revival of Hebrew as a spoken lexicographic resource by K Dictionaries, language, at the end of the 19th century, we have constructed and implemented an the Sephardic pronunciation was adopted. algorithm to deduce the pronunciation This pronunciation fused several diacritics from dotted texts and tested it on a large and Israeli Hebrew further modified the manually tagged database. The database pronunciation. Thus dotted texts are only contains 35,443 dotted Hebrew words a rough guide to pronouncing Hebrew, and and their IPA (International Phonetic various systematic deviations exist. Alon Itai is Professor Emeritus Alphabet) transcription. The program of the Computer Science succeeded in correctly predicting the 2. Pronunciation Rules pronunciation of over 89% of the words in 2.1 Consonants. The consonants follow a Department of the Technion, the database. 78.5% of the errors occurred regular pattern. Table 1 shows a many-to-one Haifa. His research interests when predicting the stress of loan words. map to IPA. include Machine Learning and Most of the remaining errors occurred in 2.2 Vowels. Most vowels follow a Natural Language Processing. predicting the pronunciation of schwa. We many-to-one transcription. The two He is the director of the MILA found out that the traditional phonological problematic cases are the diacritic qamatz, center for processing Hebrew explanation that is based on sonority which is most often pronounced /a/ but (http://mila.cs.technion.ac.il/ theory correctly predicts 88.3% of all sometimes /o/, and the diacritic schwa, eng/index.html), dedicated to pronunciations of schwa. We constructed an which is in most cases silent but sometimes developing tools for processing alternative algorithm that correctly predicts pronounced /e/. Hebrew. the pronunciation of schwa in 99% of the 2.3 Stress. In most Hebrew words the [email protected] words of the database. stress is on the last syllable, though some are penultimate. Even though stress is 1. Historical background phonemic, it is not transcribed explicitly in Hebrew is among the first languages Hebrew dotted script. In nearly all cases one transcribed by a phonetic alphabet. The can deduce the stress from the vowel pattern original script constitutes an abjad, of the word or from its morphological i.e., a script where the vowels are not analysis. represented. However, some consonants Some noun patterns also have penultimate served as matres lectionis, namely the stress. For example, the segolite word in addition to their role as pattern class can be easily recognized since י,ו,ה consonants consonants are used to indicate the vowels, the last vowel is the diacritic segol (e.g. kelev/ dog). There are several relatedˈ/ ּכֶ לֶ ב a,u/o and i . During the first centuries A.D. Aramaic replaced Hebrew as the patterns which are easy to identify. Verbs main spoken language of Jewish people in the past tense end with an unstressed .(kaˈtavti/ I wrote/ ּכָ תַ בְ תִ י .in Palestine. However, the Holy Scriptures suffix (e.g and mainly the Bible were canonized using These inflections can be identified by a the original abjad. Because of the need morphological analyzer. to correctly read the Bible, a system of Loan words pose a greater challenge as diacritics, called dots, was added during their stress does not follow the rules of native the 10th century C.E. These symbols were words. The stress is most often penultimate

2017 small, so as to not change the holy texts, and, contrary to native Hebrew words, its and reflected the way the scriptures were position does not change even in the presence pronounced at the time. of a stressed suffix. Thus, a prerequisite

ׁש ר צ פ ּפ ס,ׂש נ מ ל ּכ,ק י ט,ת ח,כ ז ה ד ג ב,ו ּב א,ע Heb IPA ʔ b v g d h z x t j k l m n s p f t͡ s ʁ ʃ

Kernerman Dictionary News, July Table 1: Transcription of Hebrew consonants 17 for determining the stress position of a Total number stress schwa qamatz miscellaneous word is to determine whether it is a loan of errors word, and in some cases a morphological or semantic analysis is necessary. 3,622 3,295 172 122 33 For example, the word bira with ultimate 91.0% 5.2% 3.4% 0.9% stress is a native word meaning capital (city) and with penultimate stress is a loan word Table 2: Distribution of errors meaning beer. The dotted script renders both words identically. Thus, to correctly determine the stress position one first needs Total sample loan words program error slang database error to disambiguate the word, which requires 500 476 15 8 1 examining the context and performing a sematic analysis. 100% 95.2% 3% 1.6% 0.2% 2.4 Miscellaneous. Some combinations of Table 3: Distribution of stress errors on a random sample of 500 words letters/diacritics do not follow the above χa/ at the end of a/ חַ ,rules. For instance χχ/ is/ כְ ח ./word is always pronounced /aχ often, but not always, pronounced /kχ/. hataf-segol) that appear only in native 3. The experiment words. We constructed an algorithm to transform Since in most loan words (though not in dotted Hebrew to IPA. all) the stress is penultimate, the program 3.1 The database. We tested our algorithm placed the stress there, thus eliminating a on a database of 36,358 dotted words potential error. provided by K Dictionaries. The database As described above, the diacritic Schwa was created from the Hebrew dictionary core in Hebrew is sometimes pronounced /e/ and edited by Orna Ben Natan. Then the project sometimes omitted (in Hebrew the diacritic managers, Anat Merdler-Kravitz and Yifat schwa is never pronounced as a phonologist Ben-Moshe reviewed the automatically- schwa). Hebrew phonologists used sonority transcribed words and amended them theory to predict this behavior. Phonologists as necessary. As a result, each database define sonority as the audible energy entry consists of a word in both its dotted omitted with each (Burquest and transcription and IPA counterpart. Payne 1998, O’Grady and Archibald 2013). The database consisted of 21,126 In each syllable the sonority rises until lemmas, represented by their base form. In reaching the syllable’s nucleus (usually addition there were some plural forms of a vowel) and then it falls. In English, the nouns, verb inflections, and 184 multiword sonority scale, from highest to lowest, is expressions. Many Hebrew conjunctions the following: and prepositions are represented as prefixes a > e o > i u > r > l > m n ŋ > z v ð > s f in the standard script. Except for the latter, θ > b d ɡ > p t k. the words did not contain such prefixes. Rosen (1957, following Segal), and later 3.2 Evaluation. The program correctly Boletzky (2007), postulated that when the transcribed 32,736 words which consist of onset of the syllable defies this order, i.e., 90% of the database. the first phoneme is more sonorous than the The miscellaneous category consists of second, the syllable is split by inserting the that were incorrectly phoneme /e/ between the first and second כְ ח occurrences of 9 transcribed as /kx/, and some occurrences . The sonority of the phonemes that were transcribed as the null of the onset of each of the two syllables הְ and עְ of character and /h/ instead of /ʔa/ and /ha/. increases. Thus, for example, since /l/ is The main source of errors is the more sonorous than /v/, /lvi.ˈva/ becomes misplacement of stress. /le.vi.ˈva/, i.e., the syllable /lvi/ becomes The program correctly identified the two syllables /le/ and /vi/, thus causing the stress position of all but 3% of the original sonority of each syllable to increase.

Hebrew words in the sample. Loan To accommodate for the observed 2017 words are the main source of errors. The behavior of Israeli Hebrew, Rosen (1957) program identified some of these words postulated the following sonority hierarchy using heuristics, such as the existence of for Hebrew: non-native consonants (ʤ, ʒ, ʧ), suffixes a > e o > i u > j > l > m n > z v > f x χ > (t͡sija, nik,…), words starting with /f/ b d ɡ > p t k > ʔ h and other patterns that defy Hebrew Rosen did not place /ʁ/ (r) and the silibants phonotactics (a cluster of four consecutive (s and ʃ) in this hierarchy. To properly place consonants or three consecutive consonants phonemes one should check whether an /e/ at the beginning of a word). There are some is inserted before or after the occurrences of diacritics (hataf-qamatz, hataf-patax and the phoneme. Thus, we found that it is best Kernerman Dictionary News, July 18

to place /r/ together with /l/. However, since contained 199 occurrences where qamatz LOTKS 2017 such a split never occurs before or after s, is pronounced /o/ (small qamatz). We used ʃ Workshop on Language, it is not possible to place the silibants to two heuristics to identify (some of) them: conform to this rule. The qamatz was followed by a consonant Ontology, Terminology We tested this sonority rule on our with the diacritic hataf-qamatz (which is and Knowledge Structures database (omitting syllables with silibants always pronounced /o/). Thus, the pattern On September 19th the second in the onset). The theory successfully was /oCo/. edition of the Language, predicted the omission/inclusion of /e/ in The consonant after the qamatz had a Ontology, Terminology 88.3% of words. schwa and the following consonant had and Knowledge Structures We have developed an alternative a dagesh (that indicates germination or (LOTKS) workshop will take algorithm with better performance: When strong pronunciation). Thus, the pattern place as a satellite workshop schwa immediately follows the first letter it was qamatz C1 schwa C2 dagesh. Hebrew of the 12th International is pronounced /e/ if and only if at least one grammar dictates that the dagesh is a light .ב,ג,ד,כ,פ,ת Conference on Computational of the following occurs: dagesh and C2 is either ● Semantics (IWCS) in The first phoneme is a word prefix, This allowed us to identify 74 cases of Montpellier, France. Following such as b (=in) v (=and). /o/ (37.2%). The small qamatz is relatively ● The first phoneme is a verb conjugation rare, appearing in only 0.6% of all words of on from a successful first prefix, e.g., tsaˈper te.sa.ˈper, you will the database and in only 3% of the errors. edition as a joint workshop tell = t (future, 2nd person)+saˈper. at LREC 2016, the intention ● The first phoneme is j,l,m,n,r. Conclusions is once again to provide a ● The second phoneme is ʔ,h,ʕ . We have constructed an algorithm to forum for different research ● If schwa occurs elsewhere it is transcribe dotted Hebrew texts to IPA communities to interact and pronounced /e/ if and only if it is: conforming to the observed Israeli discuss issues within the ● The second schwa in the pattern Hebrew pronunciation. The algorithm was intersection of computational C1 schwa C2 schwa C3 (Ci a implemented as a Python 3 program and is linguistics, ontology consonant). available from the author. The program was engineering, knowledge ● Between two identical or similar tested on a large database and the error rate modelling and terminologies. letters (e.g., between /d/ and /t/) was 11.2%. LOTKS grew out of the The first two rules require a morphological We used the database to test how well need for a workshop that analyzer to identify the correct analysis of sonority theory explains the pronunciation dealt, on the one hand, with a word in context. Since we did not have at of schwa, and have formulated a simple enhancing knowledge bases our disposal a morphological analyzer for alternative algorithm that outperforms the or conceptual schemes with dotted texts, we could not apply these rules, sonority theory algorithm. linguistic knowledge, as well which could have prevented at least 49 References as on the other, the growing errors. The verb conjugation prefixes with schwa are t,j,l,n,m. With the exception of Bolotzky, S. 2007. The sonority in the use of ontologies and concept /t/ the prefix has high sonority and should, phonology of Israeli Hebrew. In Hebrew schemes to enrich linguistic or in most cases, cause a syllable break. Thus and her sisters, Efrat, M. (ed.). Haifa lexical datasets -- in particular the second rule is often subsumed by the University Press, 239-248. computational lexicons. third. (This explains the low number of Burquest, D. A., and Payne, D. L. 1998. The workshop also offers errors when rules 1-2 are ignored.) Since the Phonological analysis: A functional showcasing the use of number of remaining errors was small, we approach. Dallas, TX: Summer Institute conceptual/terminological/ were able to manually identify when rules of Linguistics. ontological resources in 1-2 were applicable, thus obtaining an error O’Grady, W. D., and Archibald, J. 2013. NLP or computational rate of less than 1%. Contemporary linguistic analysis: An linguistics in general. This introduction. (7th ed.). Toronto: Pearson year we have introduced Qamatz Longman, 70. new themes relating to the The diacritic qamatz is most often Rosen, H. 1957. Ha-Ivrit Shelanu, (Our use of terminology schemes pronounced /a/ (big qamatz). The database Hebrew). Tel Aviv: Am Oved. and ontologies in the digital humanities. The workshop welcomes contributions from both academics and industry 2017 professionals. Sonority theory The alternative w/o silibants algorithm Fahad Khan Rules 3-6 Rules 3-6 Rules 1-6 Istituto di Linguistica w/o silibants with silibants with silibants Computazionale (A. Zampolli) Sample size 7449 7449 8612 8612 – CNR # errors 871 125 126 77 https://langandonto.github.io/ % error 11.69% 1.68% 1.46% 0.89% langonto-termiks-2017/

Kernerman Dictionary News, July Table 4: Sonority theory and the alternative algorithm for words with schwa 19

The Saga of Norsk Ordbok: A scholarly dictionary for the Norwegian vernacular and the Nynorsk written language Oddrun Grønvik

The 9th of March 2016 saw the launch of long stay in Copenhagen. Danish was the Norsk Ordbok, a twelve-volume scholarly dominant written language in Norway until dictionary of the Norwegian vernacular after 1905. and the Nynorsk standard language. Norsk In the same period, the Norwegian Ordbok fills twelve volumes of 9,600 vernacular changed so much that a revival pages, has about 11 million words of text, of Old Norse as a written standard after holds 330,000 entries and ca 15,000 fixed 1814 was unthinkable – the spoken dialects phrases. It took 86 years to complete since and the old written standard were too far the material collecting started and until apart. volume 12 was out. Two thirds of the The 1814 Constitution states that editing happened after 2000. The dictionary legislation should take place in the as well as much of the evidence (contained Norwegian language, but this was a shield in the Norwegian Language Collections, against a Swedish takeover. The choice cf. Grønvik 2020) is freely available on of Danish as an administrative language Oddrun Grønvik obtained a the web (http://www.norskordbok.uio.no). nevertheless left Norway with a national B.A. with honors in English The full story of a twelve-volume legitimacy problem – the idea of a separate Litterature and Language from scholarly dictionary could easily fill another Norwegian national identity, and the need volume, but in this article only a few points for an independent state, was questioned. St. Hilda’s College, Oxford, will be adressed, i.e. (1) the linguistic The language issue became the question of has the degree Candidate of backdrop, (2) the dictionary project and its the day from the mid-19th century, and two Philology at University of source material, (3) the digitisation project solutions were presented, though not shaped Oslo, and was in 2006 granted NO2014, and (4) the future. into opposing camps until the end of the an honorary PhD from the nineteenth century. University of for 1. The linguistic and historical The response to the legitimacy issue her work on dictionaries for backdrop to Norsk Ordbok was initated by members of the Royal African languages. She has Norway has a broken history; independence Norwegian Society of Sciences and worked in the publishing until the end of the 14th century, Letters (DKNVS). The society looked industry (1970-1979) subordination under Denmark until 1814 actively for someone who could document and as consultant at the and under Sweden from 1814 to 1905, the Norwegian vernacular language, and Norwegian Language Council and independence again since 1905. These prove (a) its connection to Old Norse, political changes have had a profound and (b) its separateness from Danish (1979-1987), and in 1987 influence on the Norwegian language, and Swedish. Because of the diglossic became Asssistant Professor which in turn has affected the formation situation - Norwegian and Danish were at the University of Oslo and of written standards and the scholarly close cognates, and Danish spoken in joined the editorial staff at lexicography for Norwegian. The chief Norway was phonologically adapted to Norsk Ordbok, serving as Chief result is that Norway today has two written Norwegian - the difficulty was finding Editor between 2006-2016 standards, Bokmål and Nynorsk, which a trained linguist who was close enough with special responsibility for are close cognates, and which are each to ordinary people to gather trustworthy lexicographical and digital documented in a major dictionary. Norsk linguistic information. The problem was development and training. Ordbok documents the Nynorsk written solved when the the self-taught linguist In addition she has served as standard and all Norwegian dialects. and lexicographer Ivar Aasen (1813–1896) academic project manager for The historical background can be presented himself for the task. Aasen summarised as follows (cf. Haugen 1976; was funded from 1840 onwards and the NORAD-funded African Vikør 1995 p. 51 ff. and 92 ff.): throughout his lifetime, first by DKNVS, Languages Lexical Project, The spoken languages of medieval then by Stortinget. Within his lifetime, and on the standardisation Norway, Iceland, Sweden and Denmark Aasen documented the grammatical committee of the Norwegian must have been mutually comprehensible, structure and the lexicon of Norwegian Language Council. Dr 2017 but resulted in different written practices. In in a series of works culminating in Norsk Grønvik has written widely this period, the written language of Norway Grammatik (1864) and the dictionary on the Norwegian language, was what is now called Old Norse. Norsk Ordbog med Dansk Forklaring language development and Once the administration of Norway was (1873). The orthography expressed in the standardisation, lexicography transferred to Denmark, Old Norse was headwords of Aasen’s 1873 dictionary was and digitisation of language gradually replaced by Danish, until by the also his proposed standard for a common, resources. end of the sixteenth century Danish became wholly Norwegian written standard, the [email protected] the language for civic administration. forerunner to today’s Nynorsk (New Norway was not allowed a university Norwegian). Aasen’s work put an end to until 1811, so tertiary education meant a the legitimacy doubts – Norwegian was a Kernerman Dictionary News, July 20

and leave it to each school board to choose which one to adopt. An earlier parliamentary decision, in 1874, had tasked teachers with adapting their oral instruction in class to the dialect of their pupils, instead of the other way round. Since then, Norwegian has been expressed in two written languages. Since 1929 these have been termed Nynorsk and Bokmål (Vikør 1995; Hovdhaugen 2000).

2 The Norsk Ordbok dictionary project and its source material Plans exist back to 1911 for scholarly lexicography for Norwegian, in the form of committee reports and applications for funding. From 1920 onwards some funding was achieved for starting language collections -– slip archives with indexed excerpts, according to the best practice of the times. The resulting lexicographical work was envisaged both as a joint presentation of all spoken and written Norwegian, and separate West-Nordic language, descended as a separate work for each of the standards. from Old Norse, while modern Danish and Separate scholarly dictionaries became Swedish stem from East-Nordic. the solution and resulted in the parallell Aasen’s work was made possible by projects of Norsk Riksmålsordbok and the development of the comparative Norsk Ordbok - Ordbok for det norske methodology of 18th and 19th century folkemålet og det nynorske skriftmålet. historical philology. He systematically Norsk Riksmålsordbok was published in documented the Norwegian dialects, four volumes 1928-1958, with a supplement employed the comparative method to in two volumes published in 1995. establish a common pattern for phonology, The split into two projects had an morphology, lexicon and syntax, using ideological basis with inevitable practical Old Norse as a touchstone, but including implications. The scholarly dictionary for nothing that was undocumented in his time. the Danish-derived standard was to be In shaping his proposed standard language based on printed literature of Norwegian he also took the standards of Swedish and authorship from 1814 onwards, supplied Danish into account, to avoid unnecessary by speech materials representing “educated differenciation from what people were used everyday speech”. The scholarly dictionary to seeing in print. for Nynorsk was to be based on the To the ruling classes of Norway, however, Norwegian vernacular in all its varieties, the idea of even trying to establish a as documented back to about 1600, and on wholly Norwegian written standard, for Nynorsk literature, which came into being everyday use in competition with Danish, from the late 19th century onwards. The first seemed ridiculous and unthinkable. This project regarded speech as a supplementary would mean giving cultural hegemony to category and a dialect label as a warning; an uneducated, though literate, country the other one saw the dialect materials as population. At the same time, something primary source material, expanded and had to be done to nationalise the Norwegian developed through literary use. Sources version of Danish, clearly diverging from the as well as the lexicographical treatment of Danish of Denmark. The counter-solution to them were to be too different for the projects Nynorsk favored adapting standard Danish to be compatible within one framework. to Norwegian phonology and including Norsk Ordbok got off to a belated start in

2017 typically Norwegian words in the lexicon. 1930. A grand plan for material collection, A gradual transition from a Danish to a with a small editorial staff supported by Norwegian written standard, based on volunteers, was drawn up and drew a “educated everyday speech” was envisaged. gratifying response: 600-700 volunteers The first orthographic reform of Danish in came forward within a year or two, and Norway came in 1907 and established the by 1940 the collections encompassed one forerunner of today’s Bokmål. million slips, 20 percent documenting At the end of a long and fierce political speech, the rest drawn from written struggle, the Norwegian parliament in sources. At the same time, a rough first 1885 voted to give both standard languages version was drafted on the basis of existing

Kernerman Dictionary News, July official standing as languages of instruction Nynorsk dictionaries and some large dialect 21 collections. This draft manuscript held production plan. The basis for this plan 130,000 entries and covered 13,500 pages was the conviction that trained linguists of typescript. The plan was to expand this could become efficient scholarly manuscript with the materials from the lexicographers within one year, and after language collections and end with a modern that meet production deadlines as planned. dictionary of 4-5 volumes. In order to succeed, Norsk Ordbok would Work on Norsk Ordbok was halted need roughly 265 man years of efficient during the second World War, and started lexicography within 14 years -– 2001 to again in 1946. A review period led to the 2014. We thought we could do that, given following three decisions: (a) continue (a) competent and tough management, (b) collecting, especially oral materials; (b) scholarly computer developers, (c) enough draw up a detailed plan for the dictionary linguists, and (d) funding. All of these microstructure, so as to do justice especially factors were equally important, and the to the richness of the sources; (c) start planned project, named NO2014, would editing all over again, focussing on full be doing a tightrope act from beginning use of the Language Collections with to end. It was worth trying. the draft manuscript as a guideline. On The conviction that scholarly the basis of this very ambitious plan, the lexicographers could be trained quickly first fascicle was completed in 1950 and and efficiently ran counter to traditional the first volume in 1966. At this time, the views of training needs for scholarly completed dictionary was thought to reach lexicographers. When I started in 1987, the eight volumes at the most. general assumption was that training as a The group of editors increased slowly. scholarly lexicographer would take at least When I was recruited in 1987, I became five years, and the time would be spent in the eigth editor and the second woman getting to know the collections, mastering editor. At that time two volumes were out a multitude of conventions, and accepting and the third completed in manuscript. The the need for extensive crosschecking Language Collections had quadrupled in and proofreading. Previous experience size and the alphabet progress had slowed as a linguist would certainly be utilised, down. A little arithmetic showed that if developed and challenged in handling very work continued at the rate then current, complex materials, but language analysis Norsk Ordbok would be completed around was only one of many tasks, and they all 2060 and reach 16-20 volumes -– a plan which was unlikely to get funding. These facts were therefore kept quiet. Norsk Ordbok needed a miracle, and the miracle turned up in the shape of a huge Norway digitisation project for the university collections of the whole of Norway, designed to counteract nationwide unemployment when the Norwegian telephone and telegraph services went digital. Through the Documentation Project (1991-1997), run by Christian-Emil Smith Ore at the University of Oslo, key components of the Collections were stored in databases and on the Web by 1997, i.e. the excerpt archives, the draft manuscript of 1940 and a number of other resources. All components were then coordinated in a digital index – the Meta Dictionary – with base forms and part of speech as in Norsk Ordbok. The Meta

Dictionary at present holds about 550, 000 2017 entries for Nynorsk.

3 The Digitisation Project NO2014 (2001–2016) As the millenium approached, Norwegian authorities were planning another jubilee -– the bicentenary of the Norwegian constitution in 2014. Norsk Ordbok was chosen as one of the bicentenary projects, on the basis of a carefully worked out Kernerman Dictionary News, July 22

The entry støl in the online version of Norsk Ordbok seemed to hold equal significance. could wish for in professional and human Norsk Ordbok was one of many projects terms. The positions as chief editors – a to embrace the computer at the end of group of four -– were held by former staff the 20th century. The decisive experience members. Tasks were allocated according to for us on the issue of training enough project needs. I was made responsible for lexicographers in a fairly short time, digitalisation and training, and this account was participation in the ALLEX Project is naturally coloured by my particular (1991-2006), a Norwegian and Swedish experience. funded research project designed to A premise for funding was moving the provide monolingual dictionaries for the entire project to a digital platform. We African languages of Zimbabwe. The took that to mean not only producing the ALLEX Project proved that mother-tongue dictionary itself, but also being able to linguists could become efficient scholarly access and sort digitalised materials, take lexicographers in a very short time, working care of the sorting, make sure that no entry through a lexicographical interface based lacked materials, and saving the finished on an analysis of the relevant language, product in a form that allowed different storing results in databases, dealing with types of presentation of the finished product oral materials in corpora, etc. (Grønvik 2005). Categorising and commenting on An important decision concerning the language through well worked-out software software structure was to make a maximum is not only a tool for efficiency, it is also an format the standard, always allowing for immensely effective tool for learning and the most extensive and complex entry, mastery. This conviction combined with the rather than having a more restricted basic assurance of computerisation support from format which might have to be extended. EDD (the Unit for Digital Documentation at The standard sense unit therefore caters for

2017 the University of Oslo), covered points (b) definition, usage examples, sub-definitions and (c) above. The Norwegian Parliament with usage examples, multiword expressions guaranteed point (d), with funding through with several senses, and finally compounds the Ministry of Culture and the University of in which the entry headword appears as Oslo. NO2014 also had the immense good the initial or the final part, plus of course fortune to attract two directors1, one after source tables for literature and geographical the other, who had all the qualifications one location. The editorial interface was the first thing 1. The Project directors of Norsk Ordbok to get finished. By 2013, Norsk Ordbok 2014 were Kristin Bakken 2002-2008 and in digital form was contained within one

Kernerman Dictionary News, July Åse Wetås 2008-2015. application which was able to (1) generate 23 entries from indexed materials, with a (Norway has more than 400 dialects) in a link to the materials, (2) provide a tool common portal allowing cross-searches in for analysing linked materials and storing headwords. the analysis, (3) generate the entry head A Web edition of Norsk Ordbok was (the identifying information of the entry) launched in 2012 (http://norskordbok.uio. from a separate full form register, (4) no/), showing the section of the dictionary present a form where the edited text can be edited and completed in the relational linked directly to the materials underlying database, today from “i” to “å” in the each definition or description, (5) allow alphabet. From January 2014, the Web supervision of production flow at the edition has been linked to a digital map micro and macro levels, (6) present the of Norway, and thus been able to show finished product in an optimally accessible geographical usage extent for word forms, fashion (paper: pdf with preset style sheet senses and expressions. This edition is corresponding to the print typeface; web: popular, and so is the map function. settings for web presentation on different When the NO2014 project was nearing reading tools), and (7) provide a format for completion, we also published our editorial longtime storage of the product linked to handbook through the NO2014 website its sources. The software package became (Redigeringshandboka 2016). The project a sort of lexicographical factory, designed parole throughout was to encourage users to to allow editors to concentrate on analysing look behind the edited text into the materials and editing (Grønvik and Ore 2013). of the Language Collections, raise questions The Norsk Ordbok software is now used and demand response. In public interaction, in three other dictionaries, Bokmålsordboka publishing the guidelines, which have the and Nynorskordboka being the best known role of a method chapter in a dissertation, (see http://ordbok.uib.no/, cf. http:// has turned out to be useful. dictionaryportal.eu). When Norsk Ordbok was completed, When the project NO2014 started in more than 200,000 headwords that had 2001, the alphabet stretch a-h was already never been lexicographically treated, had edited, with great care and consistency, and an entry in a scholarly dictionary, while deep respect for the materials contributed the central vocabulary of Norwegian had over the years, especially the oral materials. received in-depth treatment on the basis A primary task in the digitalisation process of written and oral materials covering was to take care of what can be termed the whole country and four centuries of best practice in the pre-digital editorial documentation (Grønvik 2017). work. Two tasks stood out: (a) the careful A fully sourced account of the history of identification of formerly unstandardised Norsk Ordbok (in Norwegian) will be found dialect word forms, and finding for them in the Festschrift published together with a standard form consistent with modern the final volume (Karlsen et al. 2016). Nynorsk orthography; (b) the treatment of multiword expressions (MWEs), which proved to have been a major difficulty in the pre-digital entry schema. The first task became a permanent concern for the project management, especially in offering all new editors training in handling Norwegian and Nordic dialectology, synchronically and diachronically, but also in giving particular attention to the standardising of new dialect materials which were added to the Language Collections during the project period. For the second task, training in identifying MWEs was offered on a permanent basis, but we also created a software template for the registration and 2017 editorial handling of MWEs, making them directly searchable. Identifying both poorly documented word forms and MWEs was greatly helped by important additions to the digital Language Collections. The most important items were a corpus for Nynorsk literature covering the period 1866–2010, now at 105 million tokens (Nynorskkorpuset), and the digitalization of 65 dialect dictionaries Launch of Norsk Ordbok, 9 March 2016 Kernerman Dictionary News, July 24

4 From the University of Oslo to a new References life at the University of Bergen Berg-Olsen, S. and Wetås, Å. 2015. In June 2014, the University of Oslo Revision and digitisation of the decided to end its commitment to early volumes of Norsk Ordbok: Norwegian lexicography, and get rid of Lexicographical challenges. In Abel, the Language Collections, which comprise A., Vettori, C., and Ralli, N. (eds.), archives going back to the 1880’s and Proceedings of the 16th EURALEX covering far more than the Nynorsk and International Congress. Bolzano. dialect sections. NO2014 was half way EURAC research, 1075-1086. through editing volume 12 when the Grønvik, O. 2005. Norsk Ordbok NORDIX project staff was sacked. Despite great 2014 from manuscript to database difficulties, Norsk Ordbok did get finished – Standard Gains and Growing in good order, but there was a delay of Pains. In Ferenc, K., Kiss, G., Series of fully bilingual more than a year; volume 12 was sent to and Pajzs, J. (eds.), Papers in dictionaries for Nordic and the publisher, Det Norske Samlaget, on Computational Lexicography Complex major European languages. November 24, 2015, and the launch came 2005. Linguistics Institute, Hungarian DANSK in March 2016. Academy of Science. 60-70 Danish-English By that time, the Norwegian government, Grønvik, O. 2020. The lexicography of through the Ministries of Education and Norwegian. In Hanks, P. and Schruyver, / English-Danish Culture, had decided that the Language G.-M. (eds.), International Handbook Danish-French Collections, with the dictionaries of Modern Lexis and Lexicography. / French-Danish Norsk Ordbok, Bokmålsordboka and Berlin and Heidelberg: Springer-Verlag. Danish-German Nynorskordboka, represent essential Forthcoming. / German-Danish linguistic infrastructure, and therefore were Grønvik, O. and Ore, C.-E. S. 2013. Danish-Spanish too important to be left to the management What should the / Spanish-Danish of the University of Oslo alone. do for you – and how? In Kosem, I. Provided that funding was allocated, et al. (eds.), Electronic lexicography NORSK the University of Bergen had volunteered in the 21st century: thinking outside Norwegian-English to house the Language Collections the paper. Proceedings of the eLex / English-Norwegian (comprising collections for Bokmål, 2013 conference, 17-19 October 2013, Norwegian-French Nynorsk, Old Norse and place names). After Tallinn, Estonia. Eesti Keele Instituut, / French-Norwegian inventorying in the winter of 2015-2016, 243-260. Norwegian-German more than 70 tons of books and archives Haugen, E. I. 1976. The Scandinavian / German-Norwegian were moved in the summer of 2016. The languages: An introduction to their Norwegian-Spanish transfer of the digital collections started history. Cambridge, MA: Harvard / Spanish-Norwegian about the same time, the first components University Press. being run from Bergen from September Hovdhaugen, E. 2000. Normative SVENSK 2016. The total move is a very extensive studies in the Scandinavian countries. Swedish-English operation still in process, involving In Auroux, S. et al. (eds.), History of / English-Swedish recruiting and training of research, ICT the Language Sciences. Berlin and Swedish-French and administrative staff. In February this New York: Walter de Gruyter I-II, / French-Swedish year an application for revision and full 888-893 Swedish-German digitisation of Norsk Ordbok a-h was Karlsen, H. U., Vikør, L. S., and Wetås, / German-Swedish submitted by the University of Bergen to Å. 2016. Livet er æve, og evig er ordet. the Ministry of Culture, and it was – so far Oslo: Det Norske Samlaget. Swedish-Spanish and fingers crossed – well received (for Vikør, L. S. 1995. The Nordic languages. / Spanish-Swedish revision plans, see Berg-Olsen et al. 2015). their status and interrelations, 2nd ed. Based on the Global Series and This is how matters stand. Oslo: Novus. other resources and developed The cost of completing Norsk Ordbok by K Dictionaries 2017. through the Project NO2014 (2001-2016) URLs stands at 260 million NOK, somewhere Bokmålsordboka and Nynorskordboka: around 27.5 million Euro. This sounds http://ordbok.uib.no like a lot of money, though it wouldn’t The Documentation Project (1991-1997):

2017 buy many kilometers of road. However, it http://dokpro.uio.no/engelsk/drawer_ is enough not to be thrown away lightly, to_screen1.html especially when there is visible and vocal ENeL Dictionary Portal: http:// public support for maintaining both the dictionaryportal.eu/en/ Language Collections and the dictionaries. Norsk Ordbok: http://no2014.uio.no/perl/ In the future, Norwegian lexicographers ordbok/no2014.cgi will have to continue to serve the public, Nynorskkorpuset: http://no2014.uio.no/ both in Norway and internationally, through korpuset/conc_enkeltsok.htm developing Norwegian lexicography as best Redigeringshandboka 2016: http://no2014. they can. At least we now know that we uio.no/eNo/tekst/redigeringshandboka/

Kernerman Dictionary News, July can do it! redigeringshandboka.pdf 25

The First Century of English Monolingual Lexicography. Kusujiro Miyoshi

In a series of case studies ranging across between the front matter of dictionaries English lexicography of the seventeenth and their actual content” (xiv). Thus, his century, Kusujiro Miyoshi calls the tenets method is “greatly different from, or nearly of forensic dictionary analysis into action diametrically opposed to, that of Starnes and proves its methodological productivity. and Noyes,” which “[leaves] a plenitude of Miyoshi’s cast of characters is mostly historically significant facts undiscovered” familiar to historians of lexicography: (xii). Miyoshi sets out to discover those Robert Cawdrey’s Table Alphabeticall facts, some of them illuminating about (1604), John Bullokar’s English Expositor relationships among dictionaries of the (1616), Henry Cockeram’s English period, some about “the highly creative Dictionarie (1623), Thomas Blount’s use of other dictionaries in one specific Glossographia (1656), Edward Phillips’ dictionary” (xiv), and altogether registering New World of English Words (1658), Elisha diverse approaches to representing Coles’ English Dictionary (1676), and J. lexical knowledge, through which later K.’s New English Dictionary (1702). The lexicographers would sort to identify and The First Century of outlier is Richard Hogarth’s Gazophylacium develop what we might call lexicographical English Monolingual Anglicanum (1689), about which most of best practices. Lexicography us knew next to nothing until we read Chapter for chapter, Miyoshi tells us Kusujiro Miyoshi Miyoshi’s article about it in Kernerman things of interest about these early English Dictionary News (2008). De Witt Starnes dictionaries, things we ought to know Newcastle upon Tyne: and Gertrude Noyes addressed it in their before proceeding to any advanced or Cambridge Scholars classic study, The English Dictionary from speculative arguments about them. For Publishing, 2017 Cawdrey to Johnson 1604-1755 (1991; instance, in his first chapter, Miyoshi Hardback, 180 pages, £58.99 originally 1946), for just five pages, half shows that, although conventional wisdom of which discuss the Gazophylacium’s says otherwise, Cawdrey’s A Table ISBN 978-1-4438-5181-7 antecedents. If you really want to know Alphabeticall and Bullokar’s English http://cambridgescholars.com/ about it, you’ll necessarily turn to Miyoshi’s Expositor are structurally and textually the-first-century-of-english- treatment of it in his new book. closely related. Testing the overlap between monolingual-lexicography Miyoshi has long been a practitioner of the two across the alphabetical range I, L–P, forensic dictionary analysis. Julie Coleman and R–T, he discovers that the Expositor and Sarah Ogilvie (2009) codified its appropriates on average nearly 60% of the “principles and practice” and included Table’s headwords and that nearly 37% Miyoshi’s Johnson’s and Webster’s Verbal of the Expositor derives from the Table. Examples (2007) among their references. Beyond headwords, Miyoshi detects the They proposed that one cannot trust Table’s influence on more than 25% of the lexicographers to tell the truth about those Expositor’s definitions. At the very earliest dictionaries in their prefaces. One can stage, then, English lexicographers were only ascertain the truth by digging elbow aware of one another’s work and built upon deep into dictionary data and analyzing foundations laid by others. them statistically. As Coleman and Ogilvie Chapter 2 contrasts the Expositor and conclude (2009, 18), “Forensic dictionary Cockeram’s English Dictionarie on analysis brings together statistical, the treatment of derivatives. Miyoshi textual and contextual approaches that demonstrates clearly that while Cawdrey allow dictionary researchers to examine, did not itemize derivatives, Bullokar — understand, and reconstruct lexicographic often further developing Cawdrey’s entries practices and policies.” Miyoshi’s — added some. Cockeram lists even more investigations of seventeenth-century derivatives than Bullokar, sometimes where 2017 English dictionaries are statistical and Bullokar lists none, and sometimes adding textual by default — in most cases, we lack them to Bullokar’s text. So, Cawdrey lists useful contextual evidence — and he is a liquid; Bullokar lists Liquid, Liquefaction, master of the method. and Liquifie; and Cockeram lists Liquid, As Miyoshi puts it, his “method, although Liquable, Liquation, Liquator, simple, yields results: it is to dive directly Liquefaction, and Liquifie (15). into the contents of the dictionaries” in Of course, we care whether the words question, “relying little on descriptions in early English dictionaries reflected in their title pages and introductory use. Even if some of these words were materials,” which “tends to reveal the gulf not in use — “Liquator. He which Kernerman Dictionary News, July 26

melteth” — their inclusion can reveal does not in fact. In the second part, entries eLex 2017 something about language attitudes and open with colloquial terms then defined Lexicography from lexicographical technique. Suppose that with Latinate or, as Miyoshi calls them, Scratch Cockeram did make some words up to “refined” terms, the reverse of the “hard see what an extended list of derivatives word” entry pattern. Because the defining This year marks the fifth looked like — false evidence of English, terms are Latin or Latinate, it’s easy to anniversary of the biennial but an important experiment in dictionary assume some connection to the vernacular– Electronic Lexicography in structure and, as Miyoshi points out, classical bilingual dictionaries that precede the 21st Century conference evidence that the notion of derivatives had publication of The English Dictionarie, series. The conference will taken hold of the linguistic imagination but Miyoshi’s collations reveal that take place in Leiden from 19 in early seventeenth-century England. approximately 90% of the “refined” words to 21 September and will be Within Cockeram’s entry, Liquefaction come from the first part of Cockeram’s is defined as “That Liquation is,” and English Dictionarie, Cawdrey’s Table, or hosted by the the cross-reference, Miyoshi argues, may Bullokar’s Expositor. With characteristic Institute. represent “Cockeram’s attempt to present understatement, Miyoshi suggests, “We The theme of eLex 2017 is […] entries in a systematic way” (15), may now have to reconsider the influence Lexicography from Scratch and which I would emend to “increasingly of the English–Latin dictionary on the the focus is on state-of-the-art complex entries.” In the early English early English monolingual dictionary” technologies and methods dictionaries, we see both macro- and (41). for automating the creation microstructural features we now take for If the second part of Cockeram’s of dictionaries. Over the past granted in the process of their invention. dictionary borrows so much material from two decades, advances in The dictionaries tends Cawdrey and Bullokar, then what’s new to mention Cockeram in passing. Miyoshi and interesting about it? Miyoshi explains NLP techniques have enabled clearly sees him as perhaps the central in Chapter 5 that, as in the first part, the automatic extraction of figure in the development of English Cockeram developed a more complex entry different kinds of lexicographic lexicography during the seventeenth structure than those of his contemporaries information from corpora and century. Besides the treatment of derivatives — especially in presenting synonyms and other (digital) resources. As a in Chapter 2, Miyoshi considers his information on word formation — such result, key lexicographic tasks, treatment of high-frequency verbs (Chapter that the second part “is highly significant such as finding collocations, 3), source material (Chapter 4), and entry as a dictionary of its time” that contained definitions, example sentences structure (Chapter 5), and then contrasts “the precursors of techniques which are or translations, are being his Anglicization of foreign words to that indispensable for the development of increasingly transferred of Blount in Glossographia (Chapter 6). English monolingual dictionaries after it” from humans to machines. The First Century of English Monolingual (49). In Chapter 6 he wraps up his inquiry Lexicography Automating the dictionary is a slim book: there are 38 into Cockeram by comparing his approach pages of introductory material, including to Anglicizing foreign words, especially creation process is highly an elegant introduction by John Considine, Latin or Latinate ones — litispendence, for relevant, especially for and the chapters by Miyoshi comprise but example — to Blount’s in Glossographia under-resourced languages, 130 pages, not counting references and and concluding that both lexicographers where dictionaries need to be index. Half of the ten chapters and nearly blotted English with inkhorn terms. compiled from scratch and half the pages — 62 of them — focus on The book under review is a series of case where the users cannot wait for Cockeram’s work. It’s the most detailed studies operating a certain methodology; years, often decades, for the and concentrated analysis of Cockeram’s it concludes only that close attention to dictionary to be “completed”. English Dictionarie I’ve ever read — for the data of seventeenth-century English that reason alone, the book is a valuable dictionary texts leads us to re-evaluate This year we have received addition to the historical literature about relationships among them — both the data nearly 50% more submissions early dictionaries. and the dictionaries, I suppose. Miyoshi’s in comparison with the In Chapter 3, Miyoshi investigates argument about Cockeram opens into a previous conferences. The another aspect of Cockeram’s “system,” sort of teleological arc of lexicographical Programme Committee has the treatment of phrasal verbs, which as an development in the period. Cockeram made a nice selection of papers element of dictionary structure resembles experiments with systematic approaches for presentations, demos and and aligns well with treatment of noun to the lexicon in dictionary form, we’re posters. Each submission derivatives — looking across lexical told. Cockeram leads us to Chapter 7, titled

2017 was reviewed by at least two categories, one detects an inclination “Edward Phillips’s New World of English members of the 69-member towards elaboration that would drive Words (1658): The First Systematic Scientific Committee. later lexicographical innovation until Treatment of English Vocabulary.” its practices were well established in Whereas, Blount, for instance, “still saw Keynote lectures will be dictionaries by John Kersey, Nathan Bailey, naturalized foreign words as the primary delivered by Frieda Steurs and of course Samuel Johnson. In Chapter object of lexicography […] Phillips was (Dutch Language Institute), 4, Miyoshi argues quite persuasively that coming to realize that what matters is the Ivan Titov (University of the tradition of English–Latin dictionaries systematic treatment of the vocabulary Amsterdam/University of we have long assumed — under Starnes’ of English, whatever its origins” (84), and Noyes’ influence — underlies the which is a necessary step towards the

Kernerman Dictionary News, July second part of Cockeram’s dictionary, lexicographical professionalism of 27

Elisha Coles and John Kersey, who and Johnson, however many lemmata he bring the seventeenth century to a close. carried over from the Gazophylacium or any Elisha Coles’s English Dictionary, the other dictionary. Edinburgh), Jane Solomon next chapter’s focus, would summarize and Taken by themselves, then, Miyoshi’s (Dictionary.com), and Ben harmonize all lexicographical developments chapters, though well connected to one Zimmer (Wall Street Journal, of the century, looking backwards over his another, lack essential context. Fortunately, formerly Vocabulary.com). predecessors and extending lexicographical Considine’s introduction outlines both “The system to the internal linking of entries seventeenth-century monolingual English One more keynote will be held — Coles took the dictionary as a book of dictionary tradition” (xxiii–xxiv) and for the first time as part of the miscellaneous entries and made it whole. In “Studying the tradition: before and after Adam Kilgarriff Lecture, in John Considine’s formulation, “In English Starnes and Noyes” (xxiv–xxviii), and memory of our colleague and and Latin, his work was a milestone in the explains Miyoshi’s critical intervention friend Adam Kilgarriff, by establishment of the genre of the compactly in those traditions and the ways in which Paweł Rutkowski (University printed, fully alphabetized classroom his work complements and improves of Warsaw), winner of the dictionary which draws on larger and more upon Starnes and Noyes (xxviii–xxxvii). Adam Kilgarriff Prize 2017. learned dictionaries. I think it would be Considine’s knowledge of the subject The programme includes possible to argue that he was one of the is deep and wide, but the introduction is founders of that genre, although of course brief and appealing — the sort to which traditionally three parallel that is a simplification” (2012, 53). It is, only genuine erudition can lead. Miyoshi sessions, software rather, a simplification and a truth at the proves that seventeenth-century dictionaries demonstrations, poster same time. are textually much more interrelated than sessions, a book and software Chapters 9 and 10 rightly conclude we had realized and reiterates what we’ve exhibition, and a social event. the seventeenth century as it tips into known for a while, that lexicographers of On Wednesday 20 September, the eighteenth, with John Kersey’s New the time rather freely borrowed from one particpants will gather in beach English Dictionary (1702), which pivots another. But why presume originality when pavilion Paal 14 on the Dutch away from the hard words tradition the best possible definition has already coast for a relaxing evening towards the modern dictionary. Chapter 9 been written? I find repeating Considine’s with BBQ while taking in the argues, on close comparison of Hogarth’s conclusion similarly irresistible: “The stunning views of the North Gazophylacium with the New English English Dictionary from Cawdrey to Dictionary, that the former influenced the Johnson will continue, for the time being, Sea. latter, so that while innovative, the New to be the authority of first recourse, but after The conference is preceded by English Dictionary was not independent reading what it has to say on a given topic, the Final Meeting of COST of earlier dictionaries, not quite as new it will always be wise to ask, ‘does Miyoshi Action IS1305 European as the title promised. Chapter 10 suggests have anything to say about that?’ and to turn Network of e-Lexicography that Kersey’s primary innovation is the to this book” (xxxvii). Just so. (http://elexicography.eu/), careful treatment of compound adjectives which will take place on and nouns, a further development of References Monday 18 September at the Cockeram’s interest in morphological Coleman J. and S. Ogilvie. 2009. Forensic same venue. Two workshops complexity, so, Miyoshi believes, tied to Dictionary Analysis: Principles and the seventeenth century more than leading Practice. International Journal of are scheduled on Thursday into the eighteenth. Lexicography 22.2, 1-22. afternoon, immediately after Each of Miyoshi’s chapters looks Considine J. 2012. Elisha Coles in Context. the conference: a Sketch forensically into a very precise matter Dictionaries XXXIII, 42-47. Engine workshop, where of dictionary structure in one or two Miyoshi K. 2007. Johnson’s and Webster’s participants can learn about dictionaries and each has its illuminating Verbal Examples, with Special Reference the new interface and upgrade moment. But their narrowness is also to Exemplifying Usage in Dictionary their corpus skills, and a K a limiting factor and sometimes we are Entries. Lexicographica Series Maior Dictionaries workshop, on misled, much as Miyoshi rightly claims 132. Tübingen: Niemeyer. developing and handling we can be misled by Starnes and Noyes. Miyoshi K. 2008. Gazophylacium human- and machine-driven So, I accept Miyoshi’s point about the Anglicanum (1689), a Turning Point lexicographic resources. Gazophylacium’s influence on Kersey, and in the History of the General English I agree that the New English Dictionary is Dictionary. Kernerman Dictionary For more information on the textually a seventeenth-century specimen, News 16, 4-8. conference, visit the revamped but the textual matters aren’t the only salient Read A. W. 2003. The Beginnings of eLex website at https://elex. 2017 aspects of that dictionary. It leads into the English Lexicography, (ed.), Adams, link/elex2017/. eighteenth century, as Allen Walker Read M. Dictionaries XXIV, 187-226. We hope that you will join us at observed, because Kersey was “the first Starnes D. T. and G. E. Noyes. 1991. The eLex 2017 and we look forward professional lexicographer” (2003, 223). English Dictionary from Cawdrey to He saw the purpose and art of lexicography Johnson 1604-1755. Amsterdam Studies to welcoming you in Leiden. differently from his schoolmaster in the Theory and History of Linguistic predecessors — the “hard words dictionary” Science 57, (ed.), Stein, G. Amsterdam Carole Tiberius tradition might as aptly be described the and Philadelphia: John Benjamins. on behalf of the eLex 2017 “schoolmaster dictionary” tradition — and Organising Committee. in that respect he looks forward to Bailey Michael Adams Kernerman Dictionary News, July 28

Met zoveel woorden. Gids voor trefzeker taalgebruik. Rik Schutz en Ludo Permentier

In 2016 the language lovers and (Dutch) take what fits best in their text (VII). The language experts Rik Schutz and Ludo authors make an appeal to the users’ own Permentier published Met zoveel woorden linguistic feeling and their knowledge of (Mzw), which may be translated as ‘With the expressions found. Their appeal would so many words.’ It may be useful to begin seem to narrow down the target group with explaining what Mzw is not. Well, it considerably. You have to be an educated is not a dictionary, though the authors on speaker of Dutch to fully appreciate Mzw. the inside front cover refer to the book’s Non-native translators into Dutch must have dictionary section. They argue on p. IX that a fairly high command of Dutch. For the Mzw differs from traditional dictionaries. latter group Mzw may be a textbook after So, it’s not a traditional dictionary. Nor is all. The rich collection of expressions given it a textbook aimed at extending the users’ in the book must be particularly tempting knowledge of Dutch (VII). The authors for them. explicitly state that they do not pretend to In the introduction Schutz and Permentier offer their readers anything that they did offer detailed instructions of how their book Met zoveel woorden. not already know. Most of the expressions should be used. In addition, the inside front Gids voor trefzeker and phrases included in their book are and back covers have a brief outline of the taalgebruik known to most native speakers of Dutch, instructions with three sample questions Rik Schutz en Ludo they say. Mzw does not include proverbs, divided into two or three steps. Assuming Permentier because they hardly occur in normal Dutch that most users will refer to the outline, I Amsterdam: Amsterdam texts. Mzw deliberately does not contain will put it to the test. The test will show how University Press, 2016 synonyms, the argument being that good the book seeks to be a guide to well-chosen Paperback, 603 pages, €29.95 collections of synonyms are already language use. available. The first search question the authors ISBN 9789462981805 Alright then, Mzw is not a (traditional) give is ‘how can you express a concept http://en.aup.nl/ dictionary, it’s not a textbook, and it’s not in an evocative way?’ Reading all the books/9789462981805-met- a dictionary of proverbs, or of synonyms. time about a certain administration, the zoveel-woorden.html So, what is it? The book’s subtitle gives concept beïnvloeding ‘influencing’ springs an indication of what the book wants to mind. Step 1 makes it necessary for to achieve. It is meant to be a “guide to me to figure out the most common word well-chosen language use”. Mzw wants for that concept. Note that this step is to help its users to improve their Dutch completely intuitive and thus subjective. I writing. It would seem to aim at a general am afraid that it will often require a more public of people who want to write in Dutch, than average command of Dutch. If there to translate into Dutch, or simply to have is a more common word for beïnvloeding, a good time reading about Dutch (VII). I cannot think of it. Maybe it is the most The authors grouped together the wealth of common word then? Time for step 2, expressions and phrases that are available which involves looking up the word to express oneself in Dutch in well-chosen, in the dictionary section. Beïnvloeding apt terms. The book provides the user with is not listed, however, nor is the verb a useful and often illuminating overview beïnvloeden ‘influence’. Okay, let’s of expressions and phrases that belong to a think again, manipulatie ‘manipulation’ certain concept. Here we see the difference is not the most common word for the between Mzw and traditional dictionaries. concept beïnvloeding, I would say, but If Mzw were a dictionary, it would be an it comes close. Manipulatie appears to onomasiological one. The authors seek to be not listed either, but manipuleren assist writers and translators by offering ‘manipulate’ is. Under manipuleren we

2017 them ways to enforce concepts and to find the expression iets naar je hand zetten intensify their meaning as well as ways to ‘force/bend something to one’s will.’ That aptly word a concept. To meet their goals, is an expression one could certainly use they provide 2,800 intensifications and to liven up a text. The dictionary section 5,700 idiomatic expressions. Next to the provides for every expression a most dictionary part, there are three registers that welcome example sentence that should should help the user find what he or she is make clear in what context it might be looking for. used. Other than from the Internet, the The book offers its users what the authors do not provide the exact sources authors call a grabbelton, a lucky bag, or of their examples. The example given here

Kernerman Dictionary News, July a grab bag, from which the reader may is probably indeed from the Internet, and 29

globaLex maybe even quoted from a text on a certain worthwhile to take this third step if only Globalex mid-2017 administration: for your own pleasure, because that’s what De achterliggende visie is dat de Mzw abundantly gives to anyone interested Following GLOBALEX 2016 wereld er is voor de mens, dat het tot in Dutch expressions. Workshop (that was co-located de basisrechten van de mens behoort de There is a third, fairly short, register that with LREC, http://ailab.ijs.si/ wereld naar zijn hand te zetten. is not discussed on the inside covers. This globalex/), the new Globalex ‘The underlying vision is that the world register claims that it will enable the user to website (http://globalex.link/) is there for man, that it belongs to the find any of the over two thousand lemmata and preparatory board began basic human rights to force the world to used to arrange the 7,500 expressions in to operate in June. The board one’s will.’ the dictionary section. The headwords are consists of representatives The second search question is: ‘how can grouped in eleven categories ranging from of the five continental you express the meaning of a word more Aandacht: valt het wel op? ‘Attention: lexicography associations strongly?’ Step 1 merely involves looking does it attract attention?’ to Verstand: up the word you would like to intensify kan het worden begrepen? ‘(Power of) (Danie Prinsloo, Afrilex; in the dictionary section. Let’s see what reason: can it be understood?’ Does the Edward Finegan, DSNA; Ilan we find under bekend ‘known, familiar’ third register help us find the concept Kernerman, Asialex; Julia (that is step 2). Above a dotted line we ongeduld ‘impatience?’ It does, ongeduld Miller, Australex; Lars Trap find the intensifications overbekend, ‘very is under the category gevoel ‘feeling’, but Jensen, Euralex) and of the well-known, widely known’, welbekend you don’t need the register to check that eLex conference series (Iztok ‘well-known, familiar’, and wijd en zijd it’s in the dictionary section. So, why this Kosem and Simon Krek). The bekend ‘widely known, known far and register? The best I can think of is that the members have been holding wide.’ Below the line we find no less than different categories may stir up the users’ skype meetings about once eleven expressions. The two following inspiration. a month, usually having to examples may serve to illustrate the wide Browsing through Mzw and writing alternately miss someone due semantic range the expressions in Mzw may this review I realized that I have got out have: bekend zijn als de bonte hond ‘have of the habit of consulting paper reference to the time difference between a bad reputation, be notorious’ and publiek works. The only paper reference works on Australia and West Coast USA. geheim ‘open secret.’ The first example my desk are those that are not available The initial outcomes so shows that it’s not only words that intensify, electronically. It appears that roughly the far were mainly in form of phrases can do the same. same collection of headwords is available co-funding the website hosting The third question for the user to ask the on http://onderwoorden.nl/intensiveringen/. (USD20 annual per association) book is ‘which words can be intensified The online version is more extensive than the and initiating mutual greetings by the word ….?’ Let’s take the adjective book when it comes to example sentences. and some participation in duivels ‘devilish.’ Step 1 requires a search On the other hand, online one does not find each other’s conferences in in the first register, which goes from the many intensifying expressions that the Intensifier > Expression > Headword. book so generously provides. 2017. Most practically, the There we find, and that completes step Would I use Mzw myself? Yes, I certainly website is meant to function as 2, two collocations: duivels dilemma would. Mzw is a very useful, or even a hub for the publications of ‘diabolic dilemma’ and duivels plezier indispensible book for writers in Dutch and its members and others. The ‘sinful/wicked pleasure’, with references translators into Dutch who want to liven up next milestone is submitting to dilemma and plezier respectively. It is their language. The user may occasionally a proposal to hold the second worthwhile to follow the references, you find that the collection of concepts and GLOBALEX Workshop at may find out that you don’t even need the expressions is not complete, but then the LREC 2018 in Mizayaki, Japan word duivels. authors do not pretend to be complete. The (http://lrec2018.lrec-conf.org/ The fourth search question the authors Introduction and even the brief outline on en/). This might be held in give is ‘what was this expression again the inside covers are excellent guides to the cooperation with the Global with ...?’ So, what was this expression book’s content. WordNet Association with the again with woorden ‘words’? For the first To the best of my knowledge Mzw 2017 step, we need the second register, Sorting currently is the only printed collection of main topic of Lexicography word > Expression > Headword. Under the Dutch intensifications and intensifying and WordNets. Details will be ‘sorting word’ woord ‘word’, we find iets expressions. Providing access to Dutch released in Q4 of 2017. met zoveel woorden zeggen ‘say something idioms through meaning has been done in so many words’ (step 2). It’s listed three before, though not exactly in the way Schutz Ilan Kernerman times, referring to the headwords duidelijk and Permentier did. Their approach makes ‘clear’, expliciet ‘explicit’, and nadrukkelijk Mzw the only Dutch of its ‘express’ respectively. The third step kind. involves checking the headwords referred to in the dictionary section. Again, it’s Anne Dykstra Kernerman Dictionary News, July 30

A brief account of ASIALEX 2017

The 11th International Conference of We also organized two workshops, on the Asian Association for Lexicography Sketch Engine and DPS5, run by Miloš (ASIALEX 2017) was organized by Jakubíček, of Lexical Computing, and the National Key Research Center for Holger Hvelplund, of IDM, respectively. Linguistics and Applied Linguistics at The theme of ASIALEX 2017 was Guangdong University of Foreign Studies Lexicography in Asia: Challenges, and held in Guangzhou, China from June Innovations and Prospects. The time is 10 to 12. As the organizer of its first ripe to recognize Asian achievements in international conference in 1999, we were lexicographic research and practice over proud to host the ASIALEX conference the past 20 years, and to look ahead to see again after it had traveled around nine Asian how we can respond to new challenges countries and regions. of the revolutions in corpus linguistics The year 2017 is a milestone for the Asian and digital lexicography. In the keynote Association for Lexicography, celebrating speeches, Huang and Abel spoke on the the 20th anniversary of its foundation. For common theme of the dictionary user’s ASIALEX 2017, we received felicitations orientation/participation in the digital age, Hai Xu from the presidents of our global sister and Rundell and Miller discussed extended associations AFRILEX, AUSTRALEX, units of meaning, or phraseology, which DSNA, and EURALEX. lexicographers are increasingly aware of The keynote speakers included: as representing the norm, rather than the ● Prof. Jianhua Huang of Guangdong exception, in language. The issues addressed University of Foreign Studies, (the by these speakers consist of cutting-edge First President of ASIALEX) concerns, and most certainly deserve our ● Prof. Andrea Abel of EURAC closest attention. Research, President of EURALEX The conference was met by high ● Dr. Julia Miller of Adelaide University, enthusiasm of scholars and publishers President of AUSTRALEX from Asia and beyond. As one of the ● Dr. Michael Rundell, Editor-in-Chief largest conferences in its series, ASILAEX of Macmillan Dictionary 2017 hosted around 160 participants from 75 institutes over 24 countries and regions in Asia, Europe, Africa and North ASIAThe Asian Association LEX for Lexicography America. Of the 130 abstracts submitted, the total number of papers accepted was 111. The talks roughly covered the Executive Board 2017-2019 following topics: digital lexicography, general-purpose lexicography, cognitive Rachel Edita O. Roxas • President | Vincent Ooi • Vice-President | approaches to lexicography, bilingual Shirley Dita • Secretray | Deny A. Kwary • Treasurer | GAO Yongwei, lexicography, pedagogical lexicography, LI Lan, Yukio Tono • Members | Jirapa Vitayapirak, Mehmet Gürlek specialized lexicography, and historical lexicography. We were truly indebted to • Conveners | Shigeru Yamada, XU Hai • Co-Chief Editors | Ilan the contributors and reviewers for their Kernerman • Past President hard work in bringing together such a http://asialex.org/#board remarkable meeting. While preparations for this grand event were under way, we sadly lost two great LEXICOGRAPHY Journal of ASIALEX lexicographers who were highly influential in both China and abroad: Professor Gusun Shigeru Yamada (Waseda University, Japan) and Hai Xu (Guangdong Lu of Fudan University, who passed away University of Foreign Studies, China) were appointed Co-Chief on July 28, 2016, and Professor Boran 2017 Editors as of June 2017. Zhang of Nanjing University, who passed away on May 26, 2017. They both made http://asialex.org/#journal enormous contributions to our field. To honour their achievements, we set up a special session in their memory on the Next Conferences topic of unabridged Chinese-English and ASIALEX 2018 will be held in Krabi, Thailand on 8-10 June. English-Chinese dictionaries. ASIALEX 2019 will be at Istanbul University, Turkey in June 2019. Hai Xu http://asialex.org/#conferences Guangdong University of Foreign Studies

Kernerman Dictionary News, July Convener, ASIALEX 2017 31

Diccionarios electrónicos: perspectivas para el siglo XXI

In the last generation, lexicographers and terminologists have under the joint organization of LexiCon Research Group put into practice new methods for creating more dynamic (University of Granada) and the Digital Arts Master’s Degree lexical resources. This century traditional lexicography has (Complutense University of Madrid), with the sponsorship of experienced a veritable transformation, thanks to its entry in Cosnautas, Elhuyar, K Dictionaries and Lexical Computing. the digital era. A complete revolution that has been possible The aim of this summer school is to provide a general because language industries have benefited from NLP tools, overview of new trends in the creation of electronic dictionaries corpus linguistics, cognitive science and cross-cultural studies. and terminological tools, combined with hands-on sessions Many innovative projects have shed new light on where participants can obtain practical experience in dictionary e-lexicography, such as the LEAD dictionary (Paquot 2012), design. Participants will learn about the current protocols, the ARTES bilingual LSP dictionary (Kübler and Pecman software, and practices in this field of the language industry 2012), DiCoInfo (L’Homme et al. 2012), Wiktionary (Meyer and will thus acquire some of the necessary skills to use these and Gurevych 2012), WordNet (Fellbaum 2010), FrameNet tools effectively in the development of new dictionaries. (Fillmore et al. 2003), DANTE (Atkins et al. 2010) and EcoLexicon Beatriz Sánchez Cárdenas, grupo de investigación LexiCon, (Faber et al. 2014). These lexical http://lexicon.ugr.es/ resources take advantage of all the Amelia Sanz, grupo de investigación LEETHI, https://ucm. design potential offered by electronic es/leethi tools and innovative theoretical approaches. In addition, significant References efforts were made by a wide range Atkins, B. T. S., Kilgarriff, A., and Rundell, M. 2010. of agencies and organisms to produce Database of ANalysed Texts of English (DANTE): the powerful terminological databases NEID database project. In Proceedings of the XIV Euralex such as IATE in Europe (http://iate. International Congress. Ljouwert: Afûk, 549-556. europa.eu/) or Le grand dictionnaire Bergenholtz, H. 2011. Access to and presentation of terminologique in (http:// needs-adapted data in monofunctional internet dictionaries. granddictionnaire.com/). In Fuertes-Olivera, P. A., and Bergenholtz, H. (eds.), Evidently, future generations of e-Lexicography: The Internet, Digital Initiatives and lexicographers will need to use NLP Lexicography. London: Bloomsbury (Continuum), 30-53. tools to describe language more Bowker, L. 2012. Meeting the needs of translators in the age accurately, since such resources of e-lexicography: Exploring the possibilities. In Granger, allow the lexicographer to research S., and Paquot, M. (eds.), Electronic Lexicography. Oxford: and express the real use of language Oxford University Press, 379-397. as reflected in large corpora rather Faber, P., León Araúz, P., and Reimerink, A. 2014. than rely on armchair speculations. In Representing environmental knowledge in EcoLexicon. this sense, the technical possibilities In Languages for Specific Purposes in the Digital Era. offered by the digital medium Educational Linguistics, 19. Springer, 267-301. are a source of endless potential. Fellbaum, C. 2010. WordNet. In Theory and applications of For instance, since there is a wide ontology: computer applications. Springer Netherlands, variety of profiles that deal with 231-243. different types of text and knowledge levels (Bowker 2012), Fillmore, C. J., Johnson, C. R., and Petruck, M. R. 2003. tools can now be more easily adapted to different kinds of Background to FrameNet. International Journal of users with different needs (Bergenholz 2011). Consequently, Lexicography 16.3, 235-250. the information in a lexical resource may vary, depending on Kübler, N., and Pecman, M. 2012. The ARTE bilingual LSP whether, for instance, the text that the dictionary user wishes to dictionary: From collocation to higher order phraseology. In create will be submitted to a high-impact journal or is intended Granger, S., and Paquot, M. (eds.), Electronic Lexicography. for popular dissemination of scientific knowledge. Oxford: Oxford University Press, 186-208. On the other hand, the ergonomic nature of the translator’s L’Homme, M. C., Robichaud, B., and Leroyer, P. 2012. ‘workbench’ has also greatly evolved. This is of paramount Encoding collocations in DiCoInfo: From formal to importance, since some dictionary users (e.g. translators) user-friendly representations. In Granger, S., and Paquot, devote a significant amount of time to acquiring knowledge M. (eds.), Electronic Lexicography. Oxford: Oxford 2017 in order to understand the conceptual architecture of specialized University Press, 211-236 texts. Meyer, C. M., and Gurevych, I. 2012. Wiktionary: A new Finally, the issue does not concern only how to improve rival for expert-built lexicons? Exploring the possibilities of existing tools but also how to produce new multifunctional and collaborative lexicography. In Granger, S., and Paquot, M. interactive e-lexicographic tools that would contain general, (eds.), Electronic Lexicography. Oxford: Oxford University conceptual and specialized content (Bowker 2012). Press, 259-292. With the aim of discussing these various issues, the training Paquot, M. 2012. The LEAD dictionary-cum-writing aid: course entitled “Diccionarios electrónicos: perspectivas para An integrated dictionary and corpus tool. In Granger, S., el siglo XXI” is held as part of the El Escorial Summer School and Paquot, M. (eds.), Electronic Lexicography. Oxford:

(Universidad Compultense Madrid) during 17-21 July 2017 Oxford University Press, 163-185. Kernerman Dictionary News, July Kernerman Dictionary News, July 2017 8 NahumHanavi St. Tel Aviv 6350310Israel K DICTIONARIES LTD empirical basisfortheCorpus-basedDictionary been designedexplicitlyforthis project. The annotation conventions that are employed have software developedattheUniversityofHamburg. various grammatical features using the iLex into written Polish, and tagged with respect to with theHamNoSystranscriptionsymbols,translated further segmented, glossed (lemmatized), transcribed video material obtained intherecordingsessionsis including age,gender, etc. The raw account key sociological variables and their selection hastaken into from different partsofPoland signing community:theycome representative ofthePolish participants isintendedtobe Deaf. various topics pertaining to the their experiences,anddiscussing talking aboutthemselvesand recording session, naming objects, presented tothemduringthe of picturestoriesandvideoclips tasks, suchasretelling the content more than20different elicitation by Deaf users ofPJMreacting to sign languageutterances,produced richly annotatedvideosshowing idea is to create a database of cultural heritage. The underlying Polish andEuropeanlinguistic Poles, formsanimportantpartof among thehearingmajorityof which, despiteverylimitedinterest at documentingthelanguage lexicographic descriptionofPJM. a comprehensive grammatical and vast corpusofvideo recordings — the aimtodevelop—basedona communication oftheDeaf,with unit specializing in studies on the in 2010,thisisthefirstPolish solid empiricaldata.Established to analyze PJMon the basis of of University Warsaw providedauniqueopportunity but foundingtheSectionforSignLinguisticsat highly understudiedfromthelinguisticperspective, in established Warsaw. recently, Until was PJM around 1817,whenthefirstschoolfordeafwas has evolvedwithinthePolishDeafcommunitysince P This extensive set of data has been used as the Corpus PJM of group The aims project Corpus PJM The PJM) is a natural visual-gestural language that language visual-gestural natural a is PJM) migowy, język (polski Language Sign olish Corpus-based DictionaryofPolishSignLanguage The CorpusofPolishSignLanguageandthe [email protected] Adam Kilgarriff Prize,2017. Dr Rutkowskiiswinnerofthe Family, LaborandSocialPolicy. Language attheMinistryof of thePolishCouncilforSign universities, andismember and American European made researchvisitstoleading prizes, grantsandscholarships, in Poznań,hehasbeenawarded Adam MickiewiczUniversity the Universityof Warsaw and and textbooks. An alumnusof hundred academicpublications and authorofmorethana in naturallanguagesyntax, general linguistandspecialist at theUniversityof Warsaw, a Section forSignLinguistics is thefounderandheadof Paweł Rutkowski,aged39,

ı

Tel +972-3-5468102 Paweł Rutkowski communication hasbeenquestioned fordecades. , where the full-fledged nature of sign language which isofparticularimportanceinstatessuchas the issueoffulllinguisticrightsDeafminority, grammar andlexiconisinextricably related to PJM and its derived dictionary. This is why the study of Corpus PJM the of foundations linguistic solid the opportunities, would havenotbeenpossible without available freelyathttp://slownikpjm.uw.edu.pl/en/. Paweł Rutkowski, and is an open-access publication Małgorzata Łacheta, Czajkowska-Kisil, Jadwiga Linde-Usiekniewicz and Joanna by edited was modern lexicographicalstandards. The dictionary dictionary of PJMprepared in compliance with of Polish Sign Language otiig ay or o rcre material recorded of hours many Containing

ı [email protected] life andappropriateeducational right tofullparticipation in social opportunities to communicate, the ensure that the Deafhave equal textbooks. Suchattemptsto all textsincludedintheoriginal files withPJMtranslationsof access tothousandsofvideo of computer programs offering Education. These have the form commissioned bytheMinistryof hard-of-hearing), and Deaf (including needs educational for schoolchildren with special multimedia textbookadaptations years concerns the development of of theprojectoverlastthree of thedictionaryteam. were re-recorded byDeaf members appearance the original utterances their Tostandardize Corpus. PJM signed utterancesfoundinthe examples are drawn from authentic bilingual dictionaries. All sentential than customarily provided in that is more extensive and precise i.e. providingsemanticinformation typical monolingualdictionaries, Polish, akintothedefiningstylein The definitions are written in records and describes real usage. Thanks tothat,thedictionary used byDeafsignersandhow. ascertain whichPJMsignsare the corpus makes it possible to elicited from a range of individuals, Another practicalapplication (2016), which is the first the is which (2016),

ı http://kdictionaries.com