A Registry of Linguistic Data Categories

The Workshop Programme [Please insert timing of sessions, authors and titles of speeches, coffee/lunch breaks using font Times New Roman, 12 pts and interlinear spacing as you prefer] 9:30 Welcome and Introduction 9:45 Daan Broeder et al., INTERA: A Distributed Metadata Domain of Language Resources 10:15 Laurent Romary et al., An on-line data category registry for linguistic resource management 10:45Break 11:00 Gil Francopoulo et al., Data Categories in Lexical Markup Framework or how to lighten a Model 11:30 Francesca Bertagna et al.,The MILE Lexical Classes: Data Categories for Content Interoperability among Lexicons 12:00 David Dalby and Lee Gillam,“Weaving the Linguasphere”:LS 639, ISO 639 and ISO 12620 12:30 Lunch 14:00 Robert Kelly et al., Annotating Syllable Corpora with Linguistic Data Categories in XML 14:30 Thorsten Trippel, Metadata for Time Aligned Corpora 15:00 Ricardo Ribeiro, How to integrate data from different sources 15:30 Kiyong Lee et al., Towards an international standard on feature structure representation (2) 16:00 Ewan Klein and Stephen Potter, An Ontology for NLP Services 16:30 Discussion and Panel: Evaluation of Metadata and Data Categories Workshop Organisers [names and affiliations] Thierry Declerck, Saarland University and DFKI GmbH Nancy Ide, Vassar College Key-Sun Choi, Kaist Laurent Romary, LORIA Workshop Programme Committee [names and affiliations] Maria Gavrilidou, ILSP Stelios Piperidis ILSP Daan Broeder, Max Planck Institute Peter Wittenburg, Max Planck Institute Nicoletta Calzolari, ILC Monica Monachini, ILC Claudia Soria,ILC Khalid Choukry, ELRA/ELDA Mahtab Nikkhou, ELDA Kiyong Lee, Korea University Paul Buitelaar, DFKI Andreas Witt, University of Bielefeld Scott Farrar, University of Bremen William Lewis, University of Arizona Terry Langendoen, University of Arizona Gary Simons, SIL International Eric de la Clergerie, INRIA Table of Contents INTERA: A Distributed Metadata Domain of Language Resources. Daan Broeder, Maria Nava and Thierry Declerck An on-line data category registry for linguistic resource management. Laurent Romary, Nancy Ide, Peter Wittenburg and Thierry Declerck Data Categories in Lexical Markup Framework or how to lighten a Model. Gil Francopoulo, Monte George and Mandy Pet The MILE Lexical Classes: Data Categories for Content Interoperability among Lexicons. Francesca Bertagna, Alessandro Lenci, Monica Monachini and Nicoletta Calzolari “Weaving the Linguasphere”:LS 639, ISO 639 and ISO 12620. David Dalby and Lee Gillam Annotating Syllable Corpora with Linguistic Data Categories in XML. Robert Kelly, Moritz Neugebauer, MichaelWalsh & StephenWilson Metadata for Time Aligned Corpora. Thorsten Trippel How to integrate data from different sources. Ricardo Ribeiro, David M. de Matos, and Nuno J. Mamede Towards an international standard on feature structure representation (2): Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Ulrich Schaefer, Thierry Declerck, Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly, and Claude Roux An Ontology for NLP Services: Ewan Klein and Stephen Potter Author Index Syd Bauman Francesca Bertagna Daan Broeder Harry Bunt Lou Burnard Nicoletta Calzolari Lionel Clément David M. de Matos Eric de la Clergerie David Dalby Thierry Declerck Tomaz Erjavec Gil Francopoulo Monty George Lee Gillam Nancy Ide Robert Kelly Ewan Klein Kiyong Lee Alessandro Lenci Nuno J. Mamede Monica Monachini Maria Nava Moritz Neugebauer Mandy Pet Stephen Potter Ricardo Ribeiro Laurent Romary Azim Roussanaly Claude Roux Ulrich Schaefer Thorsten Trippel Peter Wittenburg INTERA: A Distributed Metadata Domain of Language Resources Daan Broeder, Maria Nava+, Thierry Declerck++ Max-Planck-Institute for Psycholinguistics [email protected], +Evaluation and Language Resources Distribution Agency, ++University of Saarland Abstract The INTERA and ECHO projects were partly intended to create a critical mass of open and linked metadata descriptions of language resources helping researchers to understand the benefits of an increased visibility of language resources in the Internet and to motivate them to participate. The work was based on using the new IMDI version 3.0.3 which is a result of experiences with the earlier version and new requirements coming from the involved partners. Language resource distribution centers such as ELDA have the opportunity to use and add to this metadata infrastructure and use it to enhance their catalogue and offer more services to their customers such as offering data samples and download of partial corpora .This document sumarizes mainly experiences done in the project INTERA. schema. It was adapted to simplify the content description Introduction and the artificial distinction between collectors and other At LREC 2000 in Athens the first workshop1 about participants probably influenced by Dublin Core was metadata concepts for making language resources visible removed. Three major extensions were applied: First, it is in and discoverable via the Internet was organized by now possible to describe written resources that are not some of the authors. At LREC 2002 two groups annotations or descriptions. This was necessary, since demonstrated operational frameworks for creating most language collections contain written resources in the metadata for language resources and to work with them form of field notes, sketch grammars, phoneme for management and discovery purposes. While OLAC2 descriptions etc. Second, as a consequence of long discussions with participants of the MILE lexicon (Open Language Community Archives Community) 6 started form a Dublin Core3 point of view with the goal to initiative , it is now possible to describe lexicons with a create a set that allows for the description of all types of specialized set of descriptor elements. language resources, software tools, and advice, the IMDI4 (ISLE Metadata Initiative) activities started with a slightly Third, it is now possible to define and add project-specific different approach. The focus was primarily on profiles. In the earlier version IMDI supported already the multimedia/multimodal corpora and a more detailed set possibility of extensions at various levels in the form of was worked out that can be used not only for resource user defined category–value pairs, i.e., the user was able discovery but also for exploitation and managing large to define a private category and associate values with it. corpora. Most importantly, IMDI allows its metadata descriptions to be organized into linked hierarchies This feature was used by individuals and also projects to supporting browsing and enabling data managers to carry include special descriptors, however, these descriptors out a variety of management tasks. were not fully supported by the IMDI tools. In the new version, however, projects or sub-domains such as the The two years since 2002 have been used to improve the Dutch Spoken Corpus respectively the Sign Language metadata sets based on the experience of the communities. community can define a set of important categories and They have also been used to create an interoperable these are supported while editing or searching. domain, i.e., a mapping schema was worked out between the IMDI and OLAC sets and the IMDI domain acts as an Therefore, IMDI consists of its core definitions that have OLAC data provider. IMDI records can be searched for to be stable to assure users that their work will be from the OLAC domain. exploitable even after many years and of sub-community specific extensions, which nevertheless are result of IMDI Metadata Set 3.0.3 discussion processes. Based on the experiences and on a broad discussion A new direction is also given to IMDI in INTERA, which process including field linguists, corpus linguists and foresees the linking (or merging) of metadata for language language engineers, the IMDI set 3.0.3 was designed as 5 data with descriptors in use in catalogues for language part of the INTERA project and is available as an XML- technology tools, like the ACL Natural Language 7 Registry, hosted at DFKI . 1 http://www.mpi.nl/ISLE 2 OLAC: http://language-archives.org 3 See http://dublincore.org/ for more information on the 6 See the Dublin Core Metadata Initiative http://www.ilc.cnr.it/EAGLES96/isle/complex/clwg_hom 4 IMDI: http://www.mpi.nl/IMDI e_page.htm for more details 5 Integrated European language Resource Area: 7 See http://regsistry.dfki.de http://www.elda.fr/index.html IMDI Catalogue Metadata descriptors. The introduction of metadata elements The design and development of the IMDI metadata set supplying information on feature distribution is currently was directed to offering adequate descriptions at the level under study. This class of metadata would be specific for of resources. However it was recognized at an early stage describing varying parameters across a corpus, where their that there is a need to describe whole collections of overall distribution is important in order to determine language resources at the level of a finished or published whether a corpus may be suitable for a certain purpose: corpus. During a IMDI workshop in 2001 a proposal for Among these feature distribution metadata are: the IMDI Metadata Elements for Catalogue descriptions8 • was presented based on information from the Evaluation Age/Gender distribution of participants (age and Language Resources Distribution Agency (ELDA) or classes, number of age classes, etc); • the Linguistic Data Consortium (LDC) Language distribution (number of languages, The description

Load more