A Multi-View Hyperlexicon Resource for Speech and Language System Development

A multi-view hyperlexicon resource for speech and language system development Dafydd Gibbon, Thorsten Trippel Fakultät für Linguistik und Literaturwissenschaft Universität Bielefeld Postfach 100 131, D–33501 Bielefeld, Germany gibbon, ttrippel @spectrum.uni-bielefeld.de Abstract New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically useful, computationally feasible, and theoretically well– founded. We describe the specification, three–level lexical document design principles, and implementation of a MARTIF document structure and several presentation structures for a terminological lexicon, including both on demand access and full hypertext lexicon compilation. The underlying resource is a relational lexical database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph structure which defines the macrostructure of the hyperlexicon. 1. Overview in the same sense in which it is used for smaller linguistic New generations of integrated multimodal speech and units such as word and sentence in current linguistic theory. language systems with dictation, readback or talking face Advances in computational techniques have led to a rap- facilities require multiple sources of lexical information for prochement between lexicon theory in computational lin- development and evaluation. Recent developments in hy- guistics and large–scale corpus–based computational lexi- perlexicon development offer new perspectives for the de- cography, minimising possible theory–application conflicts velopment of such resources which are at the same time in the present case. In the present approach, a complete practically useful, computationally feasible, and theoreti- document representation contains cally well founded. We describe an interactive hyperlexi- 1. a text syntax or document structure DS, specifying (a) con for on demand parametrised lexical access and full hy- the category or context of occurrence of the document pertext lexicon compilation with multiple views of lexical and (b) the parts (immediate constituents or daughters) resources. The underlying resource is a relational lexical of a document; database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph 2. a text interpretation or interpretation structure IS, structure which defines the macrostructure of the hyperlex- specifying on semiotic grounds an interpretation of DS icon. Fuller discussion of lexical and terminological re- in terms of a Content Structure CS and a Presentation sources for spoken language systems can be found in (Gib- Structure PS (analogous to the semantic and phonetic bon et al., 1997a) and (Gibbon et al., 1999 forthcoming). interpretation pair of mainstream linguistic theories). These distinctions are not made simply on theoretical 2. Specification grounds, however, but in order to provide a specification In development environments for the new generation of for practical lexicographic implementation. In the case of a multimodal systems there is a growing need for process- lexicon, ing many simultaneous and heterogeneous sources of lexical knowledge representing acoustic and visual signal in- 1. DS specifies the structure of a lexical database, e.g. re- formation, references to marked up corpora, and symbolic lational, object–oriented, SGML/XML, or as a lexical information at many levels of linguistic representation. Ad- knowledge base in DATR or another LKRL (lexicon vanced techniques for processing data with multiple tiers knowledge representation language), its context (e.g. of annotation, often in high-level markup languages such in relation to a corpus, a grammar, other lexica), its as XML, are currently under development at a number of parts, and their contexts. centres. 2. IS specifies Here we address the complementary problem of user access to complex information of this kind, and develop a (a) CS in terms of category definitions for the fields hyperlexicon interface for such resources as a first step to- of the lexicon (e.g. ‘types of lexical informa- wards creating a standardisable lexicographic resource. We tion’), contend that the notion of a document is too complex to be (b) PS in terms of a superimposed set of relations dealt with using purely application oriented decisions, and over document constituents defining ‘surface re- consequently we base our approach on a theoretical defini- alisations’ as tion of documents as complex signs, using the term ‘sign’ a database view with appropriate front end, a hyperlexicon in hypertext format in an on- Interpretation Structure (IS): line help environment, on CD–ROM, on the web, in clipped version for WAP or a PDA, a printed book, CATEGORY: the location of a document or document ... part in a larger structure, e.g. in a library, on a web site, or in some other archive; The existing resource on which the present work is built is a relational database, with lexical microstructure at PARTS: a part of a document, such as a table of contents, level as a database record structure, and fields representing chapter, index, lexical entry; underlying types of lexical information which are relevant CONTENT: a real–world semantic domain into which for ( and ) interpretation (Gibbon et al., 1999). Document Structure is mapped in document planning We do not address the issue of semantic representations and understanding (in the case of a terminological lex- but take these to be implicit in the user’s understanding of icon, for instance, the set of objects, relations, states his or her domain. The database concerned has been in reg- and events in the technical domain concerned); ular use on the web with a JavaScript/HTML form interface for SQL queries some two years. PRESENTATION: a visual and/or acoustic real–world The domain selected for this task is spoken language domain into which Document Structure is mapped in and multimodal system terminology, as required by an in- document production and perception (in the case of telligent software agent for potential developers and users. a terminological lexicon, for instance, as a database The domain was selected because it also defines an an im- front–end, a hypertextual web site, a conventional pa- portant resource type for HLT (human language technolo- per dictionary). gies) system development, the technical terminology of the field. Although we adhere to ISO specifications for terminol- Content ogy, practical experience in HLT and recent developments Structure in lexicology theory show that there are fundamentally un- Content semantic domain Interpretation with specific tenable constraints involved in current ISO definitions, e.g. - objects Document - relations - events 1. The inherently procedural onomasiological view of Structure - ... terminological lexicography, i.e. a direct mapping - linear sequence from concepts to terms, is an obstacle in the way of - trees - cross-linked trees flexible lexicographic views (and, though adhered to - linked multiple trees Presentation in theory for pure terminologies, generally flaunted - rings Structure - ... in practical termbanks thanks to the quasi–isomorphic - modality domain(s) - page/file segmentation Presentation relation postulated between concepts and wordforms). Interpretation - line segmentation 2. The identification of keys with concepts is just as - word segmentation - font much a confusion of categories in onomasiological - highlights, colour dictionaries as the identification of keys with spelling - ... in conventional semasiological dictionaries. Since this - pagination / linking assumption only holds within one ho- mogeneous discipline and domain, in the multidisci- Figure 1: General document representation schema. plinary domain of HLT in which concepts evolve dy- namically, approaches change, and conceptual structures are hybrid, it fails seriously. Constructor functions (i.e. a text grammar) are associated with , and a pair of interpretation functions We resolve these conflicts by applying the distinction (Content Interpretation) and (Presentation Interpreta- between underlying , conceptual and surface tion) is associated with (cf. also Figure 2): representations (Gibbon et al., 2000, forthcoming), taking DS a declarative stance in regarding as neutral between DS and . 3. Hyperlexicon design and implementation It is , in the form of converters for print make–up and hypertextualisation with which we are concerned here. Following the requirements outlined above, we intro- is in fact a family of functions, each of which may be quite duce more detail than previous approaches to hypertext document characterisation, and informally define (cf. also complex and involve multimodal visual (text, graphic, an- imated) and acoustic mappings rather than straightforward Figure 2): textual mappings. Document Representation (DR): The main differences between this basis for design and much current wisdom and folklore about hypertexts are (a) Document Structure (DS): that the present approach is deliberately based on linguistic theory; (b) it is fundamentally triadic, , in contrast to the simpler ‘logical structure / presentation’ onomasiological lexicon, often a thesaurus or synonym lex- approach of many current textbooks; (c) applied to lexica, it icon,

A Multi-View Hyperlexicon Resource for Speech and Language System Development

Proceedings of the Workshop on Multilingual Language Resources and Interoperability, Pages 1–8, Sydney, July 2006

Part 2: Data Category Selection (DCS) for Electronic Terminological Resources (ETR)

Iso 24613:2008(E)

ESFRI / E-IRG Report on Data Management, January 2010

Tomaž Erjavec: Platforms and Standards for Data Sharing

Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future

Terminology for Translators — an Implementation of ISO 12620 Robert Bononno

Corpus Query Lingua Franca Part II: Ontology

Janne Bondi Johannessen (Ed.)

Of Parallèles

ISO 12616 Translation-Oriented Terminography

Polnet 3.0 Zygmunt Vetulani, Grażyna Vetulani, Bartłomiej Kochanowski Adam Mickiewicz University in Poznań Ul