Towards Agent-Based Cross-Lingual Interoperability of Distributed Lexical Resources

Towards Agent-based Cross-lingual Interoperability of Distributed Lexical Resources * Claudia Soria Maurizio Tesconi° Andrea Marchetti° Francesca Bertagna* Monica Monachini* Chu-Ren Huang§ Nicoletta Calzolari* *CNR-ILC and °CNR-IIT §Academia Sinica Via Moruzzi 1, 56024 Pisa Nankang, Taipei Italy Taiwan {[email protected]} [email protected] {[email protected]} The need of ever growing lexical resources for Abstract effective multilingual content processing has urged the language resource community to call In this paper we present an application for a radical change in the perspective of lan- fostering the integration and interopera- guage resource creation and maintenance and the bility of computational lexicons, focusing design of a “new generation” of LRs: from static, on the particular case of mutual linking closed and locally developed resources to shared and cross-lingual enrichment of two wor- and distributed language services, based on open dnets, the ItalWordNet and Sinica BOW content interoperability standards. This has often lexicons. This is intended as a case-study been called a “change in paradigm” (in the sense investigating the needs and requirements of Kuhn, see Calzolari and Soria, 2005; Calzolari of semi-automatic integration and inter- 2006). Leaving aside the tantalizing task of operability of lexical resources. building on-site resources, the new paradigm depicts a scenario where lexical resources are 1 Introduction cooperatively built as the result of controlled cooperation of different agents, adopting the para- In this paper we present an application fostering digm of accumulation of knowledge so success- the integration and interoperability of computa- ful in more mature disciplines, such as biology tional lexicons, focusing on the particular case of and physics (Calzolari, 2006). mutual linking and cross-lingual enrichment of According to this view (or, better, this vision), two wordnets. The development of this applica- different lexical resources reside over distributed tion is intended as a case-study and a test-bed for places and can not only be accessed but choreo- trying out needs and requirements posed by the graphed by agents presiding the actions that can challenge of semi-automatic integration and en- be executed over them. This implies the ability to richment of practical, large-scale multilingual build on each other achievements, to merge re- lexicons for use in computer applications. While sults, and to have them accessible to various sys- a number of lexicons already exist, few of them tems and applications. are practically useful, either since they are not At the same time, there is another argument in sufficiently broad or because they don’t cover favor of distributed lexical resources: language the necessary level of detailed information. resources, lexicons included, are inherently dis- Moreover, multilingual language resources are tributed because of the diversity of languages not as widely available and are very costly to distributed over the world. It is not only natural construct: the work process for manual develop- that language resources to be developed and ment of new lexical resources or for tailoring maintained in their native environment. Since existing ones is too expensive in terms of effort language evolves and changes over time, it is not and time to be practically attractive. possible to describe the current state of the lan- 17 Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 17–24, Sydney, July 2006. c 2006 Association for Computational Linguistics guage away from where the language is spoken. (DW, Marchetti et al., 2005). A DW can be seen Lastly, the vast range of diversity of languages as a process of cooperative authoring where a also makes it impossible to have one single uni- document can be the goal of the process or just a versal centralized resource, or even a centralized side effect of the cooperation. Through a DW, a repository of resources. document life-cycle is tracked and supervised, Although the paradigm of distributed and in- continually providing control over the actions teroperable lexical resources has largely been leading to document compilation. In this envi- discussed and invoked, very little has been made ronment a document travels among agents who in comparison for the development of new meth- essentially carry out the pipeline receive-process- ods and techniques for its practical realization. send activity. Some initial steps are made to design frame- There are two types of agents: external agents works enabling inter-lexica access, search, inte- are human or software actors performing activi- gration and operability. An example is the Lexus ties dependent from the particular Document tool (Kemps-Snijders et al., 2006), based on the Workflow Type; internal agents are software Lexical Markup Framework (Romary et al., actors providing general-purpose activities useful 2006), that goes in the direction of managing the for many DWTs and, for this reason, imple- exchange of data among large-scale lexical re- mented directly into the system. Internal agents sources. A similar tool, but more tailored to the perform general functionalities such as creat- collaborative creation of lexicons for endangered ing/converting a document belonging to a par- language, is SHAWEL (Gulrajani and Harrison, ticular DW, populating it with some initial data, 2002). However, the general impression is that duplicating a document to be sent to multiple little has been made towards the development of agents, splitting a document and sending portions new methods and techniques for attaining a con- of information to different agents, merging du- crete interoperability among lexical resources. plicated documents coming from multiple agents, Admittedly, this is a long-term scenario requiring aggregating fragments, and finally terminating the contribution of many different actors and ini- operations over the document. External agents tiatives (among which we only mention stan- basically execute some processing using the dardisation, distribution and international coop- document content and possibly other data; for eration). instance, accessing an external database or Nevertheless, the intent of our project is to launching an application. contribute to fill in this gap, by exploring in a LeXFlow was born by tailoring XFlow to controlled way the requirement and implications management of lexical entries; in doing so, we posed by new generation multilingual lexical have assumed that each lexical entry can be resources. The paper is organized as follows: modelled as a document instance, whose behav- section 2 describes the general architectural de- iour can be formally specified by means of a sign of our project; section 3 describes the mod- lexical workflow type (LWT). A LWT describes ule taking care of cross-lingual integration of the life-cycle of a lexical entry, the agents al- lexical resources, by also presenting a case-study lowed to act over it, the actions to be performed involving an Italian and Chinese lexicons. Fi- by the agents, and the order in which the actions nally, section 4 presents our considerations and are to be executed. Embracing the view of coop- lessons learned on the basis of this exploratory erative workflows, agents can have different testing. rights or views over the same entry: this nicely suits the needs of lexicographic work, where we 2 An Architecture for Integrating Lexi- can define different roles (such as encoder, anno- cal Resources tator, validator) that can be played by either human or software agents. Other software modules LeXFlow (Soria et al., 2006) was developed can be inserted in the flow, such as an automatic having in mind the long-term goal of lexical re- acquirer of information from corpora or from the source interoperability. In a sense, LeXFlow is web. Moreover, deriving from a tool designed intended as a proof of concept attempting to for the cooperation of agents, LeXFlow allows to make the vision of an infrastructure for access manage workflows where the different agents and sharing of linguistic resources more tangible. can reside over distributed places. LeXFlow is an adaptation to computational LeXFlow thus inherits from XFlow the gen- lexicons of XFlow, a cooperative web applica- eral design and architecture, and can be consid- tion for the management of document workflows ered as a specialized version of it through design 18 of specific Lexical Workflow Types and plug-in of dedicated external software agents. In the next section we briefly illustrate a particular Lexical Workflow Type and the external software agents developed for the purpose of integrating different lexicons belonging to the same language. Since it allows the independent and coordinated sharing of actions over portions of lexicons, LeXFlow naturally lends itself as a tool for the management of distributed lexical resources. Due to its versatility, LeXFlow is both a general framework where ideas on automatic lexical resource integration can be tested and an infrastructure for proving new methods for cooperation among lexicon experts. 2.1 Using LeXFlow for Lexicon Enrichment In previous work (Soria et al., 2006), the LeX- Flow framework has been tested for integration of lexicons with differently conceived lexical architectures and diverging formats. It was Figure 1. Lexicons Augmentation Workflow shown how interoperability is possible between Type. two Italian

Towards Agent-Based Cross-Lingual Interoperability of Distributed Lexical Resources

Standardisation Action Plan for Clarin

Proceedings of the Workshop on Multilingual Language Resources and Interoperability, Pages 1–8, Sydney, July 2006

Standardization Work Isotiger Transcription of Spoken Language CQLF: Corpus Query Lingua Franca

Mise En Page 1

A Method for Reusing and Re-Engineering Non-Ontological Resources for Building Ontologies

LMF for a Selection of African Languages Chantal Enguehard, Mathieu Mangeot

An Experiment in Lexical Information Extraction

PANACEA Project Grant Agreement No.: 248064

Toward an Architecture for the Global Wordnet Initiative

Multilingual Language Resources and Interoperability

Interoperability and Standards

LMF Reloaded