DEB – A Editor and Browser

Pavel Smrzˇ and Martin Povolny´ Faculty of Informatics, Masaryk University Brno Botanicka´ 68a, 602 00 Brno, Czech Republic E-mail: {smrz,xpovolny}@fi.muni.cz

Abstract the most common ontologies form tree hierarchi- cal structures. Many systems are based on hierar- XML and related W3C standards chies with multi-parent relations and so form di- (XSLT, XML Schema, XPath, DOM, rected acyclic graph (DAG) structures. There are etc.) take often part in the linguistic data also cases where one wants to work with cycles representation and interchange today in entry links and therefore with structures of gen- but not as a direct way to implement eral graphs, or more precisely with multigraphs. efficient data manipulation software All the structures mentioned above offer an ap- tools. The aim of the paper is to show propriate representation of data but they can cause that incorporation of XML and the stan- serious efficiency problems when one attempts to dards that surrounds it can bring general visualise the organization of entries. applicability of the implemented system We present the designed and implemented sys- for various kinds of linguistic data and tem called DEB (Dictionary Editor and Browser) also easy extensibility of such systems. that is able to manage lexical data from dictionar- As a case study we present a designed ies to lexical databases, semantic networks, and and implemented system called DEB complex ontologies. It enables to store, index and (Dictionary Editor and Browser) that is efficiently retrieve linguistic data, define different able to manage lexical data from dic- views on these data, and provide some metalevel tionaries to lexical databases, semantic statistics. XML is not used as the data format only. networks, and complex ontologies. The It plays its role in customization of the user inter- smart design also facilitates the connec- face, too. tion to other linguistic tools such as cor- XML (Bray et al., 2000), XML Schema (Fall- pus managers or morphological analy- side, 2001), XSLT (Clark, 1999), and other stan- sers. dards play a crucial role in DEB. We take XML as a tool for the whole process of lexical database and/or ontology creation. The incorporation of the 1 Introduction standard can bring the advantage of general ap- Standard usually present their entries plicability of the implemented system for various in a more or less simple form — chosen informa- kinds of data and also easy extensibility of such tion from one or few related entries are displayed systems. at a moment. However, highly structured - DEB implements special features for the ef- like semantic networks and specialised ontologies ficient management of data organized in pos- call for different possibilities of data views. Most sibly complex networks. The primary stim- of terminological databases and the simplest and ulus of our work has been the aim to pro- vide a universal system for efficient manipu- side. The third section presents the original exten- lation with WordNet-like databases. We have sion of the XSLT processor by the mechanism of taken part in the EuroWordNet II (Vossen, nested queries. Then we demonstrate the advan- 1998) and BalkaNet (http://www.ceid. tages of the system on the example of multilingual upatras.gr/Balkanet/) EC projects and so WordNet management system. We conclude our the need for such a tool has become obvious. paper with some future directions of our work. The universality of DEB can be demonstrated by the fact that it is used not only for the devel- 2 System Architecture opment of the Czech lexical database but also for The overall architecture of DEB is presented in the storage and retrieval of several dictionaries in- Figure 1. The following subsections discuss par- cluding the largest Czech dictionaries with more ticular components of the system. than 200,000 entries and for various other linguis- tic resources. 2.1 DEB Server DEB is based on a specialised module for data storage, namely on FINLIB text indexing library The server side originated as a practical outcome for the conversion of data into a binary format and of two Master theses at the Faculty of Informat- FININDEX for efficient retrieving. It is in contrast ics, Masaryk University in Brno, Czech Repub- to other systems that do not implement own stor- lic (Karasek,´ 2000), (Korenˇ ek,ˇ 2002). The DEB age mechanisms and bear rather on the support of server is responsible of the storage and retrieval of various tools, e.g. on relational databases. DEB data. It consists of the following components: is therefore independent on such external tools. Moreover, the structure of data is much more flex- 1. a program that converts XML data into binary ible, addition of tags and attributes makes no prob- representation using the text indexing library lems, etc. This feature also facilitates the integra- FINLIB (Rychly,´ 2000); tion of different lexical data types under one um- 2. a library and a program that implement a spe- brella like direct connections of usage information cialized language for querying multiple dic- into to corpora. Lexical semantic databases can tionaries and allow retrieval of the results by then be viewed as containing semistructured data. dictionary client programs. There are many systems that are able to store dictionary-like data, some of them useing XML DEB does not validate the XML data during the as the core element. Many dictionary publishing conversion. Thus, it is advisable to use a DTD or houses also operate large systems with the com- XML Schema based validator beforehand. plex functionality of so called lexicographic sta- tions that manipulate XML in the last years. How- 2.2 Querying the Server ever, these and similar tools are not able to effi- ciently retrieve data needed for ontology or seman- In order demonstrate the basic features of the tic networks browsing or even editing. Therefore, query language, we present a simple example of they are not able to provide a universal environ- retrieving data from the DEB server. Supposing ment for lexical database management. we have two dictionaries: On the other hand, DEB does not implement 1. Czech WordNet, identified as “wn cz”, with a strong web support in its current version and entries of the following form: so does not facilitates extensions of semantic net- works by many users as e. g. Papillon does. The 00004865-n possibility of a combination that would join the n advantages of the two systems remains open. 00001234-n

  • podvod
  • The rest of the paper is organized as follows.
  • podraz
  • The next section offers details about system archi-
  • podfuk
  • bouda
  • tecture, the basics of the query language and client Figure 1: Schema of DEB data interchange

    2. a dictionary containing English glosses, user with the help of a HTML widget. The chosen identified as “gloss en”, connected through architecture allows hypertext links in the dictio- the ili records with entries of the following nary or between dictionaries as well as easy link- form: ing of other content. Functions that could take ad- vantage of this feature are, for example, linking 00004865-n an act of deliberate dictionary entries with corpus usage examples, or betrayal incorporating audio data into a dictionary. A universal client should be able to work not Then we can for example run the following only with standard dictionaries but also with lex- queries: ical databases that can contain a complex system of links between their lexical entries. Such a client • wn_cz-* sub "pod" – search all entries can benefit from the client-side caching of parsed in the “wn cz” dictionary that contain the dictionary entries in DOM (Hegaret,´ 2002) and substring “pod” anywhere; from the use of XPath (Clark and DeRose, 1999) for extraction of important parts of the entries. All • gloss: (wn_cz-li this functionality is provided by DEB. exa "bouda") – in “wn cz” find entries The user can modify the dictionary data view that contain the li tag with the text “bouda”, by supplying their own XSLT sheet. It gives DEB and then take the content of the ili tag of clients an additional level of flexibility. Last but such an entry (projection); not least, most of the features described above can • gloss: (gloss_en-ili exa ili: be included in a “thin” dictionary client applica- (wn_cz-* exa "bouda")) – find en- tion accessible by standard web browsers. This tries in “wn cz” that contain the configuration is used in the CZEnglish project “bouda”, then make projection on ili and — an Internet based English teaching project for search the result in the projection of the ili students of Philosophical Faculty, Masaryk Uni- tag in the “gloss en” dictionary. Then make vesity, Brno. projection on the gloss tag. 3 XSLT Processor Extension 2.3 DEB Clients XML data retrieved by a dictionary server can be DEB clients use XSLT for transforming dictionary simply sent to a XSLT processor, transformed to entries into HTML, which is then presented to the HTML (or another format) and presented to the Figure 2: XSLT processor running queries nested in the XSLT script user. The dataflow schema perfectly suits standard ftp, file, etc.). The schema creates a vir- dictionaries where one usually needs to work with tual space of XML documents which are results of a small number of entries only. However, this sim- the DEB queries. From the XSLT processor point ple approach comes across obstacles when visual- of view, accessing the DEB dictionary data is the izing data from WordNet-like lexical databases as same as accessing any other external resources. most information is in the relations between en- Let us demonstrate the described mechanism of tries represented by links. the XSLT processor extension on the following And things get even worse if we want to allow example. Supposing we work with a WordNet users to customize their data views. Then, for ex- database in XML where hypernyms of synsets are ample, one user might want to see the transitive tagged as ILI, closure of the hypernym-hyponym relation while where ILI is a link to another synset, for example another might want to see hypernyms of antonyms 00001234-n. of the word being examined. The following XSLT code replaces hypernym The system would need to know in advance elements in WordNet entries with tuples of what data will be necessary for displaying an en- and their sense indexes: try. It is obvious that the power of the own data view definition using XSLT has its dark side in the A solution that may naturally come into one’s mind is to send the whole dictionary into the XSLT () processor. Unfortunately, it would be too slow and with the current XSLT processors we would run out of memory as soon as our dictionary has reached tens of MBs. Moreover, we often need to The string document(concat(’deb:// display data from more than one dictionary at the wn cz-ili exa "’,text(), same time, for example, data from in ’"’)) tells the XSLT processor to different languages. access document wn cz-ili exa "’, Our solution to the problem is as follows. We text(), ’"’ in the schema deb, have extended the XSLT processor to allow addi- which means exec query wn cz-ili exa tional dictionary queries from the XSLT proces- "ili", where ili is the content of the hyper- sor. XSLT sheets can contain request data from nym element of entry currently being processed. the server based on the entry being processed. The The same mechanism can be used for all rela- modified dataflow schema is shown in Figure 2. tions in selected WordNet dictionary or even be- The creation of such an extension is possi- tween WordNets in different languages. The re- ble even without breaking the rules of the XSLT lations can be either explicitly marked-up in the language. We can register a new schema han- current entry, such as the hypernym relation, or dler (Kaiser and Cimprich, 2002) within the XSLT given by reversion of such a relation, such as the processor (by schema we understand the part of hyponym relation. URI before the double slash, such as http, For example, the following XSLT code queries the DEB server for all hyponyms of currently pro- the resulting words in different colors for each re- cessed entry and adds corresponding literals to the lation. output: 4 DEB Support for Lexical Database Development work distantly, there is a need to merge data and () to check consistency on the server side. Thus DEB clients record all modifications performed on the lexical database and generate change logs. These change logs called “journals” are XML files More elaborate example is demonstrated on ..... which describe how the local copy of the database money has been modified, i.e. which records have been added, deleted or updated. The same set of op- a current medium of erations should be performed on the data stored exchange in the form of coins and banknotes; coins and in the central server. DEB provides the function banknotes collectively of journal blending and data versioning to be able to check and report the results of data consistency tests. pelf The structure of journal lines is as follows: • the snapshot of an entry before the operation specie • the new version of the entry • time of the request dough an identifier of the user (editor) The architecture of DEB should support differ- brass ent concepts of individual and team, local and re- mote development of lexical data. Brit 1. In the simplest concept there is a single The complete XSLT script that transforms the user/editor working with the server. above type of data is given in Figure 3. Even in this concept the journal has a good For each entry (line 5) output a h1 tag use. It allows us to go back in time and gain containing the hw and sense number version of the database valid at a specified sense (lines 5, 6). time. We can also use it to undo specified Then we look for all rel elements with value of actions. the type attribute equal to “formal var of”, “ar- chaic var of or “informal var of”” (lines 7, 8) and 2. Simple team work is supported in the con- take their attribute link to construct a query for cept of on-line development of multiple edi- the DEB server to get the basic synonym of the tors. All users share single DEB server and original word (This is done on lines 9 and 10.). work on one version of data thus all changes Then we process the results of the nested query are available instantly to all team members. (lines 11 to 30). First we make a h2 tag containing In this concept journal can provide informa- the headword (line 11). Next we run three nested tion about who and when did what changes. DEB queries to get words that are in various re- This information can be (in some cases) used lations with the current word, and then we output to undo specific user’s actions. 01: 02: 03: 04: 05: 06:

    ()

    07: 09: 11:

    12:

    13: 14: 17: 18: 19: 20: 23: 24: 25: 26: 29: 30: 31:
    32:
    33: 34: 35:

    Figure 3: sample XSLT script

    3. The most powerfull is the third model, which Picture 4 represents a development cycle in the consists of: third model.

    • Central DEB server (potentionaly with 1. In the first step at time T0 whole dictio- some users working as in the previous nary/database is exported in XML format concept) from the central DEB server. Then these data are imported into the local DEB server. • One or more local DEB servers which do not have full-time connectivity or 2. Step two represents users working with their enough bandwidth to the central DEB DEB clients on the local DEB server. server. Each with one or more 3. In step three at time T1 journal of changes users/editors. from the local DEB server is taken and in- teractively applied using DEB client to the Changes done by users on local DEB servers are central DEB server. During this application from time to time transmited to the central DEB various conflicts may arise 5 This process re- server and also the changes on the central server sults into new wersion of the database to be are propagated to the local DEB servers. present at the central server. In this scenario journals play major role in the synchronization of the local DEB servers with the 4. In the step four at time T2 after the merging is central one. done central DEB server either prepares jour- Figure 4: Synchronization of DEB servers

    nal of changes between versions of time T0 • XML Schema (or other schema language) and T2 which is then automaticly applied on constraint failure [entry structure], local eval- the local DEB server to the data from step 1, uation or similarly as in first step, full dictionary in XML is exported. • global data constraints XSLT+nested DEB Conflicts during delete. 5. Then the process repeats from step two. • entry already deleted During the step 4 many problems arise some of which are also relevant in the ”on-line develop- – skip journal entry ment” model. • entry changed 5 Applying the journal of changes to the – delete anyway data on the deb server – skip ...

    Each entry in the journal/changelog must be semi- • changed entry’s primary key (currently as if automaticly applied. Support for this should be in deleted) the DEB client. Conflicts during insert. • global data constraints XSLT+nested DEB

    • key already existing Conflicts during update

    – assign new key • entry deleted – reject – create new entry – replace – skip • entry changed Current work on standardization of language re- sources (Lenci and Ide, 2002) focuses on creation – change anyway of unified format for representation of multilingual – skip lexical data. We strongly believe that DEB could – merge changes (using some of the xmld- be used as a standard tool for browsing and editing iff tools) of data in the proposed RDF-based formats. • XML Schema (or other schema language) constraint failure [entry structure], local eval- References uation Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. • global data constraints XSLT+nested DEB 2000. Extensible Markup Language (XML) 1.0 (Second Edition). http://www.w3.org/TR/ 6 Consistency checks REC-xml. DEB server supports two types of consistency James Clark and Steve DeRose. 1999. XML Path Lan- guage (XPath) Version 1.0. http://www.w3. checks/constraints. org/TR/xpath. 1. XML Schema and DTD grammar checks for James Clark. 1999. XSL Transformations (XSLT) Ver- individual dictionary entries. sion 1.0. (http://www.w3.org/TR/xslt).

    2. Global constraints that have the form of David C. Fallside. 2001. XML Schema Part 0: Primer. http://www.w3.org/TR/xmlschema-0/ XSLT scripts with nested DEB queries. ( ). These scripts contain checks for individual Philippe Le Hegaret.´ 2002. Document Object Model constraining condiions. (DOM) Bindings. http://www.w3.org/DOM/ Bindings. These checks can apply to whole dictionary or even to more dictionaries. Tom Kaiser and Petr Cimprich. 2002. Sablotron – XSLT, DOM and XPath processor. This concept is somehow similar to Schema- http://www.gingerall.com/charlie/ tron (?) in it’s rule-based manner. ga/xml/p_sab.xml, January 11. 7 Conclusions Lubosˇ Karasek.´ 2000. System for creation and presen- tation of multilingual and one language dictionaries. We have introduced DEB — our XML-based Master’s thesis, Faculty of Informatics, Masaryk client-server system that allows fast development University, Brno. of lexical databases and general ontologies. Josef Korenˇ ek.ˇ 2002. System for maintainance and The main concept that we have demonstrated is questioning of large XML dictionaries. Master’s the extension of a XSLT processor with the ability thesis, Faculty of Informatics, Masaryk University, Brno. to get additional data by querying the dictionary server. Alessandro Lenci and Nancy Ide. 2002. The MILE In our future work with the server side we would Lexical Model, linguistic and formal architecture. like to implement XPath evaluation inside the dic- ISLE Workshop. tionary server to make its interface closer to W3C Pavel Rychly.´ 2000. Corpus Managers and Their Ef- standards. Another task is to make the server work fective Implementation. Ph.D. thesis, Faculty of In- metadata. We will also consider the implementa- formatics, Masaryk University, Brno. tion of some data versioning features in the server. Piek Vossen, editor. 1998. EuroWordNet: A Multi- We would also like to separate the core func- lingual Database with Lexical Semantic Networks. tionality from existing DEB clients and make it a Kluwer Academic Publishers, Dordrecht. separate layer in order to get a three-level architec- ture which would further ease the development of specialized dictionary applications for thin clients such as WWW browsers.