DEB – a Dictionary Editor and Browser

DEB – a Dictionary Editor and Browser

DEB – A Dictionary Editor and Browser Pavel Smrzˇ and Martin Povolny´ Faculty of Informatics, Masaryk University Brno Botanicka´ 68a, 602 00 Brno, Czech Republic E-mail: {smrz,xpovolny}@fi.muni.cz Abstract the most common ontologies form tree hierarchi- cal structures. Many systems are based on hierar- XML and related W3C standards chies with multi-parent relations and so form di- (XSLT, XML Schema, XPath, DOM, rected acyclic graph (DAG) structures. There are etc.) take often part in the linguistic data also cases where one wants to work with cycles representation and interchange today in entry links and therefore with structures of gen- but not as a direct way to implement eral graphs, or more precisely with multigraphs. efficient data manipulation software All the structures mentioned above offer an ap- tools. The aim of the paper is to show propriate representation of data but they can cause that incorporation of XML and the stan- serious efficiency problems when one attempts to dards that surrounds it can bring general visualise the organization of entries. applicability of the implemented system We present the designed and implemented sys- for various kinds of linguistic data and tem called DEB (Dictionary Editor and Browser) also easy extensibility of such systems. that is able to manage lexical data from dictionar- As a case study we present a designed ies to lexical databases, semantic networks, and and implemented system called DEB complex ontologies. It enables to store, index and (Dictionary Editor and Browser) that is efficiently retrieve linguistic data, define different able to manage lexical data from dic- views on these data, and provide some metalevel tionaries to lexical databases, semantic statistics. XML is not used as the data format only. networks, and complex ontologies. The It plays its role in customization of the user inter- smart design also facilitates the connec- face, too. tion to other linguistic tools such as cor- XML (Bray et al., 2000), XML Schema (Fall- pus managers or morphological analy- side, 2001), XSLT (Clark, 1999), and other stan- sers. dards play a crucial role in DEB. We take XML as a tool for the whole process of lexical database and/or ontology creation. The incorporation of the 1 Introduction standard can bring the advantage of general ap- Standard dictionaries usually present their entries plicability of the implemented system for various in a more or less simple form — chosen informa- kinds of data and also easy extensibility of such tion from one or few related entries are displayed systems. at a moment. However, highly structured wordnet- DEB implements special features for the ef- like semantic networks and specialised ontologies ficient management of data organized in pos- call for different possibilities of data views. Most sibly complex networks. The primary stim- of terminological databases and the simplest and ulus of our work has been the aim to pro- vide a universal system for efficient manipu- side. The third section presents the original exten- lation with WordNet-like databases. We have sion of the XSLT processor by the mechanism of taken part in the EuroWordNet II (Vossen, nested queries. Then we demonstrate the advan- 1998) and BalkaNet (http://www.ceid. tages of the system on the example of multilingual upatras.gr/Balkanet/) EC projects and so WordNet management system. We conclude our the need for such a tool has become obvious. paper with some future directions of our work. The universality of DEB can be demonstrated by the fact that it is used not only for the devel- 2 System Architecture opment of the Czech lexical database but also for The overall architecture of DEB is presented in the storage and retrieval of several dictionaries in- Figure 1. The following subsections discuss par- cluding the largest Czech dictionaries with more ticular components of the system. than 200,000 entries and for various other linguis- tic resources. 2.1 DEB Server DEB is based on a specialised module for data storage, namely on FINLIB text indexing library The server side originated as a practical outcome for the conversion of data into a binary format and of two Master theses at the Faculty of Informat- FININDEX for efficient retrieving. It is in contrast ics, Masaryk University in Brno, Czech Repub- to other systems that do not implement own stor- lic (Karasek,´ 2000), (Korenˇ ek,ˇ 2002). The DEB age mechanisms and bear rather on the support of server is responsible of the storage and retrieval of various tools, e.g. on relational databases. DEB data. It consists of the following components: is therefore independent on such external tools. Moreover, the structure of data is much more flex- 1. a program that converts XML data into binary ible, addition of tags and attributes makes no prob- representation using the text indexing library lems, etc. This feature also facilitates the integra- FINLIB (Rychly,´ 2000); tion of different lexical data types under one um- 2. a library and a program that implement a spe- brella like direct connections of usage information cialized language for querying multiple dic- into to corpora. Lexical semantic databases can tionaries and allow retrieval of the results by then be viewed as containing semistructured data. dictionary client programs. There are many systems that are able to store dictionary-like data, some of them useing XML DEB does not validate the XML data during the as the core element. Many dictionary publishing conversion. Thus, it is advisable to use a DTD or houses also operate large systems with the com- XML Schema based validator beforehand. plex functionality of so called lexicographic sta- tions that manipulate XML in the last years. How- 2.2 Querying the Server ever, these and similar tools are not able to effi- ciently retrieve data needed for ontology or seman- In order demonstrate the basic features of the tic networks browsing or even editing. Therefore, query language, we present a simple example of they are not able to provide a universal environ- retrieving data from the DEB server. Supposing ment for lexical database management. we have two dictionaries: On the other hand, DEB does not implement 1. Czech WordNet, identified as “wn cz”, with a strong web support in its current version and entries of the following form: so does not facilitates extensions of semantic net- <synonym> works by many users as e. g. Papillon does. The <ili>00004865-n</ili> possibility of a combination that would join the <pos>n</pos> advantages of the two systems remains open. <hypernym>00001234-n</hypernym> <li sense="1">podvod</li> The rest of the paper is organized as follows. <li sense="1">podraz</li> The next section offers details about system archi- <li sense="1">podfuk</li> <li sense="6">bouda</li> tecture, the basics of the query language and client </synonym> Figure 1: Schema of DEB data interchange 2. a dictionary containing English glosses, user with the help of a HTML widget. The chosen identified as “gloss en”, connected through architecture allows hypertext links in the dictio- the ili records with entries of the following nary or between dictionaries as well as easy link- form: ing of other content. Functions that could take ad- <en> vantage of this feature are, for example, linking <ili>00004865-n</ili> <gloss>an act of deliberate dictionary entries with corpus usage examples, or betrayal</gloss> incorporating audio data into a dictionary. </en> A universal client should be able to work not Then we can for example run the following only with standard dictionaries but also with lex- queries: ical databases that can contain a complex system of links between their lexical entries. Such a client • wn_cz-* sub "pod" – search all entries can benefit from the client-side caching of parsed in the “wn cz” dictionary that contain the dictionary entries in DOM (Hegaret,´ 2002) and substring “pod” anywhere; from the use of XPath (Clark and DeRose, 1999) for extraction of important parts of the entries. All • gloss: (wn_cz-li this functionality is provided by DEB. exa "bouda") – in “wn cz” find entries The user can modify the dictionary data view that contain the li tag with the text “bouda”, by supplying their own XSLT sheet. It gives DEB and then take the content of the ili tag of clients an additional level of flexibility. Last but such an entry (projection); not least, most of the features described above can • gloss: (gloss_en-ili exa ili: be included in a “thin” dictionary client applica- (wn_cz-* exa "bouda")) – find en- tion accessible by standard web browsers. This tries in “wn cz” that contain the word configuration is used in the CZEnglish project “bouda”, then make projection on ili and — an Internet based English teaching project for search the result in the projection of the ili students of Philosophical Faculty, Masaryk Uni- tag in the “gloss en” dictionary. Then make vesity, Brno. projection on the gloss tag. 3 XSLT Processor Extension 2.3 DEB Clients XML data retrieved by a dictionary server can be DEB clients use XSLT for transforming dictionary simply sent to a XSLT processor, transformed to entries into HTML, which is then presented to the HTML (or another format) and presented to the Figure 2: XSLT processor running queries nested in the XSLT script user. The dataflow schema perfectly suits standard ftp, file, etc.). The schema creates a vir- dictionaries where one usually needs to work with tual space of XML documents which are results of a small number of entries only. However, this sim- the DEB queries. From the XSLT processor point ple approach comes across obstacles when visual- of view, accessing the DEB dictionary data is the izing data from WordNet-like lexical databases as same as accessing any other external resources.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us