Computer-aided lexicography Creation, publication, and use of dictionaries: our experience at the Ixa NLP Group Xabier Artola Zubillaga [email protected] Faculty of Computer Science, Donostia Using the dictionary is not always fun How many legs has a fly? This looks like a past participle of some verb!!!: shrunk There must be a word for... to remove the hair from the skin of goats and sheep I need a verb now!: the fire ...s Is there any relationship between these words? Which one?: to burn, to blacken Which one is correct?: a quick shower or a fast shower 2010-11-26 IULA - InfoLex (UPF) 3 Using the dictionary is not always fun Translating buy for into Spanish: The company bought stock for investment purposes They kept buying for several months They bought stock for €3,000,000 The defendant said he bought it for his brother look after: what does it mean? 2010-11-26 IULA - InfoLex (UPF) 4 Outline of the presentation Creation Computer-aided lexicography: text corpora and language databases Dictionary editing environments Knowledge representation issues Publication Print Electronic (on-line or whatever) From the editing application to the final product Use Use cases, users, and dictionary software functionality Do we get from electronic dictionaries what we could expect from them? 2010-11-26 IULA - InfoLex (UPF) 5 Creation: dictionary making Still in the 20th century: piles of index cards within shoeboxes Word usage was compiled largely on paper slips or index cards, as the basis for the creation of dictionary entries Computer technology text corpora (concordances, KWIC) to: acquire real language use examples discover and ascertain word senses, extract definitions find and verify collocations find neologisms find out multiword lexical units databases (wide sense) to store dictionary contents 2010-11-26 IULA - InfoLex (UPF) 6 Creation: dictionary making Today's electronic dictionaries: where do we get dictionary content from? print dictionaries (legacy): scanning OCR parsing of typographic features importing it from glossaries, entry lists, other electronic dictionaries... from scratch: editing (lexicographer) word processors databases XML editors publishers' custom applications dictionary editing software: Tshwanelex... 2010-11-26 IULA - InfoLex (UPF) 8 Creation: dictionary making Building electronic dictionaries from legacy dictionaries: [scanning + OCR +] parsing of typographic features Goal: to obtain a structural representation of the dictionary content (often in XML) ¾ from text to a lexicographic database Two real cases (Ixa NLP Group): eEH: from RTF to TEI SGML / XML (Arregi et al., 2003, 2007) DBE: from RTF to TEI XML (Alegria et al., 2006a, 2006b) 2010-11-26 IULA - InfoLex (UPF) 10 eEH: from RTF to TEI SGML / XML Sarasola I. Euskal Hiztegia. Kutxa Fundazioa: Donostia, 1996. Basque monolingual dictionary, reference for the standard Basque dictionary (Hiztegi Batua, Academy of the Basque Language) 33,111 entries, 41,699 senses Typical examples illustrating the use of words, drawn from corpora From RTF to TEI SGML (later to TEI XML): DCG written in Prolog TEI DTD: select / customize / enhance Manual correction of the automatically obtained output 2010-11-26 IULA - InfoLex (UPF) 11 eEH: from RTF to TEI SGML / XML eEH: electronic Euskal Hiztegia (electronic dictionary prototype) Sophisticated indexing system (no databases are used) definition and example texts fully lemmatized Users: ordinary advanced (philologists, lexicographers, translators...) Functionality full hypertext utility (from definitions and examples to corresponding entries) basic query advanced query • especially designed query language • dictionary search as in a corpus Problem: lack of editing environment 2010-11-26 IULA - InfoLex (UPF) 12 eEH: electronic dictionary prototype query language query interface 2010-11-26 IULA - InfoLex (UPF) 13 eEH: electronic dictionary prototype query language query interface 2010-11-26 IULA - InfoLex (UPF) 14 DBE: from RTF to TEI XML Miyares Bermúdez E. (dir.) Diccionario Básico Escolar. Centro de Lingüística Aplicada, Santiago de Cuba. 2003. School dictionary, monolingual st 7,473 entries, 14,013 word senses (1 ed.) From RTF to TEI P4 XML: Word macros Ferret (semi-automatic learning software) TEI DTD: select / customize / enhance Manual correction of the automatically obtained output leXkit: dictionary editing environment Three on-line versions, two CDs, three print editions 2010-11-26 IULA - InfoLex (UPF) 15 DBE: CD and on-line (3rd version) image request other entry look-up functionality letter index indexes look-up response cross- orthographic references help 2010-11-26 IULA - InfoLex (UPF) 17 Dictionary editing environments Essential if databases or markup languages are chosen for dictionary knowledge representation Wish list all kind of editing facilities: XML-transparent, navigation facilities, cross-reference building, wizards... integrity constraint checking and consistency multimedia integration import facilities collaborative editing Wiktionary dicussion forums • Ultralingua (online discussion forum) • Leo collaborative bilingual dictionaries 2010-11-26 IULA - InfoLex (UPF) 18 Dictionary editing environments Wish list (cont'd) customized output: dictionary publication different dictionary products: • unabridged dictionary • student's dictionary • ... export formats: • electronic versions: XML, HTML, other formats... • print: PDF, desktop publishing software... 2010-11-26 IULA - InfoLex (UPF) 19 A real case: leXkit (Ixa NLP Group) leXkit: a dictionary content management system (Alegria et al., 2006c) Dictionary edition and maintenance XML-based: Berkeley DBXML XML native database for storage Client-server architecture: SOAP-based communication Suitable for different kinds of dictionaries Main features: Allows adding, deleting and modifying entries in a friendly fashion: XML details are transparent for the lexicographer Provides the lexicographers with all the features of a full-fledged DBMS: full search capabilities, safe storage, concurrent access, etc. 2010-11-26 IULA - InfoLex (UPF) 20 leXkit Main features (cont'd): Maintains entry states (version control and tracking) Allows to automatically generate the files and components needed by a running application such as the current electronic DBE. Tailored output is feasible: it allows to easy export data required in print editions, diversified electronic versions, etc. Architecture Client The component used by the lexicographer Tool integration (corpora, other dictionaries...) Server: database, concurrency, configuration files (dictionary schema definitions, wizards, etc.), import/export utilities, backups... 2010-11-26 IULA - InfoLex (UPF) 21 leXkit Editor: •Edition tree •Predefined tasks dictionary tabs edition textbox Index: Dictionary entries Search results Viewer: •Entry preview (WYSIWYG) •Integrated tools 2010-11-26 IULA - InfoLex (UPF) 22 leXkit views and info tabs Viewer: •XML tab •Entry info •Session control •... 2010-11-26 IULA - InfoLex (UPF) 23 leXkit: system architecture 2010-11-26 IULA - InfoLex (UPF) 24 leXkit Communication (client / server) SOAP web services (RPC model + cookies) Intermediate declarative layer (XML) Dictionary specifications Operations (context-dependent tasks) Wizards (common edition operations, predefined searches...) Other technical aspects XSLT is widely used in the application XSLTi: decarative language that adds interactivity to XSLT scripts XML processing: Xerces + Xalan Graphical interface: wxWidgets HTML rendering: Mozilla (wxMozilla) 2010-11-26 IULA - InfoLex (UPF) 25 leXkit: wizards for the DBE 2010-11-26 IULA - InfoLex (UPF) 26 leXkit: conclusions nd leXkit has been used at the CLA for editing the DBE's 2 and 3rd editions: from 7473 entries / 14013 senses in the 1st edition to 10557 entries / 19374 senses in the 3rd one. The construction of leXkit was a vital tool in the qualitative leap of this work. Dictionary edition applications are a must, especially if dictionaries are stored in databases or XML-encoded. leXkit can be used by other lexicographical teams to create and update dictionaries. It is available as free software (open source) at http://sourceforge.net/projects/lexkit/. 2010-11-26 IULA - InfoLex (UPF) 27 Dictionary representation Representation is the key factor for dictionary functionality we won't get what is not stored and adequately represented in the dictionary the representation we choose conditions what we later on will be able to get from the dictionary Physical level text (no access facilities, deficient structuring) plain or somehow structured (CSV, tabular...) rich text: typography, word processors ¾ even the entry concept is diluted sometimes ¾ risk: vicious circle (to be avoided) 2010-11-26 IULA - InfoLex (UPF) 28 Dictionary representation Physical level (cont'd) database: relational (structure, indexing, query and update facilities) one database = one dictionary • is each pertinent information unit correctly represented in a field or column? integrated dictionary system (publishers) • publisher's general dictionary database marked text HTML: mark-up language, presentation-oriented SGML / XML: mark-up metalanguage, content-oriented 2010-11-26 IULA - InfoLex (UPF) 29 Dictionary representation ¾ content-oriented marked text constitutes a better data model for the representation of dictionary content and structure than the relational model lexical information
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages29 Page
-
File Size-