A Framework for Processing and Presenting Parallel Text Corpora

A framework for processing and presenting parallel text corpora Dissertation der Fakultat¨ fur¨ Informations- und Kognitionswissenschaften der Eberhard-Karls-Universitat¨ Tubingen¨ zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Dipl.-Inform. Volker Simonis aus Mediasch T übingen 2004 Tag der m ündlichenQualifikation: 21. Juli 2004 Dekan: Prof. Dr. Ulrich Guntzer¨ 1. Berichterstatter: Prof. Dr. Rudiger¨ Loos 2. Berichterstatter: Prof. Dr. Wolfram Luther (Univ. Duisburg-Essen) Meinen Eltern Abstract This thesis describes an extensible framework for the processing and presentation of multi-modal, parallel text corpora. It can be used to load digital documents in many formats like for example pure text, XML or bit-mapped graphics, to structure these documents with a uniform markup and link them together. The structuring or tagging can be done with respect to formal, lingui- stic, semantic, historical and many other aspects. Different, parallel taggings are possible for a document and the documents marked up this way can be linked together with respect to any of these structures. Depending on the nature of the tagging and the scope of the linking, they can be performed automatically, semi-automatically or manually. As a foundation of this work, XTE, a simple but powerful XML standoff annotation scheme has been developed and realized as a DTD and as an XML Schema. XTE is especially well suited for the encoding of multiple, overlapping hierarchies in multi-modal documents and for the cross linking of the elements of these encodings across several documents. Together with XTE, elaborate editor and browser applications have been developed which allow the comfortable creation and presentation of XTE encoded documents. These applications have been realized as a configurable and extensible framework that makes it easy for others to extend, customize and adopt the system for their special needs. The combination of a classical textual synop- sis with the supplementary options of dictionaries, encyclopedias, multi-media extensions and powerful tools opens a wide area of applicability for the system ranging from text analysis and language learning to the creation of critical edi- tions and electronic publishing. As a side effect of the main topic, different tools for program and software documentation have been developed and a new and innovative, multilingual user interface has been created. The documentation tools have been used to document the components of the framework while the new user interface has been built into the created applications. Zusammenfassung Diese Arbeit stellt ein erweiterbares System fur¨ die Bearbeitung und Prasen-¨ tation von multi-modalen, parallelen Textkorpora vor. Es kann dazu verwendet werden um digitale Dokumente in vielerlei Formaten wie zum Beispiel einfa- che Textdateien, XML-Dateien oder Graphiken zu bearbeiten wobei bearbeiten in diesem Zusammenhang vor allem strukturieren und verlinken bedeutet. Die- se Strukturierung nach einem neu entwickelten Kodierungschema kann zum Beispiel auf formalen, linguistischen, semantischen, historischen oder auch vielen anderen Gesichtspunkten beruhen. Die Dokumente konnen¨ gleichzeitig mit beliebig vielen parallelen und sich moglicherweise¨ auch uberlappenden¨ Struk- turen versehen werden und bezuglich¨ jeder dieser Strukturen auch miteinander verknupft¨ werden. Die unterschiedlichen Strukturen konnen¨ je nach Art ent- weder automatisch oder halbautomatisch erzeugt werden oder sie konnen¨ vom Benutzer manuell spezifiziert werden. Als Grundlage des vorgestellten Systems dient XTE, ein einfaches aber zu- gleich machtiges,¨ externe Kodierungsschema das sowohl als eine XML DTD als auch als ein XML Schema verwirklicht wurde. XTE ist besonders zum Kodie- ren von vielen, sich gegenseitig uberlappenden¨ Hierarchien in multi-modalen Dokumenten und zum Verknupfen¨ dieser Strukturen uber¨ mehrere Dokumente hinweg, geeignet. Zusammen mit XTE wurden zwei ausgereifte Anwendungen zum Betrach- ten und Bearbeiten von XTE-kodierten Dokumenten sowie zum komfortablen Arbeiten mit den so erstellten Ergebnisdokumenten geschaffen. Diese Anwen- dungen wurden als anpassbares und erweiterbares System konzipiert, das mog-¨ lichst einfach fur¨ andere Einsatzgebiete und an neue Benutzerwunsche¨ ange- passt werden konnen¨ soll. Die Kombination einer klassischen Synopse zusammen mit den vorhandenen Erweiterungsmoglichkeiten¨ mittels Worterb¨ uchern,¨ Lexika und Multi-Media Elementen die das System bietet, machen es zu einem Werkzeug das auf vielen Gebieten, angefangen von der Text-Analyse und dem Sprachenlernen uber¨ die Erstellung textkritischer Editionen bis hin zum elektro- nischen Publizieren, einsetzbar ist. Neben diesem System sind als weitere Ergebnisse dieser Arbeit verschiedene Werkzeuge fur¨ die Softwaredokumentation entstanden und zur Dokumentation des Systems eingesetzt worden. Weiterhin wurde eine neuartige, mehrsprachi- ge, graphische Benutzeroberflache¨ entwickelt, die unter anderem in dem hier beschriebenen System eingesetz wurde. Contents 1 Introduction 1 1.1 Text encoding ................................... 1 1.1.1 History of text encoding ......................... 2 1.1.2 Electronic character encodings .................... 4 1.2 Text markup .................................... 5 1.2.1 Text processing .............................. 5 1.2.2 General Markup Languages ...................... 6 1.2.3 Specialized Markup Languages for Text ............... 8 1.3 Scope and contribution ............................. 9 1.4 Structure of this work ............................... 10 2 A new markup scheme for text 11 2.1 A short introduction to XML ........................... 11 2.1.1 XML namespaces ............................ 13 2.1.2 XML schema languages ......................... 14 2.1.3 XPath, XPointer and XLink ........................ 15 2.1.4 XSL - The Extensible Stylesheet Language .............. 17 2.1.5 The future of XML ............................. 17 2.2 The problem of overlapping hierarchies ................... 17 2.3 Workarounds for the problem of overlapping hierarchies ......... 18 2.3.1 The SGML CONCUR feature ......................... 18 2.3.2 Milestone elements ............................ 19 2.3.3 Fragmentation .............................. 20 2.3.4 Virtual joins ................................. 20 2.3.5 Multiple encodings ............................ 21 2.3.6 Bottom up virtual hierarchies ...................... 21 2.3.7 Just in time trees ............................. 21 2.3.8 Standoff markup ............................. 21 2.4 XTE - A new standoff markup scheme ..................... 22 2.4.1 The XTE DTD ................................ 23 2.4.2 XTE - Expressed as an XML Schema .................. 30 2.4.3 Using the XTE DTD together with the XTE XML Schema ....... 38 2.4.4 Encoding facsimile texts with XTE ................... 41 3 The software architecture of LanguageExplorer and LanguageAnalyzer 43 3.1 The Java programming language ....................... 43 3.1.1 The Java APIs ............................... 44 3.2 The LanguageExplorer text classes ...................... 49 3.2.1 The document class ........................... 50 3.2.2 The editor kit ................................ 52 Dissertation der Fak. f. Informations- u. Kognitionswissenschaften, Univ. Tubingen¨ - 2004 ii CONTENTS 3.2.3 The view classes ............................. 54 3.3 The LanguageExplorer file formats ....................... 55 3.3.1 The LanguageExplorer book format .................. 55 3.3.2 Encryption of LanguageExplorer books ............... 57 3.3.3 LanguageExplorer configuration files ................. 58 3.4 The design of LanguageAnalyzer ....................... 59 3.5 The design of LanguageExplorer ........................ 60 3.6 The plugin concept ............................... 62 3.6.1 Handling new XTE elements ...................... 62 3.6.2 Support for new media types ...................... 63 3.6.3 Adding new tools ............................. 63 4 Implementation techniques and libraries 65 4.1 Program documentation with ProgDOC .................... 65 4.1.1 Introduction ................................ 66 4.1.2 Some words on Literate Programming ................ 66 4.1.3 Software documentation in the age of IDEs ............. 70 4.1.4 Software documentation and XML .................. 71 4.1.5 Overview of the ProgDOC system ................... 72 4.1.6 The \sourceinput command ...................... 75 4.1.7 Using ProgDOC in two-column mode ................. 77 4.1.8 Using the alternative highlighter pdlsthighlight ........... 77 4.1.9 The \sourcebegin and \sourceend commands ............ 78 4.1.10 The \sourceinputbase command .................... 79 4.1.11 The source file format .......................... 79 4.1.12 LATEX customization of ProgDOC ..................... 81 4.1.13 An example Makefile .......................... 83 4.2 Program documentation with XDoc ..................... 85 4.2.1 Introduction ................................ 85 4.2.2 The new XDoc approach ........................ 85 4.2.3 A prototype implementation ...................... 88 4.2.4 Conclusion ................................. 94 4.3 A Locale-Sensitive User Interface ....................... 97 4.3.1 Introduction ................................ 97 4.3.2 The Java Swing architecture ...................... 98 4.3.3 The solution - idea and implementation ............... 99 4.3.4 Conclusion ................................. 107 4.4 Scrolling on demand ..............................

A Framework for Processing and Presenting Parallel Text Corpora

Recommended Formats Statement 2019-2020

The Music Encoding Initiative As a Document-Encoding Framework

Dynamic Generation of Musical Notation from Musicxml Input on an Android Tablet

Scanscore 2 Manual

Methodology and Technical Methods

Proposals for Musicxml 3.2 and MNX Implementation from the DAISY Music Braille Project

Access Control Models for XML

Beyond PDF – Exchange and Publish Scores with Musicxml

Recordare Case Study

Deliverable 5.2 Score Edition Component V2

2021-2022 Recommended Formats Statement

Framework for a Music Markup Language Jacques Steyn Consultant PO Box 14097 Hatfield 0028 South Africa +27 72 129 4740 [email protected]