Text and Braille Computer Translation
Total Page:16
File Type:pdf, Size:1020Kb
Text and Braille Computer Translation A dissertation submitted to the University of Manchester Institute of Science and Technology for the degree of Master of Science, 2001. Alasdair King Department of Computation 28 September 2001 Text and Braille Computer Translation Declaration No portion of the work referred to in the dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institution of learning. 1 Text and Braille Computer Translation Acknowledgements I gratefully acknowledge the great support of my supervisor, Dr Gareth Evans, without whom this dissertation would not have been possible. 2 Text and Braille Computer Translation Abstract This project is concerned with the translation of text to and from Braille code by a number of Java programs. It builds on an existing translation system that combines a finite state machine with left and right context matching and a set of translation rules. This allows the translation of different languages and different grades of Braille contraction, and both text-to-Braille and Braille-to-text. An existing implementation in C, allowing the translation of languages based on 256-character extended-ANSI sets, has been successfully integrated into a Microsoft Word- based translation system. In this project, three Java implementations of this translation system were developed. LanguageInteger is a port of the existing C code. Language256 uses the same translation language files but is coded using Java programming idioms. LanguageUnicode is based upon the use of Unicode to encode characters. Each implements a Language Java interface, which defines common public methods for the classes in accordance with object-oriented software development principles of encapsulation and reuse. All the implementations performed translation correctly on a range of different operating systems and machines, demonstrating that they are platform-independent. LanguageUnicode was able to use language data files obtained over HTTP from a webserver. The implementations performed well relative to the native C program on a high-specification machine, but their performance was strongly dependent on available system resources. LanguageInteger performed fastest, and more consistently, across the range of platforms tested and is suitable to serve as a component in future development as a part of a platform- independent translation application. Language256 did not perform as fast, so its development should be discontinued. LanguageUnicode performed least well, suffering from using Java Strings to represent language information. It should be recoded using arrays of ints. This can be based on the Language256 program, which uses this representation internally. It is recommended that Java Beans be developed from the classes to facilitate future development with them as applets, GUI components and network applications. The three classes are supplemented by two more Java programs for creating the language rules tables used by the programs, a test language and test input that will allow the validation of future implementations of Language, and full documentation of the classes in the standard Sun API format for future development. 3 Text and Braille Computer Translation Contents Declaration 1 Acknowledgements 2 Abstract 3 Contents 4 1 Introduction 6 2 Current state of Braille and computer use 9 2.1 The Braille code 9 2.2 Use of Braille with computer technology 13 2.3 Approaches to performing Braille translation with computers 18 2.4 The UMIST translation system 19 2.5 Implementing the UMIST translation system: the BrailleTrans program 25 2.6 Using BrailleTrans: the Word translation system 33 2.7 Limitations of current implementation that can be addressed in this project 35 3 Solutions to current implementation limitations and development requirements 37 3.1 Implementation platform and implications 37 3.2 Addressing language universality 47 3.3 Advancing the UMIST translation system: planned development 52 4 Implementation of solutions 56 4.1 Language interface 56 4.2 The two 256-character implementations, LanguageInteger and Language256 63 4.3 LanguageUnicode 77 4.4 The Make utilities 87 4.5 Testing and translating utilities 92 4.6 Documentation 93 4.7 Packaging 95 4.2 Performance criteria 95 5 Results of implementations 96 5.1 Validation: meeting specification 96 5.2 Verification: performance 100 5.3 Language implementations 113 6 Conclusions and further work suggested 116 Appendix 1 - Computer Braille Code 121 Appendix 2 - Existing language rules table 124 Character rules 124 4 Text and Braille Computer Translation Wildcard specification 124 Decision table 125 Translation rules 126 Appendix 3 - Test results 127 Bibliography 137 5 Text and Braille Computer Translation 1 Introduction Braille is a system of writing that uses patterns of raised dots to inscribe characters on paper. It therefore allows visually-impaired people to read and write using touch instead of vision [rni2001]. It is a way for blind people to participate in a literate culture. First developed in the nineteenth century, Braille has become the pre-eminent tactile alphabet. Its characters are six-dot cells, two wide by three tall. Any of the dots may be raised, giving 26 or 64 possible characters. Although Braille cells are used world-wide, the meaning of each of the 64 cells depends on the language that they are being used to depict. Different languages have their own Braille codes, mapping the alphabets, numbers and punctuation symbols to Braille cells according to need. Braille characters can also be used to represent whole words or groups of letters. These contractions allow fewer Braille cells to encode more text (dur1996), saving the expensive printing costs of Braille text and making Braille faster to use for some experienced users (wes2001, lor1996b). Modern computer translation of Braille is of benefit to Braille users. Scanners allow printed documents to be transformed into accessible computer documents. This text can then be translated into Braille for output to a Braille printer or directed to special Braille output devices. Braille input devices like the Perkins Brailler are designed to allow Braille code to be entered directly, and Braille users can successfully use a standard QWERTY keyboard. Braille code is stored on computers as North American Computer Braille Code, a mapping of the six-dot Braille code to ASCII English text character values (kei2000). Braille translation is not a trivial task, however, because of the need to correctly perform the contractions. They complicate translation logic and introduce many idiomatic rules and exceptions. For example, the contraction for “OF” in American Braille cannot be used unless it is applied to letters pronounced the same way as the word “of”. This means that “OFten” can be contracted but not “prOFessor” [bra1999]. Despite these complications, translation programs have been developed, based on dictionaries of correct translations or complex rule systems. Few remain in the public domain for examination, however. One translation system still public is a translation system developed at UMIST [ble1995, ble1997]. This combines a large set of rules, relating input and output text, with a finite state machine that allows the application of rules to be controlled by comparing left and right contexts. This system translates input text using a set of character rules, determining what characters are valid for the language and their attributes, a finite state machine decision table, and a set of translation rules containing wildcards for matching input text. These parts together constitute a complete language rules table. Contraction and different types of translation can all be supported within the same language 6 Text and Braille Computer Translation rules table because of the state table. Idiomatic translations can be supported by the translation rules and the context matching. This system is very flexible and can be used to translate to or from Braille code any language for which a language rules table can be created. An implementation of this system has been produced at UMIST, the C program BrailleTrans [ble1995, ble1997]. It works with ANSI 256-character sets, used to represent characters on most computer systems (fow1997). BrailleTrans does not contain any language information internally. The language rules table containing all the information for the language being used is loaded from a machine-format file when BrailleTrans is first executed. BrailleTrans can use any language rules table interchangeably. The language rules tables can be created in simple plain text files by non-technical users. They are then converted into the machine format by a second program, Mk. BrailleTrans can therefore translate any language for which a language rules table file has been created. So far, Standard British English, Welsh and prototype Hungarian have been developed, both contracted and uncontracted and text to Braille and back. BrailleTrans has also been used to add Braille translation functionality to the popular Microsoft Word word processor [ble2001]. This integration provides a friendly and familiar interface to the translation system for users. BrailleTrans is fast and efficient, but it has limitations. It runs only on 32-bit Microsoft operating systems. It handles only 256-character sets, which do not allow the encoding of non- Western characters and do not supply a unique value for every different character. This makes a