23 from Image to Translation: Processing the Endangered Nyushu Script
Total Page:16
File Type:pdf, Size:1020Kb
From Image to Translation: Processing the Endangered Nyushu Script TONGTAO ZHANG, ARITRA CHOWDHURY, and NIMIT DHULEKAR, Rensselaer Polytechnic Institute JINJING XIA, Tsinghua University KEVIN KNIGHT, Information Sciences Institute, University of Southern California HENG JI and BULENT¨ YENER, Rensselaer Polytechnic Institute LIMING ZHAO, Tsinghua University 23 The lack of computational support has significantly slowed down automatic understanding of endangered lan- guages. In this paper, we take Nyushu (simplified Chinese: ; literally: “women’s writing”) as a case study to present the first computational approach that combines Computer Vision and Natural Language Process- ing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new chal- lenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance indicates that the system is a promising approach and can serve as a standard benchmark. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing— Language resources General Terms: Documentation, Languages Additional Key Words and Phrases: Endangered languages, nyushu, recognition, translation ACM Reference Format: Tongtao Zhang, Aritra Chowdhury, Nimit Dhulekar, Jinjing Xia, Kevin Knight, Heng Ji, Bulent¨ Yener, and Liming Zhao. 2016. From image to translation: Processing the endangered nyushu script. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 23 (May 2016), 16 pages. DOI: http://dx.doi.org/10.1145/2857052 1. INTRODUCTION During the past 20 years, civilizations all over the world have witnessed cultural as- similation, which has led to language endangerment. It was predicted that 90 percent of mankind’s languages would meet their doom by the end of this century [Krauss 1992]. China, one of the countries that can boast of an ancient civilization, has over 5,000 years of long and mysterious history. Currently 292 living languages are spo- ken by 56 recognized ethnic groups. There are more than 85 endangered languages spoken by non-Han minorities in southwestern China [Bradley 2005; Zhao and Song Authors’ addresses: T. Zhang, A. Chowdhury, N. Dhulekar, H. Ji, and B. Yener, 110 8th St, Troy, NY 12180 USA; emails: [email protected], [email protected], [email protected], [email protected], yener@ cs.rpi.edu; K. Knight, 4676 Admiralty Way #1001, Marina del Rey, CA 90292 USA; email: [email protected]; J. Xia and L. Zhao, 30 Shuangqing Rd, Haidian, Beijing 100084 China; emails: [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 2375-4699/2016/05-ART23 $15.00 DOI: http://dx.doi.org/10.1145/2857052 ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 23, Publication date: May 2016. 23:2 T. Zhang et al. 2011], including Nyushu [Huang 1993], [Zhao and Zhang 2014], Muya [Huang 1985], Namuyi [Zhao and Zhang 2014], Zhuang [Li 2005], Ersu [Sun 1983], Pimu [Sun et al. 2007], Naxi [He and Jiang 1985] and Shuishu [Zhang 1980]. As living “fossils”, these languages are precious and valuable to multiple research disciplines such as studying the origin and evolution of writing systems, preserving unique cultures (e.g., the af- fliction literature and the patio salon culture reflected in the gender-specific Nyushu, and the matrilineal system reflected in Mosuo), religions (e.g., Dongba is mainly used to write scriptures), music (e.g., folk songs in Zhuang), ethnicity and civilization. As Mandarin Chinese (Putonghua) is widely taught in the whole country as a national standard and used by young people belonging to minority ethnicities to work in urban Han areas or to communicate with tourists, the population that masters and uses these languages regularly keeps decreasing and aging. Most masters are over 60 years old, and even their own expertise in these languages is declining [Zhao and Zhang 2014]. Apart from the decline in the usage of these languages, the other major crisis is with regards to the veracity of documents written in these languages. Some local business- men have been forging characters and articles (e.g., randomly take components and characters to create fake documents) in these languages for the sake of commercial profit. These languages are thus suffering from the lack of standards from documen- tary orthography and the inadequate electronic support. As a consequence it is getting more convoluted to understand the works of the masters and to properly convey the past efforts of linguistic experts on these documents. Fortunately, in the past decade, extensive linguistic research efforts have been made toward manually documenting and translating traditional articles, and constructing dictionaries for some of these languages. However, such efforts are very expensive without computational support. Moreover, in a long-term endangered language-related project, consolidated standards and rules are required and objective evidence should be provided to ensure the quality and accuracy of recognition, translation and standardization. One such language under threat of extinction is Nyushu (Simplified Chinese: ; literally “women’s writing”) [Zhao 1995, 2004a, 2004b, 2005, 2008]. Nyushu was spe- cially created for women in the ancient Hunan province of China. From ancient time to the mid 20th century, Chinese women were not allowed to receive education [Zhao 2004b], and a woman who was publicly literate would bring shame to her family. Thus, many knowledgeable women, though behind the scenes, created and used this coded language to communicate with each other. As the only gender-specific language, Nyushu was kept alive by being passed from one generation of women to the next gen- eration of women. They wrote this script on books, paper, paper fans and handkerchiefs by embroidering, engraving, stamping and calligraphy using bamboo or brush pens. The secrecy of this language lays in the fact that it appeared as normal decoration to men who did not understand or know of its existence. Figure 1 presents Nyushu script on a paper fan and a handkerchief. Compared to their Chinese counterparts, Nyushu characters are thin, elegant, and generally presented in italicized diamond shapes. Chinese characters are converted into uniform dots, vertical bars, italicized bars and arcs in Nyushu. For example, the character “heart” is written as “ ” in Chinese and “ ” in Nyushu. Each Nyushu character represents a phonogram in the Jiangyong dialect, and can be mapped to a cluster of Chinese characters which share similar pronunciations but different mean- ings (which is called Rebus in Chinese Linguistics). Therefore, there are many fewer Nyushu characters (about 800) compared to Chinese which has approximately 3,500 common characters. In fact, half of the Nyushu language experts can only chant the language, but do not have the same level of expertise in writing the language. More- over, neither was Nyushu promulgated publicly, nor was it taught formally in schools. The masters of Nyushu lacked the opportunity to receive higher education, so there are ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 23, Publication date: May 2016. From Image to Translation: Processing the Endangered Nyushu Script 23:3 Fig. 1. Figures of Nyushu on paper fan and handkerchief. usually many variants and misspellings in Nyushu handwriting. Another new chal- lenge comes from the low quality of the scanned images, as we can see from the example documents in Figure 1. The old articles were vulnerable to the bad conditions (pres- sure, temperature, etc.) and thus not well preserved. Most documents were captured with old-style roll film cameras and then the photos were scanned to computers several years later. Therefore, recognition and translation methods must be robust enough to handle noisy data. In comparison to other dominant languages such as Chinese with large labeled corpora available for training, Nyushu’s resources are extremely insufficient for state- of-the-art Optical Character Recognition (OCR) and Statistical Machine Translation (SMT). Bai and Huo [2005] utilizes more than 1 million annotated characters for OCR training in order to achieve 98.24% accuracy for Chinese handwritten charac- ters, but there is no annotation with equivalent size for Nyushu characters. Moreover, stains on the articles, low resolution of the photos and distortion from scanners all increase the difficulty for OCR. More importantly, stroke recognition algorithms [Liu et al. 2004, 2013] that have been employed for character recognition cannot be utilized in this context because such recognition systems require users to literally “write” a character (e.g., writing capitalized “A” by adding a “-” after writing “”). This allows the algorithm to track the strokes including lengths, directions, positions and order. However, all Nyushu documents are scanned images and, thus, very little knowl- edge about the order of the strokes in each character can be extracted. In place of the traditional stroke-recognition algorithms, we propose a more refined character- segmentation-based recognition algorithm specific to Nyushu.