<<

1 A Rule-Based Approach for Building an Artificial English-ASL Corpus Zouhour Tmar and Achraf Othman and Mohamed Jemni, Research Lab. LaTICE, University of Tunis, Tunisia

Abstract—A serious problem facing the Community for re- project mainly in its exploitation step and to encourage its searchers in the field of is the absence of a large wide use by different communities. In this paper, we review parallel corpus for signs language. The ASLG-PC12 project our experiences with constructing one such large annotated proposes a rule-based approach for building big parallel corpus between English written texts and parallel corpus between English written text and American Gloss. We present a novel algorithm which transforms an English Sign Language Gloss the ASLG-PC12, a corpus consisting of part-of-speech sentence to ASL gloss. This project was started over one hundred million pairs of sentences. in the beginning of 2010, a part of the project WebSign, and The paper is organized as follow. Section 2 presents some it offers today a corpus containing more than one hundred several projects concerning sign language. In section 3, we million pairs of sentences between English and ASL gloss. It is available online for free in order to develop and design new describe the gloss notation system. After, we define methods algorithms and theories for American Sign Language processing, and pre-processing tasks for collecting data from the Guten- for example statistical machine and any related fields. berg Project [9]. We present two stages of pre-processing, in In this paper, we present tasks for generating ASL sentences from which each sentence had been extracted and tokenized. Section the corpus Gutenberg Project that contains only English written 4 presents our method and algorithms for constructing the texts. second part of the corpus in American Sign Language Gloss. Index Terms—Natural Language Processing, Sign Language, Constructed texts was generated automatically by transforma- Parallel Corpora. tion rules and then corrected by human experts in ASL. We describe also the composition and the size of the corpus. I.INTRODUCTION Discussions and conclusion are drawn in section 5 and 6. O develop an automatic translator or any tools that T requires a learning task for sign languages, the ma- II.BACKGROUND jor problem is the collection of a parallel corpus between Several projects, concerned with Sign Language, recorded text and Sign Lan-guage. A parallel corpus is a large and or annotated their own corpora, but only few of them are structured texts aligned between source and target languages. suitable for automatic Sign Language translation due to the They are used to do statistical analysis [1] and hypothesis number of available data for learning and processing. The testing, checking occur-rences or validating linguistic rules European Cultural Heritage Online organization (ECHO) pub- on a specific universe. Since there is no standard corpus lished corpora for British Sign Language [10] Swedish Sign and sufficient [2] because the data to develop an automatic Language [11] and the Sign Language of the Netherlands [12]. translation based on statistics, without pre-treatment prior to All of the corpora include several stories signed by a single the execution of the process of learning require an important signer. The American Sign Language Linguistic Research volume of data. In many ways, progress in sign language group at Boston University published a corpus in American research is driven by the availability of data. For these reasons, Sign Language [13]. TV broadcast news for the hearing im- we started to collect pairs of sentences between English and paired are another source of sign language recordings. Aachen American Sign Language Gloss [3]. And due to absence of University published a German Sign Language Corpus of the data especially in ASL and in other side there exist a huge Domain Weather Report [14], [15] published a multimedia data of English written text; we have developed a corpus corpus in Sign Language for machine Translation. In literature, based on a collaborative approach where experts contribute to we found many related projects aiming to build corpus for Sign the collection or correction of bilingual corpus or to validate Language. Most of them are based on video recording and we the automatic translation. This project [4] was started in cannot find textual data toward building translation memory. 2010, as a part of the project WebSign [5] [6] that carries Textual data for Sign Language is not a simple written form, on developing tools able to make information over the web because signs can contain others information line eye gaze or accessible for deaf [7] [8]. The main goal of our project is to facial expressions. So, for our corpus, we will use glosses to develop a Web-based interpreter of Sign Language (SL). This represent Sign Language. In the next section, we will present tool would enable people who do not know Sign Language a brief description about glosses. to communicate with deaf individuals. Therefore, contribute in reducing the language barrier between deaf and hearing people. Our secondary objective is to distribute this tool on a III.GLOSSING SIGNS non-profit basis to educators, students, users, and researchers, Stokoe [16] proposed the first annotation system for de- and to disseminate a call for contribution to support this scribing Sign Language. Before, signs were thought of as

U.S. Government work not protected by U.S. copyright 2 unanalyzed wholes, with no internal structure. The Stokoe notation system is used for writing American Sign Language using graphical symbols. After, others notation systems ap- peared like HamNoSys [17] and SignWriting. Furthermore, Glosses are used to write signs in textual form. Glossing means choosing an appropriate English word for signs in order to write them down. It is not a translating, but, it is similar to translating. A gloss of a signed story can be a series of English words, written in small capital letters that correspond to the signs in ASL story. Some basic conventions used for glossing are as follows: • Signs are represented with small capital letters in English. • Lexicalized finger-spelled words are written in small capital letters and preceded by the ’]’ symbol. • Full finger-spelling is represented by dashes between small capital letters (for example, A-H-M-E-D). • Non-manual signals and eye-gaze are represented on a line above the sign glosses. Fig. 1. Occurrence of Zipf’s Law in Gutenberg Corpora of English Texts, In this work, we use glosses to represent Sign Language. In the the top ten words are (the, I, and, to, of, a, in, that, was, it) next section, we will describe steps for building our corpus.

IV. PARALLELCORPUSCOLLECTION an available tool for splitting called Splitta. The models are A. Collecting data from Gutenberg trained from Wall Street Journal news combined with the Acquisition of a parallel corpus for the use in a statistical Brown Corpus which is intended to be widely representative of analysis typically takes several pre-processing steps. In our written English. Error rates on test news data are near 0.25%. case, there isn’t enough data between English texts and Amer- Also, we use CoreNLP tool . It is a set of natural language ican Sign Language. We start collecting only English data analysis tools which can take raw English language text input from Gutenberg Project toward transform it to ASL gloss. and give the base forms of words, their parts of speech. Gutenberg Project [9] offers over 38K free ebooks and more than 100K ebook through their partners. Collecting task is made in five steps: • Obtain the raw data (by crawling all files in the FTP directory). • Extract only English texts, because there exist ebook in others languages than English like German, Spanish. We found also files containing AND sequences. • Break the text into sentences (sentence splitting task). • Prepare the corpora (normalization, tokenization). In the following, we will describe in detail the acquisition of the Gutenberg corpus from FTP directory. Figure 1 shows the occurrence of Zipf’s Law in Gutenberg Corpora of English Texts. We found that words likes (the, I, and, to, of, a, in, that, was, it) are the top ten used words in corpora. Also this metrics determines which words are frequently used in English. Also, a huge work was made to remove non-English texts.

B. Sentence splitting, tokenization, chunking and parsing Sentence splitting and tokenization require specialized tools for English texts. One problem of sentence splitting is the ambiguity of the period ”.” as either an end of sentence marker, or as a marker for an abbreviation. For English, we semi-automatically created a list of known abbreviations that are typically followed by a period. Issues with tokenization Fig. 2. An example of transformation: English input ’What did Bobby buy include the English merging of words such as in ”can’t” (which yesterday?’ we transform to ”can not”), or the separation of possessive markers (”the man’s” becomes ”the man ’s”). We use also 3

V. ENGLISH-ASLPARALLEL CORPUS A. Problematic As we say in the beginning, the main problem to process American Sign Language for statistical analysis like statistical machine translation is the absence of data (corpora or corpus), especially in Gloss format. By convention, the meaning of a sign is written correspondence to the language talking to avoid the complexity of understanding. For example, the phrase ”Do you like learning sign language?” is glossed as ”LEARN SIGN YOU LIKE?”. Here, the word ”you” is replaced by the gloss ”YOU” and the word ”learning” is rated ”LEARN”. Our machine translate must generate, after learning step, the sentence in gloss of an English input.

B. Ascertainment and approach Generally, in research on statistical analysis of sign lan- guage, the corpus is annotated video sequences. In our case, we only need a bilingual corpus, the source language is English and the language is American Sign Language glosses transcribed. In this study, we started from 880 words (En- glish and ASL glosses) coupled with transformation rules. From these rules, we generated a bilingual corpus containing 800 million words. In this corpus, it is not interested in semantics or types of verbs used in sign language verbs Fig. 3. Steps for building ASL corpora such as ”agreement” or ”non-agreement”. Figure 2 shows an example of transformation between written English text and its generated sentence in ASL. The input is ”What did Bobby buy yesterday ?” and the target sentence is ”BOBBY BUY WHAT YESTERDAY ?”. In this example, we save the word ”YESTERDAY” and we can found in some reference ”PAST” which indicates the past tense and the action was made in the past. Also, for the symbol ”?” it can be replaced by a facial animation with ”WHAT”. For us, we are based on lemmatization of words. We keep the maximum of information Fig. 4. Size of the American Sign Language Gloss Parallel Corpus 2012 in the sentence toward developing more approaches in these (ASLG-PC12) corpora. Statistics of corpora are shown in figure 4. The number of sentences and tokens is huge and building ASL corpus takes more than one week. All parts are available to download online for free [4]. Using transformation rule in figure 5, we build the ASL corpora following steps shown in figure 3. The input of the system is English sentences and the output is the ASL transcription in gloss. In figure 5, only simple rules are shown, we can define complex rule starting from these simple rules. We can define a part-of-speech sentence for the two languages. According to figure 3, when we check if the rule of S exists in database, the algorithm will return true, in this case, we apply directly the transformation. Of course, all complex rules must be done by experts in ASL. Table 5 shows some transformation from Fig. 5. Example of full sentences transformation rules English sentence to American Sign Language. We present the transformation rule made by an expert in . In 3, we describe steps to transform an English sentence into try to transform the input for each lemma. In some case, we American Sign Language gloss. The input of the system is the can found that the part-of-speech sentence doesn’t exist in English sentence. Using CoreNLP tool, we generate an XML the database, so, we transform each lemma. Transformation file containing morphological information about the sentence rule for lemma is presented in 5. In the last step, we add an after tokenization task. Then, we build the part-of-speech uppercase script to transform the output. The transformation sentence and thanks to the transformation rules database, we rule is not a direct transformation for each lemma, it can an 4 alignment of words and can ignore some English words like [6] ——, “An avatar based approach for automatic interpretation of text to (the, in, a, an, ...). sign language,” in 9th European Conference for the Advancement of the Assistive Technologies in Europe, San Sebastian (Spain), 3- 5 October, 2007. C. Transformation Rules [7] ——, “A system to make signs using collaborative approach,” in ICCHP, Lecture Notes in Computer Science Springer Berlin / Heidelberg, Not all transformation rules used to transform English data pp.670-677 vol.5105, Linz Austria, 2008. [8] M. Jemni, O. E. Ghoul, and N. B. Yahya, “Sign language mms to make was verified by experts in linguistics. We validate only 800 cell phones accessible to deaf and hard-of-hearing community,” in CVHI, rules and transformation rules for lemma. We cannot validate Euro-Assist Conference and Workshop on Assistive technology for people all rules because there exist an infinite number of rules. For with Vision and Hearing impairments, Granada, Spain, 2007. [9] Project, “Gutenberg,” 2012. [Online]. Available: this reason, we developed an application that offer to experts to http://www.gutenberg.org/ enter their rules from an English sentence, without coding. The [10] B. Woll, R. Sutton-Spence, and D. Waters, “Echo data set for british application is just a simple user interface which contain lemma sign language (bsl),” in Department of Language and Communication Science,City University (London), 2004. transformation rule, and the expert will compose lemma. After [11] B. Bergman and J. Mesch, “Echo data set for swedish sign language that, he save the result and rebuild the corpora. The built (ssl),” in Department of Linguistics, University of Stockholm, 2004. corpus is a made by a collaborative approach and validated [12] O. Crasborn, E. Kooij, A. Nonhebel, and W. Emmerik, “Echo data set for sign language of the netherlands (ngt),” in Department of by experts. Linguistics,Radboud University Nijmegen, 2004. [13] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, R. Stefan, A. Thangali, H. Wang, and Q. Yuan, “Large lexicon project: American sign language D. Releases of the English-ASL Corpus video corpus and sign language indexing/retrieval algorithms,” in Pro- The initial release of this corpus consisted of data up to ceedings of the 4th Workshop on the Representation and Processing of Sign Languages:Corpora and Sign Language Technologies, LREC, September 2011. The second release added data up to January 2010. 2012, increasing the size from just over 800 sentences to up to [14] J. Bungeroth, D. Stein, M. Zahedi, and H. Ney, “A german sign language 800 million words in English. A forthcoming third release will corpus of the domain weather report,” in In 5th international, 2006, p. 29. include data up to early 2013 and will have better tokenization [15] S. Morrissey, H. Somers, R. Smith, S. Gilchrist, and I. D, “4th workshop and more words in American Sign Language. For more details, on the representation and processing of sign languages: Corpora and please check the website [4]. sign language technologies building a sign language corpus for use in machine translation,” 2010. [16] W. C. Stokoe, “10th anniversary classics sign language structure: An VI.DISCUSSIONAND CONCLUSION outline of the visual communication systems of the american deaf,” 1960. We described the construction of the English-American Sign [17] S. Prillwitz and H. Zienert, “Hamburg notation system for sign language: Language corpus. We illustrate a novel method for transform- Development of a sign writing with computer application,” in Current trends in European sign language research: Proceedings of the 3rd ing an English written text to American Sign Language gloss. European Congress on Sign Language Research, 1990. This corpus will be useful for statistical analysis for ASL [18] M. boulares and M. Jemni, “Mobile sign language translation system or any related fields [18] [19]. We present the first corpus for deaf community,” in W4A ACM,9th International Cross-Disciplinary Conference on Web Accessibility, Lyon, France, 2012. for ASL gloss that exceeds one hundred million of sentences [19] K. Jaballah and M. Jemni, “Toward automatic sign language recognition available for all researches and linguistics. During the next from web3d based scenes,” in ICCHP, Lecture Notes in Computer phase of the ASLG-PC12 project, we expect to provide both a Science Springer Berlin / Heidelberg, pp.205-212 vol.6190, Vienna Austria, 2010. richer analysis of the existing corpus and others parallel corpus (like French Sign Language, Arabic Sign Language, etc.). This will be done by first enriching the rules through experts. Enrichment will be achieved by automatically transforming the current transformation rules database, and then validating the results by hand.

REFERENCES [1] A. Othman and M. Jemni, “Statistical sign language machine translation: from english written text to american sign language gloss,” in IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 3, 2011. [2] M. Sara and W. Andy, “Joining hands: Developing a sign language machine translation system with and for the deaf community,” in Pro- ceedings of the Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments (CVHI-2007), Granada, Spain, 28th - 31th August, 2007, vol. 415, 2007. [3] A. Othman and M. Jemni, “Englishasl gloss parallel corpus 2012: Aslgpc12,” in 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, Istanbul, Turkey, 2012. [4] A. Othman, “American sign language gloss parallel corpus 2012 (aslg- pc12),” 2012. [Online]. Available: http://www.achrafothman.net/aslsmt/ [5] M. Jemni and O. E. Ghoul, “Towards web-based automatic interpretation of written text to sign language,” in First International conference on ICT and Accessibility (ICTA-2007), Hammamet, Tunisia, April,, 2007.