Simultaneous Word-Morpheme Alignment for Statistical Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

Simultaneous Word-Morpheme Alignment for Statistical Machine Translation Simultaneous Word-Morpheme Alignment for Statistical Machine Translation Elif Eyigoz¨ Daniel Gildea Kemal Oflazer Computer Science Computer Science Computer Science University of Rochester University of Rochester Carnegie Mellon University Rochester, NY 14627 Rochester, NY 14627 PO Box 24866, Doha, Qatar Abstract word segmentations that create a one-to-one map- ping of words in both languages. Al-Onaizan et al. Current word alignment models for statisti- (1999), Cmejrekˇ et al. (2003), and Goldwater and cal machine translation do not address mor- McClosky (2005) manipulate morphologically rich phology beyond merely splitting words. We languages by selective lemmatization. Lee (2004) present a two-level alignment model that dis- attempts to learn the probability of deleting or merg- tinguishes between words and morphemes, in ing Arabic morphemes for Arabic to English trans- which we embed an IBM Model 1 inside an HMM based word alignment model. The lation. Niessen and Ney (2000) split German com- model jointly induces word and morpheme pound nouns, and merge German phrases that cor- alignments using an EM algorithm. We eval- respond to a single English word. Alternatively, uated our model on Turkish-English parallel Yeniterzi and Oflazer (2010) manipulate words of data. We obtained significant improvement of the morphologically poor side of a language pair BLEU scores over IBM Model 4. Our results to mimic having a morphological structure similar indicate that utilizing information from mor- to the richer side via exploiting syntactic structure, phology improves the quality of word align- in order to improve the similarity of words on both ments. sides of the translation. We present an alignment model that assumes in- 1 Introduction ternal structure for words, and we can legitimately talk about words and their morphemes in line with All current state-of-the-art approaches to SMT rely the linguistic conception of these terms. Our model on an automatically word-aligned corpus. However, avoids the problem of collapsing words and mor- current alignment models do not take into account phemes into one single category. We adopt a two- the morpheme, the smallest unit of syntax, beyond level representation of alignment: the first level in- merely splitting words. Since morphology has not volves word alignment, the second level involves been addressed explicitly in word alignment models, morpheme alignment in the scope of a given word researchers have resorted to tweaking SMT systems alignment. The model jointly induces word and by manipulating the content and the form of what morpheme alignments using an EM algorithm. should be the so-called “word”. We develop our model in two stages. Our initial Since the word is the smallest unit of translation model is analogous to IBM Model 1: the first level from the standpoint of word alignment models, the is a bag of words in a pair of sentences, and the sec- central focus of research on translating morphologi- ond level is a bag of morphemes. In this manner, cally rich languages has been decomposition of mor- we embed one IBM Model 1 in the scope of another phologically complex words into tokens of the right IBM Model 1. At the second stage, by introducing granularity and representation for machine transla- distortion probabilities at the word level, we develop tion. Chung and Gildea (2009) and Naradowsky and an HMM extension of the initial model. Toutanova (2011) use unsupervised methods to find We evaluated the performance of our model on the 32 Proceedings of NAACL-HLT 2013, pages 32–40, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics Turkish-English pair both on hand-aligned data and are compatible if and only if: by running end-to-end machine translation experi- + + ments. To evaluate our results, we created gold word ∀ j, k, m, n ∈ N , ∃ s, t ∈ N alignments for 75 Turkish-English sentences. We [am(j, k) = (m, n) ⇒ aw(j) = m] ∧ obtain significant improvement of AER and BLEU [aw(j) = m ⇒ am(j, s) = (m, t)] ∧ scores over IBM Model 4. Section 2.1 introduces [a (j) = null ⇒ a (j, k) = null] (1) the concept of morpheme alignment in terms of its w m relation to word alignment. Section 2.2 presents Please note that, according to this definition of com- the derivation of the EM algorithm and Section 3 patibility, ‘am(j, k) = null’ does not necessarily im- presents the results of our experiments. ply ‘aw(j) = null’. A word alignment induces a set of compati- 2 Two-level Alignment Model (TAM) ble morpheme alignments. However, a morpheme alignment induces a unique word alignment. There- fore, if a morpheme alignment a and a word align- 2.1 Morpheme Alignment m ment aw are compatible, then the word alignment is Following the standard alignment models of Brown aw is recoverable from the morpheme alignment am. et al. (1993), we assume one-to-many alignment for The two-level alignment model (TAM), like IBM Model 1, defines an alignment between words both words and morphemes. A word alignment aw (or only a) is a function mapping a set of word po- of a sentence pair. In addition, it defines a mor- sitions in a source language sentence to a set of pheme alignment between the morphemes of a sen- word positions in a target language sentence. A mor- tence pair. The problem domain of IBM Model 1 is defined pheme alignment am is a function mapping a set of morpheme positions in a source language sentence over alignments between words, which is depicted to a set of morpheme positions in a target language as the gray box in Figure 1. In Figure 2, the smaller sentence. A morpheme position is a pair of integers boxes embedded inside the main box depict the new (j, k), which defines a word position j and a relative problem domain of TAM. Given the word align- morpheme position k in the word at position j. The ments in Figure 1, we are presented with a new alignments below are depicted in Figures 1 and 2. alignment problem defined over their morphemes. The new alignment problem is constrained by the given word alignment. We, like IBM Model 1, adopt aw(1) = 1 am(2, 1) = (1, 1) aw(2) = 1 a bag-of-morphemes approach to this new problem. We thus embed one IBM Model 1 into the scope of another IBM Model 1, and formulate a second-order Figure 1 shows a word alignment between two sen- interpretation of IBM Model 1. tences. Figure 2 shows the morpheme alignment be- TAM, like IBM Model 1, assumes that words and tween same sentences. We assume that all unaligned morphemes are translated independently of their morphemes in a sentence map to a special null mor- context. The units of translation are both words and pheme. morphemes. Both the word alignment aw and the A morpheme alignment a and a word alignment m morpheme alignment am are hidden variables that aw are compatible if and only if they satisfy the fol- need to be learned from the data using the EM algo- lowing conditions: If the morpheme alignment am rithm. maps a morpheme of e to a morpheme of f, then the In IBM Model 1, p(e|f), the probability of trans- word alignment aw maps e to f. If the word align- lating the sentence f into e with any alignment is ment aw maps e to f, then the morpheme alignment computed by summing over all possible word align- am maps at least one morpheme of e to a morpheme ments: of f. If the word alignment a maps e to null, then w X all of its morphemes are mapped to null. In sum, a p(e|f) = p(a, e|f) morpheme alignment am and a word alignment aw a 33 Figure 1: Word alignment Figure 2: Morpheme alignment In TAM, the probability of translating the sentence in the scope of the left part. In the right part, we f into e with any alignment is computed by sum- compute the probability of translating the word fi ming over all possible word alignments and all pos- into the word ej by summing over all possible mor- sible morpheme alignments that are compatible with pheme alignments between the morphemes of ej and a given word alignment aw: fi. R(ej, fi) is equivalent to R(e, f) except for the X X fact that its domain is not the set of sentences but p(e|f) = p(aw, e|f) p(am, e|aw, f) (2) the set of words. The length of words ej and fi in aw am R(ej, fi) are the number of morphemes of ej and fi. where am stands for a morpheme alignment. Since The left part, the contribution of word transla- the morpheme alignment am is in the scope of a tion probabilities alone, equals Eqn. 3. Therefore, given word alignment aw, am is constrained by aw. canceling the contribution of morpheme translation In IBM Model 1, we compute the probability of probabilities reduces TAM to IBM Model 1. In translating the sentence f into e by summing over our experiments, we call this reduced version of all possible word alignments between the words of f TAM ‘word-only’ (IBM). TAM with the contribu- and e: tion of both word and morpheme translation proba- |e| |f| bilities, as the equation above, is called ‘word-and- Y X p(e|f) = R(e, f) t(ej|fi) (3) morpheme’. Finally, we also cancel out the con- j=1 i=0 tribution of word translation probabilities, which is called ‘morpheme-only’.
Recommended publications
  • Text Complexity by Dr
    Author Monographs Text Complexity By Dr. Timothy Shanahan University of Illinois at Chicago Member, National Reading Panel; President, International Reading Association, 2006 Chair, National Literacy Panel and National Early Literacy Panel It would be too costly and All texts are not equal. Some texts inconvenient to perform the above- are harder to read and comprehend. explained process for every text, of Just as there are individual differences course, so a second fundamental in children, there are individual premise of readability measurement differences in the texts that we ask is that the measure must specifically children to read. Some children summarize only text features or read better than others, and there characteristics, and not how easily are a variety of reasons for these read the text has been in the past. On differences in reading abilities. The the basis of this enumeration of text same is true of texts. Some texts are features, the formula has successfully easier or harder, and there are several predicted reader comprehension for reasons for these differences. texts. The purpose of this essay is to describe what factors Readability measures have evolved since they first cause texts to differ and to explore the relationship of appeared. Now, most readability formulas only measure text difficulty and children’s learning. two factors: word complexity and sentence complexity (Klare, 1984). Thus, measuring the readability or What makes text complex? complexity of a text involves an evaluation of the words: counting the average number of letters or syllables, Since the 1920s, there has been interest in measuring checking the frequency of the words (common words text difficulty or readability (Lively & Pressey, 1923).
    [Show full text]
  • The Lexile Framework: Unnecessary and Potentially Harmful
    The Lexile Framework: Unnecessary and Potentially Harmful Stephen Krashen CSLA (California School Library Association) Journal 24(2): 25-26, 2001 The Lexile Framework attempts to solve a problem that doesn't exist. It is a readability formula that "stands firmly in the tradition of classic readability formulas" (Stenner, 1996, p. 23) that assigns reading levels to texts based on word frequency and sentence length. The Lexile Framework is intended to help teachers and librarians recommend supplementary reading that is at the right reading level: "For example, an eighth-grade girl who is interested in sports but is not reading at grade level might be able to handle a biography of a famous athlete. The teacher may not know, however, whether that biography is too difficult or too easy for the student. " All the teacher has to do is use the Lexile Framework on the text and the student and select a book at the right level. Then, "as the reader improves, new titles with higher text measures can be chosen to match the growing person (sic) measure, thus keeping the comprehension rate at the chosen level." (Stenner, 1996, p. 22). Not Necessary None of this is necessary, and it is probably harmful. There is a much easier way for readers to select texts: Are they comprehensible and interesting? It doesn't take long for a reader to determine this: All it takes is sampling a little of the text (reading it). Our eighth grader simply needs to have a look at a few biographies. Teachers and librarians can certainly help in text selection and they do this all the time, with great success.
    [Show full text]
  • How to Check a Readability Score with Microsoft Word
    How to Check a Readability Score with Microsoft Word Step 1: Open your Informed Consent document in Word. Step 2: Look at the top tool bar and click “Review.” It should the ninth option at the top in Word 365. Step 3: Click “Check Document” or “Spelling Check.” It should be on the right at the very top with the image of a green check mark and the letters ‘abc’. Step 4: A small window should pop up that reads “Readability Statistics,” like the one to the right. *If this window does not automatically pop up for you, try these steps: (1) Go to “File,” then “Options.” (2) Select “Proofing.” (3) Under “When correcting spelling and grammar in Word,” make sure the “Check grammar with spelling” check-box is selected. (4) Select “Show readability statistics.” (5) Try starting over with Step 2. If you still can’t get the readability statistics to show, please click the help button in the top tool bar for step-by-step instructions on how to enable this feature. Step 5: “Flesh-Kincaid Grade Level” in this window is the number the IRB is concerned with regarding readability for the population you are studying. Guidelines - Any population that requires an advanced degree (2-year degree or more) must have a reading level of 12.0 or below. - Children or other special populations (the elderly, prison inmates, etc.) must have a reading level of 3.0 or below. - The general public or any other population that does not fit into the other two categories must have a reading level of 8.0 or below.
    [Show full text]
  • Smog Readability Formula
    The Smog Readability Formula Adapted from McLaughlin, G. (1969). SMOG grading: A new readability formula. Journal of Reading, 12 (8). 639-646. The SMOG conversion tables were developed by Harold C. McGraw, Office of Educational Research, Baltimore Co. Public Schools, Towson, MD. The SMOG Readability Formula is a simple method you can use to determine the reading level of your written materials. If a person reads at or above a grade level, they will understand 90-100% of the information. Generally, you need to aim for a reading level of sixth grade or less. In addition, to ensure that the text is clear and readable, read your draft aloud. How to use the SMOG formula: 1. Count 10 sentences in a row near the beginning of your material. Count 10 sentences in the middle. Count 10 sentences near the end. (30 total sentences) 2. Count every word with three or more syllables in each group of sentences, even if the same word appears more than once. 3. Add the total number of words counted. Use the SMOG Conversion Table I to find the grade level. If your material has fewer than 30 sentences, follow the instructions for "SMOG on Shorter Passages" and use SMOG Conversion Table II. Word Counting Rules: • A sentence is any group of words ending with a period, exclamation point, or question mark. • Words with hyphens count-as-one-word. • Proper nouns are counted. • Read numbers out loud to decide the number of syllables. • In long sentences with colons or semicolons followed by a list, count each part of the list with the beginning phrase of the sentence as an individual sentence.
    [Show full text]
  • Generating and Rendering Readability Scores for Project Gutenberg Texts
    Generating and Rendering Readability Scores for Project Gutenberg Texts Ronald P. Reck RRecktek LLC. [email protected] and Ruth A. Reck University of California Davis, CA 95616 U.S.A. [email protected] Abstract Here the frequency distribution functions have been calculated for seven different types of readability measurements for over fourteen thousand texts from Project Gutenberg1 (PG). Other supporting measurements were also obtained: the average characters per word, the words per sentence, and the syllables per word. Three types of distributions have been demonstrated from the analysis of the metadata. While there are similarities among some of the scores, there is considerable interpretation yet to be made. The most complex and unique distribution function is found for the Flesch Reading Ease scores. Because of the computing intensity necessary to obtain these distributions it is only in the present age of information science that such a broad brush of characterization of a billion word data source can be made. It is essential that these be sorted by language to better interpret the meaning of the distributions. 1. Introduction Various readability measurements can serve as indicators to quantify the relative accessibility of written information. However, domain specific attributes such as complex terminology or language can direct readability scores towards higher values than the actual complexity of the text warrants. For instance, scientific writing is likely to contain long words that may not significantly increase the complexity of the writing to those familiar with the terms but make the readability value appear greater. Despite this and other limitations, readability measurements remain useful attributes for describing text, especially when the values are regarded as relative measurements from within a specific type of writing or language.
    [Show full text]
  • SUFE and the Internet: Are Healthcare Information Websites Accessible to Parents?
    Open access Original research bmjpo: first published as 10.1136/bmjpo-2020-000782 on 20 October 2020. Downloaded from SUFE and the internet: are healthcare information websites accessible to parents? Andrea Mc Carthy , Colm Taylor To cite: Mc Carthy A, Taylor C. ABSTRACT What is known about the subject? SUFE and the internet: are Background Slipped upper femoral epiphysis is healthcare information an adolescent hip disorder requiring rapid surgical ► Slipped upper femoral epiphysis (SUFE) is a complex websites accessible to intervention. Faced with the prospect of their child parents? BMJ Paediatrics Open adolescent hip disorder. Caregivers may turn to the undergoing surgery, many fearful parents will turn to the 2020;4:e000782. doi:10.1136/ internet for information but find healthcare websites internet to provide information and reassurance. Previous bmjpo-2020-000782 offer conflicting advice, resulting in increased anx- studies have shown the orthopaedic information can be iety and reduced postoperative rehab compliance. ► Additional material is difficult to comprehend. published online only. To view Objective Assess the readability of healthcare websites please visit the journal online regarding slipped upper femoral epiphysis. (http:// dx. doi. org/ 10. 1136/ Methods The term Slipped Upper Femoral Epiphysis was What this study adds? bmjpo- 2020- 000782). searched in Google, Bing and Yahoo. The websites were evaluated using readability software with seven specialised ► The study shows that for websites regarding SUFE, the average Flesich Read Ease Score score is 52 Received 30 June 2020 readability tests including the Flesch- Kincaid Reading and the average reading grade level is 8.67. Both of Revised 28 August 2020 Grade Level, the Flesch Reading Ease Score, the Simple Accepted 6 September 2020 Measure of Gobbledygook, Coleman- Liau Index, Automated these figures do not comply with acceptable stan- Readability Index and the Gunning Fog Index.
    [Show full text]
  • Readability Level Reading Formulas
    Tool 13. Assessing Readability Level Reading Formulas Readability measures are primarily based on factors such as the complexity of the printed materials by measuring the number of words in the sentence and the number of letters or syllables per word (i.e., as a reflection of word frequency.) A “score” reflects the grade level of the printed material. • The Flesch Reading Ease The Flesch Reading Ease is one of the oldest and most accurate formulas. It is a simple approach to assess the grade-level of the reader. • The Fry Graph Readability Formula Edward Fry developed readability tests based on graph. This graph-based test determined readability through high school. In 1977, Fry extended the graph to test through the college years. • The SMOG Readability Formula This is a simple method used to determine the reading level of written materials. The SMOG Readability Formula was developed by Harold C. McGraw. These formulas can be helpful but should not be your only evaluation tool because reading level is only one aspect of readability, and readability formulas are not always accurate with forms that have short sentences or phrases. The Flesch Reading Ease Readability Formula The specific mathematical formula is: RE = 206.835 – (1.015 x ASL) – (84.6 x ASW) RE = Readability Ease ASL = Average Sentence Length (i.e., the number of words divided by the number of sentences) ASW = Average number of syllables per word (i.e., the number of syllables divided by the number of words.) The output, i.e., RE is a number ranging from 0 to 100.
    [Show full text]
  • A Text Readability Continuum for Postsecondary Readiness
    Volume 19 ✤ Number 4 ✤ Summer 2008 ✤ pp. 602–632 A Text Readability Continuum for Postsecondary Readiness Gary L. Williamson North Carolina Department of Public Instruction Over the last 3 decades, a number of authors claimed that too many students were unprepared for the postsecondary world (e.g., Bock & Moore, 1986; Greene & Forster, 2003; Sum, 1999; oU.S. Department of Labor, 1991). To support this contention, some reports focused on student behaviors and attainments, such as high school dropout rates and/or graduation rates (e.g., Greene & Forster, 2003; Horn, 1998). Some focused on anal- ogous later events, such as the rates at which college students require remediation or fail to complete their programs of study (Parsad & Lewis, 2003; Rosenbaum, 2004). Others focused on the skill demands of the workplace in contrast to what stu- dents have been taught in school (Parker, 2004; Pennsylvania Department of Education, 2004). Still others focused on the literacy demands of citizenship (Kirsch, Jungeblut, Jenkins, & Kolstad, 1993). Traditional indicators of college readiness have included college transcripts, student course completions, and student performance (McCormick, 1999; Sanderson, Dugoni, Rasinski, Taylor, & Carroll, 1996). However, no one has examined the text demands of high school in contrast to those of college, the workplace, the mili- 602 Copyright © 2008 Prufrock Press, P.O. Box 8813, Waco, TX 76714 Although high school graduates may be able to successfully navigate high school course content, they may not be prepared to handle the readability of texts they encounter in various postsecondary endeavors. The average readability of high school texts is lower than the average readability of citizenship, workplace, community college, university, and graduate admissions text collections.
    [Show full text]
  • ARTICLES Readability, Logodiversity, and the Effectiveness of College
    ARTICLES Readability, Logodiversity, and the Effectiveness of College Science Textbooks Rebecca S. Burton Department of Biology, Alverno College, PO Box 343922, Milwaukee, WI 53234-3922 [email protected] Abstract: Textbooks are required in most introductory college science courses, but students may not be benefitting from the textbooks as much as their instructors might hope. Word use in the textbooks may influence textbook effectiveness. I tested whether either the amount of technical vocabulary or the readability had a significant effect on students’ ability to learn general biology concepts. I provided different versions of the same reading, then tested students on the content. On the topic with the lowest overall post-reading quiz scores, students who received readings with less technical vocabulary outperformed their peers (P = 0.03). Textbooks did not appear to be an important source of learning for students in this study; fewer than half the students reported that they were reading the assigned chapters near the start of the semester, and this number declined sharply. Students had difficulty correctly answering questions immediately after reading brief selections, indicating a low level of comprehension. Changes in textbooks and teaching strategies may improve student learning and reading compliance. Key words: readability, logodiversity, textbooks INTRODUCTION these grade level scores are based on length of Unless students can learn from their textbooks, sentences and length of words. These indices are there is little purpose in requiring them. generally intended for use in evaluating materials for Overwhelmingly, introductory college science K-12 schools or for the general public. Making courses use textbooks, but questions remain about sentences shorter and simpler will result in a lower what aspects of the actual prose result in greater grade level score, but may not make college biology student learning of scientific concepts.
    [Show full text]
  • Simply Put: a Guide for Creating Easy-To-Understand Materials
    Simply Put A guide for creating easy-to-understand materials What’s in this guide? Communication that is Clear and Understandable………………................................3 Where this Guide Fits into an Overall Communication Plan…………………….......4 Make Your Message Clear ........................................... ..............................................5 Text Appearance Matters ……….............. .................................................................9 Visuals Help Tell Your Story ....................................................................................10 Layout and Design………………………………….................................................17 Consider Culture…………………………...………………………………….…….23 Translations Take Your Message Further ..................................................................25 Testing for Readability ...............................................................................................27 Appendix A - Checklist for Easy-to-Understand Print Materials...............................29 Appendix B - Resources for Communication Planning .............................................30 Appendix C - Formulas for Calculating Readability .................................................31 Appendix D - Resources ............................................................................................38 Acknowledgements………………………………………………………………….43 Strategic and Proactive Communication Branch Division of Communication Services Office of the Associate Director for Communication Centers for Disease Control
    [Show full text]
  • The Principles of Readability
    The Principles of Readability By William H. DuBay Impact Information, 126 E. 18th Street, #C204, Costa Mesa, CA 92627, (949) 631-3309 Copyright The Principles of Readability 25 August 2004 2004 William H. DuBay. All Rights Reserved. Abstract The Principles of Readability gives a short history of literacy studies in the U.S. and a short history of research in readability and the readability formulas. Readers' Comments Please send all comments and suggestions regarding this document to: William DuBay Impact Information 126 E. 18th Street, #C204 Costa Mesa, CA 92627 Phone: (949) 631-3309 Email: [email protected] Website: http://www.impact-information.com Copyright © 2004 William H. DuBay Page ii Contents Introduction ..........................................................................................................1 Guidelines For Readability..........................................................................2 The readability formulas..............................................................................2 Are the readability formulas a problem? .....................................................2 What is readability?.....................................................................................3 Content ........................................................................................................3 The Adult Literacy Studies...................................................................................4 Grading the reading skills of students..........................................................4
    [Show full text]
  • Smart Language: Readers, Readability, and the Grading of Text 25 January 2007 © 2007 William H
    Smart Language Readers, Readability, and the Grading of Text William H. DuBay Impact Information Costa Mesa, California Copyright Smart Language: Readers, Readability, and the Grading of Text 25 January 2007 © 2007 William H. DuBay. All Rights Reserved. Abstract Smart Language gives a brief introduction to the adult literacy sur‐ veys and the research on readability and the readability formulas. Readersʹ Comments Please send all comments and suggestions regarding this document to: William DuBay Email: bdubay@impact‐information.com Website: http://www.impact‐information.com ISBN: 1‐4196‐5439‐X To George R. Klare 1922—2006 Teacher, writer, scientist, and friend Contents Introduction...............................................................................................................1 Writing for the Right Audience................................................................1 Writing for the Wrong Audience .............................................................2 What Is a Reading Grade Level? ..............................................................4 What is Readability? ..................................................................................4 The Readability Formulas .........................................................................6 How This Book is Organized....................................................................7 Part 1 How People Read ........................................................................................10 Chapter 1 The Adult Literacy Surveys .........................................................12
    [Show full text]