Handbook of Natural Language Processing and

Joseph Olive · Caitlin Christianson · John McCary Editors

Handbook of Natural Language Processing and Machine Translation

DARPA Global Autonomous Language Exploitation

123 Editors Joseph Olive Caitlin Christianson Defense Advanced Research Projects Agency Defense Advanced Research Projects Agency IPTO Reston Virginia, USA N Fairfax Drive 3701 caitlin.christianson.ctr@.mil Arlington, VA 22203, USA [email protected]

John McCary Defense Advanced Research Projects Agency Bethesda Maryland, USA [email protected]

ISBN 978-1-4419-7712-0 e-ISBN 978-1-4419-7713-7 DOI 10.1007/978-1-4419-7713-7 Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2011920954

c Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com) Acknowledgements

First, I would like to thank the community for all of its hard work in making GALE a success. I would like to thank the technical assistants Caitlin Christianson and John McCary for all of their help with the program and this book. Special thanks to Paul Dietrich for making GALE run smoothly. We would like to thank the current DARPA management for its continued support throughout the program, especially DARPA Director Regina Dugan, Deputy DARPA Director Ken Gabriel, Information Processing Techniques Office (IPTO) Director Daniel Kaufman, and IPTO Deputy Director Mark Luettgen. We would also like to thank previous IPTO Directors and Deputy Directors Ron Brachman, Charles Holland, Barbara Yoon, Charles Holland, and Charles Morefield for their help in launching the program, continued support, and encouragement to write this book. Special thanks to former DARPA Director Anthony Tether for having the vision and the faith in us to fund the program so that it had a chance to succeed. Finally, I would like to thank my wife Virginia Kayhart Olive. Without her great personal sacrifice this project would not have been possible.

v

Introduction

Authors: Joseph Olive, Caitlin Christianson, and John McCary

“When I use a word, it means just what I choose it to mean – neither more nor less.” Humpty Dumpty in Lewis Carroll’s Through the Looking Glass, or Alice in Wonderland

The meaning of what Lewis Carroll’s Humpty Dumpty says is abundantly clear to him, but to others, it is virtually incomprehensible. Because the rules of his language are entirely of his own devising, he is the only one who knows what he means. Translation of one language into another poses a similar problem; knowing the usual meaning of a word is not enough. To translate, it is necessary to convey the meaning of the entire message, not just transfer words from one language to another. Because people can perform this task so adeptly, it is easy to underestimate the challenge it poses to computers. Although computational capabilities of machines exceed those of humans in many ways, even the most advanced of today’s computers cannot match the language ability that humans acquire naturally. To translate and extract information conveyed through language, humans take advantage of a variety of cognitive abilities that no computer can currently emulate. The Defense Advanced Research Projects Agency (DARPA), however, specializes in tackling just such challenging problems. DARPA researchers have attacked the problem of machine translation as part of the Global Autonomous Language Exploitation (GALE) Program. Like other programs at DARPA, GALE was initiated to fill a need for the Defense Department. In the case of GALE, this need is to close the language gap – to make relevant information, regardless of source language, accessible to English-speaking personnel. The program’s goal is to create technology to automatically translate, analyze, and distill the information that forms the overwhelming tidal wave of foreign language material facing Defense Department personnel. The value of GALE technology is to be creation of the ability not only to translate, but also to identify relevant information and sort it from what is not, and to present a final product in a form both understandable and useable. While there is no existing parallel for such a capability, there have certainly been fictionalized precedents to the idea behind GALE – the universal translator, capable of translating between English and thousands of other languages, in the form of a compact silver device worn on the chest of every Star Trek crew member; the HAL 9000 computer, capable of reasoning and defending itself in 2001, Stanley Kubrick’s classic

vii viii Introduction futuristic film; and Star Wars’ robot C3PO, who speaks nearly every known language in the universe. For the purpose of research and development of automated translation and human language processing capability in the GALE Program, language has been classified into two input modes– speech and electronic text. A third important language input mode, hardcopy document text, is the subject of another DARPA program. While research related to each of GALE’s two input modes focuses on producing a correct translation and extracting information, each mode of input presents singular problems that require different research paths. Unlike text, which generally has orthographic separation, speech signals are continuous, lacking word and phrase boundary markers. In speech, even sentence boundaries are difficult to determine. In addition, the confusability of many phonemes adds to the uncertainty. Because these difficulties are alleviated when a computer carries out transcription using a more explicit orthographic system than those used by human writers, machine translation researchers previously attempted to use computers to transcribe speech into text and then translate that text. In the GALE Program, researchers have begun to combine the processes of transcription and translation, enabling information about possible translations to assist transcription, and information about possible transcriptions and transcription ambiguities to assist translation. An important benefit of using this method is that interactivity between transcription and translation reduces instances of propagation of errors made earlier in the process by providing opportunities for correction. This interactive technique has yielded significant improvements in accuracy for both transcription and translation of speech. In this way, GALE researchers have achieved revolutionary progress by consistently and effectively blending previously distinct speech and text technologies. Text input can also be problematic. In many languages, word boundaries are clear in writing, but due to the lack of orthographic representation of other language elements, such as prosody, it can still be difficult to know a writer’s intent. Scripts without word boundaries introduce additional uncertainty in reading. Chinese writing, for example, does not indicate word boundaries orthographically, a characteristic that can create ambiguity. Semitic language scripts do indicate word boundaries, but often do not include explicit vowel marking, thus creating ambiguities, since it can be uncertain which vowels were intended. GALE researchers have undertaken extraordinary efforts to address these and other obstacles in machine translation of text. One of the greatest challenges in planning an approach for GALE was defining precisely what tasks GALE’s natural language processing machines would be achieving. Was it the ability to translate any language into English? Or was there an even higher goal of retrieving what was relevant from the input? Would achieving such a goal mean that GALE researchers would have to create technology that could extract relevant information from translated material and operate on foreign language material directly? Would GALE machines be able to perform all of these tasks well enough to enable assessment and analysis of the volume of information now available to anyone connected to the Internet or satellite television? These questions have resulted in many challenges to GALE and refinements of the program’s fundamental aspects.

Handbook of Natural Language Processing and Machine Translation ix

A Partial History of Human Language Technology Research at DARPA

Author: Allen Sears

During the past four decades, DARPA has sponsored a wide variety of research on human language technology – efforts that turned out to be stepping stones to GALE. DARPA entered the speech recognition field in 1971 with the launch of the five-year Speech Understanding Research (SUR) Program. Although its immediate impact was limited, SUR included pioneering work with hidden Markov models, which lie at the heart of all modern speech-to-text systems. DARPA speech and text processing research proceeded at a relatively low level from the late 1970s through the early 1980s, then accelerated in the second half of the 1980s. On the speech side, the Spoken Language Program worked on automatic transcription of grammatically constrained read speech with a 1000-word vocabulary, advancing from speaker-dependent to speaker-independent transcription. In the early 1990s, the program moved on to include read speech from Wall Street Journal sentences, progressing from a 5000-word vocabulary, through a 20,000-word vocabulary, to an unlimited vocabulary. A companion program, WHISPER, made an initial foray into automatic transcription of conversational telephone speech. On the text side, the Written Language Program began working on technology to pull facts out of short, semantically-constrained military reports. In the early and mid 1990s, the TIPSTER program aggressively tackled the twin challenges of detecting relevant documents and extracting information needed to fill templates, using naturally-occurring English and Japanese documents as source data. And in the early 1990s, a modest DARPA machine translation initiative explored competing approaches for translating unconstrained foreign language text, laying important groundwork for future advances. In the mid and late 1990s, DARPA’s Text, Radio, Video, Speech (TRVS) Program worked on transcribing and analyzing broadcast news, emphasizing English, but including some preliminary work on Arabic and Chinese. In the late 1990s, Topic Detection and Tracking (TDT) attacked the problem of finding and following events discussed in news reports. In the early 2000’s, Automatic Content Extraction (ACE) made a fresh assault on the challenge of discovering and characterizing entities, relations and events described in newswire plus automatically transcribed broadcast news. Two major programs launched in the early 2000’s were particularly significant — Effective, Affordable, Reusable Speech-to-Text (EARS) attacked challenges posed by broadcast news and telephone conversations in English, Chinese, and Arabic. In addition to improving the speed and accuracy of transcription, EARS worked on automatic metadata extraction to make transcripts more readable by adding structure and removing disfluencies. Translingual Information Detection, Extraction, and Summarization (TIDES) worked towards enabling English speakers to find and interpret required information quickly and effectively regardless of language or medium. TIDES dealt with input from a variety of sources including newswires and automatically transcribed broadcast news in

x Introduction

English, Arabic, and Chinese. It also included “surprise language experiments” that showed how well and how quickly the technology could be ported to other languages. To meet the challenges posed by real world data, DARPA researchers developed increasingly sophisticated algorithms, moving away from symbolic approaches that relied on hand-coded rules and towards statistical approaches that learned from large quantities of sample data and were substantially language-independent. The shift from symbolic to statistical approaches occurred over a number of years. It happened first in the speech community, where a 1987 evaluation of automatic transcription algorithms put to rest the notion that good speech-to-text systems could be built from hand-coded rules. The text processing community followed suit, learning a great deal from the speech community. Building on the advances in automatic transcription and translation achieved by EARS and TIDES, DARPA produced two multilingual news monitoring systems (eTAP and TALES) able to convert Arabic and Chinese broadcasts to English good enough for English speakers to find relevant material. Deployed to military customers (CENTCOM and PACOM), these systems were productively employed from 2004 onwards. Three and a half decades of progress had begun to produce useful technology and provided a strong foundation. It was time for a grander and more ambitious program, GALE.

The GALE Program

Authors: Joseph Olive and Caitlin Christianson

Planning GALE

The most fundamental difference between GALE and its predecessor programs has been its holistic integration of previously separate or sequential processes. In earlier language programs, each of the individual component processes was carried out individually: speech recognition, transcription, translation, information retrieval, content extraction, and content presentation. GALE involves use of a distinctly new approach, one by which researchers have sought to create systems able to execute these processes simultaneously. Under this rubric, speech transcription algorithms aid translation and vice versa. In addition, the processes of information retrieval, content extraction, and content presentation have been joined into an activity referred to under GALE as distillation, which has also been included in the interactive assistance framework of transcription and translation. As is further detailed in the chapters that follow, this combination of previously distinct processes has resulted in substantial technological breakthroughs. The GALE program focuses primarily on transcription, translation, and distillation of information in two languages: Mandarin Chinese and Arabic. These two languages have been chosen because of the high degree of difficulty posed by translation between each of them and English, their relative linguistic distance from each other and from English, the relatively high availability of data in both, and their immediate relevance to current national security applications.

Handbook of Natural Language Processing and Machine Translation xi

The Origin of the GALE Program

Around 2004, DARPA Director Anthony Tether asked two important questions regarding human language technology programs at DARPA: who the end users were, and at what level of accuracy the technologies would become useful. In part, these questions arose from the reduction in word error rate (WER) that resulted from the EARS Program, but it was not clear toward what goal this reduction was aimed. For dictation, WER above 10 percent is not acceptable, but for a dialogue system, a much higher rate can be tolerated as long as the system can enable a user to complete a specific task successfully. Initial studies were conducted to determine answers to these questions, but the results were not satisfactory, mainly due to the fact that the stimuli used did not represent a fine enough grain to determine what level of accuracy was sufficient. For example, translation quality testing was performed on only two intermediate levels of translation quality between machine-generated translation quality and human-generated translation quality. The stimuli consisted of translation in which either one or two of every three sentences was machine generated and the remaining sentences were human generated. To get better answers to the questions of language technology applications and the quality of output required for these applications, it was necessary to design a new study. With the help of retired Air Force Colonel Jose Negron, DARPA contacted Colonel Rafael Sanchez-Carrasquillo, head of a group of language analysts at the Defense Intelligence Agency to ask him if, based on his experience, he could answer DARPA’s question about the level of accuracy at which a translation would become useful for various analysis tasks. Colonel Sanchez-Carrasquillo agreed to provide assistance in carrying out a study to determine the answer, so the next task was to create a set of translations at accuracy levels between those of baseline machine translation and human translation with quality gradations small enough to enable determination of the level of accuracy necessary to enable performance of various tasks. With assistance from Kevin Knight and Salim Roukos, two representatives of the machine translation community, the following process was created. First, machine translation was corrected by a human so that the edited translation reflected the meaning of the original document. Dividing the number of edits by the number of words in the translated passage resulted in an accuracy value of 55 percent for both Arabic and Spanish. Randomly-selected errors were then removed 5 percent at a time to create ten translations, ranging from 55 percent to 100 percent accurate relative to the human-translated standard. These translations of varying accuracy were then presented to analysts from the intelligence community and Defense Department, who were asked to determine what quality of translation would be appropriate for gisting, triage, editing, or use without any alteration. While there was no overwhelming consensus among the analysts as to the level of accuracy at which a translation was felt to be useful, there was a sense that an accuracy level between 75 percent and 80 percent was required to gain a basic understanding of the meaning of a passage. For a translation to be truly useful, however, the intelligence analysts chose the 90 percent mark. They stated that with translation at or above a 90 percent accuracy level, they would choose to work with an existing translation, making edits improve its quality, rather than starting from scratch. The 90th percentile was therefore determined to

xii Introduction be the standard for a translation to be deemed edit-worthy, and would become an important programmatic target for GALE. It is important to note, however, that translations created by human professionals often do not meet this mark without multiple stages of revision.

Evaluation under GALE

Prior to GALE, many translation programs relied on the BLEU metric (See Section 5.2.2.2), an automatic system for evaluating translation quality by counting word or word group matches between a machine-generated translation and multiple human-created translations. The fact that this means of evaluation was automatic rather than manual allowed algorithm developers to conduct numerous experiments in a relatively short period of time, which enabled great progress in machine translation. Despite comparing each machine translation to multiple human translations, however, there was no guarantee that scores generated by BLEU would correlate with preservation of the meaning of the source-language document in the translation. Instead of using an automatic metric, it was determined that GALE machine translation systems would be evaluated on the basis of whether their output accurately conveyed the correct meaning of the source language in English, allowing for different but equivalent word choice and word order. The evaluation standard used in GALE has become known as human translation error rate (HTER) and is based on edit distance. For GALE purposes, edit distance is determined as the number of edits an editor is required to make to a machine-generated translation for it to accurately reflect the meaning of a corresponding highly perfected human translation created through multiple translators and multiple levels of revision, i.e. a “gold standard” translation. The process for creating gold standard translations was developed in cooperation with National Virtual Translation Center Technology Director Kathleen Egan and Stephanie Strassel of the Linguistic Data Consortium. The ultimate goal of the GALE Program has been set as achievement of 95 percent accuracy in translation of Arabic and Chinese newswire text and broadcast news speech into English text. Also, in response to requirements gathered from potential users of GALE technologies, additional genres have been added, such as talk shows, newsgroups, and weblogs. For these less formal genres, which pose a higher degree of difficulty to both human translators and machine translation systems than do the more formal genres, target accuracy has been set at 85 percent. At the end of GALE’s first year, the goals for that year had been achieved, but DARPA’s director revised the targets to make the task more difficult by specifying that future target accuracy levels would not be averages, but minimums that had to be met for a certain percentage of documents in each test set. The resulting goals follow a gradually increasing scale, specifying that translation accuracy of 95 percent must be achieved for 95 percent of documents in the relevant test set for newswire, with slightly lower targets for the other, more difficult genres (See section 5.4.4.8.1 for all GALE targets). To accomplish these ambitious targets for speech input, it was proposed that there be combination of translation and transcription technologies, which had previously been developed separately, so that errors in transcription would not necessarily be

Handbook of Natural Language Processing and Machine Translation xiii irrecoverable in translation. For all translation, it was also proposed that there be incorporation of a variety of algorithms, including algorithms relating to morphology, syntax, semantics, and topic-dependent language models. In GALE, researchers have also stopped relying on the unsustainable process of developing extremely large parallel corpora, for which matching sets of heavily annotated transcripts in both English and a corresponding foreign language must be obtained or created. This step has been taken under GALE due to the fact that although machine translation accuracy does increase when systems are trained with increasingly large amounts of parallel data, each degree of accuracy improvement requires an exponentially larger amount of data.

GALE Data

The approach adopted in GALE represents a shift from the concept of employment of increasingly large amounts of data to the use of smaller amounts of richer data. As there was already a huge amount of newswire and broadcast news parallel data at the beginning of the program, some new data has been added to these genres, but the greatest portion of the funds for data collection have been invested in targeted collection of data to address particular areas of difficulty, as well as annotation, such as treebanks, propbanks, careful text alignment, etc. In addition, corpora used in GALE have been augmented by addition of collections from other language programs such as the Text Retrieval Evaluation Conference.

GALE Distillation

In view of the ever-increasing amount of information confronting those responsible for maintaining the security of the United States, the decision was made that GALE would not just address the most immediate challenges in human language processing of transcription and translation, but also take human language technology research a step further, towards determining how to use all the new information made accessible by automatic transcription and translation. To answer this challenge, GALE has included, as a second and integral step in the language processing paradigm, the assessment, analysis, and presentation of translation results in an easily readable and coherent format. This process consists of a combination of information retrieval, content extraction, and content presentation, which have been collectively termed distillation. Distillation is a concept entirely new in GALE, in which relevant information is extracted from foreign language and English input and concisely presented to the user in English. GALE distillation is not just a key-word search, and does not involve summarization. Instead, it consists of utilization of language analysis to identify information relevant to a user's query, with the aim of extracting all available relevant information without redundancy and presenting it to the user in a functional form. Because warfighters and analysts often face time pressure, they do not have the luxury of wading through many documents that present similar information. Therefore, GALE distillation entails placing particular importance on targeted searches and elimination of redundant results. GALE distillation includes a goal of combining

xiv Introduction redundant results and presenting users with a single, distilled version of what is important, accompanied by multiple citations. Depending on the intended task of a user – enabling military operations, conducting intelligence analysis, assisting policy formulation, monitoring foreign perceptions – GALE systems are required to provide a customized version of a given set of data, saving users hours, if not days or weeks, and allowing a user’s valuable energy and insight to be focused on only the most important information. Through translation in combination with distillation, GALE systems are intended to increase the number of sources of information available by translating previously inaccessible foreign language data and the efficiency of system users in employing this newly-available data to conduct whatever task is required. Like the GALE translation goals, GALE distillation performance targets have also been set very high, with a final target of 95 percent recall and 90 percent precision. To give system performance a meaningful standard of measurement, computer distillation performance is compared to that of a human given the same requirements. The research detailed in this book shows a snapshot of the results of the first three years of groundbreaking progress under the GALE Program. As of the writing of this book, two years of GALE research remain.

Contents

1 Data Acquisition and Linguistic Resources ...... 1 1.1 Introduction ...... 1 1.2 Data Collection, Distribution, and Management ...... 2 1.3 Human Annotation ...... 14 1.4 Automatic Annotation ...... 64

2 Machine Translation from Text ...... 133 2.1 Introduction ...... 133 2.2 Segmentation, Tokenization and Preprocessing...... 135 2.3 Word Alignment ...... 164 2.4 Translation Models ...... 183 2.5 Language Modeling for SMT ...... 252 2.6 Search and Complexity ...... 271 2.7 Adaptation and Data Selection ...... 297 2.8 System Combination ...... 324

3 Machine Translation from Speech ...... 399 3.1 Introduction ...... 399 3.2 Front End Features ...... 401 3.3 Improved Speech Acoustic Models ...... 428 3.4 Language Models ...... 460 3.5 Language-Specific Models and Systems: Mandarin ...... 485 3.6 Language-Specific Models and Systems: Arabic...... 520 3.7 Integration of Speech Recognition and Translation ...... 569

4 Distillation ...... 617 4.1 Introduction ...... 617 4.2 Template-Based Query Development ...... 618 4.3 Architecture and Implementation of a Distillation System ...... 623 4.4 Enabling Technology Breakthroughs to Improve Distillation Capabilities ...... 636 4.5 Distillation in an Integrated GALE system ...... 690 4.6 Evaluating Distillation Technology ...... 716

5 Machine Translation Evaluation and Optimization ...... 745 5.1 Introduction ...... 745 5.2 Automatic and Semi-Automatic Measures ...... 758 5.3 Tasks and Human-in-the-Loop Measures ...... 768

xv xvi Contents

5.4 GALE Machine Translation Metrology: Definition, Implementation, and Calculation ...... 783 5.5 Use of Evaluation for Optimization ...... 812 5.6 Searching for Better Automatic MT Metrics ...... 818

6 Operational Engines ...... 845 6.1 Introduction ...... 845 6.2 Implementation of Operational Engines ...... 846 6.3 Evaluation of Operational Engines ...... 905

Concluding Remarks ...... 933

Contributors

Abhaya Agarwal Carnegie Mellon University, Pittsburgh, PA, USA Jaewook Ahn University of Pittsburgh, Pittsburgh, PA, USA James Allan University of Massachusetts Amherst, Amherst, MA, USA Abhishek Arun University of Edinburgh, Edinburgh, UK Sabine Atwell Defense Language Institute, Monterey, CA, USA Necip Fazil Ayan SRI International, Menlo Park, CA, USA Olga Babko-Malaya BAE Systems, Burlington, MA, USA Robert Belvin HRL Laboratories, Malibu, CA, USA Oliver Bender RWTH Aachen University, Aachen, Germany Ann Bies Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Daniel M. Bikel IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Maximilian Bisani RWTH Aachen University, Aachen, Germany Matthias Blume Fair Isaac Corporation, San Diego, CA, USA Roger Bock BBN Technologies, Cambridge, MA, USA Elizabeth Boschee BBN Technologies, Cambridge, MA, USA Sebastién Bronsart National Institute of Standards and Technology, Gaithersburg, MD, USA Peter Brusilovsky University of Pittsburgh, Pittsburgh, PA, USA William Byrne Cambridge University, Cambridge, UK Marine Carpuat Columbia University, New York, NY, USA Christopher Caruso Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Vittorio Castelli IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Ozgur Cetin International Computer Science Institute, Berkeley, CA, USA Achraf Chalabi Sakhr Software, Vienna, VA, USA Pi-Chuan Chang Stanford University, Stanford, CA, USA Upendra V. Chaudhari IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Caitlin Christianson Defense Advanced Research Projects Agency, Arlington, VA, USA

xvii xviii Contributors

Stephen M. Chu IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Christopher Cieri Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Sean Colbath BBN Technologies, Cambridge, MA, USA Steve DeNeefe Information Sciences Institute, University of Southern California, Los Angeles, CA, USA Michael Denkowski Carnegie Mellon University, Pittsburgh, PA, USA Thomas Deselaers RWTH Aachen University, Aachen, Germany Mona T. Diab Columbia University, New York, NY, USA Frank Diehl Cambridge University, Cambridge, UK Dan Ding Defense Language Institute, Monterey, CA, USA Denise DiPersio Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Bonnie Dorr University of Maryland, College Park, MD, USA Loϊc Dugast SYSTRAN Software, Inc., San Diego, CA, USA Chris Dyer University of Maryland, College Park, MD, USA Abdessamad Echihabi Language Weaver, Los Angeles, CA, USA Kathleen Egan Department of Defense, USA Jason Eisner Johns Hopkins University, Baltimore, MD, USA Ahmad Emami Johns Hopkins University, Baltimore, MD, USA Edward A. Epstein IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Reem Faraj Columbia University, New York, NY, USA Arlo Faria International Computer Science Institute, Berkeley, CA, USA Benoit Favre International Computer Science Institute, Berkeley, CA, USA David Ferrucci IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Radu Hans Florian IBM T. J. Watson Research Center, Yorktown Heights, NY, USA George Foster National Research Institute of Canada, Saskatoon, SK, Canada Connie Fournelle BAE Systems, Burlington, MA, USA Petr Fousek IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Marjorie Freedman BBN Technologies, Cambridge, MA, USA Dayne Freitag Fair Isaac Corporation, San Diego, CA, USA

Handbook of Natural Language Processing and Machine Translation xix

Lauren Friedman Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Mark Fuhs Carnegie Mellon University, Pittsburgh, PA, USA Pascale Fung The Hong Kong University of Science and Technology, Hong Kong, China Mark J.F. Gales Cambridge University, Cambridge, UK Michel Galley Stanford University, Stanford, CA, USA Jianfeng Gao Microsoft Corporation, Redmond, WA, USA Qin Gao Carnegie Mellon University, Pittsburgh, PA, USA Jean-Luc Gauvain The Computer Sciences Laboratory for Mechanics and Engineering Sciences (LIMSI), Orsay Cedex, France Niyu Ge IBM T.J. Watson Research Center, Yorktown Heights, NY, USA Daniel Gillick International Computer Science Institute, Berkeley, CA, USA Adrià de Gispert Cambridge University, Cambridge, UK Meghan Lammie Glenn Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Christian Gollan RWTH Aachen University, Aachen, Germany Martin Graciarena SRI International, Menlo Park, CA, USA Jonathan Grady University of Pittsburgh, Pittsburgh, PA, USA John Graettinger BBN Technologies, Cambridge, MA, USA Stephen Grimes Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Ralph Grishman New York University, New York, NY, USA Francisco Guzman Carnegie Mellon University, Pittsburgh, PA, USA Nizar Habash Columbia University, New York, NY, USA Dilek Hakkani-Tür International Computer Science Institute, Berkeley, CA, USA Greg Hanneman Carnegie Mellon University, Pittsburgh, PA, USA Mary Harper University of Maryland, College Park, MD, USA Saša Hasan RWTH Aachen University, Aachen, Germany Daqing He University of Pittsburgh, Pittsburgh, PA, USA Kenneth Heafield Carnegie Mellon University, Pittsburgh, PA, USA Georg Heigold RWTH Aachen University, Aachen, Germany

xx Contributors

Hynek Hermansky International Computer Science Institute, Berkeley, CA, USA Martha Herzog Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA Almut Silja Hildebrand Carnegie Mellon University, Pittsburgh, PA, USA Dustin Hillard University of Washington, Seattle, WA, USA Julia Hirschberg Columbia University, New York, NY, USA Hieu Hoang University of Edinburgh, Edinburgh, UK Bjorn Hoffmeister RWTH Aachen University, Aachen, Germany Jon Holbrook Aptima, Inc., Woburn, MA, USA Eduard Hovy Information Sciences Institute, University of Southern California, Los Angeles, CA, USA Roger Hsiao Carnegie Mellon University, Pittsburgh, PA, USA Fei Huang IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Zhongqiang Huang University of Maryland, College Park, MD, USA Dan Hunter BAE Systems, Burlington, MA, USA Mei-Yuh Hwang University of Washington, Seattle, WA, USA Hussny Ibrahim Defense Language Institute, Monterey, CA, USA Abraham Ittycheriah IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Heng Ji New York University, New York, NY, USA Qin Jin Carnegie Mellon University, Pittsburgh, PA, USA Doug Jones Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA Jeremy Kahn University of Washington, Seattle, WA, USA Damianos Karakos Johns Hopkins University, Baltimore, MD, USA Shahram Khadivi RWTH Aachen University, Aachen, Germany Sanjeev Khudanpur Johns Hopkins University, Baltimore, MD, USA Daniel Kiecza BBN Technologies, Cambridge, MA, USA Brian Kingsbury IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Katrin Kirchhoff University of Washington, Seattle, WA, USA Judith L. Klavans University of Maryland, College Park, MD, USA Kevin Knight Information Sciences Institute, University of Southern California, Los Angeles, CA, USA

Handbook of Natural Language Processing and Machine Translation xxi

Philipp Koehn University of Edinburgh, Edinburgh, UK Gary Krug Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Roland Kuhn National Research Institute of Canada, Saskatoon, SK, Canada Seth Kulick Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Hong-Kwang Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Lori Lamel The Computer Sciences Laboratory for Mechanics and Engineering Sciences (LIMSI), Orsay Cedex, France Ian Lane Carnegie Mellon University, Pittsburgh, PA, USA Alon Lavie Carnegie Mellon University, Pittsburgh, PA, USA Audrey Le National Institute of Standards and Technology, Gaithersburg, MD, USA Haejoong Lee Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Xin Lei SRI International, Menlo Park, CA, USA Gregor Leusch RWTH Aachen University, Aachen, Germany Michael Levit International Computer Science Institute, Berkeley, CA, USA Burn L. Lewis IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Xuansong Li Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Zhifei Li Johns Hopkins University, Baltimore, MD, USA Martha Lillie BBN Technologies, Cambridge, MA, USA Xunying Andrew Liu Cambridge University, Cambridge, UK Chi-kiu Lo The Hong Kong University of Science and Technology, Hong Kong, China Tomasz Loboda University of Pittsburgh, Pittsburgh, PA, USA Jun Luo University of Maryland, College Park, MD, USA Xiaoqiang Luo IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Jeff Ma BBN Technologies, Cambridge, MA, USA Weiyun Ma Columbia University, New York, NY, USA Xiaoyi Ma Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Mohamed Maamouri Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Jessica MacBride BBN Technologies, Cambridge, MA, USA

xxii Contributors

Nitin Madnani University of Maryland, College Park, MD, USA Carl Madson SRI International, Menlo Park, CA, USA Kazuaki Maeda Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA John Makhoul BBN Technologies, Cambridge, MA, USA Arindam Mandal SRI International, Menlo Park, CA, USA Lidia Mangu IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Christopher D. Manning Stanford University, Stanford, CA, USA Daniel Marcu Language Weaver, Los Angeles, CA, USA Mitchell Marcus University of Pennsylvania, Philadelphia, PA, USA Marie-Catherine de Marneffe Stanford University, Stanford, CA, USA Spyros Matsoukas BBN Technologies, Cambridge, MA, USA Evgeny Matusov RWTH Aachen University, Aachen, Germany Arne Mauser RWTH Aachen University, Aachen, Germany Andrea Mazzucchi Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Abdelkhalek Messaoudi The Computer Sciences Laboratory for Mechanics and Engineering Sciences (LIMSI), Orsay Cedex, France Scott McCarley IBM T. J. Watson Research Center, Yorktown Heights, NY, USA John McCary Defense Advanced Research Projects Agency, Arlington, VA, USA Kathleen McKeown Columbia University, New York, NY, USA Calandra Tate Moore University of Maryland, College Park, MD, USA Nelson Morgan International Computer Science Institute, Berkeley, CA, USA Smaranda Muresan Rutgers University, New Brunswick, NJ, USA Hazem Nader Sakhr Software, Vienna, VA, USA Udhyakumar Nallasamy Carnegie Mellon University, Pittsburgh, PA, USA Prem Natarajan BBN Technologies, Cambridge, MA, USA Hermann Ney RWTH Aachen University, Aachen, Germany Tim Ng BBN Technologies, Cambridge, MA, USA Kham Nguyen BBN Technologies, Cambridge, MA, USA Long Nguyen BBN Technologies, Cambridge, MA, USA Jan Niehues Institute for Theoretical Computer Science, Zürich, Switzerland

Handbook of Natural Language Processing and Machine Translation xxiii

Mohamed Noamany Carnegie Mellon University, Pittsburgh, PA, USA Eric Nyberg Carnegie Mellon University, Pittsburgh, PA, USA Douglas W. Oard University of Maryland, College Park, MD, USA Joseph Olive Defense Advanced Research Projects Agency, Arlington, VA, USA Mari Ostendorf University of Washington, Seattle, WA, USA Sebastian Pado Stanford University, Stanford, CA, USA Martha Palmer University of Colorado at Boulder, Boulder, CO, USA Kishore Papineni IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Junho Park Cambridge University, Cambridge, UK Robert Parker Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Alok Parlikar Carnegie Mellon University, Pittsburgh, PA, USA Kristen Parton Columbia University, New York, NY, USA Matthias Paulik Carnegie Mellon University, Pittsburgh, PA, USA Jason Pelecanos IBM T. J. Watson Research Center, Yorktown Heights, NY, USA John F. Pitrelli IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Christian Plahl RWTH Aachen University, Aachen, Germany Daniel Povey Microsoft Corporation, Redmond, WA, USA Sameer Pradhan BBN Technologies, Cambridge, MA, USA Mark Przybocki National Institute of Standards and Technology, Gaithersburg, MD, USA Leiming Qian IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Yong Qin IBM China Research Lab, Beijing, China Jerry Quinn IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Anna N. Rafferty University of California, Berkeley, CA, USA Owen Rambow Columbia University, New York, NY, USA Lance Ramshaw BBN Technologies, Cambridge, MA, USA Suman Ravuri Columbia University, New York, NY, USA Philip Resnik University of Maryland, College Park, MD, USA Eric Riebling Carnegie Mellon University, Pittsburgh, PA, USA Brian Roark Oregon Health & Sciences University, Portland, OR, USA

xxiv Contributors

Monica Rogati Carnegie Mellon University, Pittsburgh, PA, USA Antti-Veikko I. Rosti BBN Technologies, Cambridge, MA, USA Ryan M. Roth Columbia University, New York, NY, USA David Rybach RWTH Aachen University, Aachen, Germany Salim Roukos IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Fatiha Sadat University of Quebec Montreal, Montreal, QC, Canada Rami Safadi Sakhr Software, Vienna, VA, USA Guruprasad Saikumar BBN Technologies, Cambridge, MA, USA William Salter Aptima, Inc., Woburn, MA, USA Gregory Sanders National Institute of Standards and Technology, Gaithersburg, MD, USA George Saon IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Ralf Schlüter RWTH Aachen University, Aachen, Germany Tanja Schultz Carnegie Mellon University, Pittsburgh, PA, USA Richard Schwartz BBN Technologies, Cambridge, MA, USA Holger Schwenk University of Le Mans, Le Mans, France Allen Sears Corporation for National Research Initiatives, Reston, VA, USA Jean Senellart SYSTRAN Software, Inc., San Diego, CA, USA Libin Shen BBN Technologies, Cambridge, MA, USA Wade Shen Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA Qin Shi IBM China Research Lab, Beijing, China Heather Simpson Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Adish Singla International Computer Science Institute, Berkeley, CA, USA Jason Smith Johns Hopkins University, Baltimore, MD, USA Matthew Snover University of Maryland, College Park, MD, USA Dagobert Soergel University of Maryland, College Park, MD, USA Hagen Soltau IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Zhiyi Song Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Jeffrey Sorensen IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Handbook of Natural Language Processing and Machine Translation xxv

Amit Srivastava BBN Technologies, Cambridge, MA, USA William Staderman Defense Advanced Research Projects Agency, Arlington, VA, USA Daniel Stein RWTH Aachen University, Aachen, Germany Jens Stephan SYSTRAN Software, Inc., San Diego, CA, USA Andreas Stolcke SRI International, Menlo Park, CA, USA Stephanie Strassel Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA David Svoboda Carnegie Mellon University, Pittsburgh, PA, USA Yik-Cheung Tam Carnegie Mellon University, Pittsburgh, PA, USA Christoph Tillmann IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Marcus Tomalin Cambridge University, Cambridge, UK Kristina Toutanova Microsoft Corporation, Redmond, WA, USA Gokhan Tür SRI International, Menlo Park, CA, USA Nicola Ueffing National Research Institute of Canada, Saskatoon, SK, Canada Fabio Valente IDIAP Research Institute, Martigny, Switzerland Dimitra Vergyri SRI International, Menlo Park, CA, USA David Vilar RWTH Aachen University, Aachen, Germany Paola Virga IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Stephan Vogel Carnegie Mellon University, Pittsburgh, PA, USA Clare Voss University of Maryland, College Park, MD, USA Kevin Walker Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Lan Wang Cambridge University, Cambridge, UK Wei Wang Language Weaver, Los Angeles, CA, USA Wen Wang SRI International, Menlo Park, CA, USA Zhiqiang (John) Wang Fair Isaac Corporation, San Diego, CA, USA Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Ralph Weischedel BBN Technologies, Cambridge, MA, USA James V. White BAE Systems, Burlington, MA, USA William Wong Language Weaver, Los Angeles, CA, USA Phillip C. Woodland Cambridge University, Cambridge, UK

xxvi Contributors

Dekai Wu The Hong Kong University of Science and Technology, Hong Kong, China Wei Wu University of Washington, Seattle, WA, USA Zhaojun Wu The Hong Kong University of Science and Technology, Hong Kong, China Eric P. Xing Carnegie Mellon University, Pittsburgh, PA, USA Jia Xu RWTH Aachen University, Aachen, Germany Jinxi Xu BBN Technologies, Cambridge, MA, USA Nianwen Xue Brandeis University, Waltham, MA, USA Sibel Yaman International Computer Science Institute, Berkeley, CA, USA Jin Yang SYSTRAN Software, Inc., San Diego, CA, USA Yiming Yang Carnegie Mellon University, Pittsburgh, PA, USA Yongsheng Yang The Hong Kong University of Science and Technology, Hong Kong, China Kai Yu Cambridge University, Cambridge, UK Dalal Zakhary Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA Alex Zamanian BBN Technologies, Cambridge, MA, USA Rabih Zbib BBN Technologies, Cambridge, MA, USA Richard Zens RWTH Aachen University, Aachen, Germany Bing Zhang BBN Technologies, Cambridge, MA, USA Pengyi Zhang University of Maryland, College Park, MD, USA Shilei Zhang IBM China Research Lab, Beijing, China Ying Zhang Carnegie Mellon University, Pittsburgh, PA, USA Bing Zhao IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Sherry Zhao International Computer Science Institute, Berkeley, CA, USA Jing Zheng SRI International, Menlo Park, CA, USA Imed Zitouni IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Contributor affiliations are as of when the work described in this book was performed.