Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Short Papers

Total Page:16

File Type:pdf, Size:1020Kb

Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Short Papers HLT-NAACL 2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Short Papers Robert C. Moore, General Chair Jeff Bilmes, Jennifer Chu-Carroll and Mark Sanderson Program Committee Chairs June 4-9, 2006 New York, New York, USA Published by the Association for Computational Linguistics http://www.aclweb.org Production and Manufacturing by Omnipress Inc. 2600 Anderson Street Madison, WI 53704 c 2006 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Table of Contents Factored Neural Language Models Andrei Alexandrescu and Katrin Kirchhoff. .1 The MILE Corpus for Less Commonly Taught Languages Alison Alvarez, Lori Levin, Robert Frederking, Simon Fung, Donna Gates and Jeff Good. .5 Museli: A Multi-Source Evidence Integration Approach to Topic Segmentation of Spontaneous Dialogue Jaime Arguello and Carolyn Rose . 9 Measuring Semantic Relatedness Using People and WordNet Beata Beigman Klebanov . 13 Thai Grapheme-Based Speech Recognition Paisarn Charoenpornsawat, Sanjika Hewavitharana and Tanja Schultz . 17 Class Model Adaptation for Speech Summarisation Pierre Chatain, Edward Whittaker, Joanna Mrozinski and Sadaoki Furui . 21 Semi-supervised Relation Extraction with Label Propagation Jinxiu Chen, Donghong Ji, Chew Lim Tan and Zhengyu Niu . 25 Temporal Classification of Text and Automatic Document Dating Angelo Dalli . 29 Answering the question you wish they had asked: The impact of paraphrasing for Question Answering Pablo Duboue and Jennifer Chu-Carroll . 33 Gesture Improves Coreference Resolution Jacob Eisenstein and Randall Davis . 37 Spectral Clustering for Example Based Machine Translation Rashmi Gangadharaiah, Ralf Brown and Jaime Carbonell . 41 A Finite-State Model of Georgian Verbal Morphology Olga Gurevich . 45 Arabic Preprocessing Schemes for Statistical Machine Translation Nizar Habash and Fatiha Sadat . 49 Agreement/Disagreement Classification: Exploiting Unlabeled Data using Contrast Classifiers Sangyun Hahn, Richard Ladner and Mari Ostendorf . 53 OntoNotes: The 90% Solution Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw and Ralph Weischedel. 57 Investigating Cross-Language Speech Retrieval for a Spontaneous Conversational Speech Collection Diana Inkpen, Muath Alzghool, Gareth Jones and Douglas Oard . 61 iii Evaluating Centering for Sentence Ordering in Two New Domains Nikiforos Karamanis . 65 MMR-based Active Machine Learning for Bio Named Entity Recognition Seokhwan Kim, Yu Song, Kyungduk Kim, Jeong-Won Cha and Gary Geunbae Lee . 69 Early Deletion of Fillers In Processing Conversational Speech Matthew Lease and Mark Johnson . 73 Evaluation of Utility of LSA for Word Sense Discrimination Esther Levin, Mehrbod Sharifi and Jerry Ball. .77 Initial Study on Automatic Identification of Speaker Role in Broadcast News Speech YangLiu................................................................................. 81 Automatic Recognition of Personality in Conversation Franois Mairesse and Marilyn Walker . 85 Summarizing Speech Without Text Using Hidden Markov Models Sameer Maskey and Julia Hirschberg . 89 NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction Einat Minkov, Richard Wang, Anthony Tomasic and William Cohen . 93 Syntactic Kernels for Natural Language Learning: the Semantic Role Labeling Case Alessandro Moschitti . 97 Accurate Parsing of the Proposition Bank Gabriele Musillo and Paola Merlo . 101 Using Semantic Authoring for Blissymbols Communication Boards Yael Netzer and Michael Elhadad . 105 Extracting Salient Keywords from Instructional Videos Using Joint Text, Audio and Visual Cues Youngja Park and Ying Li . 109 Exploiting Variant Corpora for Machine Translation Michael Paul and Eiichiro Sumita. .113 Quantitative Methods for Classifying Writing Systems Gerald Penn and Travis Choma . 117 Computational Modelling of Structural Priming in Dialogue David Reitter, Frank Keller and Johanna D. Moore . 121 Story Segmentation of Broadcast News in English, Mandarin and Arabic Andrew Rosenberg and Julia Hirschberg. .125 Parser Combination by Reparsing Kenji Sagae and Alon Lavie . 129 iv Using Phrasal Patterns to Identify Discourse Relations Manami Saito, Kazuhide Yamamoto and Satoshi Sekine . 133 Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach Franco Salvetti and Nicolas Nicolov . 137 Word Domain Disambiguation via Word Sense Disambiguation Antonio Sanfilippo, Stephen Tratz and Michelle Gregory . 141 Selecting relevant text subsets from web-data for building topic specific language models Abhinav Sethy, Panayiotis Georgiou and Shrikanth Narayanan . 145 A Comparison of Tagging Strategies for Statistical Information Extraction Christian Siefkes . 149 Unsupervised Induction of Modern Standard Arabic Verb Classes Neal Snider and Mona Diab . 153 Sentence Planning for Realtime Navigational Instruction Laura Stoia, Donna Byron, Darla Shockley and Eric Fosler-Lussier . 157 Using the Web to Disambiguate Acronyms Eiichiro Sumita and Fumiaki Sugaya . 161 Word Pronunciation Disambiguation using the Web Eiichiro Sumita and Fumiaki Sugaya . 165 Illuminating Trouble Tickets with Sublanguage Theory Svetlana Symonenko, Steven Rowe and Elizabeth D. Liddy . 169 Evolving optimal inspectable strategies for spoken dialogue systems Dave Toney, Johanna Moore and Oliver Lemon . 173 Lycos Retriever: An Information Fusion Engine Brian Ulicny. .177 Improved Affinity Graph Based Multi-Document Summarization Xiaojun Wan and Jianwu Yang . 181 A Maximum Entropy Framework that Integrates Word Dependencies and Grammatical Relations for Read- ing Comprehension Kui Xu, Helen Meng and Fuliang Weng . 185 BioEx: A Novel User-Interface that Accesses Images from Abstract Sentences Hong Yu and Minsuk Lee . 189 Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation Ruiqiang Zhang, Genichiro Kikui and Eiichiro Sumita . 193 v Comparing the roles of textual, acoustic and spoken-language features on spontaneous-conversation sum- marization Xiaodan Zhu and Gerald Penn . 197 Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation Andreas Zollmann, Ashish Venugopal and Stephan Vogel . 201 vi Factored Neural Language Models Andrei Alexandrescu Katrin Kirchhoff Department of Comp. Sci. Eng. Department of Electrical Engineering University of Washington University of Washington [email protected] [email protected] Abstract Though this may be sufficient for applications that use a closed vocabulary, the current trend of porting We present a new type of neural proba- systems to a wider range of languages (esp. highly- bilistic language model that learns a map- inflected languages such as Arabic) calls for dy- ping from both words and explicit word namic dictionary expansion and the capability of as- features into a continuous space that is signing probabilities to newly added words without then used for word prediction. Addi- having seen them in the training data. Here, we in- tionally, we investigate several ways of troduce a novel type of NLM that improves gener- deriving continuous word representations alization by using vectors of word features (stems, for unknown words from those of known affixes, etc.) as input, and we investigate deriving words. The resulting model significantly continuous representations for unknown words from reduces perplexity on sparse-data tasks those of known words. when compared to standard backoff mod- els, standard neural language models, and 2 Neural Language Models factored language models. wn−2 |V| rows d columns P(w | w ,w ) t t−1 t−2 1 Introduction M h W W ih ho Neural language models (NLMs) (Bengio et al., V = vocabulary i w d = continuous space size o 2000) map words into a continuous representation n−1 space and then predict the probability of a word Figure 1: NLM architecture. Each word.
Recommended publications
  • Proactive Learning for Building Machine Translation Systems for Minority Languages
    Proactive Learning for Building Machine Translation Systems for Minority Languages Vamshi Ambati Jaime Carbonell [email protected] [email protected] Language Technologies Institute Language Technologies Institute Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213, USA Pittsburgh, PA 15213, USA Abstract based MT systems require lexicons that provide cov- erage for the target translations, synchronous gram- Building machine translation (MT) for many mar rules that define the divergences in word-order minority languages in the world is a serious across the language-pair. In case of minority lan- challenge. For many minor languages there is guages one can only expect to find meagre amount little machine readable text, few knowledge- of such data, if any. Building such resources effec- able linguists, and little money available for MT development. For these reasons, it be- tively, within a constrained budget, and deploying an comes very important for an MT system to MT system is the need of the day. make best use of its resources, both labeled We first consider ‘Active Learning’ (AL) as a and unlabeled, in building a quality system. framework for building annotated data for the task In this paper we argue that traditional active learning setup may not be the right fit for seek- of MT. However, AL relies on unrealistic assump- ing annotations required for building a Syn- tions related to the annotation tasks. For instance, tax Based MT system for minority languages. AL assumes there is a unique omniscient oracle. In We posit that a relatively new variant of active MT, it is possible and more general to have multiple learning, Proactive Learning, is more suitable sources of information with differing reliabilities or for this task.
    [Show full text]
  • Access to Blissymbolics In
    Access to Blissymbolics in ICT state-of-the-art and visions Mats Lundälv, DART, Sahlgrenska University Hospital, and BCI, Göteborg, Sweden Presented by Stephen van Tetzchner, ISAAC Research Symposium, Pittsburgh, USA, 2012 Introduction: Blissymbolics is, unlike most other graphical AAC systems, well suited for use on all technology levels, from no-tech, over low-tech, to current hi-tech ICT platforms. Blisymbols may be drawn by hand, in the sand, with paper and pen, or on a blackboard or whiteboard. In current practice we are however, as we are for all other symbol resources, heavily dependent on ICT support for practically managing and using Blissymbolics. It may be for setting up and printing symbol charts for low-tech use, or for writing documents or communicating remotely via email etc on computers and mobile devices. This presentation is an attempt to give an overview of the availability of Blissymbolics on current ICT platforms 2012, and some hints about what we can expect for the near future. Background: Blissymbolics was probably the first graphical AAC system to be supported on the emerging computer platforms in the early 1980:s; Talking BlissApple was one of the first pieces of AAC software developed for Apple II computers by Gregg Vanderheiden et al at Trace R&D Center, Madison Wisconsin USA. It was a break-through and was followed by a large number of Blissymbol AAC programs for the different computing platforms over the next decades. However, when the real expansion of the AAC technology came, with a number of commercially promoted pictorial systems such as PCS etc, Blissymbolics was largely dropping out of support on the dominating AAC software and hardware platforms.
    [Show full text]
  • Jaime Carbonell Massively Multilingual Language
    Jaime Carbonell Carnegie Mellon University Massively Multilingual Language Technologies The so-called "digital divide" addressed the challenge of poorer people not having access to the internet and all the myriad resources that come from being digitally connected. With the advent and spread of affordable smart phones, the present challenge is the "linguistic divide" where speakers of rare languages, often in poorer nations or regions, lack access to the vast content of information contained only in major languages. We -- CMU and InterAct -- have long sought solutions, via machine translation and via methods that extend to rare languages with minimal on-line data, well beyond the major languages with economic clout that the ma- jor search engines address. The presentation outlines some of the challenges and methods to bridge the linguistic divide. Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received SB degrees in Physics and Mathe- matics from MIT, and MS and PhD degrees in Computer Science from Yale University. His current research includes machine learning, data / text mining, machine translation, rare-language analysis and computational proteomics. He invented Proactive Machine Learning, including its underlying decision-theoretic framework. He is also known for the Maximal Marginal Relevance principle in information retrieval, for derivational analogy in problem solving and for example-based machine translation and for machine learning in structural biology, and in protein interaction networks. Overall, he has published some 350 papers and books and supervised some 60 PhD dissertations. .
    [Show full text]
  • Application of Blissymbolics to the Non-Vocal Communicatively Handicapped
    University of Montana ScholarWorks at University of Montana Graduate Student Theses, Dissertations, & Professional Papers Graduate School 1977 Application of Blissymbolics to the non-vocal communicatively handicapped Karen Lynn Jones The University of Montana Follow this and additional works at: https://scholarworks.umt.edu/etd Let us know how access to this document benefits ou.y Recommended Citation Jones, Karen Lynn, "Application of Blissymbolics to the non-vocal communicatively handicapped" (1977). Graduate Student Theses, Dissertations, & Professional Papers. 1576. https://scholarworks.umt.edu/etd/1576 This Thesis is brought to you for free and open access by the Graduate School at ScholarWorks at University of Montana. It has been accepted for inclusion in Graduate Student Theses, Dissertations, & Professional Papers by an authorized administrator of ScholarWorks at University of Montana. For more information, please contact [email protected]. THE APPLICATION OF BLISSYMBOLICS TO THE NON-VOCAL COMMUNICATIVELY HANDICAPPED by Karen L. Jones B.A., University of Montana, 1974 Presented in partial fulfillment of the requirements for the degree of Master of Communication Sciences and Disorders UNIVERSITY OF MONTANA 1977 Approved by: Chairman iBo^d of ^miners Deaw^ Gradua t^chool Date UMI Number; EP34649 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMT IXMHtitian PUbMIng UMI EP34649 Copyright 2012 by ProQuest LLC. All rights reserved.
    [Show full text]
  • The Fundamental Rules of Blissymbolics: Creating New Blissymbolics Characters and Vocabulary
    The fundamental rules of Blissymbolics: creating new Blissymbolics characters and vocabulary Blissymbolics Communication International (BCI) ¯ 2009-01-26 1 Introduction 1 2 Blissymbolics 2 3 Definitions 2 4 Graphic aspects of the language 4 5 Bliss-characters 7 6 Bliss-words 11 7 Indicators 16 8 Wordbuilding strategies for vocabulary extension 19 9 The Blissymbolics Development Process 22 10 Bibliography 26 11 The history of Blissymbol standardization 26 12 Figure 1: Summary of the Blissymbolics development process 29 13 Figure 2: Flow chart of the Blissymbolics development process 30 1.0 Introduction. This document describes the basic structure of the Blissymbolics language, and outlines both the rules necessary to be followed for creating new BCI Authorized Vocabulary, as well as procedures used for adopting that vocabulary. This reference document will guide anyone wishing to use the Blissymbolics language. Its purpose is to ensure consistency and maintain the integrity of Blissymbolics as an international language. The formal process for the development of Blissymbolics is outlined in clause 9. NOTE: A number of technical notes appear throughout the document in smaller type. These notes refer to a number of elements which are technical in nature, such as providing specific advice for font implementations (clause 4.3.6) or the need to keep the creation of new Bliss-characters to a minimum (clause 8.9). Many users of this document will not need to take these notes into account for purposes of teaching, but they are nonetheless important for vocabulary development work and do form a part of the official guidelines. 1.1 Target users.
    [Show full text]
  • Information Extraction Lecture 5 – Named Entity Recognition III
    Information Extraction Lecture 5 – Named Entity Recognition III CIS, LMU München Winter Semester 2016-2017 Dr. Alexander Fraser, CIS Administravia • Seminar • There is now a LaTeX template for the Hausarbeit on the Seminar web page • Please don't forget to send me your presentation (as a PDF) after giving it • And, as you know, the Hausarbeit is due 3 weeks after your presentation! 2 Outline • IE end-to-end • Introduction: named entity detection as a classification problem 3 CMU Seminars task • Given an email about a seminar • Annotate – Speaker – Start time – End time – Location CMU Seminars - Example <[email protected] (Jaime Carbonell).0> Type: cmu.cs.proj.mt Topic: <speaker>Nagao</speaker> Talk Dates: 26-Apr-93 Time: <stime>10:00</stime> - <etime>11:00 AM</etime> PostedBy: jgc+ on 24-Apr-93 at 20:59 from NL.CS.CMU.EDU (Jaime Carbonell) Abstract: <paragraph><sentence>This Monday, 4/26, <speaker>Prof. Makoto Nagao</speaker> will give a seminar in the <location>CMT red conference room</location> <stime>10</stime>-<etime>11am</etime> on recent MT research results</sentence>.</paragraph> IE Template Slot Name Value Speaker Prof. Makoto Nagao Start time 1993-04-26 10:00 End time 1993-04-26 11:00 Location CMT red conference room Message Identifier (Filename) [email protected]. EDU (Jaime Carbonell).0 • Template contains *canonical* version of information • There are several "mentions" of speaker, start time and end- time (see previous slide) • Only one value for each slot • Location could probably also
    [Show full text]
  • ONIX for Books Codelists Issue 40
    ONIX for Books Codelists Issue 40 23 January 2018 DOI: 10.4400/akjh All ONIX standards and documentation – including this document – are copyright materials, made available free of charge for general use. A full license agreement (DOI: 10.4400/nwgj) that governs their use is available on the EDItEUR website. All ONIX users should note that this is the fourth issue of the ONIX codelists that does not include support for codelists used only with ONIX version 2.1. Of course, ONIX 2.1 remains fully usable, using Issue 36 of the codelists or earlier. Issue 36 continues to be available via the archive section of the EDItEUR website (http://www.editeur.org/15/Archived-Previous-Releases). These codelists are also available within a multilingual online browser at https://ns.editeur.org/onix. Codelists are revised quarterly. Go to latest Issue Layout of codelists This document contains ONIX for Books codelists Issue 40, intended primarily for use with ONIX 3.0. The codelists are arranged in a single table for reference and printing. They may also be used as controlled vocabularies, independent of ONIX. This document does not differentiate explicitly between codelists for ONIX 3.0 and those that are used with earlier releases, but lists used only with earlier releases have been removed. For details of which code list to use with which data element in each version of ONIX, please consult the main Specification for the appropriate release. Occasionally, a handful of codes within a particular list are defined as either deprecated, or not valid for use in a particular version of ONIX or with a particular data element.
    [Show full text]
  • Of ISO/IEC 10646 and Unicode
    ISO/IEC JTC1/SC2/WG2 N2114 Title: Graphic representation of the Roadmap to the SMP, Plane 1 of the UCS Source: Ad hoc group on Roadmap Status: Expert contribution Date: 1999-09-15 Action: For confirmation by ISO/IEC JTC1/SC2/WG2 Replaces: N2046 The following tables comprise a real-size map of Plane 1, the SMP (Secondary Multilingual Plane) of the UCS (Universal Character Set). To print the HTML document it may be necessary to set the print percentage to 90% as the tables are wider than A4 or US Letter paper. The tables are formatted to use the Times font. The following conventions are used in the table to help the user identify the status of (colours can be seen in the online version of this document, http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2114.pdf): Bold text indicates an allocated (i.e. published) character collection (none as yet in Plane 1). (Bold text between parentheses) indicates scripts which have been accepted for processing toward inclusion in the standard. (Text between parentheses) indicates scripts for which proposals have been submitted to WG2 or the UTC. ¿Text beween question marks? indicates scripts for which detailed proposals have not yet been written. ??? in a block indicates that no suggestion has been made regarding the block allocation. NOTE: With regard to the revision practice employed in this document, when scripts are actually proposed to WG2 or to the UTC, the practice is to "front" them in the zones to which they are tentatively allocated, and to adjust the block size with regard to the allocation proposed.
    [Show full text]
  • Probabilistic Approaches for Answer Selection in Multilingual Question Answering
    Probabilistic Approaches for Answer Selection in Multilingual Question Answering Jeongwoo Ko Aug 27 2007 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Eric Nyberg (Chair) Teruko Mitamura Jaime Carbonell Luo Si (Purdue University) Copyright c 2007 Jeongwoo Ko This work was supported in part by ARDA/DTO Advanced Question Answering for Intelligence (AQUAINT) program award number NBCHC040164 and used the NTCIR 5-6 corpus. Keywords: Answer ranking, answer selection, probabilistic framework, graphi- cal model, multilingual question answering To my family for love and support. iv Abstract Question answering (QA) aims at finding exact answers to a user’s natural language question from a large collection of documents. Most QA systems combine information retrieval with extraction techniques to iden- tify a set of likely candidates and then utilize some selection strategy to generate the final answers. This selection process can be very challenging, as it often entails ranking the relevant answers to the top positions. To address this challenge, many QA systems have incorporated semantic re- sources for answer ranking in a single language. However, there has been little research on a generalized probabilistic framework that models the correctness and correlation of answer candidates for multiple languages. In this thesis, we propose two probabilistic models for answer ranking: independent prediction and joint prediction. The independent prediction model directly estimates the probability of an individual answer candi- date given the degree of answer relevance and the amount of supporting evidence provided in a set of answer candidates. The joint prediction model uses an undirected graph to estimate the joint probability of all answer candidates, from which the probability of an individual candidate is inferred.
    [Show full text]
  • Active Learning
    AI, Machine Learning and Language Technologies: What’s Next? Army Mad Scientists Program March 7, 2017 Jaime Carbonell, and colleagues School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~jgc AI Touches Virtually All Areas of Computer Science Systems Entertain Lang + Theory Tech Tech Fine Arts Comp Artificial Intelligence Bio Sciences Machine Learning Human-Comp Interaction Robotics Computer Humanities Science Engineering 3/6/2017 Jaime G. Carbonell, Language 2 Technolgies Institute AI is Becoming Central to the World Economy (Davos 2016) “The fourth Industrial Revolution” is characterized by: n “Ubiquitous and mobile internet”, n “Smaller and more powerful sensors”, n “Artificial Intelligence”, and n “Machine Learning” -- Prof. Klaus Schwab, founder of the Davos World Economic Forum, 2016 3/6/2017 Jaime G. Carbonell, Language 3 Technolgies Institute Key Components of AI o Automated Perception n Vision, sonar, lidar, haptics, … o Robotic Action n Locomotion, manipulation, … o Deep Reasoning n Planning, goal-oriented behavior, projection, … o Language Technologies n Language, speech, dialog, social nets, … o Machine Learning n Adaptation, reflection, knowledge acquisition, … o Big Data 3/6/2017 Jaime G. Carbonell, Language 4 Technolgies Institute Key Components of AI o Automated Perception n Vision, sonar, lidar, haptics, … o Robotic Action n Locomotion, manipulation, … o Deep Reasoning n Planning, goal-oriented behavior, projection, … o Language Technologies n Language, speech, dialog, social nets, … o Machine Learning Today’s main focus n Adaptation, reflection, knowledge acquisition, … My research o Big Data 3/6/2017 Jaime G. Carbonell, Language 5 Technolgies Institute How Big is Big? Dimensions of Big Data Analytics LARGE-SCALE : TERABYTES PETABYTES EXOBYTES Billions++ of entries: Terabyes/Petabyes of data HIGH-COMPLEXITY Trillions of potential relations among entries (graphs) HIGH-DIMENSIONAL Millions of attributes per entry (but typically sparse encoding) 3/6/2017 Jaime G.
    [Show full text]
  • ONIX for Books Codelists Issue 47
    ONIX for Books Codelists Issue 47 31 October 2019 DOI: 10.4400/akjh Go to latest Issue All ONIX standards and documentation – including this document – are copyright materials, made available free of charge for general use. A full license agreement (DOI: 10.4400/nwgj) that governs their use is available on the EDItEUR website. All ONIX users should note that this issue of the ONIX codelists does not include support for codelists used only with ONIX version 2.1. ONIX 2.1 remains fully usable, using Issue 36 of the codelists or earlier, and Issue 36 continues to be available via the archive section of the EDItEUR website (https://www.editeur.org/15/Archived-Previous-Releases). These codelists are also available within a multilingual online browser at https://ns.editeur.org/onix. Codelists are revised quarterly. Layout of codelists This document contains ONIX for Books codelists Issue 46, intended primarily for use with ONIX 3.0. The codelists are arranged in a single table for reference and printing. They may also be used as controlled vocabularies, independent of ONIX. This document does not differentiate explicitly between codelists for ONIX 3.0 and those that are used with earlier releases, but lists used only with earlier releases have been removed. For details of which code list to use with which data element in each version of ONIX, please consult the main Specification for the appropriate release. Occasionally, a handful of codes within a particular list are defined as either deprecated, or not valid for use in a particular version of ONIX or with a particular data element.
    [Show full text]
  • Active Learning and Crowd-Sourcing for Machine Translation
    Active Learning and Crowd-Sourcing for Machine Translation Vamshi Ambati, Stephan Vogel, Jaime Carbonell Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA 15213, USA fvamshi,vogel,[email protected] Abstract In recent years, corpus based approaches to machine translation have become predominant, with Statistical Machine Translation (SMT) being the most actively progressing area. Success of these approaches depends on the availability of parallel corpora. In this paper we propose Active Crowd Translation (ACT), a new paradigm where active learning and crowd-sourcing come together to enable automatic translation for low-resource language pairs. Active learning aims at reducing cost of label acquisition by prioritizing the most informative data for annotation, while crowd-sourcing reduces cost by using the power of the crowds to make do for the lack of expensive language experts. We experiment and compare our active learning strategies with strong baselines and see significant improvements in translation quality. Similarly, our experiments with crowd-sourcing on Mechanical Turk have shown that it is possible to create parallel corpora using non-experts and with sufficient quality assurance, a translation system that is trained using this corpus approaches expert quality. 1. Introduction tive learning approaches help us identify sentences, which Corpus based approaches to automatic translation like Ex- if translated have the potential to provide maximal improve- ample Based and Statistical Machine Translation systems ment to an existing system. Crowd-sourcing techniques, on use large amounts of parallel data created by humans to the other hand help us reach a significant number of trans- train mathematical models for automatic language transla- lators at very low costs.
    [Show full text]