Open Architecture for Multilingual Parallel Texts M.T
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Workflow Manual
Published on Multilingual Online Translation (http://www.molto-project.eu) D3.3 Translation Tools – Workflow Manual Contract No.: FP7-ICT-247914 Project full title: MOLTO - Multilingual Online Translation Deliverable: D3.3 MOLTO translation tools – workflow manual Security (distribution level): Public Contractual date of delivery: M31 Actual date of delivery: April 2013 Type: Manual Status & version: Final Author(s): Inari Listenmaa, Jussi Rautio Task responsible: UHEL Other contributors: Lauri Alanko, John Camilleri, Thomas Hallgren Abstract Deliverable D3.3 consists of a manual and a description of the workflow of the MOLTO translation tools. The document introduces the components: the open-source translation management system Pootle, the Simple Translation Tool, which supports many different translation methods and the Syntax Editor, which allows to modify text by manipulating abstract syntax trees. The document presents two translation workflows. The first scenario integrates MOLTO tools in a professional translation on fixed source, using Pootle. MOLTO translations with GF grammars are added in machine translation options. Another direction taken is the population of translation memory with GF generated data. In the second workflow, the translator is authorized to do changes to source. The tools used in this scenario are the Simple Translation Tool and the Syntax Editor. 1. Introduction This Deliverable D3.3, is a manual and a description of workflow for the translation tools produced within WP3. As stated in the previous Deliverables 3.2 and 3.1, the user of the translation tools is not required to know how to write GF grammars. They are either translators, whose job is to translate from fixed source, or they are authorized to modify the source text in order to fit into the structures covered by the domain-specific grammar(s). -
Champollion: a Robust Parallel Text Sentence Aligner
Champollion: A Robust Parallel Text Sentence Aligner Xiaoyi Ma Linguistic Data Consortium 3600 Market St. Suite 810 Philadelphia, PA 19104 [email protected] Abstract This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. framework to find the maximum likelihood alignment of 1. Introduction sentences. Parallel text is a very valuable resource for a number The length based approach works remarkably well on of natural language processing tasks, including machine language pairs with high length correlation, such as translation (Brown et al. 1993; Vogel and Tribble 2002; French and English. Its performance degrades quickly, Yamada and Knight 2001;), cross language information however, when the length correlation breaks down, such retrieval, and word disambiguation. as in the case of Chinese and English. Parallel text provides the maximum utility when it is Even with language pairs with high length correlation, sentence aligned. The sentence alignment process maps the Gale-Church algorithm may fail at regions that contain sentences in the source text to their translation. The labor many sentences with similar length. A number of intensive and time consuming nature of manual sentence algorithms, such as (Wu 1994), try to overcome the alignment makes large parallel text corpus development weaknesses of length based approaches by utilizing lexical difficult. -
Webinar: Translation Memory and Machine Translation
Legal Services National Technology Assistance Project www.lsntap.org Webinar: Translation Memory and Machine Translation Jillian Theil, Claudia Johnson, Diana Glick, Leland Sampson, Maria Mindlin and Sart Rowe Machine Translation The Perils of Google Translate Typically Google translate tool will give you broken language translations. You might be able to tell what the translation is saying but it will be grammatically incorrect and use wrong words. The key when creating something like a sign, is to use more visuals and less words. Arrow signs, people icons, Dollar signs etc. Check for existing resources - look for signs that already exist and you can see what they are doing. Consider using visuals for wayfinding - arrows, etc Conduct Plain language review and editing Ensure your signage is readable (font etc) Back translation is running a translation back into english to ensure it is correct. Case Study: Using chinese as an example: the word for computer mouse is a different word than the animal mouse - so google translate will translate a sentence about this wrong. If possible obtain a legal review of the translation from an attorney / translator. Is it Ever OK to use Google Translate? It’s ok for informal communications, for general understanding or when you are in a complete bind and have no other options. Translation Workflow for Lingotek and People’s Law Library 1. Volunteer contacts them, and they qualify that volunteer 2. The volunteer selects an article to translate and that article is uploaded to lingotek 3. The volunteer performs the actual translation and then the article is assigned to a volunteer reviewer who is a licensed attorney 4. -
Simplifying the Scientific Writing and Review Process with Sciflow
Future Internet 2010, 2, 645-661; doi:10.3390/fi2040645 OPEN ACCESS future internet ISSN 1999-5903 www.mdpi.com/journal/futureinternet Article Simplifying the Scientific Writing and Review Process with SciFlow Frederik Eichler and Wolfgang Reinhardt ? Computer Science Education Group, University of Paderborn, Fuerstenallee 11, 33102 Paderborn, Germany; E-Mail: [email protected] ? Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel.: +49-5251-60-6603; Fax: +49-5251-60-6336. Received: 13 September 2010; in revised form: 30 November 2010 / Accepted: 2 December 2010 / Published: 6 December 2010 Abstract: Scientific writing is an essential part of a student’s and researcher’s everyday life. In this paper we investigate the particularities of scientific writing and explore the features and limitations of existing tools for scientific writing. Deriving from this analysis and an online survey of the scientific writing processes of students and researchers at the University of Paderborn, we identify key principles to simplify scientific writing and reviewing. Finally, we introduce a novel approach to support scientific writing with a tool called SciFlow that builds on these principles and state of the art technologies like cloud computing. Keywords: scientific writing; survey; word processors; cloud computing 1. Introduction Scientific writing is an essential part of a student’s and researcher’s life. Depending on the particular field of study, papers have to be written and written assignments have to be handed in. As the end of studies approaches, most students are asked to prove their ability to work in a scientific manner by writing a thesis. -
The Web As a Parallel Corpus
The Web as a Parallel Corpus Philip Resnik∗ Noah A. Smith† University of Maryland Johns Hopkins University Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content- based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. 1. Introduction Parallel corpora—bodies of text in parallel translation, also known as bitexts—have taken on an important role in machine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g., Gale and Church 1991; Melamed 1997), they provide indispensable training data for statistical translation models (e.g., Brown et al. 1990; Melamed 2000; Och and Ney 2002), and they can provide the connection between vocabularies in cross-language information retrieval (e.g., Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997). More recently, researchers at Johns Hopkins University and the -
Introducing a Cat Tool to Translate: Wordfast
Indonesian Journal of English Language Studies Vol. 2, Number 1, February 2016 Introducing a Cat Tool to Translate: Wordfast Fika Apriliana, Ardiyarso Kurniawan, Sandy Ferianda, and Fidelis Chosa Kastuhandani ABSTRACT This article aims at introducing CAT tools to those prospective translators who are familiar with with the tools for the first time. Some of the CAT tools must be paid for while some others are free. This article is to inform the readers about the list of free and paid CAT tools, advantages and disadvantages of those tools. One does not need special training for using a free CAT tool while using the paid CAT tools, one needs some special preparation. This article is going to focus more on Wordfast Pro as the second most widely used CAT tools after SDLTrados. Wordfast Pro is a paid software the functioning of which is based on the creation of a Translation Memory which facilitates and speeds up the translator's work. This article is going to briefly explain the advantages of Wordfast Pro and the steps of using it. The translation example is presented to reveal the different translation results of Wordfast Pro as a paid CAT tool and OmegaT as a free CAT tool. Therefore, the article will facilitate those who intend to know more about Wordfast Pro and start using it. Keywords: CAT tools, Wordfast Pro, OmegaT INTRODUCTION When using CAT tools, it is Computerized translation has attracted immediately clear which parts of the text the attention of a large number of people must be translated; the unchanging who work directly or indirectly on portions are transferred accurately and translational issues such as professional directly; the time savings due to repeating translators, teachers, linguists, researchers expressions is huge; and expressions are and future translators. -
An Empirical Study on the Influence of Translation Suggestions’ Provenance Metadata
An empirical study on the influence of translation suggestions’ provenance metadata A Thesis Submitted for the Degree of Doctor of Philosophy by Lucía Morado Vázquez Department of Computer Science and Information Systems, University of Limerick Supervisor: Dr Chris Exton Co-Supervisor: Reinhard Schäler Submitted to the University of Limerick, August 2012 i Abstract In the area of localisation there is a constant pressure to automate processes in order to reduce the cost and time associated with the ever growing workload. One of the main approaches to achieve this objective is to reuse previously-localised data and metadata using standardised translation memory formats –such as the LISA Translation Memory eXchange (TMX) format or the OASIS XML Localisation Interchange File Format (XLIFF). This research aims to study the effectiveness and importance of the localisation metadata associated with the translation suggestions provided by Computer-Assisted Translation (CAT) tools. Firstly, we analysed the way in which localisation data and metadata can be represented in the current specification of XLIFF (1.2). Secondly, we designed a new format called the Localisation Memory Container (LMC) to organise previously-localised XLIFF files in a single container. Finally, we developed a prototype (XLIFF Phoenix) to leverage the data and metadata from the LMC into untranslated XLIFF files in order to improve the task of the translator by helping CAT tools, not only to produce more translation suggestions easily, but also to enrich those suggestions with relevant metadata. In order to test whether this “CAT-oriented” enriched metadata has any influence in the behaviour of the translator involved in the localisation process, we designed an experimental translation task with translators using a modified CAT tool (Swordfish II). -
Resourcing Machine Translation with Parallel Treebanks John Tinsley
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by DCU Online Research Access Service Resourcing Machine Translation with Parallel Treebanks John Tinsley A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Andy Way December 2009 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents Abstract vii Acknowledgements viii List of Figures ix List of Tables x 1 Introduction 1 2 Background and the Current State-of-the-Art 7 2.1 ParallelTreebanks ............................ 7 2.1.1 Sub-sentential Alignment . 9 2.1.2 Automatic Approaches to Tree Alignment . 12 2.2 Phrase-Based Statistical Machine Translation . ...... 14 2.2.1 WordAlignment ......................... 17 2.2.2 Phrase Extraction and Translation Models . 18 2.2.3 ScoringandtheLog-LinearModel . 22 2.2.4 LanguageModelling . 25 2.2.5 Decoding ............................. 27 2.3 Syntax-Based Machine Translation . 29 2.3.1 StatisticalTransfer-BasedMT . 30 2.3.2 Data-OrientedTranslation . 33 2.3.3 OtherApproaches ........................ 35 2.4 MTEvaluation............................. -
Design Decisions for a Structured Front End to LATEX Documents
Design decisions for a structured front end to LATEX documents Barry MacKichan MacKichan Software, Inc. barry dot mackichan at mackichan dot com 1 Logical design Procedural Scientific WorkPlace and Scientific Word are word processors that have been designed from the start to TeX handle mathematics gracefully. Their design philos- PostScript ophy is descended from Brian Reid’s Scribe,1 which emphasized the separation of content from form and 2 was also an inspiration for LATEX. This logical de- sign philosophy holds that the author of a document should concern him- or herself with the content of the document, and with identifying the role that each bit of text plays, such as a header, a footnote, Structured or a quote. The details of formatting should be ig- Unstructured nored by the author, and handled instead by a pre- defined (or custom) style specification. LaTeX There are several very compelling reasons for the separation of content from form. • The expertise of the author is in the content; PDF the expertise of the publisher is in the presen- tation. Declarative • Worrying and fussing about the presentation is wasted effort when done by the author, since Thus, PostScript is a powerful programming the publisher will impose its own formatting on language, but it was later supplemented by PDF, the paper. which is not a programming language, but instead contains declarations of where individual characters • Applying formatting algorithmically is the eas- are placed. PDF is not structured, but Adobe has iest way to assure consistency of presentation. been adding a structural overlay. LATEX is quite • When a document is re-purposed it can be re- structured, but it still contains visible signs of the formatted automatically for its new purpose. -
Towards Controlled Counterfactual Generation for Text
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text Nishtha Madaan, Inkit Padhi, Naveen Panwar, Diptikalyan Saha 1IBM Research AI fnishthamadaan, naveen.panwar, [email protected], [email protected] Abstract diversity will ensure high coverage of the input space de- fined by the goal. In this paper, we aim to generate such Machine Learning has seen tremendous growth recently, counterfactual text samples which are also effective in find- which has led to a larger adoption of ML systems for ed- ing test-failures (i.e. label flips for NLP classifiers). ucational assessments, credit risk, healthcare, employment, criminal justice, to name a few. Trustworthiness of ML and Recent years have seen a tremendous increase in the work NLP systems is a crucial aspect and requires guarantee that on fairness testing (Ma, Wang, and Liu 2020; Holstein et al. the decisions they make are fair and robust. Aligned with 2019) which are capable of generating a large number of this, we propose a framework GYC, to generate a set of coun- test-cases that capture the model’s ability to misclassify or terfactual text samples, which are crucial for testing these remove unwanted bias around specific protected attributes ML systems. Our main contributions include a) We introduce (Huang et al. 2019), (Garg et al. 2019). This is not only lim- GYC, a framework to generate counterfactual samples such ited to fairness but the community has seen great interest that the generation is plausible, diverse, goal-oriented and ef- in building robust models susceptible to adversarial changes fective, b) We generate counterfactual samples, that can direct (Goodfellow, Shlens, and Szegedy 2014; Michel et al. -
Natural Language Processing Security- and Defense-Related Lessons Learned
July 2021 Perspective EXPERT INSIGHTS ON A TIMELY POLICY ISSUE PETER SCHIRMER, AMBER JAYCOCKS, SEAN MANN, WILLIAM MARCELLINO, LUKE J. MATTHEWS, JOHN DAVID PARSONS, DAVID SCHULKER Natural Language Processing Security- and Defense-Related Lessons Learned his Perspective offers a collection of lessons learned from RAND Corporation projects that employed natural language processing (NLP) tools and methods. It is written as a reference document for the practitioner Tand is not intended to be a primer on concepts, algorithms, or applications, nor does it purport to be a systematic inventory of all lessons relevant to NLP or data analytics. It is based on a convenience sample of NLP practitioners who spend or spent a majority of their time at RAND working on projects related to national defense, national intelligence, international security, or homeland security; thus, the lessons learned are drawn largely from projects in these areas. Although few of the lessons are applicable exclusively to the U.S. Department of Defense (DoD) and its NLP tasks, many may prove particularly salient for DoD, because its terminology is very domain-specific and full of jargon, much of its data are classified or sensitive, its computing environment is more restricted, and its information systems are gen- erally not designed to support large-scale analysis. This Perspective addresses each C O R P O R A T I O N of these issues and many more. The presentation prioritizes • identifying studies conducting health service readability over literary grace. research and primary care research that were sup- We use NLP as an umbrella term for the range of tools ported by federal agencies. -
A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems
A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems Seiya Osada1, Kiyoshi Yamabana1, Ken Hanazawa1, Akitoshi Okumura1 1 Media and Information Research Laboratories NEC Corporation 1753, Shimonumabe, Nakahara-Ku, Kawasaki, Kanagawa 211-8666, Japan {s-osada@cd, k-yamabana@ct, k-hanazawa@cq, a-okumura@bx}.jp.nec.com Abstract. We propose a new method to integrate multiple speech translation systems based on user interface-level integration. Users can select the result of free-sentence speech translation or that of registered sentence translation without being conscious of the configuration of the automatic speech translation system. We implemented this method on a portable device. Keywords: speech translation, user interface, machine translation, registered sentence retrieval, speech recognition 1 Introduction There have been many researches on speech-to-speech translation systems, such as NEC speech translation system[1], ATR-MATRIX[2] and Verbmobil[3]. These speech-to-speech translation systems include at least three components: speech recognition, machine translation, and speech synthesis. However, in practice, each component does not always output the correct result for various inputs. In actual use of a speech-to-speech translation system with a display, the speaker using the system can examine the result of speech recognition on the display. Accordingly, when the recognition result is inappropriate, the speaker can correct errors by speaking again to the system. Similarly, when the result of speech synthesis is not correct, the listener using the system can examine the source sentence of speech synthesis on the display. On the other hand, the feature of machine translation is different from that of speech recognition or speech synthesis, because neither the speaker nor the listener using the system can confirm the result of machine translation.