Rapid Generation of Pronunciation Dictionaries for New Domains and Languages

Total Page:16

File Type:pdf, Size:1020Kb

Rapid Generation of Pronunciation Dictionaries for New Domains and Languages Rapid Generation of Pronunciation Dictionaries for new Domains and Languages Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Tim Schlippe aus Ettlingen Tag der mündlichen Prüfung: 10.12.2014 Erste Gutachterin: Prof. Dr.-Ing. Tanja Schultz Zweite Gutachterin: Prof. Marelie Davel Acknowledgements The road to a Doctorate is long and winding, and would be impossible were it not for the people met along the way. I would like to gratefully acknowledge my primary supervisor Prof. Tanja Schultz. She believed in my research and supported me with many useful discussions. Her great personality and excellent research skills had a very strong effect on my scientific career. All the travels to conferences, workshops and project meetings which are one of the best experiences in my life would be not possible without her support. Tanja and my co-advisor Prof. Marelie Davel also deserve thanks for reading through my dissertation to provide constructive criticism. It was very kind of Marelie to take the long trip from South Africa to Karlsruhe to participate in the dissertation committee. Special thanks to Dr. Stephan Vogel, Prof. Pascale Fung, Prof. Haizhou Li, Prof. Florian Metze, and Prof. Alan Black for their support regarding research visits for me and my students and for very close and beneficial collaboration. Furthermore, I would like to thank all my friends and colleagues at the Cog- nitive Systems Lab for a great time. Their support is magnificent. Thanks to Ngoc Thang Vu, Michael Wand, Matthias Janke, Dominic Telaar, Dominic Heger, Christoph Amma, Christian Herff, Felix Putze, Jochen Weiner, Mar- cus Georgi, Udhyakumar Nallasamy, Dirk Gehrig for many great travel ex- periences and lovely activities after work. Moreover, thanks to Helga Scherer and Alfred Schmidt for their support. Even he did not work at CSL and left the Karlsruhe Institute of Technology in 2009, Prof. Matthias Wölfel was always an inspiring example. Thanks to all students I have supervised in their master, diploma, bachelor theses and student research projects for their collaboration and encouragement. Finally, I am grateful for the patience and support of my family and friends. Especially, without my father Peter Schlippe who always supported me, my career would have never been possible. Sarah Kissel, Johannes Paulke and Norman Enghusen – thank you for always beeing there for me. Summary Automatic speech recognition allows people an intuitive communication with machines. Compared to the keyboard and mouse as input modalities, au- tomatic speech recognition enables a more natural and efficient way where hands and eyes are free for other activities. Applications, such as dicta- tion systems, voice control, voice-driven telephone dialogue systems and the translation of spoken language, are already present in daily life. However, automatic speech recognition systems exist only for a small fraction of the 7,100 languages in the world [LSF14] since the development of such systems is usually expensive and time-consuming. Therefore, porting speech technology rapidly to new languages with little effort and cost is an important part of research and development. Pronunciation dictionaries are a central component for both automatic speech recognition and speech synthesis. They provide the mapping from the ortho- graphic form of a word to its pronunciation, typically expressed as a sequence of phonemes. This dissertation will present innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. De- pending on various conditions, solutions are proposed and developed. – Start- ing from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists. We embedded many of the tools implemented in this work in the Rapid Language Adaptation Toolkit. Its web interface is publicly accessible and allows people to build initial speech recognition systems with little technical background. Furthermore, we demonstrate the potential of the developed methods in this thesis by many automatic speech recognition results. ii Summary The main achievements of this thesis for languages with writing system The main achievements of this thesis providing solutions for languages with writing system, can be summarized as follows: Rapid Vocabulary Selection: Strategies for the collection of Web texts (Crawling) help to define the vocabulary of the dictionary. It is important to collect words that are time- and topic-relevant for the domain of the automatic speech recognition system. This includes the use of information from RSS Feeds and Twitter which is particularly helpful for the automatic transcription of broadcast news. Rapid Generation of Text Normalization Systems: Text Normaliza- tion transfers writing variations of the same word into a canonical represen- tation. For rapid development of text normalization systems at low cost we present methods where Internet users generate training data for such systems by simple text editing (Crowdsourcing). Analyses of Grapheme-to-Phoneme Model Quality: Except for lang- uages with logographic writing systems, data-driven grapheme-to-phoneme (G2P) models trained from existing word-pronunciation pairs can be used to directly provide pronunciations for words that do not appear in the dic- tionary. In analyses with European languages we demonstrate that despite varying grapheme-to-phoneme relationships word-pronunciations pairs whose pronunciations contain 15k phoneme tokens are sufficient to obtain a high consistency in the resulting pronunciations for most languages. Automatic Error Recovery for Pronunciation Dictionaries: The man- ual production of pronunciations, e.g. collecting training data for grapheme- to-phoneme models, can lead to flawed or inadequate dictionary entries due to subjective judgements, typographical errors, and ’convention drift’ by multi- ple annotators. We propose completely automatic methods to detect, remove, and substitute inconsistent or flawed entries (filter methods). Web-based Tools and Methods for Rapid Pronunciation Creation: Crowdsourcing-based approaches are particularly helpful for languages with a complex grapheme-to-phoneme correspondence when data-driven approaches decline in performance or when no initial word-pronunciation pairs exist. Analyses of word-pronunciation pairs provided by Internet users on Wik- tionary (Web-derived pronunciations), demonstrate that many pronuncia- tions are available for languages of various language families – even for flex- ions and proper names. Additionally, Wiktionary’s paragraphs about the Summary iii word’s origin and language help us to automatically detect foreign words and produce their pronunciations with separate grapheme-to-phoneme models. In the second crowdsourcing approach a tool Keynounce was implemented and analyzed. Keynounce is an online game where users can proactively produce pronunciations. Finally, we present efficient methods for a rapid and low-cost semi-automatic pronunciation dictionary development. These are especially helpful to minimize potential errors and the editing effort in a human- or crowdsourcing-based pronunciation generation process. Cross-lingual G2P Model based Pronunciation Generation: For the production of the pronunciations we investigate strategies using grapheme- to-phoneme models derived from existing dictionaries of other languages, thereby substantially reducing the necessary manual effort. The main achievements of this thesis for languages without writing system The most difficult challenge is the construction of pronunciation dictionaries for languages and dialects, which have no writing system. If only a recording of spoken phrases is present, the corresponding phonetic transcription can be obtained using a phoneme recognizer. However, the resulting phoneme sequence does not contain any information about the word boundaries. This thesis presents procedures where exploiting the written translation of the spo- ken phrases helps the word discovery process (human translations guided lan- guage discovery). The assumption is that a human translator produces utter- ances in the (non-written) target language from prompts in a resource-rich source language. The main achievements of this thesis aimed at languages without writing system, can be summarized as follows: Word Segmentation from Phoneme Sequences through Cross-lingual Word-to-Phoneme Alignment: We implicitly define word boundaries from the alignment of words in the translation to phonemes. Furthermore, our procedures contain an automatic error compensation. Pronunciation Extraction from Phoneme Sequences through Cross- lingual Word-to-Phoneme Alignment: We gain a pronunciation dictio- nary whose words are represented by phoneme clusters and the corresponding pronunciations by their phoneme sequence. This enables an automatic “in- vention” of a writing system for the target language. iv Summary Zero-resource Automatic Speech Recognition: Finally, we tackle the task of bootstrapping automatic speech recognition systems without an a priori given language model, pronunciation dictionary, or transcribed speech data for the target language — available are only untranscribed speech and translations to resource-rich source languages of what
Recommended publications
  • Arabic Corpus and Word Sketches
    Journal of King Saud University – Computer and Information Sciences (2014) 26, 357–371 King Saud University Journal of King Saud University – Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com arTenTen: Arabic Corpus and Word Sketches Tressy Arts a, Yonatan Belinkov b, Nizar Habash c,*, Adam Kilgarriff d, Vit Suchomel e,d a Chief Editor Oxford Arabic Dictionary, UK b MIT, USA c New York University Abu Dhabi, United Arab Emirates d Lexical Computing Ltd, UK e Masaryk Univ., Czech Republic Available online 7 October 2014 KEYWORDS Abstract We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen con- Corpora; sists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with Lexicography; the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it Morphology; is open for all to use. We have also created ‘word sketches’: one-page, automatic, corpus-derived Concordance; summaries of a word’s grammatical and collocational behavior. We use examples to demonstrate Arabic what the corpus can show us regarding Arabic words and phrases and how this can support lexi- cography and inform linguistic research. The article also presents the ‘sketch grammar’ (the basis for the word sketches) in detail, describes the process of building and processing the corpus, and considers the role of the corpus in additional research on Arabic. Ó 2014 Production and hosting by Elsevier B.V. on behalf of King Saud University. 1. Introduction ber of the TenTen Corpus Family (Jakubı´cˇek et al., 2013).
    [Show full text]
  • Centuries of Silence : the Story of Latin American Journalism / Leonardo Ferreira
    Centuries of Silence: The Story of Latin American Journalism Leonardo Ferreira PRAEGER CENTURIES OF SILENCE The Story of Latin American Journalism Leonardo Ferreira Library of Congress Cataloging-in-Publication Data Ferreira, Leonardo, 1957– Centuries of silence : the story of Latin American journalism / Leonardo Ferreira. p. cm. Includes bibliographical references and index. ISBN 0–275–98397–8 (alk. paper)—ISBN 0–275–98410–9 (pbk : alk. paper) 1. Press—Latin America—History. 2. Journalism—Political aspects—Latin America—History. I. Title. PN4930.F47 2006 079.8–dc22 2006015112 British Library Cataloguing in Publication Data is available Copyright c 2006 by Leonardo Ferreira All rights reserved. No portion of this book may be reproduced, by any process or technique, without the express written consent of the publisher. Library of Congress Catalog Card Number: 2006015112 ISBN: 0–275–98397–8 (cloth) 0–275–98410–9 (pbk) First published in 2006 Praeger Publishers, 88 Post Road West, Westport, CT 06881 An imprint of Greenwood Publishing Group, Inc. www.praeger.com Printed in the United States of America The paper used in this book complies with the Permanent Paper Standard issued by the National Information Standards Organization (Z39.48–1984). 10987654321 To my eternal stars, mi Gaby, Taty, Luisita, Sarita, and Juanita. To my inspiring and beloved mom, the courageous Mary, and to my precious Angie. All determined women, like most others, born to figh for freedom and a sense of harmony in this troubled planet. Contents Preface ix Introduction: When Good News Is Bad News 1 1. Whose Truth on True Street 9 2.
    [Show full text]
  • Jazykovedný Ústav Ľudovíta Štúra
    JAZYKOVEDNÝ ÚSTAV ĽUDOVÍTA ŠTÚRA SLOVENSKEJ AKADÉMIE VIED 2 ROČNÍK 70, 2019 JAZYKOVEDNÝ ČASOPIS ________________________________________________________________VEDEcKÝ ČASOPIS PRE OTáZKY TEóRIE JAZYKA JOURNAL Of LINGUISTIcS ScIENTIfIc JOURNAL fOR ThE Theory Of LANGUAGE ________________________________________________________________ hlavná redaktorka/Editor-in-chief: doc. Mgr. Gabriela Múcsková, PhD. Výkonní redaktori/Managing Editors: PhDr. Ingrid Hrubaničová, PhD., Mgr. Miroslav Zumrík, PhD. Redakčná rada/Editorial Board: PhDr. Klára Buzássyová, CSc. (Bratislava), prof. PhDr. Juraj Dolník, DrSc. (Bratislava), PhDr. Ingrid Hrubaničová, PhD. (Bra tislava), doc. Mgr. Martina Ivanová, PhD. (Prešov), Mgr. Nicol Janočková, PhD. (Bratislava), Mgr. Alexandra Jarošová, CSc. (Bratislava), prof. PaedDr. Jana Kesselová, CSc. (Prešov), PhDr. Ľubor Králik, CSc. (Bratislava), PhDr. Viktor Krupa, DrSc. (Bratislava), doc. Mgr. Gabriela Múcsková, PhD. (Bratislava), Univ. Prof. Mag. Dr. Stefan Michael Newerkla (Viedeň – Ra- kúsko), Associate Prof. Mark Richard Lauersdorf, Ph.D. (Kentucky – USA), prof. Mgr. Martin Ološtiak, PhD. (Prešov), prof. PhDr. Slavomír Ondrejovič, DrSc. (Bratislava), prof. PaedDr. Vladimír Patráš, CSc. (Banská Bystrica), prof. PhDr. Ján Sabol, DrSc. (Košice), prof. PhDr. Juraj Vaňko, CSc. (Nitra), Mgr. Miroslav Zumrík, PhD. (Bratislava), prof. PhDr. Pavol Žigo, CSc. (Bratislava). Technický_______________________________________________________________ redaktor/Technical editor: Mgr. Vladimír Radik Vydáva/Published by: Jazykovedný
    [Show full text]
  • Korpusleksikograafia Uued Võimalused Eesti Keele Kollokatsioonisõnastiku Näitel
    doi:10.5128/ERYa11.05 korpuSLekSikograafia uued võimaLuSed eeSti keeLe koLLokatSiooniSõnaStiku näiteL Jelena Kallas, Kristina Koppel, Maria Tuulik Ülevaade. Artiklis tutvustame korpusleksikograafia üldisi arengu- tendentse ja uusi meetodeid. Käsitleme korpuse kui leksikograafilise 11, 75–94 EESTI RAKENDUSLINGVISTIKA ÜHINGU AASTARAAMAT info allika potentsiaali ning analüüsime, kuidas saab leksikograafilisi andmebaase pool- ja täisautomaatselt genereerida. Vaatleme, mil määral on uusi tehnoloogilisi lahendusi võimalik rakendada Eesti õppe- leksikograafias, täpsemalt eesti keele kollokatsioonisõnastiku (KOLS) koostamisel. KOLS on esimene eestikeelne sõnastik, kus rakendatakse andmebaasi automaatset genereerimist nii märksõnastiku kui ka sõnaartikli sisu (kollokatiivse info ja näitelausete) tasandil. Tutvustame sõnastiku koostamise üldisi põhimõtteid ja esitame näidisartikli.* Võtmesõnad: korpusleksikograafia, kollokatsioonisõnastik, korpus- päringusüsteem, sõnastikusüsteem, eesti keel 1. Sõnastike korpuspõhine koostamine internetiajastul: vahendid ja meetodid Korpusleksikograafia eesmärk on luua meetodid, mis võimaldaksid leksikograafilisi üksusi automaatselt tuvastada ja valida. Korpusanalüüsi tulemusi rakendatakse tänapäeval märksõnastiku loomisel, tähendusjaotuste uurimisel, leksikaalsete üksuste erinevate omaduste tuvastamisel (süntaktiline käitumine, kollokatsioonid, semantiline prosoodia eri tüüpi tekstides, leksikaal-semantilised suhted), näitelau- sete valikul ja tõlkevastete leidmisel (Kilgarriff 2013: 77). Korpuspõhine koostamine
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins
    ASp la revue du GERAS 68 | 2015 Varia ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi Electronic version URL: http://journals.openedition.org/asp/4682 DOI: 10.4000/asp.4682 ISSN: 2108-6354 Publisher Groupe d'étude et de recherche en anglais de spécialité Printed version Date of publication: 23 October 2015 Number of pages: 7-24 ISSN: 1246-8185 Electronic reference Hilary Nesi, « ESP corpus construction: a plea for a needs-driven approach », ASp [Online], 68 | 2015, Online since 01 November 2016, connection on 02 November 2020. URL : http:// journals.openedition.org/asp/4682 ; DOI : https://doi.org/10.4000/asp.4682 This text was automatically generated on 2 November 2020. Tous droits réservés ESP corpus construction: a plea for a needs-driven approach 1 ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi 1. Introduction 1 A huge amount of corpus data is now available for English language teachers and learners, making it difficult for them to decide which kind of corpus is most suited to their needs. The corpora that they probably know best are the large, publicly accessible general corpora which inform general reference works and textbooks; the ones that aim to cover the English language as a whole and to represent many different categories of text. The 100 million word British National Corpus (BNC), created in the 1990s, is a famous example of this type and has been used extensively for language learning purposes, but at 100 million words it is too small to contain large numbers of texts within each of its subcomponents, and it is now too old to represent modern usage in the more fast-moving fields.
    [Show full text]
  • A Corpus-Based Study of the Effects of Collocation, Phraseology
    A CORPUS-BASED STUDY OF THE EFFECTS OF COLLOCATION, PHRASEOLOGY, GRAMMATICAL PATTERNING, AND REGISTER ON SEMANTIC PROSODY by TIMOTHY PETER MAIN A thesis submitted to the University of Birmingham for the degree of DOCTOR OF PHILOSOPHY Department of English Language and Applied Linguistics College of Arts and Law The University of Birmingham December 2017 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract This thesis investigates four factors that can significantly affect the positive-negative semantic prosody of two high-frequency verbs, CAUSE and HAPPEN. It begins by exploring the problematic notion that semantic prosody is collocational. Statistical measures of collocational significance, including the appropriate span within which evaluative collocates are observed, the nature of the collocates themselves, and their relationship to the node are all examined in detail. The study continues by examining several semi-preconstructed phrases associated with HAPPEN. First, corpus data show that such phrases often activate evaluative modes other than semantic prosody; then, the potentially problematic selection of the canonical form of a phrase is discussed; finally, it is shown that in some cases collocates ostensibly evincing semantic prosody occur in profiles because of their occurrence in frequent phrases, and as such do not constitute strong evidence of semantic prosody.
    [Show full text]
  • Machine Learning of Antonyms in English and Arabic Corpora
    Machine Learning of Antonyms in English and Arabic Corpora Luluh Basim M. Aldhubayi School of Computing University of Leeds Submitted in accordance with the requirements for the degree of Doctor of Philosophy January 2019 2 The candidate confirms that the work submitted is her own, ex- cept where work which has formed part of jointly-authored pub- lications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated below. The candidate confirms that appropriate credit has been given within the thesis where reference has been made to the work of others. Parts of the work presented in Chapters 7 have been published in the following jointly-authored article: Al-Yahya, M., Al-Malak, S., Aldhubayi, L. (2016). Ontological Lexicon Enrichment: The Badea System For Semi-Automated Extraction Of Antonymy Relations From Arabic Language Corpora. Malaysian Journal of Computer Science, 29(1). This copy has been supplied on the understanding that it is copy- right material and that no quotation from the thesis may be pub- lished without proper acknowledgement. The right of Luluh Basim M. Aldhubayi to be identified as Au- thor of this work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. c 2019 The University of Leeds and Luluh Basim M. Aldhubayi to my favourite antonym pair Amal & Ahmed Acknowledgements First and foremost, all praise and thanks to Allah The Almighty for His graces and guidance. Peace and Blessings of Allah be upon the Prophet Muhammad. I would like to thank my supervisors, Prof.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]
  • The Learning Satisfaction Dispositions of Special Needs Students Regarding the Academic Effects of Exclusively Using Digital Resources
    THE LEARNING SATISFACTION DISPOSITIONS OF SPECIAL NEEDS STUDENTS REGARDING THE ACADEMIC EFFECTS OF EXCLUSIVELY USING DIGITAL RESOURCES A Dissertation Presented to The Faculty of the Education Department Carson-Newman University In Partial Fulfillment Of the Requirements for the Degree Doctor of Education By Lea Michelle Curtis Jackson May 2020 Dissertation Approval Student Name: Lea Michelle Curtis Jackson Dissertation Title: THE LEARNING SATISFACTION DISPOSITIONS OF SPECIAL NEEDS STUDENTS REGARDING THE ACADEMIC EFFECTS OF EXCLUSIVELY USING DIGITAL RESOURCES This dissertation has been approved and accepted by the faculty of the Education Department, Carson-Newman University, in partial fulfillment of the requirements for the degree, Doctor of Education. Dissertation Committee: Dr. Julia Price, Dissertation Chair Dr. P. Mark Taylor, Methodologist Member Dr. Samuel Hollingshead, Content Member Approved by the Dissertation Committee Date: April 9, 2020 ii Abstract The movement toward greater access to digital resources in education creates challenges for educators when determining how and when to implement the use of digital instruction particularly when using technology as the primary source for delivering content. The focus of this qualitative, phenomenological case study was to determine if special needs student literacy, including comprehension and recall, was affected by exclusively using digital textbooks as the primary source of instruction. Data were collected from four data sources, including semi- structured interviews, observations, a focus group interview, and an artifact. The participants received two units of instruction using print and digital forms of the same textbook. The instructional unit using the printed textbook served as a basis for comparison for the participants. A thorough analysis of the data revealed three major themes that influence student satisfaction dispositions regarding using digital resources.
    [Show full text]
  • Designing a Learner's Dictionary Based on Sinclair's Lexical Units by Means of Corpus Pattern Analysis and the Sketch Engine
    Designing a Learner’s Dictionary Based on Sinclair’s Lexical Units by Means of Corpus Pattern Analysis and the Sketch Engine Paolo Vito DiMuccio-Failla, Laura Giacomini1 1 Heidelberg University/ Hildesheim University E-mail: [email protected], [email protected] Abstract This paper is part of a study for the design of an advanced learner’s dictionary (in Italian) aimed at implementing Sinclair's vision of ‘the ultimate dictionary’ (see Sinclair et al., 2004: xxiv) and based on his conception of lexical units. Our present goal is to exhaustively portray the meaning profile of verbs, systematically distinguishing their meanings by their normal patterns of usage. To achieve this, we apply Hanks’s Corpus Pattern Analysis by means of Kilgarriff and Rychlý's Sketch Engine. The first chapter presents and discusses the theoretical background to our work. The second gives a description of our methodology, which is then exemplified by a case study of the Italian verb seguire. The final part of the paper draws a few conclusions on the feasibility and usefulness of Sinclair's ‘ultimate dictionary’ and hints at future steps of the dictionary- making process. The dictionary project is in its design stage and is intended to be a platform for cooperation between the Italian publisher Zanichelli and a network of international universities and research institutes. Keywords: learner's dictionary; Italian learner’s dictionary; lexical units; Sinclair's thesis; Sinclair patterns 1. Methodological background 1.1 COBUILD’s scientific revolution At the start of the 1980s, lexicography, and linguistics in general, were undergoing a far-reaching paradigm shift thanks to the new availability of huge quantities of machine-readable text made possible by advances in computer technology.
    [Show full text]
  • Arabic Corpus and Word Sketches
    Journal of King Saud University – Computer and Information Sciences (2014) 26, 357–371 King Saud University Journal of King Saud University – Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com arTenTen: Arabic Corpus and Word Sketches Tressy Arts a, Yonatan Belinkov b, Nizar Habash c,*, Adam Kilgarriff d, Vit Suchomel e,d a Chief Editor Oxford Arabic Dictionary, UK b MIT, USA c New York University Abu Dhabi, United Arab Emirates d Lexical Computing Ltd, UK e Masaryk Univ., Czech Republic Available online 7 October 2014 KEYWORDS Abstract We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen con- Corpora; sists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with Lexicography; the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it Morphology; is open for all to use. We have also created ‘word sketches’: one-page, automatic, corpus-derived Concordance; summaries of a word’s grammatical and collocational behavior. We use examples to demonstrate Arabic what the corpus can show us regarding Arabic words and phrases and how this can support lexi- cography and inform linguistic research. The article also presents the ‘sketch grammar’ (the basis for the word sketches) in detail, describes the process of building and processing the corpus, and considers the role of the corpus in additional research on Arabic. Ó 2014 Production and hosting by Elsevier B.V. on behalf of King Saud University. 1. Introduction ber of the TenTen Corpus Family (Jakubı´ cˇ ek et al., 2013).
    [Show full text]