Text and Speech Alignment Methods for Speech Translation Corpora Creation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Adapting to Trends in Language Resource Development: a Progress
Adapting to Trends in Language Resource Development: A Progress Report on LDC Activities Christopher Cieri, Mark Liberman University of Pennsylvania, Linguistic Data Consortium 3600 Market Street, Suite 810, Philadelphia PA. 19104, USA E-mail: {ccieri,myl} AT ldc.upenn.edu Abstract This paper describes changing needs among the communities that exploit language resources and recent LDC activities and publications that support those needs by providing greater volumes of data and associated resources in a growing inventory of languages with ever more sophisticated annotation. Specifically, it covers the evolving role of data centers with specific emphasis on the LDC, the publications released by the LDC in the two years since our last report and the sponsored research programs that provide LRs initially to participants in those programs but eventually to the larger HLT research communities and beyond. creators of tools, specifications and best practices as well 1. Introduction as databases. The language resource landscape over the past two years At least in the case of LDC, the role of the data may be characterized by what has changed and what has center has grown organically, generally in response to remained constant. Constant are the growing demands stated need but occasionally in anticipation of it. LDC’s for an ever-increasing body of language resources in a role was initially that of a specialized publisher of LRs growing number of human languages with increasingly with the additional responsibility to archive and protect sophisticated annotation to satisfy an ever-expanding list published resources to guarantee their availability of the of user communities. Changing are the relative long term. -
The Role of Higher-Level Linguistic Features in HMM-Based Speech Synthesis
INTERSPEECH 2010 The role of higher-level linguistic features in HMM-based speech synthesis Oliver Watts, Junichi Yamagishi, Simon King Centre for Speech Technology Research, University of Edinburgh, UK [email protected] [email protected] [email protected] Abstract the annotation of a synthesiser’s training data on listeners’ re- action to the speech produced by that synthesiser is still unclear We analyse the contribution of higher-level elements of the lin- to us. In particular, we suspect that features such as pitch ac- guistic specification of a data-driven speech synthesiser to the cent type which might be useful in voice-building if labelled naturalness of the synthetic speech which it generates. The reliably, are under-used in conventional systems because of la- system is trained using various subsets of the full feature-set, belling/prediction errors. For building the systems to be evalu- in which features relating to syntactic category, intonational ated we therefore used data for which hand labelled annotation phrase boundary, pitch accent and boundary tones are selec- of ToBI events is available. This allows us to compare ideal sys- tively removed. Utterances synthesised by the different config- tems built from a corpus where these higher-level features are urations of the system are then compared in a subjective evalu- accurately annotated with more conventional systems that rely ation of their naturalness. exclusively on prediction of these features from text for annota- The work presented forms background analysis for an on- tion of their training data. going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be triv- 2. -
Characteristics of Text-To-Speech and Other Corpora
Characteristics of Text-to-Speech and Other Corpora Erica Cooper, Emily Li, Julia Hirschberg Columbia University, USA [email protected], [email protected], [email protected] Abstract “standard” TTS speaking style, whether other forms of profes- sional and non-professional speech differ substantially from the Extensive TTS corpora exist for commercial systems cre- TTS style, and which features are most salient in differentiating ated for high-resource languages such as Mandarin, English, the speech genres. and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and 2. Related Work are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from TTS speakers are typically instructed to speak as consistently “found” data collected for other purposes (e.g. training ASR as possible, without varying their voice quality, speaking style, systems) or available on the web (e.g. news broadcasts, au- pitch, volume, or tempo significantly [1]. This is different from diobooks) to produce TTS systems for low-resource languages other forms of professional speech in that even with the rela- (LRLs) which do not currently have expensive, commercial sys- tively neutral content of broadcast news, anchors will still have tems. This study investigates whether traditional TTS speakers some variance in their speech. Audiobooks present an even do exhibit significantly less variation and better speaking char- greater challenge, with a more expressive reading style and dif- acteristics than speakers in found genres. By examining char- ferent character voices. Nevertheless, [2, 3, 4] have had suc- acteristics of f0, energy, speaking rate, articulation, NHR, jit- cess in building voices from audiobook data by identifying and ter, and shimmer in found genres and comparing these to tra- using the most neutral and highest-quality utterances. -
The RACAI Text-To-Speech Synthesis System
The RACAI Text-to-Speech Synthesis System Tiberiu Boroș, Radu Ion, Ștefan Daniel Dumitrescu Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy (RACAI) [email protected], [email protected], [email protected] of standalone Natural Language Processing (NLP) Tools Abstract aimed at enabling text-to-speech (TTS) synthesis for less- This paper describes the RACAI Text-to-Speech (TTS) entry resourced languages. A good example is the case of for the Blizzard Challenge 2013. The development of the Romanian, a language which poses a lot of challenges for TTS RACAI TTS started during the Metanet4U project and the synthesis mainly because of its rich morphology and its system is currently part of the METASHARE platform. This reduced support in terms of freely available resources. Initially paper describes the work carried out for preparing the RACAI all the tools were standalone, but their design allowed their entry during the Blizzard Challenge 2013 and provides a integration into a single package and by adding a unit selection detailed description of our system and future development speech synthesis module based on the Pitch Synchronous directions. Overlap-Add (PSOLA) algorithm we were able to create a fully independent text-to-speech synthesis system that we refer Index Terms: speech synthesis, unit selection, concatenative to as RACAI TTS. 1. Introduction A considerable progress has been made since the start of the project, but our system still requires development and Text-to-speech (TTS) synthesis is a complex process that although considered premature, our participation in the addresses the task of converting arbitrary text into voice. -
100000 Podcasts
100,000 Podcasts: A Spoken English Document Corpus Ann Clifton Sravana Reddy Yongze Yu Spotify Spotify Spotify [email protected] [email protected] [email protected] Aasish Pappu Rezvaneh Rezapour∗ Hamed Bonab∗ Spotify University of Illinois University of Massachusetts [email protected] at Urbana-Champaign Amherst [email protected] [email protected] Maria Eskevich Gareth J. F. Jones Jussi Karlgren CLARIN ERIC Dublin City University Spotify [email protected] [email protected] [email protected] Ben Carterette Rosie Jones Spotify Spotify [email protected] [email protected] Abstract Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typi- cally studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a re- source for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research. -
Syntactic Annotation of Spoken Utterances: a Case Study on the Czech Academic Corpus
Syntactic annotation of spoken utterances: A case study on the Czech Academic Corpus Barbora Hladká and Zde ňka Urešová Charles University in Prague Institute of Formal and Applied Linguistics {hladka, uresova}@ufal.mff.cuni.cz text corpora, e.g. the Penn Treebank, Abstract the family of the Prague Dependency Tree- banks, the Tiger corpus for German, etc. Some Corpus annotation plays an important corpora contain a semantic annotation, such as role in linguistic analysis and computa- the Penn Treebank enriched by PropBank and tional processing of both written and Nombank, the Prague Dependency Treebank in spoken language. Syntactic annotation its highest layer, the Penn Chinese or the of spoken texts becomes clearly a topic the Korean Treebanks. The Penn Discourse of considerable interest nowadays, Treebank contains discourse annotation. driven by the desire to improve auto- It is desirable that syntactic (and higher) an- matic speech recognition systems by notation of spoken texts respects the written- incorporating syntax in the language text style as much as possible, for obvious rea- models, or to build language under- sons: data “compatibility”, reuse of tools etc. standing applications. Syntactic anno- A number of questions arise immediately: tation of both written and spoken texts How much experience and knowledge ac- in the Czech Academic Corpus was quired during the written text annotation can created thirty years ago when no other we apply to the spoken texts? Are the annota- (even annotated) corpus of spoken texts tion instructions applicable to transcriptions in has existed. We will discuss how much a straightforward way or some modifications relevant and inspiring this annotation is of them must be done? Can transcriptions be to the current frameworks of spoken annotated “as they are” or some transformation text annotation. -
Synthesis and Recognition of Speech Creating and Listening to Speech
ISSN 1883-1974 (Print) ISSN 1884-0787 (Online) National Institute of Informatics News NII Interview 51 A Combination of Speech Synthesis and Speech Oct. 2014 Recognition Creates an Affluent Society NII Special 1 “Statistical Speech Synthesis” Technology with a Rapidly Growing Application Area NII Special 2 Finding Practical Application for Speech Recognition Feature Synthesis and Recognition of Speech Creating and Listening to Speech A digital book version of “NII Today” is now available. http://www.nii.ac.jp/about/publication/today/ This English language edition NII Today corresponds to No. 65 of the Japanese edition [Advance Notice] Great news! NII Interview Yamagishi-sensei will create my voice! Yamagishi One result is a speech translation sys- sound. Bit (NII Character) A Combination of Speech Synthesis tem. This system recognizes speech and translates it Ohkawara I would like your comments on the fu- using machine translation to synthesize speech, also ture challenges. and Speech Recognition automatically translating it into every language to Ono The challenge for speech recognition is how speak. Moreover, the speech is created with a human close it will come to humans in the distant speech A Word from the Interviewer voice. In second language learning, you can under- case. If this study is advanced, it will be possible to Creates an Affluent Society stand how you should pronounce it with your own summarize the contents of a meeting and to automati- voice. If this is further advanced, the system could cally take the minutes. If a robot understands the con- have an actor in a movie speak in a different language tents of conversations by multiple people in a natural More and more people have begun to use smart- a smartphone, I use speech input more often. -
The Role of Speech Processing in Human-Computer Intelligent Communication
THE ROLE OF SPEECH PROCESSING IN HUMAN-COMPUTER INTELLIGENT COMMUNICATION Candace Kamm, Marilyn Walker and Lawrence Rabiner Speech and Image Processing Services Research Laboratory AT&T Labs-Research, Florham Park, NJ 07932 Abstract: We are currently in the midst of a revolution in communications that promises to provide ubiquitous access to multimedia communication services. In order to succeed, this revolution demands seamless, easy-to-use, high quality interfaces to support broadband communication between people and machines. In this paper we argue that spoken language interfaces (SLIs) are essential to making this vision a reality. We discuss potential applications of SLIs, the technologies underlying them, the principles we have developed for designing them, and key areas for future research in both spoken language processing and human computer interfaces. 1. Introduction Around the turn of the twentieth century, it became clear to key people in the Bell System that the concept of Universal Service was rapidly becoming technologically feasible, i.e., the dream of automatically connecting any telephone user to any other telephone user, without the need for operator assistance, became the vision for the future of telecommunications. Of course a number of very hard technical problems had to be solved before the vision could become reality, but by the end of 1915 the first automatic transcontinental telephone call was successfully completed, and within a very few years the dream of Universal Service became a reality in the United States. We are now in the midst of another revolution in communications, one which holds the promise of providing ubiquitous service in multimedia communications. -
Named Entity Recognition in Question Answering of Speech Data
Proceedings of the Australasian Language Technology Workshop 2007, pages 57-65 Named Entity Recognition in Question Answering of Speech Data Diego Molla´ Menno van Zaanen Steve Cassidy Division of ICS Department of Computing Macquarie University North Ryde Australia {diego,menno,cassidy}@ics.mq.edu.au Abstract wordings of the same information). The system uses symbolic algorithms to find exact answers to ques- Question answering on speech transcripts tions in large document collections. (QAst) is a pilot track of the CLEF com- The design and implementation of the An- petition. In this paper we present our con- swerFinder system has been driven by requirements tribution to QAst, which is centred on a that the system should be easy to configure, extend, study of Named Entity (NE) recognition on and, therefore, port to new domains. To measure speech transcripts, and how it impacts on the success of the implementation of AnswerFinder the accuracy of the final question answering in these respects, we decided to participate in the system. We have ported AFNER, the NE CLEF 2007 pilot task of question answering on recogniser of the AnswerFinder question- speech transcripts (QAst). The task in this compe- answering project, to the set of answer types tition is different from that for which AnswerFinder expected in the QAst track. AFNER uses a was originally designed and provides a good test of combination of regular expressions, lists of portability to new domains. names (gazetteers) and machine learning to The current CLEF pilot track QAst presents an find NeWS in the data. The machine learn- interesting and challenging new application of ques- ing component was trained on a develop- tion answering. -
Reducing Out-Of-Vocabulary in Morphology to Improve the Accuracy in Arabic Dialects Speech Recognition
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by University of Birmingham Research Archive, E-theses Repository Reducing Out-of-Vocabulary in Morphology to Improve the Accuracy in Arabic Dialects Speech Recognition by Khalid Abdulrahman Almeman A thesis submitted to The University of Birmingham for the degree of Doctor of Philosophy School of Computer Science The University of Birmingham March 2015 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. Abstract This thesis has two aims: developing resources for Arabic dialects and improving the speech recognition of Arabic dialects. Two important components are considered: Pro- nunciation Dictionary (PD) and Language Model (LM). Six parts are involved, which relate to finding and evaluating dialects resources and improving the performance of sys- tems for the speech recognition of dialects. Three resources are built and evaluated: one tool and two corpora. The methodology that was used for building the multi-dialect morphology analyser involves the proposal and evaluation of linguistic and statistic bases. We obtained an overall accuracy of 94%. The dialect text corpora have four sub-dialects, with more than 50 million tokens. -
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations Helen Kwong Neil Yorke-Smith Stanford University, USA SRI International, USA [email protected] [email protected] Abstract is part of an effort to endow an email client with a range of usable smart features by means of AI technology. Question-answer pairs extracted from email threads Prior work that studies the problems of question identifi- can help construct summaries of the thread, as well cation and question-answer pairing in email threads has fo- as inform semantic-based assistance with email. cused on interrogative questions, such as “Which movies do Previous work dedicated to email threads extracts you like?” However, questions are often expressed in non- only questions in interrogative form. We extend the interrogative forms: declarative questions such as “I was scope of question and answer detection and pair- wondering if you are free today”, and imperative questions ing to encompass also questions in imperative and such as “Tell me the cost of the tickets”. Our sampling from declarative forms, and to operate at sentence-level the well-known Enron email corpus finds about one in five fidelity. Building on prior work, our methods are questions are non-interrogative in form. Further, previous based on learned models over a set of features that work for email threads has operated at the paragraph level, include the content, context, and structure of email thus potentially missing more fine-grained conversation. threads. For two large email corpora, we show We propose an approach to extract question-answer pairs that our methods balance precision and recall in ex- that includes imperative and declarative questions. -
Voice User Interface on the Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to Create a VUI on the Web?
Voice User Interface On The Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to create a VUI on the Web? § Three (main) steps, typically: o Speech Recognition o Text manipulation (e.g., Natural Language Processing) o Speech Synthesis § We are going to start from a simple application to reach a quite complex scenario o by using HTML5, JS, and PHP § Reminder: we are interested in creating an interactive prototype, at the end 2 Human Computer Interaction Weather Web App A VUI for "chatting" about the weather Base implementation at https://github.com/polito-hci-2019/vui-example 3 Human Computer Interaction Speech Recognition and Synthesis § Web Speech API o currently a draft, experimental, unofficial HTML5 API (!) o https://wicg.github.io/speech-api/ § Covers both speech recognition and synthesis o different degrees of support by browsers 4 Human Computer Interaction Web Speech API: Speech Recognition § Accessed via the SpeechRecognition interface o provides the ability to recogniZe voice from an audio input o normally via the device's default speech recognition service § Generally, the interface's constructor is used to create a new SpeechRecognition object § The SpeechGrammar interface can be used to represent a particular set of grammar that your app should recogniZe o Grammar is defined using JSpeech Grammar Format (JSGF) 5 Human Computer Interaction Speech Recognition: A Minimal Example const recognition = new window.SpeechRecognition(); recognition.onresult = (event) => { const speechToText = event.results[0][0].transcript;