Segmental Audio Word2vec: Representing Utterances As Sequences of Vectors with Applications in Spoken Term Detection

Total Page:16

File Type:pdf, Size:1020Kb

Segmental Audio Word2vec: Representing Utterances As Sequences of Vectors with Applications in Spoken Term Detection SEGMENTAL AUDIO WORD2VEC: REPRESENTING UTTERANCES AS SEQUENCES OF VECTORS WITH APPLICATIONS IN SPOKEN TERM DETECTION Yu-Hsuan Wang, Hung-yi Lee, Lin-shan Lee College of Electrical Engineering and Computer Science National Taiwan University fr04922167, [email protected], [email protected] ABSTRACT structure information for a spoken word. This actually extends While Word2Vec represents words (in text) as vectors carrying se- the audio Word2Vec from word-level up to utterance-level. Such mantic information, audio Word2Vec was shown to be able to rep- segmental audio Word2Vec can have plenty of potential applications resent signal segments of spoken words as vectors carrying phonetic in the future, for example, speech information summarization, structure information. Audio Word2Vec can be trained in an un- speech-to-speech translation or voice conversion[14]. Here we show supervised way from an unlabeled corpus, except the word bound- the very attractive first application in spoken term detection. aries are needed. In this paper, we extend audio Word2Vec from The segmental audio Word2Vec proposed in this paper is word-level to utterance-level by proposing a new segmental audio based on a segmental sequence-to-sequence autoencoder (SSAE) Word2Vec, in which unsupervised spoken word boundary segmenta- for learning a segmentation gate and a sequence-to-sequence au- tion and audio Word2Vec are jointly learned and mutually enhanced, toencoder jointly. The former determines the word boundaries in so an utterance can be directly represented as a sequence of vectors the utterance, and the latter represents each audio segment with an carrying phonetic structure information. This is achieved by a seg- embedding vector. These two processes can be jointly learned from mental sequence-to-sequence autoencoder (SSAE), in which a seg- an unlabeled corpus in a completely unsupervised way. During mentation gate trained with reinforcement learning is inserted in the training, the model learns to convert the utterances into sequences encoder. Experiments on English, Czech, French and German show of embeddings, and then reconstructs the utterances with these very good performance in both unsupervised spoken word segmenta- sequences of embeddings. A guideline for the proper number of tion and spoken term detection applications (significantly better than vectors (or words) within an utterance of a given length is needed, frame-based DTW). in order to prevent the machine from segmenting the utterances into Index Terms— recurrent neural network, autoencoder, rein- more segments (or words) than needed. Since the number of em- forcement learning, policy gradient beddings is a discrete variable and not differentiable, the standard back-propagation is not applicable[15][16]. The policy gradient for reinforcement learning[17] is therefore used. How these generated 1. INTRODUCTION word vector sequences carry the phonetic structure information of the original utterances was evaluated with the real application task of In natural language processing, it is well known that Word2Vec query-by-example spoken term detection on four languages: English transforming words (in text) into vectors of fixed dimensionality (on TIMIT), Czech, French, German (on GlobalPhone corpora)[18]. is very useful in various applications, because those vectors carry semantic information[1][2]. In speech signal processing, it has been shown that audio Word2Vec transforming spoken words into 2. PROPOSED APPROACH vectors of fixed dimensionality[3][4] is also useful for example in spoken term detection or data augmentation[5][6], because those 2.1. Segmental Sequence-to-Sequence Autoencoder (SSAE) vectors carry phonetic structure for the spoken words. It has been shown that this audio Word2Vec can be trained in a completely un- The proposed structure for SSAE is depicted in Fig. 1, in which the supervised way from an unlabeled dataset, except the spoken word segmentation gate is inserted into the recurrent autoencoder. For an boundaries are needed. The need for spoken word boundaries is a input utterance X = fx1, x2, ..., xT g, where xt represents the t-th major limitation for audio Word2Vec, because word boundaries are acoustic feature like MFCC and T is the length of the utterance, the usually not available for given speech utterances or corpora[7][8]. model learns to determine the word boundaries and produce the em- Although it is possible to use some automatic pro- beddings for the N generated audio segments, Y = f e1; e2; :::; eN g, cesses to estimate word boundaries followed by the audio where en is the n-th embedding and N ≤ T . Word2Vec[9][10][11][12][13], it is highly desired that the signal The proposed SSAE consists of an encoder RNN (ER) and a segmentation and audio Word2Vec may be integrated and jointly decoder RNN (DR) just like the conventional autoencoder. But the learned, because in that way they may enhance each other. This encoder includes an extra segmentation gate, controlled by another means the machine learns to segment the utterances into a sequence RNN (shown as a sequence of blocks S in Fig. 1). The segmenta- of spoken words, and transform these spoken words into a sequence tion problem is formulated as a reinforcement learning problem. At of vectors at the same time. This is the segmental audio Word2Vec each time t, the segmentation gate agent performs an action at, ”seg- proposed here: representing each utterance as a sequence of fixed- ment” or ”pass”, according to a given state st. xt is taken as a word dimensional vectors, each of which hopefully carries the phonetic boundary if at is ”segment”. x x x x x For the segmentation gate, the state at time t, st, is defined as 1 2 3 4 xt xt+1 T the concatenation of the input xt, the gate activation signal (GAS) �#" �#$ �#% �#& �( �*')" �#+ gt extracted from the gates of the GRU in another pre-trained RNN ' autoencoder [10], and the previous action at−1 taken [19], DR DR DR DR … DR DR … DR st = xtjjgtjjat−1 : (1) e1 e1 e1 e2 en eN eN decoder The output ht of layers of the segmentation gate RNN (blocks 0 0 e1 0 en 0 eN π π S in Fig. 1) followed by a linear transform (W ,b ) and a softmax encoder nonlinearity models the policy πt at time t, … … ht = RNN(s1; s2; :::st); (2) S S S S S S S π π πt = softmax(W ht + b ): (3) ER ER ER ER … ER ER … ER This πt gives two probabilities respectively for ”segment” and pass ”pass”. An action at is then sampled from this distribution during x1 x2 x3 x4 xt xt+1 xT segment training to encourage exploration. During testing at is ”segment” Segment 1 Segment 2 Segment n Segment N whenever its probability is higher. When at is ”segment”, the time t is viewed as a word boundary, and the segmentation gate passes the output of encoder RNN as an Fig. 1. The segmental sequence-to-sequence autoencoder embedding. The state of the encoder RNN is also reset to its initial (SSAE). In addition to the encoder RNN (ER) and decoder value. So the embedding en is generated based on the acoustic fea- RNN (DR), a segmentation gate (blocks S) is included in the tures of the audio segment only, independent of the previous input in encoder for estimating the word boundaries. During transi- spite of the recurrent structure, tions across the segment boundaries, encoder RNN and de- coder RNN are reset (illustrated with a slash in front of an e = Encoder(x ; x ; :::; x ); (4) n t1 t1+1 t2 arrow) so there is no information flow across segment bound- aries. Each segment (shown in different colors) can be viewed where t1, t2 refers to the beginning and ending time for the n-th audio segment. as performing sequence-to-sequence training individually. The input utterance X should be reconstructed with the embed- ding sequence Y = f e1; e2; :::; eN g. Because the decoder RNN (DR) is backward in order as shown in Fig. 1 [20], for the embed- policy π as J(θ) = Eπ[r], where θ is the parameter set. The updates of the segmentation gate are simply given by: ding en for the input segment from t1 to t2 in Eq.(4) above, the reconstructed feature vector is, T X (θ) rθJ(θ) = Ea∼π[rθ logπ (at)(r − rb)]; (7) ^x = Decoder(^x ; ^x ; :::^x ; e ): (5) t t t2 t2−1 t+1 n t=1 The decoder RNN is also reset when beginning decoding each (θ) where π (at) is the probability for the action at taken as in Eq.(3). audio segment to remove the information flow from the following t segment. 2.3.2. Rewards 2.2. Encoder and Decoder Training The reconstruction error is certainly a good indicator to see whether the segmentation boundaries are good, since the embeddings are The loss function L for training the encoder and decoder is simply generated based on the segmentation. We hypothesize that good the averaged squared `-2 norm for the reconstruction error of all in- boundaries, for example those close to word boundaries, would re- put xt: sult in smaller reconstruction errors, because the audio segments for L T l 2 X X 1 (l) (l) words would appear more frequently in the corpus and thus the em- L = ^x − x ; (6) d t t beddings would be trained better giving lower reconstruction errors. l t So the smaller the reconstruction errors the higher the reward: where the superscript (l) indicates the l-th training utterance with T length Tl, and L is the number of utterances used in training. d is X 1 2 (l) rMSE = − k^xt − xtk : (8) the dimensionality of xt . d t 2.3. Segmentation Gate Training This is very similar to Eq.(6) except for a specific utterance here.
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Aviation Speech Recognition System Using Artificial Intelligence
    SPEECH RECOGNITION SYSTEM OVERVIEW AVIATION SPEECH RECOGNITION SYSTEM USING ARTIFICIAL INTELLIGENCE Appareo’s embeddable AI model for air traffic control (ATC) transcription makes the company one of the world leaders in aviation artificial intelligence. TECHNOLOGY APPLICATION As featured in Appareo’s two flight apps for pilots, Stratus InsightTM and Stratus Horizon Pro, the custom speech recognition system ATC Transcription makes the company one of the world leaders in aviation artificial intelligence. ATC Transcription is a recurrent neural network that transcribes analog or digital aviation audio into text in near-real time. This in-house artificial intelligence, trained with Appareo’s proprietary dataset of flight-deck audio, takes the terabytes of training data and its accompanying transcriptions and reduces that content to a powerful 160 MB model that can be run inside the aircraft. BENEFITS OF ATC TRANSCRIPTION: • Provides improved situational awareness by identifying, capturing, and presenting ATC communications relevant to your aircraft’s operation for review or replay. • Provides the ability to stream transcribed speech to on-board (avionic) or off-board (iPad/ iPhone) display sources. • Provides a continuous representation of secondary audio channels (ATIS, AWOS, etc.) for review in textual form, allowing your attention to focus on a single channel. • ...and more. What makes ATC Transcription special is the context-specific training work that Appareo has done, using state-of-the-art artificial intelligence development techniques, to create an AI that understands the aviation industry vocabulary. Other natural language processing work is easily confused by the cadence, noise, and vocabulary of the aviation industry, yet Appareo’s groundbreaking work has overcome these barriers and brought functional artificial intelligence to the cockpit.
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech
    A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech Joshua Y. Kim1, Chunfeng Liu1, Rafael A. Calvo1*, Kathryn McCabe2, Silas C. R. Taylor3, Björn W. Schuller4, Kaihang Wu1 1 University of Sydney, Faculty of Engineering and Information Technologies 2 University of California, Davis, Psychiatry and Behavioral Sciences 3 University of New South Wales, Faculty of Medicine 4 Imperial College London, Department of Computing Abstract large-scale commercial products such as Google Home and Amazon Alexa. Mainstream ASR Automatic Speech Recognition (ASR) systems use only voice as inputs, but there is systems have proliferated over the recent potential benefit in using multi-modal data in order years to the point that free platforms such to improve accuracy [1]. Compared to machines, as YouTube now provide speech recognition services. Given the wide humans are highly skilled in utilizing such selection of ASR systems, we contribute to unstructured multi-modal information. For the field of automatic speech recognition example, a human speaker is attuned to nonverbal by comparing the relative performance of behavior signals and actively looks for these non- two sets of manual transcriptions and five verbal ‘hints’ that a listener understands the speech sets of automatic transcriptions (Google content, and if not, they adjust their speech Cloud, IBM Watson, Microsoft Azure, accordingly. Therefore, understanding nonverbal Trint, and YouTube) to help researchers to responses to unintelligible speech can both select accurate transcription services. In improve future ASR systems to mark uncertain addition, we identify nonverbal behaviors transcriptions, and also help to provide feedback so that are associated with unintelligible speech, as indicated by high word error that the speaker can improve his or her verbal rates.
    [Show full text]
  • Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels
    Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels Hyeji Kim Himanshu Asnani Yihan Jiang Samsung AI Center School of Technology ECE Department Cambridge and Computer Science University of Washington Cambridge, United Tata Institute of Seattle, United States Kingdom Fundamental Research [email protected] [email protected] Mumbai, India [email protected] Sewoong Oh Pramod Viswanath Sreeram Kannan Allen School of ECE Department ECE Department Computer Science & University of Illinois at University of Washington Engineering Urbana Champaign Seattle, United States University of Washington Illinois, United States [email protected] Seattle, United States [email protected] [email protected] Abstract Designing codes that combat the noise in a communication medium has remained a significant area of research in information theory as well as wireless communica- tions. Asymptotically optimal channel codes have been developed by mathemati- cians for communicating under canonical models after over 60 years of research. On the other hand, in many non-canonical channel settings, optimal codes do not exist and the codes designed for canonical models are adapted via heuristics to these channels and are thus not guaranteed to be optimal. In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. TurboAE shows that the development of channel coding design can be automated via deep learning, with near-optimal performance.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Lecture 26 Word Embeddings and Recurrent Nets
    CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 26 Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Word Embeddings and Lecture 29: In-class final exam Recurrent Nets Julia Hockenmaier [email protected] 3324 Siebel Center CS447: Natural Language Processing (J. Hockenmaier) !2 What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Single output unit Input layer: vector x Return 1 if y > 0.5 Recap Return 0 otherwise For multiclass Output layer: vector y classification tasks: K output units (a vector) Input layer: vector x Each output unit " yi = class i Return argmaxi(yi) CS447: Natural Language Processing (J. Hockenmaier) !3 CS447: Natural Language Processing (J. Hockenmaier) !4 Multi-layer feedforward networks Multiclass models: softmax(yi) We can generalize this to multi-layer feedforward nets Multiclass classification = predict one of K classes. Return the class i with the highest score: argmaxi(yi) Output layer: vector y In neural networks, this is typically done by using the softmax N Hidden layer: vector hn function, which maps real-valued vectors in R into a distribution … … … over the N outputs … … … … … …. For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) Hidden layer: vector h1 (NB: This is just logistic regression) Input layer: vector x CS447: Natural Language Processing (J. Hockenmaier) !5 CS447: Natural Language Processing (J. Hockenmaier) !6 Neural Language Models LMs define a distribution over strings: P(w1….wk) LMs factor P(w1….wk) into the probability of each word: " P(w1….wk) = P(w1)·P(w2|w1)·P(w3|w1w2)·…· P(wk | w1….wk#1) A neural LM needs to define a distribution over the V words in Neural Language the vocabulary, conditioned on the preceding words.
    [Show full text]
  • Artificial Intelligence Applied to Electromechanical Monitoring, A
    ARTIFICIAL INTELLIGENCE APPLIED TO ELECTROMECHANICAL MONITORING, A PERFORMANCE ANALYSIS Erasmus project Authors: Staš Osterc Mentor: Dr. Miguel Delgado Prieto, Dr. Francisco Arellano Espitia 1/5/2020 P a g e II ANNEX VI – DECLARACIÓ D’HONOR P a g e II I declare that, the work in this Master Thesis / Degree Thesis (choose one) is completely my own work, no part of this Master Thesis / Degree Thesis (choose one) is taken from other people’s work without giving them credit, all references have been clearly cited, I’m authorised to make use of the company’s / research group (choose one) related information I’m providing in this document (select when it applies). I understand that an infringement of this declaration leaves me subject to the foreseen disciplinary actions by The Universitat Politècnica de Catalunya - BarcelonaTECH. ___________________ __________________ ___________ Student Name Signature Date Title of the Thesis : _________________________________________________ ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ P a g e II Contents Introduction........................................................................................................................................5 Abstract ..........................................................................................................................................5 Aim .................................................................................................................................................6
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Synthesis and Recognition of Speech Creating and Listening to Speech
    ISSN 1883-1974 (Print) ISSN 1884-0787 (Online) National Institute of Informatics News NII Interview 51 A Combination of Speech Synthesis and Speech Oct. 2014 Recognition Creates an Affluent Society NII Special 1 “Statistical Speech Synthesis” Technology with a Rapidly Growing Application Area NII Special 2 Finding Practical Application for Speech Recognition Feature Synthesis and Recognition of Speech Creating and Listening to Speech A digital book version of “NII Today” is now available. http://www.nii.ac.jp/about/publication/today/ This English language edition NII Today corresponds to No. 65 of the Japanese edition [Advance Notice] Great news! NII Interview Yamagishi-sensei will create my voice! Yamagishi One result is a speech translation sys- sound. Bit (NII Character) A Combination of Speech Synthesis tem. This system recognizes speech and translates it Ohkawara I would like your comments on the fu- using machine translation to synthesize speech, also ture challenges. and Speech Recognition automatically translating it into every language to Ono The challenge for speech recognition is how speak. Moreover, the speech is created with a human close it will come to humans in the distant speech A Word from the Interviewer voice. In second language learning, you can under- case. If this study is advanced, it will be possible to Creates an Affluent Society stand how you should pronounce it with your own summarize the contents of a meeting and to automati- voice. If this is further advanced, the system could cally take the minutes. If a robot understands the con- have an actor in a movie speak in a different language tents of conversations by multiple people in a natural More and more people have begun to use smart- a smartphone, I use speech input more often.
    [Show full text]
  • Natural Language Processing in Speech Understanding Systems
    Working Paper Series ISSN 11 70-487X Natural language processing in speech understanding systems by Dr Geoffrey Holmes Working Paper 92/6 September, 1992 © 1992 by Dr Geoffrey Holmes Department of Computer Science The University of Waikato Private Bag 3105 Hamilton, New Zealand Natural Language P rocessing In Speech Understanding Systems G. Holmes Department of Computer Science, University of Waikato, New Zealand Overview Speech understanding systems (SUS's) came of age in late 1071 as a result of a five year devel­ opment. programme instigated by the Information Processing Technology Office of the Advanced Research Projects Agency (ARPA) of the Department of Defense in the United States. The aim of the progranune was to research and tlevelop practical man-machine conuuunication system s. It has been argued since, t hat. t he main contribution of this project was not in the development of speech science, but in the development of artificial intelligence. That debate is beyond the scope of th.is paper, though no one would question the fact. that the field to benefit most within artificial intelligence as a result of this progranune is natural language understan ding. More recent projects of a similar nature, such as projects in the Unite<l Kiug<lom's ALVEY programme and Ew·ope's ESPRIT programme have added further developments to this important field. Th.is paper presents a review of some of the natural language processing techniques used within speech understanding syst:ems. In particular. t.ecl.miq11es for handling syntactic, semantic and pragmatic informat.ion are ,Uscussed. They are integrated into SUS's as knowledge sources.
    [Show full text]