A Deep Reinforcement Learning Framework Where Agents Learn a Basic Form of Social Movement

Total Page:16

File Type:pdf, Size:1020Kb

A Deep Reinforcement Learning Framework Where Agents Learn a Basic Form of Social Movement UPTEC F 18008 Examensarbete 30 hp April 2018 A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement Erik Ekstedt Abstract A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement Erik Ekstedt Teknisk- naturvetenskaplig fakultet UTH-enheten For social robots to move and behave appropriately in dynamic and complex social contexts they need to be flexible in their movement Besöksadress: behaviors. The natural complexity of social interaction makes this a Ångströmlaboratoriet Lägerhyddsvägen 1 difficult property to encode programmatically. Instead of programming Hus 4, Plan 0 these algorithms by hand it could be preferable to have the system learn these behaviors. In this project a framework is created in which Postadress: an agent, through deep reinforcement learning, can learn how to mimic Box 536 751 21 Uppsala poses, here defined as the most basic case of social movements. The framework aimed to be as agent agnostic as possible and suitable for Telefon: both real life robots and virtual agents through an approach called 018 – 471 30 03 "dancer in the mirror". The framework utilized a learning algorithm Telefax: called PPO and trained agents, as a proof of concept, on both a virtual 018 – 471 30 00 environment for the humanoid robot Pepper and for virtual agents in a physics simulation environment. The framework was meant to be a simple Hemsida: starting point that could be extended to incorporate more and more http://www.teknat.uu.se/student complex tasks. This project shows that this framework was functional for agents to learn to mimic poses on a simplified environment. Handledare: Alex Yuan Gao Ämnesgranskare: Ginevra Castellano Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F 18008 Popul¨arvetenskaplig sammanfattning M¨ansklighetenblir alltmer beroende av teknologi och utvecklingen g˚arsnabbare ¨ann˚agonsinf¨orr. F¨or ett decennium sedan s˚ablev den f¨orstasmartphonen in- troducerad och plattformar som Facebook och Youtube d¨okupp f¨oratt f¨or¨andra samh¨alletf¨oralltid. P˚agrund av hur fort teknologin utvecklas ¨ardet fullt m¨ojligt att vi snart lever i ett samh¨alled¨arsociala robotar k¨annslika sj¨alvklarasom v˚ara smartphones g¨oridag. Robotar som kan hj¨alpaoss med allt fr˚ans¨allskap till sjukv˚ard,r¨addningstj¨anstoch utbildning. N¨arm¨anniskor interagerar och kommunicerar i v˚aravardagliga liv, det vill s¨aga n¨arvi tr¨affasi det verkliga livet, anv¨andervi oss mycket av gester och r¨orelser.Vi r¨oross p˚aolika s¨attberoende p˚avilken social umg¨angeskrets vi ¨armed eller vad f¨orslags social situation som ¨arrelevant. Det ¨arannorlunda att se n˚agonm¨ota sina b¨astav¨annerinne p˚aen fest eller g˚aut fr˚anen begravning. Vi anv¨ander v˚artkroppsspr˚aktill att f¨ortydliga vad vi menar och vi kan bed¨omaandras sin- nesst¨amningfr˚ananalys av deras h˚allning och s¨attetde f¨orsig. Om sociala rob- otar ska vara en naturlig del av samh¨alletoch interagera och kommunicera med oss m¨anniskor vore det f¨ordelaktigt om dessa hade liknande egenskaper. Sociala robotar borde kunna r¨orasig p˚aett naturligt s¨attsom tillf¨orn˚agottill den so- ciala interaktionen och g¨oratt m¨anniskor k¨annersig lugna och s¨akra. Deras tillv¨agag˚angss¨attborde ¨andrasberoende p˚ahur andra i det sociala sammanhanget beter sig. Sociala situationer ¨arav en dynamisk natur som g¨ordet sv˚artatt p˚a f¨orhandprogrammera in den exakta kunskap som kr¨avsf¨oratt r¨orasig p˚aett, f¨or m¨anniskor, ¨overtygande s¨att.Ist¨alletf¨oratt best¨ammahur en robot ska bete sig och programmera in olika typer av r¨orelservore det b¨attreom roboten sj¨alvl¨arde sig detta. P˚asenare ˚arhar deep learning, ett omr˚adeav maskininl¨arningsom anv¨andersig av neurala n¨atverk, visat stora framsteg inom m˚angaolika omr˚aden. AI ¨arett popul¨arkulturellt begrepp och f˚armycket utrymme i media. Det kan vara r¨on om allt fr˚ansj¨alvk¨orande bilar, personliga assisstenter till cancerdiagnosterande system och i de flesta fallen ¨ardet deep learning och neurala n¨atverk som ¨arden underliggande teknologin. Neurala n¨atverk har funnits sedan 40-talet men det ¨ar under de senaste ˚arende har blivit mainstream. Det ¨arf¨orstidag vi har tillr¨ackligt med ber¨akningskrafttillg¨angligt f¨ortillr¨ackligt m˚anga m¨anniskor som dessa neu- rala n¨atverk har kunnat ge de resultat vi nu ser ¨arm¨ojliga.Dessa typer av program ¨arnu standard i allt fr˚anljud- och bildigenk¨anningtill att att ¨overs¨attatext mel- lan olika spr˚ak.Det ¨ar¨aven denna teknologin som ligger bakom de program som nu ¨arb¨attre¨anm¨anniskor p˚aspel som Go, Atari och schack. Dessa program har l¨artsig spela dessa spel genom en teknik, p˚aengelska kallad reinforcement learn- i ing. Denna teknik handlar om att l¨arasig beteende p˚aliknande s¨attsom djur och m¨anniskor l¨arsig. Inom reinforcement learning s˚aanv¨andsuttryck som agent, milj¨ooch bel¨oning.En agent interagerar med sin milj¨od¨arolika handlingar ger olika bel¨oningarberoende p˚ahur bra handlingen var. Agenten testar sedan att g¨oramassvis med olika han- dlingar och efter en viss m¨angdtr¨aningl¨arden sig vad som ¨arb¨astatt g¨oraoch vad som b¨orundvikas. Detta ¨argenerellt och de beteenden som agenten l¨arsig beror p˚amilj¨on,bel¨oningenoch inl¨arningsalgorithmen. Olika milj¨oermed olika bel¨oningssystem ger upphov till agenter som ¨arbra p˚aolika saker. I detta projekt skapas en milj¨omed tillh¨orandebel¨oningssystemsom ¨artill f¨oratt en agent ska l¨arasig att h¨armaen annan agents kroppsh˚allning.Att h¨armaen annan agents kroppsh˚allningantas i detta projekt vara den mest element¨araformen av sociala r¨orelser.Planen ¨arsedan att utg˚afr˚andetta och introducera mer och mer komplexa uppgifter. F¨orutommilj¨ons˚aanv¨andesen ny optimeringsalgoritm, f¨orkortad som PPO, f¨oratt optimera de neurala n¨atverk som var skapade f¨oratt l¨osauppgiften. I denna implementation ¨ardet viktigt att milj¨on¨argenerell f¨oratt kunna tr¨anadels helt fiktiva virtuella figurer men ocks˚ariktiga robotar s˚asom den humanoida roboten Pepper fr˚anSoftbank Robotics. Projektet implementerade en milj¨obaserat p˚aProgrammet Choregraphe d¨arman kan styra Pepper samt en milj¨osom ¨arbaserat p˚anon profit-f¨oretagetOpenAI's Roboschool-milj¨obyggt p˚a fysik-simuleringsprogram Bullet. Det de olika milj¨oernahar gemensamt ¨ars¨attet agenter i milj¨onska l¨arasig akten att h¨armaen annan agents kroppsh˚allning. Efter det att milj¨oernablev funktionella s˚autf¨ordesn˚agramindre omfattande experiment f¨or att se om algoritmen, milj¨on,bel¨oningssystemetoch de neurala n¨atverken kunde visas klara uppgiften att h¨armaen annan agents kroppsh˚allning. Resultaten fr˚andessa mindre experiment visar p˚aatt det ¨arm¨ojligtatt h¨arma kroppsh˚allningp˚adetta s¨attet,i en f¨orenkladmilj¨o,men att mer arbete beh¨ovs f¨oratt g¨oramilj¨oernamer komplexa och relevanta f¨orrealistiska situationer. ii TABLE OF CONTENTS 1 Introduction 1 1.1 Setup . 2 1.2 Dancer in the Mirror Approach . 3 1.3 Research Questions . 5 2 Background 5 2.1 Machine Learning . 5 2.2 Artificial Neural Networks . 6 2.3 Activation Functions . 7 2.3.1 Sigmoidal Activation Function . 7 2.3.2 ReLu . 8 2.4 Backpropogation . 9 2.4.1 Stochastic Gradient Descent . 9 2.4.2 Adam . 10 2.5 Architectures . 10 2.5.1 Convolutional Neural Network . 11 2.5.2 Recurrent Neural Networks . 11 2.5.3 Hyperparameters . 12 2.6 Reinforcement Learning . 12 2.6.1 Value iteration . 15 2.6.2 Policy Optimization . 15 2.6.3 Actor-Critic Methods . 17 2.6.4 Proximal Policy Optimization . 17 2.6.5 Exploration vs Exploitation . 18 2.7 Pepper . 19 2.7.1 Choregraphe . 19 2.8 OpenAI's Gym . 20 2.8.1 Roboschool . 21 3 Method 21 3.1 Learning Algorithm . 22 3.2 Pepper Environment . 24 3.3 Custom Roboschool Environment . 25 3.4 Reward Function . 26 3.5 Networks . 28 3.5.1 Modular Approach . 28 3.5.2 Semi Modular Approach . 29 3.5.3 Combined Approach . 30 3.6 Experiment . 30 iii 3.7 Custom Reacher Experiments . 31 3.8 Custom Humanoid Experiments . 32 3.9 Pepper Experiments . 33 4 Results 34 4.1 Reward Function . 34 4.2 Experiment . 34 4.2.1 Reacher Environment . 35 4.2.2 Humanoid Environment . 39 4.3 Pose Evaluation . 39 4.4 Pepper . 41 4.5 Code . 44 5 Discussion and Future Work 45 5.1 Custom Roboschool Environment . 45 5.2 Pepper . 46 5.3 Project . 48 5.4 Future Work . 49 6 Conclusion 49 iv 1 Introduction This project aims to construct a framework for training agents to learn a basic form of social movement through end to end deep reinforcement learning. In human so- cial interactions individuals convey a lot of information through the movements of different body parts. We implement many detailed movements in the facial area and in the use of our arms, hands and overall posing. There are a wide variety of different movements humans use when we engage in social interactions and they range from fully conscious and explicit in their meanings, all the way to movements that we are unconsciously doing and are not aware of. We use the movement information of others as a way to infer the type of social interaction we are in as well as the emotional state and intentions of the people we socialize with. Social movements are highly context dependent and the context change over time. In other words the contexts are dynamical and require that an agent is able to adapt to different behaviors based on queues in the social environment.
Recommended publications
  • Self-Training Wavenet for TTS in Low-Data Regimes
    StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders
    Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning by modeling data distribution Train the model to minimize − log 푝(푥) E.g.
    [Show full text]
  • Real-Time Black-Box Modelling with Recurrent Neural Networks
    Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019 REAL-TIME BLACK-BOX MODELLING WITH RECURRENT NEURAL NETWORKS Alec Wright, Eero-Pekka Damskägg, and Vesa Välimäki∗ Acoustics Lab, Department of Signal Processing and Acoustics Aalto University Espoo, Finland [email protected] ABSTRACT tube amplifiers and distortion pedals. In [14] it was shown that the WaveNet model of several distortion effects was capable of This paper proposes to use a recurrent neural network for black- running in real time. The resulting deep neural network model, box modelling of nonlinear audio systems, such as tube amplifiers however, was still fairly computationally expensive to run. and distortion pedals. As a recurrent unit structure, we test both Long Short-Term Memory and a Gated Recurrent Unit. We com- In this paper, we propose an alternative black-box model based pare the proposed neural network with a WaveNet-style deep neu- on an RNN. We demonstrate that the trained RNN model is capa- ral network, which has been suggested previously for tube ampli- ble of achieving the accuracy of the WaveNet model, whilst re- fier modelling. The neural networks are trained with several min- quiring considerably less processing power to run. The proposed utes of guitar and bass recordings, which have been passed through neural network, which consists of a single recurrent layer and a the devices to be modelled. A real-time audio plugin implement- fully connected layer, is suitable for real-time emulation of tube ing the proposed networks has been developed in the JUCE frame- amplifiers and distortion pedals.
    [Show full text]
  • Linear Prediction-Based Wavenet Speech Synthesis
    LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis Min-Jae Hwang Frank Soong Eunwoo Song Search Solution Microsoft Naver Corporation Seongnam, South Korea Beijing, China Seongnam, South Korea [email protected] [email protected] [email protected] Xi Wang Hyeonjoo Kang Hong-Goo Kang Microsoft Yonsei University Yonsei University Beijing, China Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract—We propose a linear prediction (LP)-based wave- than the speech signal, the training and generation processes form generation method via WaveNet vocoding framework. A become more efficient. WaveNet-based neural vocoder has significantly improved the However, the synthesized speech is likely to be unnatural quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target when the prediction errors in estimating the excitation are database contains massive amount of acoustical information propagated through the LP synthesis process. As the effect such as prosody, style or expressiveness. As a solution, the of LP synthesis is not considered in the training process, the approaches that only generate the vocal source component by synthesis output is vulnerable to the variation of LP synthesis a neural vocoder have been proposed. However, they tend to filter. generate synthetic noise because the vocal source component is independently handled without considering the entire speech To alleviate this problem, we propose an LP-WaveNet, production process; where it is inevitable to come up with a which enables to jointly train the complicated interactions mismatch between vocal source and vocal tract filter.
    [Show full text]
  • Anomaly Detection in Raw Audio Using Deep Autoregressive Networks
    ANOMALY DETECTION IN RAW AUDIO USING DEEP AUTOREGRESSIVE NETWORKS Ellen Rushe, Brian Mac Namee Insight Centre for Data Analytics, University College Dublin ABSTRACT difficulty to parallelize backpropagation though time, which Anomaly detection involves the recognition of patterns out- can slow training, especially over very long sequences. This side of what is considered normal, given a certain set of input drawback has given rise to convolutional autoregressive ar- data. This presents a unique set of challenges for machine chitectures [24]. These models are highly parallelizable in learning, particularly if we assume a semi-supervised sce- the training phase, meaning that larger receptive fields can nario in which anomalous patterns are unavailable at training be utilised and computation made more tractable due to ef- time meaning algorithms must rely on non-anomalous data fective resource utilization. In this paper we adapt WaveNet alone. Anomaly detection in time series adds an additional [24], a robust convolutional autoregressive model originally level of complexity given the contextual nature of anomalies. created for raw audio generation, for anomaly detection in For time series modelling, autoregressive deep learning archi- audio. In experiments using multiple datasets we find that we tectures such as WaveNet have proven to be powerful gener- obtain significant performance gains over deep convolutional ative models, specifically in the field of speech synthesis. In autoencoders. this paper, we propose to extend the use of this type of ar- The remainder of this paper proceeds as follows: Sec- chitecture to anomaly detection in raw audio. In experiments tion 2 surveys recent related work on the use of deep neu- using multiple audio datasets we compare the performance of ral networks for anomaly detection; Section 3 describes the this approach to a baseline autoencoder model and show su- WaveNet architecture and how it has been re-purposed for perior performance in almost all cases.
    [Show full text]
  • Predicting Uber Demand in NYC with Wavenet
    SMART ACCESSIBILITY 2019 : The Fourth International Conference on Universal Accessibility in the Internet of Things and Smart Environments Predicting Uber Demand in NYC with Wavenet Long Chen Konstantinos Ampountolas Piyushimita (Vonu) Thakuriah Urban Big Data Center School of Engineering Rutgers University Glasgow, UK University of Glasgow, UK New Brunswick, NJ, USA Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Uber demand prediction is at the core of intelligent transportation systems when developing a smart city. However, exploiting uber real time data to facilitate the demand predic- tion is a thorny problem since user demand usually unevenly distributed over time and space. We develop a Wavenet-based model to predict Uber demand on an hourly basis. In this paper, we present a multi-level Wavenet framework which is a one-dimensional convolutional neural network that includes two sub-networks which encode the source series and decode the predicting series, respectively. The two sub-networks are combined by stacking the decoder on top of the encoder, which in turn, preserves the temporal patterns of the time series. Experiments on large-scale real Uber demand dataset of NYC demonstrate that our model is highly competitive to the existing ones. Figure 1. The structure of WaveNet, where different colours in embedding Keywords–Anything; Something; Everything else. input denote 2 × k, 3 × k, and 4 × k convolutional filter respectively. I. INTRODUCTION With the proliferation of Web 2.0, ride sharing applications, such as Uber, have become a popular way to search nearby at an hourly basis, which is a WaveNet-based neural network sharing rides.
    [Show full text]
  • Parallel Wave Generation in End-To-End Text-To-Speech
    Published as a conference paper at ICLR 2019 CLARINET:PARALLEL WAVE GENERATION IN END-TO-END TEXT-TO-SPEECH Wei Ping∗ Kainan Peng∗ Jitong Chen∗ {pingwei01, pengkainan, chenjitong01}@baidu.com Baidu Research 1195 Bordeaux Dr, Sunnyvale, CA 94089 ABSTRACT In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van den Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model. 1 1 INTRODUCTION Speech synthesis, also called text-to-speech (TTS), is traditionally done with complex multi-stage hand-engineered pipelines (Taylor, 2009). Recent successes of deep learning methods for TTS lead to high-fidelity speech synthesis (van den Oord et al., 2016a), much simpler “end-to-end” pipelines (Sotelo et al., 2017; Wang et al., 2017; Ping et al., 2018), and a single TTS model that reproduces thousands of different voices (Ping et al., 2018). WaveNet (van den Oord et al., 2016a) is an autoregressive generative model for waveform synthesis.
    [Show full text]
  • Unsupervised Learning of Cross-Modal Mappings Between
    Unsupervised Learning of Cross-Modal Mappings between Speech and Text by Yu-An Chung B.S., National Taiwan University (2016) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 c Massachusetts Institute of Technology 2019. All rights reserved. Author.............................................................. Department of Electrical Engineering and Computer Science May 3, 2019 Certified by. James R. Glass Senior Research Scientist in Computer Science Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Unsupervised Learning of Cross-Modal Mappings between Speech and Text by Yu-An Chung Submitted to the Department of Electrical Engineering and Computer Science on May 3, 2019, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Deep learning is one of the most prominent machine learning techniques nowadays, being the state-of-the-art on a broad range of applications in computer vision, nat- ural language processing, and speech and audio processing. Current deep learning models, however, rely on significant amounts of supervision for training to achieve exceptional performance. For example, commercial speech recognition systems are usually trained on tens of thousands of hours of annotated data, which take the form of audio paired with transcriptions for training acoustic models, collections of text for training language models, and (possibly) linguist-crafted lexicons mapping words to their pronunciations. The immense cost of collecting these resources makes applying state-of-the-art speech recognition algorithm to under-resourced languages infeasible.
    [Show full text]
  • A Survey of Forex and Stock Price Prediction Using Deep Learning
    Review A Survey of Forex and Stock Price Prediction Using Deep Learning Zexin Hu †, Yiqi Zhao † and Matloob Khushi * School of Computer Science, The University of Sydney, Building J12/1 Cleveland St., Camperdown, NSW 2006, Australia; [email protected] (Z.H.); [email protected] (Y.Z.) * Correspondence: [email protected] † The authors contributed equally; both authors should be considered the first author. Abstract: Predictions of stock and foreign exchange (Forex) have always been a hot and profitable area of study. Deep learning applications have been proven to yield better accuracy and return in the field of financial prediction and forecasting. In this survey, we selected papers from the Digital Bibliography & Library Project (DBLP) database for comparison and analysis. We classified papers according to different deep learning methods, which included Convolutional neural network (CNN); Long Short-Term Memory (LSTM); Deep neural network (DNN); Recurrent Neural Network (RNN); Reinforcement Learning; and other deep learning methods such as Hybrid Attention Networks (HAN), self-paced learning mechanism (NLP), and Wavenet. Furthermore, this paper reviews the dataset, variable, model, and results of each article. The survey used presents the results through the most used performance metrics: Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Mean Square Error (MSE), accuracy, Sharpe ratio, and return rate. We identified that recent models combining LSTM with other methods, for example, DNN, are widely researched. Reinforcement learning and other deep learning methods yielded great returns and performances. We conclude that, in recent years, the trend of using deep-learning-based methods for financial modeling is rising exponentially.
    [Show full text]
  • Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement
    Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement Renee Li Paul Mure Dept. of Computer Science Dept. of Computer Science Stanford University Stanford University [email protected] [email protected] Abstract The task of accent transfer has recently become a field actively under research. In fact, several companies were established with the goal of tackling this challenging problem. We designed and implemented a model using WaveNet and a multitask learner to transfer accents within a paragraph of English speech. In contrast to the actual WaveNet, our architecture made several adaptations to compensate for the fact that WaveNet does not work well for drastic dimensionality reduction due to the use of residual and skip connections, and we added a multitask learner in order to learn the style of the accent. 1 Introduction In this paper, we explored the task of accent transfer in English speech. This is an interesting problem because it shows the ability of neural network to generate novel contents based on partial information, and this problem has not been explored as much as some of the other topics in style transfer. We believe that this task is very interesting and challenging because we will be developing the state-of-the-art model for our specific task: while there are some published papers about accent classification, so far, we have not been able to find a good pre-trained model to use for evaluating our results. Moreover, raw audio generative models are also a very active field of research, so there is not much literature or implementations for relevant tasks.
    [Show full text]
  • A Practical Guide to Ai in the Contact Center
    A PRACTICAL GUIDE TO AI IN THE CONTACT CENTER making your Business Brilliant Introduction Are you interested in how articial intelligence (AI) might impact your contact center? The hype cycle for AI is nearing its peak. But before you rush to deploy an AI tool, let’s separate fact from ction. What are the practical benets of AI today? What kind of chal- lenges arise from automation? What are the underlying technologies at play? In this e-book, we will answer these questions and more. We will examine AI from a prag- matic lens and offer suggestions to minimize costs and maximize returns. A Practical Guide to AI in the Contact Center 2 What is AI? People tend to generalize their discussions of AI with all the underlying technologies. The articial intelligence denition is open to interpretation. After all, there are several levels of “intelligent.” For EXAMPLES OF AI the purpose of business, we typically consider technologies that simulate or supplant human action as PLATFORMS: “AI.” • Microsoft Azure The ambiguity of AI makes it a difcult concept to invest in. The idea of advanced technology depos- Machine Learning ing human effort has overt appeal. However, without a directive, AI is wasted intelligence. • Google Cloud Machine Learning Engine • Amazon Lex WAVENET TIP: • Infosys Mana If you want to use AI successfully, ask yourself, “What problem am I trying to solve?” • WiPro Holmes • Rainbird • Ayasdi Do you want a chatbot that can take and place If you have a clear understanding of the problem • Vital A.I. orders in lieu of call center representatives? Do you are trying to solve, AI may benet your • Meya you want a program that analyzes customer data company.
    [Show full text]