Implementing Wavenet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis

Total Page:16

File Type:pdf, Size:1020Kb

Implementing Wavenet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis White Paper FPGA Artificial Intelligence Implementing WaveNet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis Authors Abstract Jonny Shipton Real-time text-to-speech models are widely deployed in interactive voice services Software Lead such as voice assistants and smart speakers. However, deploying these models is often challenging due to the strict latency requirements that they face. We present Myrtle .ai an implementation of WaveNet, a state-of-the-art vocoder, that can generate 256 16 kHz audio streams at near-human level quality in real time: 8 times higher Jon Fowler throughput than a hand optimized GPU solution. Our implementation is enabled FPGA Lead by the Intel® Stratix® 10 NX FPGA, demonstrating the capability of FPGAs to deliver Myrtle .ai high-throughput, low-latency inference for realworld applications powered by neural networks. Chris Chalmers Senior Developer I . Introduction Myrtle .ai The increasing use of voice assistants has been powered, in large part, by advances in the field of Text-to-Speech (TTS). State-of-the-art speech synthesis systems consist of two neural networks that are run sequentially to generate audio. The first Sam Davis model takes text as input and generates acoustic features, such as spectrograms, Head of Machine Learning while the second model, referred to as a vocoder, takes these intermediate features Myrtle .ai and produces speech. Tacotron 2 is often used as the first model. In this paper, we focus on the second model in the speech synthesis system. WaveNet [1] is a state-of-the art vocoder that is capable of producing speech with near-human-level Sam Gooch naturalness [2]. The key to the model’s quality is its autoregressive loop but this Machine Learning Engineer property also makes the network exceptionally challenging to implement for real- Myrtle .ai time applications. As a result, TTS research has focused on finding alternative vocoder architectures Giuseppe Coccia such as Parallel-WaveNet [3], WaveRNN [4], ClariNet [5] and WaveGlow [6] that Machine Learning Engineer are better suited to efficient inference on existing hardware. There is a degree Myrtle .ai of ambiguity as to the highest quality vocoder, as audio quality evaluation is subjective, but all authors agree the original WaveNet architecture produces at least as good, if not higher, quality audio than more recent approaches [3], [6]–[9]. There have been some efforts made to accelerate the WaveNet model directly rather than using an alternative architecture, including some other FPGA solutions Table of Contents [10]. However, these are not known to achieve real-time audio synthesis. The Abstract . 1 implementation with the highest performance that we have been able to identify is the NVIDIA* nv-wavenet1 that can achieve real-time audio synthesis using highly- I . Introduction . 1 optimised hand-written CUDA kernels. II . Model Description . 2 In this paper, we implement a WaveNet model, with approximately the same III . Model Compression . 2 number of parameters in the repeated part of the network, as the largest BFP16 Quantization . 3 configuration of nv-wavenet available in the example repository. The repeated IV . Hardware Implementation . .. 4 part of the network is the most latency sensitive section. By using Block Floating Point (BFP16) quantization and the Intel Stratix 10 NX FPGA, we are able to deploy FPGA Processing Architecture . 4 this model and produce 256 16 kHz streams in real time. We show that the FPGA Using Intel Stratix 10 NX FPGA implementation remains efficient at higher frequencies, demonstrating 160 24 kHz AI Tensor Blocks . .. 4 streams and 128 32 kHz streams in real time. Implementing Dilated Convolutions on the FPGA . 5 The rest of this paper is structured as follows: the WaveNet model architecture is presented in more detail in Section II. The concepts of model compression by the Programming Model . 5 use of quantization and the BFP16 data format are discussed in Section III. Section V . Results . 6 IV describes the implementation of the WaveNet model on Intel Stratix 10 NX FPGA, Methodology . 6 including the use of AI Tensor Blocks, BFP16 computation and high-bandwidth memory (HBM). In Section V we present our results, including the performance of Model Quality . 6 the implementation, as well as the quality of the generated audio. Finally, in Section Model Performance . 6 VI, we summarize our results and interpret their importance for future deployments VI . Conclusion . 7 of neural networks in the data center with FPGAs. References . 8 1 https://github.com/NVIDIA/nv-wavenet White Paper | Implementing WaveNet Using Intel Stratix 10 NX FPGA for Real-Time Speech Synthesis # PARAMETERS GOP/SECOND AUDIO LAYER TYPE Per-layer Total Per-layer Total Pre-processing Layers Embedding Embedding 30,720 - Feature Upsample ConvTranspose1d 5,120,080 0.82 Repeated Layers Dilation Dilated Conv1d 57,840 925,440 1.84 29.49 Conditional Conv1d 19,440 311,040 0.61 9.83 Residual Conv1d 14,520 217,800 0.46 6.91 Skip Conv1d 29,040 464,640 0.92 14.75 Post-processing Layers Out Conv1d 61,440 1.96 End Conv1d 65,536 2.09 Total 7,196,696 65.85 Table 1 . A Breakdown of the Model's Parameters and Giga-Operations Required per Second of Synthesized Audio Output. II . Model Description The task of audio generation is that of producing a sequence After the two post-processing convolutional layers are of values [x1,...,xn]. Just like in a regular audio file, each valuex i applied, the model samples from the probability distribution is an integer representing a discrete sample from a waveform. p(xt|x1,..., xi-1); we find sampling to produce higher fidelity There are two critical parameters of any audio stream: the speech than other methods such as selecting the value with sample rate, which is the number of these samples that make highest probability. The model produces audio output that up a single second of audio, measured in Hertz; and the bit has a bit depth of 16. However, to reduce the computational depth, which is the range of discrete values that each xi can overhead of producing 216 = 65,536 values the model instead take. Increasing these parameters improves the quality of the generates 8-bit µ-law encoded values that are then decoded audio stream but requires the production of more samples to produce the final 16-bit audio sample. per second or an increase in the number of computations required for each sample. In this paper, we use a sample rate At inference time, calculating the output for one step requires of 16 kHz and a bit depth of 16, giving each sample a range the value of the previous step, so it is necessary to calculate of 65,536 possible values, as we find these to achieve both each step individually and sequentially. For 16 kHz audio it near-human-level quality and high performance. is necessary to run the model 16,000 times per second to achieve real-time audio generation. This means that a single WaveNet is a convolutional neural network that models forward pass through the model must take at most 62.5μs. the conditional probability of producing a sample given all WaveNet is parameterized by the number of channels in previous samples: p(xt|x1,..., xi-1). The input to the first layer of the model at a given step is an embedding of the model its convolutions: the number of skip, residual and audio output from the previous time step. The core of the network channels, as well as by the number of repeated layers and consists of a series of repeated layers that each contain four dilation cycle parameter D. In this paper we use a model with convolutions; see Figure 1. The first convolution in each s = 240 skip channels, r = 120 residual channels, a = 256 repeated layer is a dilated convolution where the input is audio channels, L = 16 repeated layers and D = 8. This model the concatenation of the output of the previous layer at has 7.2 million parameters, as shown in Table 1, and requires approximately 65.85 GFLOPs of computation to generate one the current step, t, and a past step. The past step for the kth second of audio. We map the ConvTranspose1d layer to CPU repeated layer depends on the dilation cycle parameter D and all other layers to the FPGA. This significantly reduces and is set to step t - 2k mod D. model parameters that must be held on the FPGA, requiring Each repeated layer also produces a skip output. The skip approximately 2.1 million parameters to be stored on the outputs are summed and used to produce the final model FPGA overall. The layers running on the FPGA still account for output after post-processing with two more convolutional the majority of the computation, requiring 98.8% of the total layers. In addition to the outputs from the previous layer, operations of the model. each repeated layer also receives a conditional input that allows control over what audio is generated. The conditional inputs are generated by applying an upsampling III . Model Compression transposed convolution to conditional features. In early Quantization is one technique for model compression. versions of WaveNet these conditional features were simple Quantizing a model consists of using a different numerical linguistic features, however, current approaches typically format for the parameters and/or activations. This lowers use spectrograms generated by a separate model such as the power consumption and latency, as both the memory Tacotron 2 [2] or Deep Voice 3 [11]. requirements are reduced and the operations can be more efficiently executed in hardware. 2 White Paper | Implementing WaveNet Using Intel Stratix 10 NX FPGA for Real-Time Speech Synthesis 16kHz Audio Output μ-law Features in + [0, a) L REPEATED LAYERS 1x1 Conv Sample r Outputs FPGA 1x1 Conv 1x1 Conv Σ s Outputs a Outputs tanh Θ σ 1x1 Conv ReLU ReLU a Outputs Split Channels 2x1 Dilated Conv 1x1 Conv 800x1 200-Strided Conv + 2r Outputs 2r Outputs 80 Outputs CPU Mel Spectrogram Input 16kHz Audio Input Embedding 80 Features μ-law Features in Features r Outputs [0, a) Model Layers μ-law 16kHz 16-bit Decoding Audio Output Stream Activation Operations Figure 1 .
Recommended publications
  • Face Recognition: a Convolutional Neural-Network Approach
    98 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997 Face Recognition: A Convolutional Neural-Network Approach Steve Lawrence, Member, IEEE, C. Lee Giles, Senior Member, IEEE, Ah Chung Tsoi, Senior Member, IEEE, and Andrew D. Back, Member, IEEE Abstract— Faces represent complex multidimensional mean- include fingerprints [4], speech [7], signature dynamics [36], ingful visual stimuli and developing a computational model for and face recognition [8]. Sales of identity verification products face recognition is difficult. We present a hybrid neural-network exceed $100 million [29]. Face recognition has the benefit of solution which compares favorably with other methods. The system combines local image sampling, a self-organizing map being a passive, nonintrusive system for verifying personal (SOM) neural network, and a convolutional neural network. identity. The techniques used in the best face recognition The SOM provides a quantization of the image samples into a systems may depend on the application of the system. We topological space where inputs that are nearby in the original can identify at least two broad categories of face recognition space are also nearby in the output space, thereby providing systems. dimensionality reduction and invariance to minor changes in the image sample, and the convolutional neural network provides for 1) We want to find a person within a large database of partial invariance to translation, rotation, scale, and deformation. faces (e.g., in a police database). These systems typically The convolutional network extracts successively larger features return a list of the most likely people in the database in a hierarchical set of layers. We present results using the [34].
    [Show full text]
  • CNN Architectures
    Lecture 9: CNN Architectures Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 1 May 2, 2017 Administrative A2 due Thu May 4 Midterm: In-class Tue May 9. Covers material through Thu May 4 lecture. Poster session: Tue June 6, 12-3pm Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 2 May 2, 2017 Last time: Deep learning frameworks Paddle (Baidu) Caffe Caffe2 (UC Berkeley) (Facebook) CNTK (Microsoft) Torch PyTorch (NYU / Facebook) (Facebook) MXNet (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of Theano TensorFlow choice at AWS (U Montreal) (Google) And others... Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 3 May 2, 2017 Last time: Deep learning frameworks (1) Easily build big computational graphs (2) Easily compute gradients in computational graphs (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 4 May 2, 2017 Last time: Deep learning frameworks Modularized layers that define forward and backward pass Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 5 May 2, 2017 Last time: Deep learning frameworks Define model architecture as a sequence of layers Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 6 May 2, 2017 Today: CNN Architectures Case Studies - AlexNet - VGG - GoogLeNet - ResNet Also.... - NiN (Network in Network) - DenseNet - Wide ResNet - FractalNet - ResNeXT - SqueezeNet - Stochastic Depth Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 2, 2017 Review: LeNet-5 [LeCun et al., 1998] Conv filters were 5x5, applied at stride 1 Subsampling (Pooling) layers were 2x2 applied at stride 2 i.e.
    [Show full text]
  • Self-Training Wavenet for TTS in Low-Data Regimes
    StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher.
    [Show full text]
  • Deep Learning Architectures for Sequence Processing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER Deep Learning Architectures 9 for Sequence Processing Time will explain. Jane Austen, Persuasion Language is an inherently temporal phenomenon. Spoken language is a sequence of acoustic events over time, and we comprehend and produce both spoken and written language as a continuous input stream. The temporal nature of language is reflected in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter streams, all of which emphasize that language is a sequence that unfolds in time. This temporal nature is reflected in some of the algorithms we use to process lan- guage. For example, the Viterbi algorithm applied to HMM part-of-speech tagging, proceeds through the input a word at a time, carrying forward information gleaned along the way. Yet other machine learning approaches, like those we’ve studied for sentiment analysis or other text classification tasks don’t have this temporal nature – they assume simultaneous access to all aspects of their input. The feedforward networks of Chapter 7 also assumed simultaneous access, al- though they also had a simple model for time. Recall that we applied feedforward networks to language modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way. Fig. 9.1, reproduced from Chapter 7, shows a neural language model with window size 3 predicting what word follows the input for all the. Subsequent words are predicted by sliding the window forward a word at a time.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders
    Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning by modeling data distribution Train the model to minimize − log 푝(푥) E.g.
    [Show full text]
  • Real-Time Black-Box Modelling with Recurrent Neural Networks
    Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019 REAL-TIME BLACK-BOX MODELLING WITH RECURRENT NEURAL NETWORKS Alec Wright, Eero-Pekka Damskägg, and Vesa Välimäki∗ Acoustics Lab, Department of Signal Processing and Acoustics Aalto University Espoo, Finland [email protected] ABSTRACT tube amplifiers and distortion pedals. In [14] it was shown that the WaveNet model of several distortion effects was capable of This paper proposes to use a recurrent neural network for black- running in real time. The resulting deep neural network model, box modelling of nonlinear audio systems, such as tube amplifiers however, was still fairly computationally expensive to run. and distortion pedals. As a recurrent unit structure, we test both Long Short-Term Memory and a Gated Recurrent Unit. We com- In this paper, we propose an alternative black-box model based pare the proposed neural network with a WaveNet-style deep neu- on an RNN. We demonstrate that the trained RNN model is capa- ral network, which has been suggested previously for tube ampli- ble of achieving the accuracy of the WaveNet model, whilst re- fier modelling. The neural networks are trained with several min- quiring considerably less processing power to run. The proposed utes of guitar and bass recordings, which have been passed through neural network, which consists of a single recurrent layer and a the devices to be modelled. A real-time audio plugin implement- fully connected layer, is suitable for real-time emulation of tube ing the proposed networks has been developed in the JUCE frame- amplifiers and distortion pedals.
    [Show full text]
  • Linear Prediction-Based Wavenet Speech Synthesis
    LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis Min-Jae Hwang Frank Soong Eunwoo Song Search Solution Microsoft Naver Corporation Seongnam, South Korea Beijing, China Seongnam, South Korea [email protected] [email protected] [email protected] Xi Wang Hyeonjoo Kang Hong-Goo Kang Microsoft Yonsei University Yonsei University Beijing, China Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract—We propose a linear prediction (LP)-based wave- than the speech signal, the training and generation processes form generation method via WaveNet vocoding framework. A become more efficient. WaveNet-based neural vocoder has significantly improved the However, the synthesized speech is likely to be unnatural quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target when the prediction errors in estimating the excitation are database contains massive amount of acoustical information propagated through the LP synthesis process. As the effect such as prosody, style or expressiveness. As a solution, the of LP synthesis is not considered in the training process, the approaches that only generate the vocal source component by synthesis output is vulnerable to the variation of LP synthesis a neural vocoder have been proposed. However, they tend to filter. generate synthetic noise because the vocal source component is independently handled without considering the entire speech To alleviate this problem, we propose an LP-WaveNet, production process; where it is inevitable to come up with a which enables to jointly train the complicated interactions mismatch between vocal source and vocal tract filter.
    [Show full text]
  • Anomaly Detection in Raw Audio Using Deep Autoregressive Networks
    ANOMALY DETECTION IN RAW AUDIO USING DEEP AUTOREGRESSIVE NETWORKS Ellen Rushe, Brian Mac Namee Insight Centre for Data Analytics, University College Dublin ABSTRACT difficulty to parallelize backpropagation though time, which Anomaly detection involves the recognition of patterns out- can slow training, especially over very long sequences. This side of what is considered normal, given a certain set of input drawback has given rise to convolutional autoregressive ar- data. This presents a unique set of challenges for machine chitectures [24]. These models are highly parallelizable in learning, particularly if we assume a semi-supervised sce- the training phase, meaning that larger receptive fields can nario in which anomalous patterns are unavailable at training be utilised and computation made more tractable due to ef- time meaning algorithms must rely on non-anomalous data fective resource utilization. In this paper we adapt WaveNet alone. Anomaly detection in time series adds an additional [24], a robust convolutional autoregressive model originally level of complexity given the contextual nature of anomalies. created for raw audio generation, for anomaly detection in For time series modelling, autoregressive deep learning archi- audio. In experiments using multiple datasets we find that we tectures such as WaveNet have proven to be powerful gener- obtain significant performance gains over deep convolutional ative models, specifically in the field of speech synthesis. In autoencoders. this paper, we propose to extend the use of this type of ar- The remainder of this paper proceeds as follows: Sec- chitecture to anomaly detection in raw audio. In experiments tion 2 surveys recent related work on the use of deep neu- using multiple audio datasets we compare the performance of ral networks for anomaly detection; Section 3 describes the this approach to a baseline autoencoder model and show su- WaveNet architecture and how it has been re-purposed for perior performance in almost all cases.
    [Show full text]
  • Introduction-To-Deep-Learning.Pdf
    Introduction to Deep Learning Demystifying Neural Networks Agenda Introduction to deep learning: • What is deep learning? • Speaking deep learning: network types, development frameworks and network models • Deep learning development flow • Application spaces Deep learning introduction ARTIFICIAL INTELLIGENCE Broad area which enables computers to mimic human behavior MACHINE LEARNING Usage of statistical tools enables machines to learn from experience (data) – need to be told DEEP LEARNING Learn from its own method of computing - its own brain Why is deep learning useful? Good at classification, clustering and predictive analysis What is deep learning? Deep learning is way of classifying, clustering, and predicting things by using a neural network that has been trained on vast amounts of data. Picture of deep learning demo done by TI’s vehicles road signs person background automotive driver assistance systems (ADAS) team. What is deep learning? Deep learning is way of classifying, clustering, and predicting things by using a neural network that has been trained on vast amounts of data. Machine Music Time of Flight Data Pictures/Video …any type of data Speech you want to classify, Radar cluster or predict What is deep learning? • Deep learning has its roots in neural networks. • Neural networks are sets of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. Biologocal Artificial Biological neuron neuron neuron dendrites inputs synapses weight axon output axon summation and cell body threshold dendrites synapses cell body Node = neuron Inputs Weights Summation W1 x1 & threshold W x 2 2 Output Inputs W3 Output x3 y layer = stack of ∑ ƒ(x) neurons Wn xn ∑=sum(w*x) Artificial neuron 6 What is deep learning? Deep learning creates many layers of neurons, attempting to learn structured representation of big data, layer by layer.
    [Show full text]
  • Predicting Uber Demand in NYC with Wavenet
    SMART ACCESSIBILITY 2019 : The Fourth International Conference on Universal Accessibility in the Internet of Things and Smart Environments Predicting Uber Demand in NYC with Wavenet Long Chen Konstantinos Ampountolas Piyushimita (Vonu) Thakuriah Urban Big Data Center School of Engineering Rutgers University Glasgow, UK University of Glasgow, UK New Brunswick, NJ, USA Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Uber demand prediction is at the core of intelligent transportation systems when developing a smart city. However, exploiting uber real time data to facilitate the demand predic- tion is a thorny problem since user demand usually unevenly distributed over time and space. We develop a Wavenet-based model to predict Uber demand on an hourly basis. In this paper, we present a multi-level Wavenet framework which is a one-dimensional convolutional neural network that includes two sub-networks which encode the source series and decode the predicting series, respectively. The two sub-networks are combined by stacking the decoder on top of the encoder, which in turn, preserves the temporal patterns of the time series. Experiments on large-scale real Uber demand dataset of NYC demonstrate that our model is highly competitive to the existing ones. Figure 1. The structure of WaveNet, where different colours in embedding Keywords–Anything; Something; Everything else. input denote 2 × k, 3 × k, and 4 × k convolutional filter respectively. I. INTRODUCTION With the proliferation of Web 2.0, ride sharing applications, such as Uber, have become a popular way to search nearby at an hourly basis, which is a WaveNet-based neural network sharing rides.
    [Show full text]
  • Face Recognition Using Popular Deep Net Architectures: a Brief Comparative Study
    future internet Article Face Recognition Using Popular Deep Net Architectures: A Brief Comparative Study Tony Gwyn 1,* , Kaushik Roy 1 and Mustafa Atay 2 1 Department of Computer Science, North Carolina A&T State University, Greensboro, NC 27411, USA; [email protected] 2 Department of Computer Science, Winston-Salem State University, Winston-Salem, NC 27110, USA; [email protected] * Correspondence: [email protected] Abstract: In the realm of computer security, the username/password standard is becoming increas- ingly antiquated. Usage of the same username and password across various accounts can leave a user open to potential vulnerabilities. Authentication methods of the future need to maintain the ability to provide secure access without a reduction in speed. Facial recognition technologies are quickly becoming integral parts of user security, allowing for a secondary level of user authentication. Augmenting traditional username and password security with facial biometrics has already seen impressive results; however, studying these techniques is necessary to determine how effective these methods are within various parameters. A Convolutional Neural Network (CNN) is a powerful classification approach which is often used for image identification and verification. Quite recently, CNNs have shown great promise in the area of facial image recognition. The comparative study proposed in this paper offers an in-depth analysis of several state-of-the-art deep learning based- facial recognition technologies, to determine via accuracy and other metrics which of those are most effective. In our study, VGG-16 and VGG-19 showed the highest levels of image recognition accuracy, Citation: Gwyn, T.; Roy, K.; Atay, M. as well as F1-Score.
    [Show full text]