A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement

Total Page:16

File Type:pdf, Size:1020Kb

A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement A Cross-channel Attention-based Wave-U-Net for Multi-channel Speech Enhancement Minh Tri Ho1, Jinyoung Lee1, Bong-Ki Lee2, Dong Hoon Yi2, Hong-Goo Kang1 1Department of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea 2Artificial Intelligence Lab, LG Electronics Co., Seoul, South Korea [email protected] Abstract performance and robustness are not reliable in harsh environ- ments. Recently, deep-learning based approaches have been In this paper, we present a novel architecture for multi- introduced to the first stage of enhancement [11, 12]. In ad- channel speech enhancement using a cross-channel attention- dition, end-to-end models have been designed to extract both based Wave-U-Net structure. Despite the advantages of utiliz- spectral and spatial features. Examples of this approach include ing spatial information as well as spectral information, it is chal- the Wave-U-Net model [7] and time-domain convolutional au- lenging to effectively train a multi-channel deep learning sys- toencoder model [13]. Since end-to-end structures only mimic tem in an end-to-end framework. With a channel-independent the behavior of beamforming at the first layer of the encoder, encoding architecture for spectral estimation and a strategy to their capabilities for spatial filtering should be limited by the extract spatial information through an inter-channel attention signal-to-noise ratio (SNR) of the input signal and the number mechanism, we implement a multi-channel speech enhance- of microphones. In our preliminary experiments, we found out ment system that has high performance even in reverberant and that this simple approach would not work in extremely low SNR extremely noisy environments. Experimental results show that conditions. the proposed architecture has superior performance in terms of To overcome the problem, we modify the Wave-U-Net signal-to-distortion ratio improvement (SDRi), short-time ob- structure to process each channel separately, but utilize inter- jective intelligence (STOI), and phoneme error rate (PER) for channel information using a nonlinear spatial filter designed by speech recognition. a lightweight attention block. Motivated by the attention block Index Terms: Multi-channel Speech Enhancement, Wave-U- proposed in [14], we exploit inter-channel information in a more Net, Cross-Channel Attention straightforward manner which directly compares information in one channel with another. Moreover, since encoding each input 1. Introduction channel independently helps to preserve spatial information, the Speech enhancement aiming to enhance the quality of the target output of each encoding layer can be repeatedly utilized to esti- speech signal is crucial to improving the robustness to noise for mate inter-channel information speech recognition [1, 2]. With the development of deep learn- Our contributions in this paper are three-fold. First, we pro- ing, data-driven speech enhancement approaches have shown pose a novel modification of the well-known Wave-U-Net ar- breakthroughs when using a single microphone. In most of sin- chitecture for multi-channel speech enhancement. It separately gle channel approaches [3, 4, 5] the speech signal is first trans- encodes each channel, then interchanges information between formed into the frequency domain, after which time-frequency encoded channels after each downsampling block is efficiently (TF) masks are estimated to determine the amount of noise re- combined by a convolution size of one before passing to the duction in each TF bin. However, their performance improve- skip connection. Secondly, we introduce a cross-channel atten- ments are not significant in low signal-to-noise environments tion block to boost network performance by effectively exploit- because of their limitations on estimating the phase spectrum. ing spatial information of multi-channel data. To the best of our Williamson et al. [6] estimated the TF masks in the complex knowledge, our paper is the first work proving a model’s per- domain, but it was not easy to train the network. To over- formance in an artificial multi-channel acoustic scenario with come this limitation, Wave-U-Net [7] was proposed as a time all of the following four intricate challenges: minimum num- domain approach. Since the Wave-U-Net model directly es- ber of microphones (only two microphones with a small dis- timates the clean target signal waveform, it does not need to tance between them), varying positions of both speech and noise consider the phase problem and spectral efficiency caused by sources, reverberation and extremely low SNR conditions (-10 the time-frequency transformation with a fixed-length analysis dB, -5 dB cases). frame. The rest of this paper is organized as follows. Section When multiple microphones are available, the performance 2 gives a review of related works while section 3 describes of speech enhancement algorithms can be further improved our baseline model. Section 4 represents our proposed cross- because of spatially related information between the micro- channel attention Wave-U-Net architecture. Section 5 describes phones [8]. Statistical approaches such as beamforming [9] our experiment and analyses the experiment results, followed and multi-channel Wiener filtering [10] first estimate the di- by the conclusion in the final section. rection of arrival (DOA) between microphones, then enhance the incoming signal from the estimated source direction but 2. Related Works attenuate interference from other directions using a linear fil- ter [8]. Although the methods are fast and lightweight, their With the success of deep-learning based single-channel speech enhancement approaches, much research has been done on This work is supported and funded by LG Electronics Co., Ltd. combining models with traditional statistic-based beamforming Concat Mic 1 Enhanced speech Mic 2 Encoder , Mic 1 , 24 , 2, 2, 48 ∙ ∙ ∙ 240 Decoder /512, /512, 240 /4, 48 /4, /2, 24 /2, 24 /2, 48 /4, / 푇 /1024 푇 푇 ( , 1) 푇 푇 푇 푇 푇 푇 Conv1d kernel size 1 Figure 1: Wave-U-Net structure for multi-channel data. , Conv1d DS stands for a downsampling block and US does for an Mic 2 kernel size 15 , 24 , ∙ ∙ ∙ 240 /512, /512, 240 /4, 48 /4, /2, 24 /2, 24 /2, 48 /2, /4, 48 /4, /1024 푇 푇 푇 ( , 1) 푇 푇 푇 푇 푇 upsampling block. L denotes the number of downsam- 푇 pling/upsampling blocks. Downsampling Figure 2: Proposed Cross-channel attention-based Wave-U-Net algorithms. Typical works that manifest this idea include [11], structure for multi-channel data. Feature maps obtained at the [12] and [15]. The common strategy of these works is that the encoding layer are interchanged between channels after pro- neural network tries to reduce single channel noise using TF cessing each downsampling block. masks at the first stage, after which beamforming is applied to linearly integrate the multi-channel signals. This approach 4. Proposed Model not only gives better results than the pure statistical methods In this section, we propose a cross-channel attention-based but also shows robustness in terms of various noisy types and multi-channel Wave-U-Net model by modifying the baseline ar- SNR ranges. However, in this method, the neural network only chitecture described in Section 3. Fig. 2 illustrates the architec- learns the spectral characteristics of a single channel, not spa- ture of our proposed model. Although the proposed algorithm tial information. To address this limitation, recent deep learning can be generalized to an arbitrary number of channels, we fix approaches have used inter-channel features as additional inputs the number of channels to two for simplicity in this paper. to the network, such as generalized cross-correlation (GCC), in- teraural phase or level difference (IDP, ILD) [16, 17]. Although 4.1. Encoder Structure learning spectral information together with spatial features re- sults in performance improvements, adding spatial features as To provide flexibility for the processing of each channel and to the separated input makes it difficult to learn the mutual rela- explicitly utilize cross-channel relationships, the encoder pro- tionship between spatial and spectral information. cesses each channel independently. Feature maps from the en- coder of each channel are used as inputs to the cross-channel To handle multi-channel data, the author of Wave-U-Net attention block, then are interchanged between channels. The proposed that the first layer of the network takes into account main objective of the cross-channel attention block is to derive all input channels. A similar solution could be found in [13], the relationship between two channels. Details of this block are in which the authors used a dilated convolution autoencoder in- described in subsection 4.3. At the bottleneck of the network, stead of the U-Net structure. However, when handling all the feature maps obtained by the encoder of each channel are con- input channels together, the first convolution layer only played catenated, after which they are projected into one feature map the role of performing nonlinear channel fusion. using a 1-D convolution layer. When the number of channels is greater than two, one channel is chosen as a reference channel 3. Wave-U-Net Baseline for Speech and feature maps are interchanged between the reference and Enhancement other channels. 4.2. Decoder Structure Wave-U-Net was originally developed for a singing voice source separation task by reconfiguring the U-Net structure [5] In each decoding layer, feature maps extracted from the encod- in the time domain. Based on an encoder-decoder structure, it ing layer are fused by a 1-D convolution layer with size 1. The introduces skip connections between the same layer in down- processing is different from the baseline structure that uses a sampling and upsampling blocks so that the high-level features direct skip connection between the same level layers of the en- of the decoding layer also include local features from the en- coder and decoder.
Recommended publications
  • Fast, High Quality Noise
    The Importance of Being Noisy: Fast, High Quality Noise Natalya Tatarchuk 3D Application Research Group AMD Graphics Products Group Outline Introduction: procedural techniques and noise Properties of ideal noise primitive Lattice Noise Types Noise Summation Techniques Reducing artifacts General strategies Antialiasing Snow accumulation and terrain generation Conclusion Outline Introduction: procedural techniques and noise Properties of ideal noise primitive Noise in real-time using Direct3D API Lattice Noise Types Noise Summation Techniques Reducing artifacts General strategies Antialiasing Snow accumulation and terrain generation Conclusion The Importance of Being Noisy Almost all procedural generation uses some form of noise If image is food, then noise is salt – adds distinct “flavor” Break the monotony of patterns!! Natural scenes and textures Terrain / Clouds / fire / marble / wood / fluids Noise is often used for not-so-obvious textures to vary the resulting image Even for such structured textures as bricks, we often add noise to make the patterns less distinguishable Ех: ToyShop brick walls and cobblestones Why Do We Care About Procedural Generation? Recent and upcoming games display giant, rich, complex worlds Varied art assets (images and geometry) are difficult and time-consuming to generate Procedural generation allows creation of many such assets with subtle tweaks of parameters Memory-limited systems can benefit greatly from procedural texturing Smaller distribution size Lots of variation
    [Show full text]
  • Synthetic Data Generation for Deep Learning Models
    32. DfX-Symposium 2021 Synthetic Data Generation for Deep Learning Models Christoph Petroll 1 , 2 , Martin Denk 2 , Jens Holtmannspötter 1 ,2, Kristin Paetzold 3 , Philipp Höfer 2 1 The Bundeswehr Research Institute for Materials, Fuels and Lubricants (WIWeB) 2 Universität der Bundeswehr München (UniBwM) 3 Technische Universität Dresden * Korrespondierender Autor: Christoph Petroll Institutsweg 1 85435 Erding Germany Telephone: 08122/9590 3313 Mail: [email protected] Abstract The design freedom and functional integration of additive manufacturing is increasingly being implemented in existing products. One of the biggest challenges are competing optimization goals and functions. This leads to multidisciplinary optimization problems which needs to be solved in parallel. To solve this problem, the authors require a synthetic data set to train a deep learning metamodel. The research presented shows how to create a data set with the right quality and quantity. It is discussed what are the requirements for solving an MDO problem with a metamodel taking into account functional and production-specific boundary conditions. A data set of generic designs is then generated and validated. The generation of the generic design proposals is accompanied by a specific product development example of a drone combustion engine. Keywords Multidisciplinary Optimization Problem, Synthetic Data, Deep Learning © 2021 die Autoren | DOI: https://doi.org/10.35199/dfx2021.11 1. Introduction and Idea of This Research Due to its great design freedom, additive manufacturing (AM) shows a high potential of functional integration and part consolidation [1]-[3]. For this purpose, functions and optimization goals that are usually fulfilled by individual components must be considered in parallel.
    [Show full text]
  • Digital to Analog, PWM, Stepper Motors
    Lecture #19 Digital To Analog, PWM, Stepper Motors 18-348 Embedded System Engineering Philip Koopman Monday, 28-March-2016 Electrical& Computer ENGINEERING © Copyright 2006-2016, Philip Koopman, All Rights Reserved Example System: HVAC Compressor Expansion Valve [http://www.solarpowerwindenergy.org] 2 HVAC Embedded Control Compressors (reciprocating & scroll) • Smart loading and unloading of compressor – Want to minimize motor turn on/turn off cycles – May involve bypassing liquid so compressor keeps running but doesn’t compress • Variable speed for better output temperature control • Diagnostics and prognostics – Prevent equipment damage (e.g., liquid entering compressor, compressor stall) – Predict equipment failures (e.g., low refrigerant, motor bearings wearing out) Expansion Valve • Smart control of amount of refrigerant evaporated – Often a stepper motor • Diagnostics and prognostics – Low refrigerant, icing on cold coils, overheating of hot coils System coordination • Coordinate expansion valve and compressor operation • Coordinate multiple compressors • Next lecture – talk about building-level system level diagnostics & coordination 3 Where Are We Now? Where we’ve been: • Interrupts, concurrency, scheduling, RTOS Where we’re going today: • Analog Output Where we’re going next: • Analog Input •Human I/O • Very gentle introduction to control •… • Test #2 and last project demo 4 Preview Digital To Analog Conversion • Example implementation • Understanding performance • Low pass filters Waveform encoding PWM • Digital way
    [Show full text]
  • Networking, Noise
    Final TAs • Jeff – Ben, Carmen, Audrey • Tyler – Ruiqi – Mason • Ben – Loudon – Jordan, Nick • David – Trent – Vijay Final Design Checks • Optional this week • Required for Final III Playtesting Feedback • Need three people to playtest your game – One will be in class – Two will be out of class (friends, SunLab, etc.) • Should watch your victim volunteer play your game – Don’t comment unless necessary – Take notes – Pay special attention to when the player is lost and when the player is having the most fun Networking STRATEGIES The Illusion • All players are playing in real-time on the same machine • This isn’t possible • We need to emulate this as much as possible The Illusion • What the player should see: – Consistent game state – Responsive controls – Difficult to cheat • Things working against us: – Game state > bandwidth – Variable or high latency – Antagonistic users Send the Entire World! • Players take turns modifying the game world and pass it back and forth • Works alright for turn-based games • …but usually it’s bad – RTS: there are a million units – FPS: there are a million bullets – Fighter: timing is crucial Modeling the World • If we’re sending everything, we’re modeling the world as if everything is equally important Player 1 Player 2 – But it really isn’t! My turn Not my turn Read-only – Composed of entities, only World some of which react to the StateWorld 1 player • We need a better model to solve these problems Client-Server Model • One process is the authoritative server Player 1 (server) Player 2 (client) – Now we
    [Show full text]
  • Procedural Generation of Infinite Terrain from Real-World Elevation
    Journal of Computer Graphics Techniques Vol. 3, No. 1, 2014 http://jcgt.org Designer Worlds: Procedural Generation of Infinite Terrain from Real-World Elevation Data Ian Parberry University of North Texas Figure 1.A 180◦ panorama of terrain generated using elevation data from Utah. Abstract The standard way to procedurally generate random terrain for video games and other applica- tions is to post-process the output of a fast noise generator such as Perlin noise. Tuning the post-processing to achieve particular types of terrain requires game designers to be reason- ably well-trained in mathematics. A well-known variant of Perlin noise called value noise is used in a process accessible to designers trained in geography to generate geotypical terrain based on elevation statistics drawn from widely available sources such as the United States Geographical Service. A step-by-step process for downloading and creating terrain from real- world USGS elevation data is described, and an implementation in C++ is given. 1. Introduction Terrain generation is an example of what is known in the game industry as procedu- ral content generation. Procedural content generation needs to have three important properties. First, it needs to be fast, meaning that it has to use only a fraction of the computing power on a current-generation computer. Second, it needs to be both ran- dom and structured so that it creates content that is varied and interesting. Third, it needs to be controllable in a natural and intuitive way. As noted in the excellent survey paper by Smelik et al. [2009], procedural terrain generation is often based on fractal noise generators such as Perlin noise [Perlin 1985; Perlin 2002] (for more details see, for example, the book by Ebert et al.
    [Show full text]
  • Coherent Noise
    Coherent Noise User manual Table of Contents Introduction...........................................................................................5 Chapter I. .............................................................................................6 Noise Basics........................................................................................6 Combining Noise. ................................................................................8 CoherentNoise library...........................................................................9 Chapter II............................................................................................12 Noise generators................................................................................12 Generator base class.......................................................................12 ValueNoise.....................................................................................13 GradientNoise................................................................................14 ValueNoise2D and GradientNoise2D...................................................16 Constant.......................................................................................16 Function........................................................................................17 Cache...........................................................................................17 Patterns........................................................................................18 Cylinders....................................................................................18
    [Show full text]
  • Interfacing to Analog World Sensor Interfacing
    Interfacing to Analog World Sensor Interfacing • Introduction to Analog to digital Conversion • Why Analog to Digital? • Basics of A/D Conversion. • A/D converter inside PIC16F887 • Related Problems Prepared By- Mohammed Abdul Kader Assistant Professor, EEE, IIUC Introduction to Analog to digital Conversion Signals in the real world are analog: light, sound, temperature, pressure, acceleration or other phenomenon. So, real-world signals must be converted into digital, using a circuit called ADC (Analog-to-Digital Converter), before they can be manipulated by digital equipment. When you scan a picture with a scanner what the scanner is doing is an analog-to-digital conversion: it is taking the analog information provided by the picture (light) and converting into digital. When you record your voice on your computer, you are using an analog-to-digital converter to convert your voice, which is analog, into digital information. When an audio CD is recorded at a studio, once again analog-to-digital is taking place, converting sounds into digital numbers that will be stored on the disc. Whenever we need the analog signal back, the opposite conversion – digital-to-analog, which is done by a circuit called DAC, Digital-to-Analog Converter – is needed. When you play an audio CD, what the CD player is doing is reading digital information stored on the disc and converting it back to analog so you can hear the audio. 2 Lecture Materials on "Interfacing to Analog World", By- Mohammed Abdul Kader, Assistant Professor, EEE, IIUC Why Analog to Digital? 1. Reducing Noise: Since analog signals can assume any value, noise is interpreted as being part of the original signal.
    [Show full text]
  • Fast Paralleled Removal Method of Impulsive Noise Using Edge Strength
    Proceedings of the 6th IIAE International Conference on Intelligent Systems and Image Processing 2018 Fast Paralleled Removal Method of Impulsive Noise Using Edge Strength Yusuke Koshimuraa*, Takashi Miyazakia, Yasuki Yokoyamaa, Yasushi Amaria, Hiroaki Yamamotob aNational Institute of Technology, Nagano College, 716 Tokuma, Nagano city, Nagano Prefecture and 381-8550, Japan bShinshu University, 4-17-1 Wakasato, Nagano city, Nagano Prefecture and 380-0928, Japan *Yusuke KOSHIMURA: [email protected] Takashi MIYAZAKI: [email protected] Abstract of the variation in sensitivity among light receiving cells, impulsive noise is superimposed on the images and The Median Filter is known as a method for effectively deterioration is likely to occur. removing to impulsive noise. However, this method occurs The Median Filter method (MF) is effective to remove deteriorating image quality since the filtering process is also such as these noises (1,2), but this method performs processing applied to non-noise pixels. Therefore, a method called the on all pixels, so degrades non-noise pixel. As a method Switching Median Filter, improved capable of distinguishing solving this problem, the Switching Median Filter method between noise pixels and non-noise pixels has been proposed. (SMF) using switching type noise detection filter has been Our Multi-directional Switching Median Filter is one of proposed. Therefore, degradation of non-noise pixels caused which has been improved such as same. As the first step in by MF is able to be suppressed and both denoising and edge this method, the detection and removal processing of pixels preservation are able to be suppressed and both denoising considered as noise are performed by using a 2x2 detection and edge preservation are able to be achieved (3-12).
    [Show full text]
  • BACHELOR's THESIS Implicit Procedural Textures
    2007:12 HIP BACHELOR'S THESIS Implicit Procedural Textures as a means of saving texture memory Joakim Lindqvist Luleå University of Technology BSc Programmes in Engineering BSc programme in Computer Engineering Department of Skellefteå Campus Division of Leisure and Entertainment 2007:12 HIP - ISSN: 1404-5494 - ISRN: LTU-HIP-EX--07/12--SE Implicit Procedural Textures as a means of saving texture memory Page | I Joakim Lindqvist LTU Skellefteå 6/4/07 Abstract This thesis explores the possibility of exchanging regular textures with procedural textures. It focuses on classic procedural textures like different types of rock and sand. Most time is spent on implicit textures generation on a GPU, this means a limit to the types of algorithms that can be used. As such, much of the work focuses on noise implementations. The goal is to lower video memory usage by changing regular sampled textures with procedural ones. In the end a few ways of creating textures with a satisfactory level of detail is suggested using a blend of regular textures and procedural ones. Sammanfattning Detta examensarbete undersöker möjligheterna att byta ut vanliga texturer mot procedurella texturer. Den fokuserar på klassiska procedurella texturer som exempelvis sten och sand. Jag jobbar mest med implicita texturer körda på GPUn vilket medför en del begränsningar i vilka algoritmer som kan köras. På grund av detta ligger väldigt mycket fokus på olika noise implementationer. Målet är att minska texturminnet som används igenom att ändra vanliga texturer med procedurella motsvarigheter. I slutändan presenterar jag några sätt att skapa texturer med tillräckliga detaljer där vi använder en blandning av vanliga texturer och procedurella.
    [Show full text]
  • Scene Classification Based on a Deep Random-Scale Stretched
    remote sensing Article Scene Classification Based on a Deep Random-Scale Stretched Convolutional Neural Network Yanfei Liu, Yanfei Zhong * ID , Feng Fei, Qiqi Zhu and Qianqing Qin State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China; [email protected] (Y.L.); [email protected] (F.F.); [email protected] (Q.Z.); [email protected] (Q.Q.) * Correspondence: [email protected]; Tel./Fax: +86-27-6877-9969 Received: 10 January 2018; Accepted: 19 February 2018; Published: 12 March 2018 Abstract: With the large number of high-resolution images now being acquired, high spatial resolution (HSR) remote sensing imagery scene classification has drawn great attention but is still a challenging task due to the complex arrangements of the ground objects in HSR imagery, which leads to the semantic gap between low-level features and high-level semantic concepts. As a feature representation method for automatically learning essential features from image data, convolutional neural networks (CNNs) have been introduced for HSR remote sensing image scene classification due to their excellent performance in natural image classification. However, some scene classes of remote sensing images are object-centered, i.e., the scene class of an image is decided by the objects it contains. Although previous methods based on CNNs have achieved comparatively high classification accuracies compared with the traditional methods with handcrafted features, they do not consider the scale variation of the objects in the scenes. This makes it difficult to directly utilize CNNs on those remote sensing images belonging to object-centered classes to extract features that are robust to scale variation, leading to wrongly classified scene images.
    [Show full text]
  • Package 'Ambient'
    Package ‘ambient’ March 21, 2020 Type Package Title A Generator of Multidimensional Noise Version 1.0.0 Maintainer Thomas Lin Pedersen <[email protected]> Description Generation of natural looking noise has many application within simulation, procedural generation, and art, to name a few. The 'ambient' package provides an interface to the 'FastNoise' C++ library and allows for efficient generation of perlin, simplex, worley, cubic, value, and white noise with optional pertubation in either 2, 3, or 4 (in case of simplex and white noise) dimensions. License MIT + file LICENSE Encoding UTF-8 SystemRequirements C++11 LazyData true Imports Rcpp (>= 0.12.18), rlang, grDevices, graphics LinkingTo Rcpp RoxygenNote 7.1.0 URL https://ambient.data-imaginist.com, https://github.com/thomasp85/ambient BugReports https://github.com/thomasp85/ambient/issues NeedsCompilation yes Author Thomas Lin Pedersen [cre, aut] (<https://orcid.org/0000-0002-5147-4711>), Jordan Peck [aut] (Developer of FastNoise) Repository CRAN Date/Publication 2020-03-21 17:50:12 UTC 1 2 ambient-package R topics documented: ambient-package . .2 billow . .3 clamped . .4 curl_noise . .4 fbm .............................................6 fracture . .7 gen_checkerboard . .8 gen_spheres . .9 gen_waves . 10 gradient_noise . 11 long_grid . 12 modifications . 13 noise_cubic . 14 noise_perlin . 15 noise_simplex . 17 noise_value . 19 noise_white . 20 noise_worley . 21 ridged . 24 trans_affine . 25 Index 27 ambient-package ambient: A Generator of Multidimensional Noise Description Generation of natural looking noise has many application within simulation, procedural generation, and art, to name a few. The ’ambient’ package provides an interface to the ’FastNoise’ C++ library and allows for efficient generation of perlin, simplex, worley, cubic, value, and white noise with optional pertubation in either 2, 3, or 4 (in case of simplex and white noise) dimensions.
    [Show full text]
  • Training a Quantum Pointnet with Nesterov Accelerated Gradient Estimation by Projection
    Training a Quantum PointNet with Nesterov Accelerated Gradient Estimation by Projection Ruoxi Shi Hao Tang Xian-Min Jin Center for Integrated Quantum Information Technologies (IQIT) Shanghai Jiao Tong University Shanghai, 200240, China {eliphat, htang2015, xianmin.jin}@sjtu.edu.cn Abstract PointNet is a bedrock for deep learning methods on point clouds for 3D machine vision. However, the pointwise operations involved in PointNet is resource in- tensive, and the expressiveness of PointNet is limited by the size of its feature space. In this paper, we propose Quantum PointNet with a rectifed max pooling operation to achieve an exponential speedup performing the pointwise operations and meanwhile obtaining a quantum-enhanced feature space. We provide an im- plementation with quantum tensor networks and specify a circuit model that runs on near-term quantum computers. Meanwhile, we develop the NA-GEP (Nesterov Accelerated Gradient Estimation by Projection) optimization framework, together with a periodic batching scheme, to help train large-scale quantum networks more efficiently. We demonstrate that Quantum PointNet reaches competitive perfor- mance to its classical counterpart on a subset of the ModelNet40 dataset with 48x fewer operations required to process a point cloud. It is also shown that NA-GEP is robust under different kinds of noises. A mini Quantum PointNet is able to run on real quantum computers, achieving ∼100% accuracy classifying three kinds of shapes with a small number of shots. 1 Introduction We live in a 3-dimensional world. Machines, especially robots, need to interpret 3-dimensional sur- roundings precisely to interact rationally with their environment. Thus, understanding 3D information is of great importance.
    [Show full text]