Signal Analysis Amplitude Envelope Follower Peak Detection

Total Page:16

File Type:pdf, Size:1020Kb

Signal Analysis Amplitude Envelope Follower Peak Detection Signal Analysis Music 270a: Signal Analysis Tamara Smyth, [email protected] • Some tools we may want to use to automate analysis Department of Music, are: University of California, San Diego (UCSD) 1. Amplitude Envelope Follower December 2, 2019 2. Peak Detection (attacks or harmonics), surfboard method 3. Pitch Detection, harmonics vs. chaotic signal 4. Frequency/Spectral Envelope (formant tracking, mccs or lpc) 5. Constant overlap-add (COLA) 1 Music 270a: Signal Analysis 2 Amplitude Envelope Follower Peak Detection • An envelope follower will essentially determine the • It is often desirable to automatically detect peaks, amplitude envelope without dipping down into the particularly in a spectral envelope. valleys/zero crossings. • The peak is the location where the slope changes • It is a reduction of the information, representing the directions. overal shape of the signal’s amplitude (i.e. amplidtude spectrum magnitude (linear) envelpe) without the higher frequency information. 14 12 • The amplitude envelope y(n) is given by 10 8 6 y(n)=(1 − ν)|x(n)| + νy(n − 1), 4 magnitude (linear) 2 0 where ν determines how quickly changes in x(n) are 0 500 1000 1500 2000 2500 3000 3500 4000 tracked: magnitude zoomed on first peak 14 – 12 if ν is close to one, changes are tracked slowly 10 – 8 if ν is close to zero, x(n) has an immediate 6 influence on y(n). 4 magnitude (linear) 2 0 600 620 640 660 680 700 720 740 760 780 800 • In order to capture attacks in the signal, the value for frequency (Hz) ν is usually smaller for an increasing signal and large Figure 1: Peak Detection for one that is decreasing. • See envfollower.m % threshhold Music 270a: Signal Analysis 3 Music 270a: Signal Analysis 4 th = 0; Quadratic interpolation % filter descending values uslope = Ymag > [Ymag(1); Ymag(1:end-1)]; % filter ascending values dslope = Ymag >= [Ymag(2:end); 1+Ymag(end)]; • The position of the peak is limited by the resolution % only indeces at maxima retain non-zero value Ymax = Ymag .* (Ymag > th) .* uslope .* dslope; of the DFT/FFT and its estimation can be improved using quadratic interpolation. % peak indeces maxixs = find(Ymax); • The general equation for a parabola is given by y(x) , a(x − p)2 + b, where p is the peak location and b = y(p). • Considering the parabola at points x =0, 1, −1 yields 3 equations, y(0) = ap2 + b = β y(1) = a(1 + p2 − 2p)+ b = γ y(−1) = a(1 + p2 +2p)+ b = α, and 3 unknowns a,p,b. • Solve for a: α − γ α − γ =4ap −→ a = . 4p Music 270a: Signal Analysis 5 Music 270a: Signal Analysis 6 • Solve for p using a: Pitch Detection γ − β = a − 2ap α − γ α − γ = − 2 p 4p 4p • Considerations for Computer Music applications: 4p(γ − β) = α − γ − 2(α − γ)p 1. Signals are often noisy, eg: poor soundcards, other 4p(γ − β)+2p(α − γ) = α − γ instruments/voices, 2p(γ − 2β + α) = α − γ α − γ 2. How much frequency resolution is needed? Correct p = . octave a must, but will a semitone suffice? 2(γ − 2β + α) 3. What latency can be tolerated (what framesize should be used for analysis?) • Finally, the height at peak p is given by 4. Does the instrument have well-defined/behaved y(p) = b harmonics? = β − ap2 α − γ • In the paper by de la Cuadra et al., “Efficient Pitch = β − p2. 4p Detection Techniques for Interactive Music”, four (4) pitch detection algorithms are summarized: • Another way (Dan Ellis) 1. Harmonic Product Spectrum 2. Maximum Likelihood 3. Cepstrum-Biased HPS 4. Weighted Autocorrelation Function Music 270a: Signal Analysis 7 Music 270a: Signal Analysis 8 Harmonic Product Spectrum (HPS) • Due to noise, frequencies below about 50 Hz should not be searched for a pitch. • Pros: HPS is simple to implement, does well under a • HPS (Noll 1969) measures the maximum coincidence wide range of conditions, and runs in real-time. for harmonics for each spectral frame according to • Cons: low frequency resolution must be enhanced by R zero-padding, so that the spectrum can be Y (ω)= Y |X(ωr)|, (1) interpolated to the nearest semitone. This means that r=1 high frequencies are also being unecessarily where R is the number of harmonics being considered. interpolated. • The resulting periodic correlation array Y (ω) is then • See hps.m searched for a maximum value of a range of possible fundamental frequencies ωi X(ω ) 0 Yˆ = max Y (ωi) (2) ωi X(ω ) 1 to obtain the fundamental frequency estimate. X(ω ) • Octave errors are common (detection is sometimes an 2 octave too high). X(ω ) 3 • To correct, apply this rule: if the second peak X(ω ) amplitude below initially chosent pitch is 4 approximately 1/2 of the chosen pitch AND the ratio of amplitudes is above a threshold (e.g., 0.2 for 5 Figure 2: Harmonic Product Spectrum harmonics), THEN select the lower octave peak as the pitch for the current frame. Music 270a: Signal Analysis 9 Music 270a: Signal Analysis 10 Uses of Linear Predictive Coding Concept of LPC (LPC) • Given a digital system, can the value of any sample • LPC, a statistical method for predicting future values be predicted by taking a linear combination of the of a waveform on the basis of its past values1, is often previous N samples? used to obtain a spectral envelope. • Stated mathematically, can a set of coefficients, ak, • LPC differs from formant tracking in that: be determined such that – the waveform remains in the time domain; y(n)= a1y(n − 1)+ a2y(n − 2)+ ... + aN y(n − N). resonances are described by the coefficients of an all-pole filter. • That is, can a signal be represented as coefficients of an all-pole (only feedback terms) IIR filter. – altering resonances is difficult since editing IIR filter coefficients can result in an unstable filter. • If yes, then the coefficients and the first N samples – analysis may be applied to a wide range of sounds. would completely determine the remainder of the signal, because the rest of the samples can be • LPC is often used to determine the filter in a calculated by the above equation. source-filter model of speech2 which: – characterizes the response of the vocal tract. – reconstitutes the speech waveform when driven by the correct source. 1Markel, J. D., and Gray, A. H., Hr. Linear Prediction of Speech. New York: Apringer-Verlag, 1976. 2In a source-filter model of speech, there is assumed to be no feedback dependency between the vibrating vocal folds and the vocal track Music 270a: Signal Analysis 11 Music 270a: Signal Analysis 12 LPC residual LPC Order • The answer is actually “not precisely” for a finite N. • The accuracy of the predictor improves with an • Rather, we determine the cofficients that give the increase of order N: best prediction by minimizing the difference, or error • The smallest value of N that will yield sufficiently e(n), between the actual sample values of the input quality in the speech representation is related to waveform y(n) and the waveform re-created using the 1. the highest frequency in the speech, derived predictors yˆ(n). 2. the number of formant peaks expected and min{e(n)} = min{yˆ(n) − y(n)}. n n 3. the sampling rate. • The smaller the average value of the error, also called • There is no exact relationship for determining what the residual, the better the set of predictors. value of N should be used, but in general it’s between • The residual may be used to exactly reconstruct the 10 and 20 or the sampling rate in kHz plus 4 (for original signal y(n) by using it as an input to our fs = 15kHz, you might use a N = 19). all-pole filter, that is • Schemes for determining predictors by minimizing the y(n)= b0e(n)+a1y(n−1)+a2y(n−2)+...+aNy(n−N) residual work either explicitly or implicitly. where b0 is a scaling factor that gives the correct • A more rigorous treatment of the subject can be amplitude. found in J. Makhoul’s tutorial3 • When the speech is voiced, the residual is essentially a periodic pulse waveform with the same fundamental frequency as the speech. When unvoiced, the residual 3Makhoul, J. “Linear Prediction, a Tutorial Review.” Proceedinds of the institute of Electrical and is similar to white noise. Electronics Engineers, 63, 1975, 561-580. Music 270a: Signal Analysis 13 Music 270a: Signal Analysis 14 LPC Analysis in Matlab • Matlab for generating original, lpc, and residual spectra. LPC Analysis of the vowel "aah" 25 %lpc original spectrum lpc spectrum order = fs/1000 + 5; % order residual spectrum xlpc = lpc(xw, order)’; % coefficients 20 %windowed speech frequency response 15 zpf = 3; Nfft = 2^nextpow2(N*zpf); XW = fft(xw, Nfft); 10 % lpc spectrum [H,w] = freqz(1, xlpc, Nfft, ’whole’); 5 0 %inverse filter to obtain residual 0 500 1000 1500 2000 2500 E = XW./H; Figure 3: LPC Analysis Results for vowel ’a’. e = real(ifft(E)); Music 270a: Signal Analysis 15 Music 270a: Signal Analysis 16 Constant Overlap Add (COLA) signal X(ω), that is: ∞ ∞ ∞ , −jωn X Xm(ω) X X x(n)w(n − mR)e m=−∞ m=−∞ n=−∞ • Mathematical definition of the Short-time Fourier ∞ ∞ −jωn Transform (STFT) is given by = X x(n)e X w(n − mR) ∞ n=−∞ m=−∞ X ω x n w n mR e−jωn, ∞ m( )= X ( ) ( − ) −jωn n=−∞ = X x(n)e = X(ω), if COLA. n=−∞ where R is the hopsize, and m is the length of the window.
Recommended publications
  • The Science of String Instruments
    The Science of String Instruments Thomas D. Rossing Editor The Science of String Instruments Editor Thomas D. Rossing Stanford University Center for Computer Research in Music and Acoustics (CCRMA) Stanford, CA 94302-8180, USA [email protected] ISBN 978-1-4419-7109-8 e-ISBN 978-1-4419-7110-4 DOI 10.1007/978-1-4419-7110-4 Springer New York Dordrecht Heidelberg London # Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer ScienceþBusiness Media (www.springer.com) Contents 1 Introduction............................................................... 1 Thomas D. Rossing 2 Plucked Strings ........................................................... 11 Thomas D. Rossing 3 Guitars and Lutes ........................................................ 19 Thomas D. Rossing and Graham Caldersmith 4 Portuguese Guitar ........................................................ 47 Octavio Inacio 5 Banjo ...................................................................... 59 James Rae 6 Mandolin Family Instruments........................................... 77 David J. Cohen and Thomas D. Rossing 7 Psalteries and Zithers .................................................... 99 Andres Peekna and Thomas D.
    [Show full text]
  • Vapar Synth--A Variational Parametric Model for Audio Synthesis
    VAPAR SYNTH - A VARIATIONAL PARAMETRIC MODEL FOR AUDIO SYNTHESIS Krishna Subramani , Preeti Rao Alexandre D’Hooge Indian Institute of Technology Bombay ENS Paris-Saclay [email protected] [email protected] ABSTRACT control over the perceptual timbre of synthesized instruments. With the advent of data-driven statistical modeling and abun- With the similar aim of meaningful interpolation of timbre in dant computing power, researchers are turning increasingly to audio morphing, Engel et al. [4] replaced the basic spectral deep learning for audio synthesis. These methods try to model autoencoder by a WaveNet [5] autoencoder. audio signals directly in the time or frequency domain. In the In the present work, rather than generating new timbres, interest of more flexible control over the generated sound, it we consider the problem of synthesis of a given instrument’s could be more useful to work with a parametric representation sound with flexible control over the pitch. Wyse [6] had the of the signal which corresponds more directly to the musical similar goal in providing additional information like pitch, ve- attributes such as pitch, dynamics and timbre. We present Va- locity and instrument class to a Recurrent Neural Network to Par Synth - a Variational Parametric Synthesizer which utilizes predict waveform samples more accurately. A limitation of a conditional variational autoencoder (CVAE) trained on a suit- his model was the inability to generalize to notes with pitches able parametric representation. We demonstrate1 our proposed the network has not seen before. Defossez´ et al. [7] also model’s capabilities via the reconstruction and generation of approached the task in a similar fashion, but proposed frame- instrumental tones with flexible control over their pitch.
    [Show full text]
  • A Comparison of Viola Strings with Harmonic Frequency Analysis
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Student Research, Creative Activity, and Performance - School of Music Music, School of 5-2011 A Comparison of Viola Strings with Harmonic Frequency Analysis Jonathan Paul Crosmer University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/musicstudent Part of the Music Commons Crosmer, Jonathan Paul, "A Comparison of Viola Strings with Harmonic Frequency Analysis" (2011). Student Research, Creative Activity, and Performance - School of Music. 33. https://digitalcommons.unl.edu/musicstudent/33 This Article is brought to you for free and open access by the Music, School of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Student Research, Creative Activity, and Performance - School of Music by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. A COMPARISON OF VIOLA STRINGS WITH HARMONIC FREQUENCY ANALYSIS by Jonathan P. Crosmer A DOCTORAL DOCUMENT Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Doctor of Musical Arts Major: Music Under the Supervision of Professor Clark E. Potter Lincoln, Nebraska May, 2011 A COMPARISON OF VIOLA STRINGS WITH HARMONIC FREQUENCY ANALYSIS Jonathan P. Crosmer, D.M.A. University of Nebraska, 2011 Adviser: Clark E. Potter Many brands of viola strings are available today. Different materials used result in varying timbres. This study compares 12 popular brands of strings. Each set of strings was tested and recorded on four violas. We allowed two weeks after installation for each string set to settle, and we were careful to control as many factors as possible in the recording process.
    [Show full text]
  • Timbre Perception
    HST.725 Music Perception and Cognition, Spring 2009 Harvard-MIT Division of Health Sciences and Technology Course Director: Dr. Peter Cariani Timbre perception www.cariani.com Friday, March 13, 2009 overview Roadmap functions of music sound, ear loudness & pitch basic qualities of notes timbre consonance, scales & tuning interactions between notes melody & harmony patterns of pitches time, rhythm, and motion patterns of events grouping, expectation, meaning interpretations music & language Friday, March 13, 2009 Wikipedia on timbre In music, timbre (pronounced /ˈtæm-bər'/, /tɪm.bər/ like tamber, or / ˈtæm(brə)/,[1] from Fr. timbre tɛ̃bʁ) is the quality of a musical note or sound or tone that distinguishes different types of sound production, such as voices or musical instruments. The physical characteristics of sound that mediate the perception of timbre include spectrum and envelope. Timbre is also known in psychoacoustics as tone quality or tone color. For example, timbre is what, with a little practice, people use to distinguish the saxophone from the trumpet in a jazz group, even if both instruments are playing notes at the same pitch and loudness. Timbre has been called a "wastebasket" attribute[2] or category,[3] or "the psychoacoustician's multidimensional wastebasket category for everything that cannot be qualified as pitch or loudness."[4] 3 Friday, March 13, 2009 Timbre ~ sonic texture, tone color Paul Cezanne. "Apples, Peaches, Pears and Grapes." Courtesy of the iBilio.org WebMuseum. Paul Cezanne, Apples, Peaches, Pears, and Grapes c. 1879-80); Oil on canvas, 38.5 x 46.5 cm; The Hermitage, St. Petersburg Friday, March 13, 2009 Timbre ~ sonic texture, tone color "Stilleben" ("Still Life"), by Floris van Dyck, 1613.
    [Show full text]
  • Automatic Transcription of Bass Guitar Tracks Applied for Music Genre Classification and Sound Synthesis
    Automatic Transcription of Bass Guitar Tracks applied for Music Genre Classification and Sound Synthesis Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorlelegt der Fakultät für Elektrotechnik und Informationstechnik der Technischen Universität Ilmenau von Dipl.-Ing. Jakob Abeßer geboren am 3. Mai 1983 in Jena Gutachter: Prof. Dr.-Ing. Gerald Schuller Prof. Dr. Meinard Müller Dr. Tech. Anssi Klapuri Tag der Einreichung: 05.12.2013 Tag der wissenschaftlichen Aussprache: 18.09.2014 urn:nbn:de:gbv:ilm1-2014000294 ii Acknowledgments I am grateful to many people who supported me in the last five years during the preparation of this thesis. First of all, I would like to thank Prof. Dr.-Ing. Gerald Schuller for being my supervisor and for the inspiring discussions that improved my understanding of good scientific practice. My gratitude also goes to Prof. Dr. Meinard Müller and Dr. Anssi Klapuri for being available as reviewers. Thank you for the valuable comments that helped me to improve my thesis. I would like to thank my former and current colleagues and fellow PhD students at the Semantic Music Technologies Group at the Fraunhofer IDMT for the very pleasant and motivating working atmosphere. Thank you Christian, Holger, Sascha, Hanna, Patrick, Christof, Daniel, Anna, and especially Alex and Estefanía for all the tea-time conversations, discussions, last-minute proof readings, and assistance of any kind. Thank you Paul for providing your musicological expertise and perspective in the genre classification experiments. I also thank Prof. Petri Toiviainen, Dr. Olivier Lartillot, and all the collegues at the Finnish Centre of Excellence in Interdisciplinary Music Research at the University of Jyväskylä for a very inspiring research stay in 2010.
    [Show full text]
  • Advances in Perceptual Stereo Audio Coding Using Linear Prediction Techniques
    Advances in Perceptual Stereo Audio Coding Using Linear Prediction Techniques PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de Rector Magnificus, prof.dr.ir. C.J. van Duijn, voor een commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op dinsdag 15 mei 2007 om 16.00 uur door Arijit Biswas geboren te Calcutta, India Dit proefschrift is goedgekeurd door de promotoren: prof.dr. R.J. Sluijter en prof.Dr. A.G. Kohlrausch Copromotor: dr.ir. A.C. den Brinker °c Arijit Biswas, 2007. All rights are reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner. The work described in this thesis has been carried out under the auspices of Philips Research Europe - Eindhoven, The Netherlands. CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN Biswas, Arijit Advances in perceptual stereo audio coding using linear prediction techniques / by Arijit Biswas. - Eindhoven : Technische Universiteit Eindhoven, 2007. Proefschrift. - ISBN 978-90-386-2023-7 NUR 959 Trefw.: signaalcodering / signaalverwerking / digitale geluidstechniek / spraakcodering. Subject headings: audio coding / linear predictive coding / signal processing / speech coding. Samenstelling promotiecommissie: prof.dr. R.J. Sluijter, promotor Technische Universiteit Eindhoven, The Netherlands prof.Dr. A.G. Kohlrausch, promotor Technische Universiteit Eindhoven, The Netherlands dr.ir. A.C. den Brinker, copromotor Philips Research Europe - Eindhoven, The Netherlands prof.dr. S.H. Jensen, extern lid Aalborg Universitet, Denmark Dr.-Ing. G.D.T. Schuller, extern lid Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany dr.ir. S.J.L. van Eijndhoven, lid TU/e Technische Universiteit Eindhoven, The Netherlands prof.dr.ir.
    [Show full text]
  • Expressive Sampling Synthesis
    i THESE` DE DOCTORAT DE l’UNIVERSITE´ PIERRE ET MARIE CURIE Ecole´ doctorale Informatique, T´el´ecommunications et Electronique´ (EDITE) Expressive Sampling Synthesis Learning Extended Source–Filter Models from Instrument Sound Databases for Expressive Sample Manipulations Presented by Henrik Hahn September 2015 Submitted in partial fulfillment of the requirements for the degree of DOCTEUR de l’UNIVERSITE´ PIERRE ET MARIE CURIE Supervisor Axel R¨obel Analysis/Synthesis group, IRCAM, Paris, France Reviewer Xavier Serra MTG, Universitat Pompeu Fabra, Barcelona, Spain Vesa V¨alim¨aki Aalto University, Espoo, Finland Examiner Sylvain Marchand Universit´ede La Rochelle, France Bertrand David T´el´ecom ParisTech, Paris, France Jean-Luc Zarader UPMC Paris VI, Paris, France This page is intentionally left blank. Abstract This thesis addresses imitative digital sound synthesis of acoustically viable in- struments with support of expressive, high-level control parameters. A general model is provided for quasi-harmonic instruments that reacts coherently with its acoustical equivalent when control parameters are varied. The approach builds upon recording-based methods and uses signal trans- formation techniques to manipulate instrument sound signals in a manner that resembles the behavior of their acoustical equivalents using the fundamental control parameters intensity and pitch. The method preserves the inherent quality of discretized recordings of a sound of acoustic instruments and in- troduces a transformation method that retains the coherency with its timbral variations when control parameters are modified. It is thus meant to introduce parametric control for sampling sound synthesis. The objective of this thesis is to introduce a new general model repre- senting the timbre variations of quasi-harmonic music instruments regarding a parameter space determined by the control parameters pitch as well as global and instantaneous intensity.
    [Show full text]
  • Uncontrolled Manifolds and Short-Delay Reflexes in Speech Motor Control : a Modeling Study Andrew Szabados
    Uncontrolled manifolds and short-delay reflexes in speech motor control : a modeling study Andrew Szabados To cite this version: Andrew Szabados. Uncontrolled manifolds and short-delay reflexes in speech motor control : a mod- eling study. Neurons and Cognition [q-bio.NC]. Université Grenoble Alpes, 2017. English. NNT : 2017GREAS039. tel-01731536 HAL Id: tel-01731536 https://tel.archives-ouvertes.fr/tel-01731536 Submitted on 14 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE LA COMMUNAUTÉ UNIVERSITÉ GRENOBLE ALPES Spécialité : PCN - Sciences cognitives, psychologie et neurocognition Arrêté ministériel : 25 mai 2016 Présentée par Andrew SZABADOS Thèse dirigée par Pascal PERRIER (EDISCE), Professeur, G-INP préparée au sein du Laboratoire Grenoble Images Parole Signal Automatique dans l'École Doctorale Ingénierie pour la santé la Cognition et l'Environnement Uncontrolled manifolds et réflexes à courte latence dans le contrôle moteur de la parole : une étude de modélisation Uncontrolled manifolds and
    [Show full text]
  • Parsing the Spectral Envelope
    PARSING THE SPECTRAL ENVELOPE: TOWARD A GENERAL THEORY OF VOCAL TONE COLOR Ian Howell Doctor of Musical Arts Thesis The New England Conservatory of Music Submitted: 18 May 2016 Advisors: Katarina Markovic & Alan Karass DMA Committee Readers: Matthias Truniger & Thomas Novak Howell: Parsing the Spectral Envelope 2 Contents Abstract ..................................................................................................................................................................... 4 Acknowledgements ............................................................................................................................................... 5 Foreword .................................................................................................................................................................. 6 1. How We Draw Vowels: An Introduction to Current Models .............................................................. 8 2. What are Timbre and Tone Color? ........................................................................................................... 20 3. Exploring the Special Psychoacoustics of Sung Vowels .................................................................... 27 Absolute Spectral Tone Color ............................................................................................................................................... 29 The Multiple Missing Fundamentals ................................................................................................................................
    [Show full text]
  • Analysing Deep Learning-Spectral Envelope Prediction Methods For
    Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis Frederik Bous Axel Roebel UMR STMS - IRCAM UMR STMS - IRCAM CNRS, Sorbonne University CNRS, Sorbonne University Paris, France Paris, France [email protected] [email protected] Abstract—We conduct an investigation on various hyper- The system of [7] uses recurrent neural networks to model parameters regarding neural networks used to generate spectral the statistic properties needed for their concatenative synthesis. envelopes for singing synthesis. Two perceptive tests, where WaveNet [8] goes further and models the raw audio, rather the first compares two models directly and the other ranks models with a mean opinion score, are performed. With these than concatenating existing audio or using a vocoder. Shortly tests we show that when learning to predict spectral en- after that, end-to-end systems like Tacotron [9] and Deep Voice velopes, 2d-convolutions are superior over previously proposed [10] were developed which create raw audio from input on 1d-convolutions and that predicting multiple frames in an iterated phoneme level or even character level. The authors of [11] fashion during training is superior over injecting noise to the used the architecture of [8] to learn input data for a parametric input data. An experimental investigation whether learning to predict a probability distribution vs. single samples was per- singing synthesizer. formed but turned out to be inconclusive. A network architecture While WaveNet processes data that is inherently one di- is proposed that incorporates the improvements which we found mensional (i.e., raw audio), spectral envelopes are generated to be useful and we show in our experiments that this network in [11].
    [Show full text]
  • A Separate and Restore Approach to Score-Informed Music Decomposition
    2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY A SEPARATE AND RESTORE APPROACH TO SCORE-INFORMED MUSIC DECOMPOSITION Christian Dittmar, Jonathan Driedger, Meinard Muller¨ International Audio Laboratories Erlangen∗, Am Wolfsmantel 33, 91058 Erlangen, Germany, {christian.dittmar;jonathan.driedger;meinard.mueller}@audiolabs-erlangen.de ABSTRACT the decomposition technique. In this paper, we follow the paradigm of score-informed Non-Negative Matrix Factorization (NMF) to Our goal is to improve the perceptual quality of signal components separate the mixture magnitude spectrogram as described in [3, 4]. extracted in the context of music source separation. Specifically, This procedure involves soft masks used to derive the targeted com- we focus on decomposing polyphonic, mono-timbral piano record- ponent magnitude spectrograms by Wiener filtering. The soft masks ings into the sound events that correspond to the individual notes of are usually obtained from multiplying suitable NMF templates and the underlying composition. Our separation technique is based on activations. In contrast to that, we propose to refine the soft masks score-informed Non-Negative Matrix Factorization (NMF) that has on the basis of a timbre-adapted dictionary of isolated tones. We re- been proposed in earlier works as an effective means to enforce a fer to this extension of the conventional method as restore approach musically meaningful decomposition of piano music. However, the and show that it can be beneficial for certain types of mixtures. method still has certain shortcomings for complex mixtures where The remainder of this paper is organized as follows: Section 2 pro- the tones strongly overlap in frequency and time.
    [Show full text]
  • A Comparison of Spectral Smoothing Methods for Segment Concatenation Based Speech Synthesis Q David T
    Speech Communication 36 (2002) 343–374 www.elsevier.com/locate/specom A comparison of spectral smoothing methods for segment concatenation based speech synthesis q David T. Chappell b, John H.L. Hansen a,b,* a Robust Speech Processing Laboratory (RSPL), Center for Spoken Language Research (CSLR), Room E265, University of Colorado, 3215 Marine St., P.O. Box 594, Boulder, CO 80309-0594, USA b Department of Electrical Engineering, P.O. Box 90291, Duke University, Durham, NC 27708-0291, USA Received 21 April 1999; received in revised form 24 May 2000; accepted 15 December 2000 Abstract There are many scenarios in both speech synthesis and coding in which adjacent time-frames of speech are spectrally discontinuous. This paper addresses the topic of improving concatenative speech synthesis with a limited database by proposing methods to smooth, adjust, or interpolate the spectral transitions between speech segments. The objective is to produce natural-sounding speech via segment concatenation when formants and other spectral features do not align properly. We consider several methods for adjusting the spectra at the boundaries between waveform segments. Tech- niques examined include optimal coupling, waveform interpolation (WI), linear predictive parameter interpolation, and psychoacoustic closure. Several of these algorithms have been previously developed for either coding or synthesis, while others are enhanced. We also consider the connection between speech science and articulation in determining the type of smoothing appropriate for given phoneme–phoneme transitions. Moreover, this work incorporates the use of a recently- proposed auditory-neural based distance measure (ANBM), which employs a computational model of the auditory system to assess perceived spectral discontinuities.
    [Show full text]