Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM)

Total Page:16

File Type:pdf, Size:1020Kb

Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM) University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses Summer 8-9-2017 Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM) Mohan Kumar Kanuri [email protected] Follow this and additional works at: https://scholarworks.uno.edu/td Part of the Signal Processing Commons Recommended Citation Kanuri, Mohan Kumar, "Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM)" (2017). University of New Orleans Theses and Dissertations. 2381. https://scholarworks.uno.edu/td/2381 This Thesis is protected by copyright and/or related rights. It has been brought to you by ScholarWorks@UNO with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights- holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Thesis has been accepted for inclusion in University of New Orleans Theses and Dissertations by an authorized administrator of ScholarWorks@UNO. For more information, please contact [email protected]. Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM) A Thesis Submitted to the Graduate Faculty of the University of New Orleans in partial fulfillment of the requirements for the degree of Master of Science in Engineering – Electrical By Mohan Kumar Kanuri B.Tech., Jawaharlal Nehru Technological University, 2014 August 2017 This thesis is dedicated to my parents, Mr. Ganesh Babu Kanuri and Mrs. Lalitha Kumari Kanuri for their constant support, encouragement, and motivation. I also dedicate this thesis to my brother, Mr. Hima Kumar Kanuri for all his support. ii Acknowledgement I would like to express my sincere gratitude to my advisor Dr. Dimitrios Charalampidis for his constant support, encouragement, patient guidance and instruction in the completion of my thesis and degree requirements. His innovative ideas, encouragement, and positive attitude have been an asset to me throughout my Masters in achieving my long-term career goals. I would also like to thank Dr. Vesselin Jilkov, Dr. Kim D Jovanovich, for serving on my committee, and for their support, motivation throughout my graduate research that enabled me to complete my thesis successfully. iii Table of Contents List of Figures ................................................................................................................................................ v Abstract ........................................................................................................................................................ vi 1. Introduction .............................................................................................................................................. 1 1.1 Sound .................................................................................................................................................. 1 1.2 Characteristics of sound ...................................................................................................................... 1 1.3 Music and speech................................................................................................................................ 3 2. Scope and Objectives ................................................................................................................................ 5 3. Literature Review ...................................................................................................................................... 6 3.1 Repetition used as a criterion to extract different features in audio ................................................. 6 3.1.1 Similarity matrix .......................................................................................................................... 7 3.1.2 Cepstrum .................................................................................................................................... 10 3.2 Previous work.................................................................................................................................... 12 3.2.1 Mel Frequency Cepstral Coefficients (MFCC) .......................................................................... 13 3.2.2 Perceptual Linear Prediction (PLP) ........................................................................................... 15 4. REPET and Proposed Methodologies ...................................................................................................... 18 4.1 REPET methodology .......................................................................................................................... 18 4.1.1 Overall idea of REET................................................................................................................. 18 4.1.2 Identification of repeating period: ............................................................................................. 20 4.1.3 Repeating Segment modeling .................................................................................................... 23 4.1.4 Repeating Patterns Extraction .................................................................................................... 24 4.2. Proposed methodology:................................................................................................................... 25 4.2.1 Lag evaluation ............................................................................................................................ 27 4.2.2 Alignment of segments based on the lag t: ................................................................................ 28 4.2.3 Stitching the segments ............................................................................................................... 29 4.2.4 Unwrapping and extraction of repeating background ................................................................ 30 5. Results and Data Analysis ....................................................................................................................... 32 6. Limitations and Future Recommendations ............................................................................................. 37 7. Bibliography ............................................................................................................................................ 39 Vita .............................................................................................................................................................. 43 iv List of Figures Figure 1. Intensity of sound varies with the distance ...................................................................... 2 Figure 2. Acoustic processing for similarity measure .................................................................... 8 Figure 3. Visualization of drum pattern highlighting the similar region on diagonal .................. 10 Figure 4. cepstrum coefficients calculation .................................................................................. 11 Figure 5. Matlab graph representing X[k], X̂ [k] and c[n] of a signal x[n] ................................... 12 Figure 6. Building blocks of Vembu separation system ............................................................... 13 Figure 7. Process of building MFCCs .......................................................................................... 14 Figure 8. Process of building PLP cepstral coefficients ............................................................... 16 Figure 9. Depiction of Musical work production using different instruments and voices ........... 18 Figure 10. REPET Methodology summarized into three stages ................................................... 20 Figure 11. Spectral Content of drums using different window length for STFT .......................... 21 Figure 12. Segmentation of magnitude spectrogram V into ‘r’ segments .................................... 23 Figure 13. Estimation of background and unwrapping of signal using ISTFT. ........................... 25 Figure 14. Alignment of segment for positive lag ........................................................................ 28 Figure 15. Alignment of segment for negative lag ....................................................................... 29 Figure 16. Stitching of CRM segments........................................................................................ 30 Figure 17. Unwrapping of repeating patterns in audio signal. ...................................................... 31 Figure 18. SNR ratio of REPET and CPRM for different audio clips .......................................... 33 Figure 19. Foreground extracted by REPET and CPRM for Matlab generated sound ................. 34 Figure 20. Foreground extracted by REPET and CPRM for priyathama ..................................... 34 Figure 21. Foreground extracted by REPET and CPRM for Desiigner Panda song .................... 35 v Abstract Extraction of singing voice from music is one of the ongoing research topics in the field of speech recognition and audio analysis. In particular, this topic finds many applications in the music field, such as in determining music structure, lyrics recognition, and singer recognition. Although many studies have been conducted for the separation of
Recommended publications
  • The Physical Significance of Acoustic Parameters and Its Clinical
    www.nature.com/scientificreports OPEN The physical signifcance of acoustic parameters and its clinical signifcance of dysarthria in Parkinson’s disease Shu Yang1,2,6, Fengbo Wang3,6, Liqiong Yang4,6, Fan Xu2,6, Man Luo5, Xiaqing Chen5, Xixi Feng2* & Xianwei Zou5* Dysarthria is universal in Parkinson’s disease (PD) during disease progression; however, the quality of vocalization changes is often ignored. Furthermore, the role of changes in the acoustic parameters of phonation in PD patients remains unclear. We recruited 35 PD patients and 26 healthy controls to perform single, double, and multiple syllable tests. A logistic regression was performed to diferentiate between protective and risk factors among the acoustic parameters. The results indicated that the mean f0, max f0, min f0, jitter, duration of speech and median intensity of speaking for the PD patients were signifcantly diferent from those of the healthy controls. These results reveal some promising indicators of dysarthric symptoms consisting of acoustic parameters, and they strengthen our understanding about the signifcance of changes in phonation by PD patients, which may accelerate the discovery of novel PD biomarkers. Abbreviations PD Parkinson’s disease HKD Hypokinetic dysarthria VHI-30 Voice Handicap Index H&Y Hoehn–Yahr scale UPDRS III Unifed Parkinson’s Disease Rating Scale Motor Score Parkinson’s disease (PD), a chronic, progressive neurodegenerative disorder with an unknown etiology, is asso- ciated with a signifcant burden with regards to cost and use of societal resources 1,2. More than 90% of patients with PD sufer from hypokinetic dysarthria3. Early in 1969, Darley et al. defned dysarthria as a collective term for related speech disorders.
    [Show full text]
  • The Fingerprints of Pain in Human Voice
    1 The Open University of Israel Department of Mathematics and Computer Science The Fingerprints of Pain in Human Voice Thesis submitted in partial fulfillment of the requirements towards an M.Sc. degree in Computer Science The Open University of Israel Computer Science Division By Yaniv Oshrat Prepared under the supervision of Dr. Anat Lerner, Dr. Azaria Cohen, Dr. Mireille Avigal The Open University of Israel July 2014 2 Contents Abstract .......................................................................................................................... 5 1. Introduction ...................................................................................................................... 6 1.1. Related work .......................................................................................................... 6 1.2. The scope and goals of our study .......................................................................... 8 2. Methods ........................................................................................................................... 10 2.1. Participants .......................................................................................................... 10 2.2. Data Collection .................................................................................................... 10 2.3. Sample processing ............................................................................................... 11 2.4. Pain-level analysis ..............................................................................................
    [Show full text]
  • Telmate Biometric Solutions
    TRANSFORMING INMATE COMMUNICATIONS TELMATE BIOMETRIC SOLUTIONS Telmate’s advanced and comprehensive image Telmate’s and voice biometric solutions verify the identity image and voice biometric of every contact, measuring and analyzing unique solutions physical and behavioral characteristics. increase facility security, prevent Image Biometrics: Do You Really Know Who is Visiting? fraudulent calling, Security is a concern with every inmate interaction. When inmates conduct video visits, it is essential to ensure that the inmate that booked the visit is the one that and thwart the conducts the visit, from start to finish. extortion of calling funds. With Telmate, every video visit is comprehensively analyzed for rule violations, including communications from unauthorized inmates. The Telmate system actively analyzes every video visit, comparing every face with a known and verified photo of the approved participants. When a mismatch is identified, the Telmate system instantly compares the unrecognized facial image with other verified inmate photos housed in the same location in an attempt to identify the unauthorized inmate participant. Next, Telmate timestamps any identified violation in the video and flags the live video visit for immediate review by facility investigators, along with a shortcut link that allows staff to quickly jump to and INMATES review the suspicious section of video and see the potential identity of the perpetrator. Telmate has the most comprehensive, and only 100% accurate, unapproved video visit participant detection system in the industry today. Telmate biometric solutions result REVIEW REVIEW in a unique methodology that: TELMATE COMMAND IDENTIFY & FLAG TELMATE • Prevents repeat occurrences of all unapproved inmate video visit BIOMETRICS extortion, or using visitation funds as a participants.
    [Show full text]
  • Psychological Measurement for Sound Description and Evaluation 11
    11 Psychological measurement for sound description and evaluation Patrick Susini,1 Guillaume Lemaitre,1 and Stephen McAdams2 1Institut de Recherche et de Coordination Acoustique/Musique Paris, France 2CIRMMT, Schulich School of Music, McGill University Montréal, Québec, Canada 11.1 Introduction Several domains of application require one to measure quantities that are representa- tive of what a human listener perceives. Sound quality evaluation, for instance, stud- ies how users perceive the quality of the sounds of industrial objects (cars, electrical appliances, electronic devices, etc.), and establishes specifications for the design of these sounds. It refers to the fact that the sounds produced by an object or product are not only evaluated in terms of annoyance or pleasantness, but are also important in people’s interactions with the object. Practitioners of sound quality evaluation therefore need methods to assess experimentally, or automatic tools to predict, what users perceive and how they evaluate the sounds. There are other applications requir- ing such measurement: evaluation of the quality of audio algorithms, management (organization, retrieval) of sound databases, and so on. For example, sound-database retrieval systems often require measurements of relevant perceptual qualities; the searching process is performed automatically using similarity metrics based on rel- evant descriptors stored as metadata with the sounds in the database. The “perceptual” qualities of the sounds are called the auditory attributes, which are defined as percepts that can be ordered on a magnitude scale. Historically, the notion of auditory attribute is grounded in the framework of psychoacoustics. Psychoacoustical research aims to establish quantitative relationships between the physical properties of a sound (i.e., the properties measured by the methods and instruments of the natural sciences) and the perceived properties of the sounds, the auditory attributes.
    [Show full text]
  • Design of a Protocol for the Measurement of Physiological and Emotional Responses to Sound Stimuli
    DESIGN OF A PROTOCOL FOR THE MEASUREMENT OF PHYSIOLOGICAL AND EMOTIONAL RESPONSES TO SOUND STIMULI ANDRÉS FELIPE MACÍA ARANGO UNIVERSIDAD DE SAN BUENAVENTURA MEDELLÍN FACULTAD DE INGENIERÍAS INGENIERÍA DE SONIDO MEDELLÍN 2017 DESIGN OF A PROTOCOL FOR THE MEASUREMENT OF PHYSIOLOGICAL AND EMOTIONAL RESPONSES TO SOUND STIMULI ANDRÉS FELIPE MACÍA ARANGO A thesis submitted in partial fulfillment for the degree of Sound Engineer Adviser: Jonathan Ochoa Villegas, Sound Engineer Universidad de San Buenaventura Medellín Facultad de Ingenierías Ingeniería de Sonido Medellín 2017 TABLE OF CONTENTS ABSTRACT ................................................................................................................................................................ 7 INTRODUCTION ..................................................................................................................................................... 8 1. GOALS .................................................................................................................................................................... 9 2. STATE OF THE ART ........................................................................................................................................ 10 3. REFERENCE FRAMEWORK ......................................................................................................................... 15 3.1. Noise ...........................................................................................................................................................
    [Show full text]
  • Comparison of Frequency-Warped Filter Banks in Relation to Robust Features for Speaker Identification
    Recent Advances in Electrical Engineering Comparison of Frequency-Warped Filter Banks in relation to Robust Features for Speaker Identification SHARADA V CHOUGULE MAHESH S CHAVAN Electronics and Telecomm. Engg.Dept. Electronics Engg. Department, Finolex Academy of Management and KIT’s College of Engineering, Technology, Ratnagiri, Maharashtra, India Kolhapur,Maharashtra,India [email protected] [email protected] Abstract: - Use of psycho-acoustically motivated warping such as mel-scale warping in common in speaker recognition task, which was first applied for speech recognition. The mel-warped cepstral coefficients (MFCCs) have been used in state-of-art speaker recognition system as a standard acoustic feature set. Alternate frequency warping techniques such as Bark and ERB rate scale can have comparable performance to mel-scale warping. In this paper the performance acoustic features generated using filter banks with Bark and ERB rate warping is investigated in relation to robust features for speaker identification. For this purpose, a sensor mismatched database is used for closed set text-dependent and text-independent cases. As MFCCs are much sensitive to mismatched conditions (any type of mismatch of data used for training evaluation purpose) , in order to reduce the additive noise, spectral subtraction is performed on mismatched speech data. Also normalization of feature vectors is carried out over each frame, to compensate for channel mismatch. Experimental analysis shows that, percentage identification rate for text-dependent case using mel, bark and ERB warped filter banks is comparably same in mismatched conditions. However, in case of text-independent speaker identification, ERB rate warped filter bank features shows improved performance than mel and bark warped features for the same sensor mismatched condition.
    [Show full text]
  • Fundamentals of Psychoacoustics Psychophysical Experimentation
    Chapter 6: Fundamentals of Psychoacoustics • Psychoacoustics = auditory psychophysics • Sound events vs. auditory events – Sound stimuli types, psychophysical experiments – Psychophysical functions • Basic phenomena and concepts – Masking effect • Spectral masking, temporal masking – Pitch perception and pitch scales • Different pitch phenomena and scales – Loudness formation • Static and dynamic loudness – Timbre • as a multidimensional perceptual attribute – Subjective duration of sound 1 M. Karjalainen Psychophysical experimentation • Sound events (si) = pysical (objective) events • Auditory events (hi) = subject’s internal events – Need to be studied indirectly from reactions (bi) • Psychophysical function h=f(s) • Reaction function b=f(h) 2 M. Karjalainen 1 Sound events: Stimulus signals • Elementary sounds – Sinusoidal tones – Amplitude- and frequency-modulated tones – Sinusoidal bursts – Sine-wave sweeps, chirps, and warble tones – Single impulses and pulses, pulse trains – Noise (white, pink, uniform masking noise) – Modulated noise, noise bursts – Tone combinations (consisting of partials) • Complex sounds – Combination tones, noise, and pulses – Speech sounds (natural, synthetic) – Musical sounds (natural, synthetic) – Reverberant sounds – Environmental sounds (nature, man-made noise) 3 M. Karjalainen Sound generation and experiment environment • Reproduction techniques – Natural acoustic sounds (repeatability problems) – Loudspeaker reproduction – Headphone reproduction • Reproduction environment – Not critical in headphone
    [Show full text]
  • Abstract 1. Introduction the Process of Normalising Vowel Formant Data To
    COMPARING VOWEL FORMANT NORMALISATION PROCEDURES NICHOLAS FLYNN Abstract This article compares 20 methods of vowel formant normalisation. Procedures were evaluated depending on their effectiveness at neutralising the variation in formant data due to inter-speaker physiological and anatomical differences. This was measured through the assessment of the ability of methods to equalise and align the vowel space areas of different speakers. The equalisation of vowel spaces was quantified through consideration of the SCV of vowel space areas calculated under each method of normalisation, while the alignment of vowel spaces was judged through considering the intersection and overlap of scale-drawn vowel space areas. An extensive dataset was used, consisting of large numbers of tokens from a wide range of vowels from 20 speakers, both male and female, of two different age groups. Normalisation methods were assessed. The results showed that vowel-extrinsic, formant-intrinsic, speaker-intrinsic methods performed the best at equalising and aligning speakers‟ vowel spaces, while vowel-intrinsic scaling transformations were judged to perform poorly overall at these two tasks. 1. Introduction The process of normalising vowel formant data to permit accurate cross-speaker comparisons of vowel space layout, change and variation, is an issue that has grown in importance in the field of sociolinguistics in recent years. A plethora of different methods and formulae for this purpose have now been proposed. Thomas & Kendall (2007) provide an online normalisation tool, “NORM”, a useful resource for normalising formant data, and one which has opened the viability of normalising to a greater number of researchers. However, there is still a lack of agreement over which available algorithm is the best to use.
    [Show full text]
  • Perceptual Atomic Noise
    PERCEPTUAL ATOMIC NOISE Kristoffer Jensen University of Aalborg Esbjerg Niels Bohrsvej 6, DK-6700 Esbjerg [email protected] ABSTRACT signal point-of-view that creates sound with a uniform distribution and spectrum, or the physical point-of-view A noise synthesis method with no external connotation that creates a noise similar to existing sounds. Instead, is proposed. By creating atoms with random width, an attempt is made at rendering the atoms perceptually onset-time and frequency, most external connotations white, by distributing them according to the perceptual are avoided. The further addition of a frequency frequency axis using a probability function obtained distribution corresponding to the perceptual Bark from the bark scale, and with a magnitude frequency, and a spectrum corresponding to the equal- corresponding to the equal-loudness contour, by loudness contour for a given phon level further removes filtering the atoms using a warped filter corresponding the synthesis from the common signal point-of-view. to the equal-loudness contour for a given phon level. The perceptual frequency distribution is obtained by This gives a perceptual white resulting spectrum with a creating a probability density function from the Bark perceptually uniform frequency distribution. scale, and the equal-loudness contour (ELC) spectrum is created by filtering the atoms with a filter obtained in 2. NOISE the warped frequency domain by fitting the filter to a simple ELC model. An additional voiced quality Noise has been an important component since the parameter allows to vary the harmonicity. The resulting beginning of music, and recently noise music has sound is susceptible to be used in everything from loud evolved as an independent music style, sometimes noise music, contemporary compositions, to meditation avoiding the toned components altogether.
    [Show full text]
  • A. Acoustic Theory and Modeling of the Vocal Tract
    A. Acoustic Theory and Modeling of the Vocal Tract by H.W. Strube, Drittes Physikalisches Institut, Universität Göttingen A.l Introduction This appendix is intended for those readers who want to inform themselves about the mathematical treatment of the vocal-tract acoustics and about its modeling in the time and frequency domains. Apart from providing a funda­ mental understanding, this is required for all applications and investigations concerned with the relationship between geometric and acoustic properties of the vocal tract, such as articulatory synthesis, determination of the tract shape from acoustic quantities, inverse filtering, etc. Historically, the formants of speech were conjectured to be resonances of cavities in the vocal tract. In the case of a narrow constriction at or near the lips, such as for the vowel [uJ, the volume of the tract can be considered a Helmholtz resonator (the glottis is assumed almost closed). However, this can only explain the first formant. Also, the constriction - if any - is usu­ ally situated farther back. Then the tract may be roughly approximated as a cascade of two resonators, accounting for two formants. But all these approx­ imations by discrete cavities proved unrealistic. Thus researchers have have now adopted a more reasonable description of the vocal tract as a nonuni­ form acoustical transmission line. This can explain an infinite number of res­ onances, of which, however, only the first 2 to 4 are of phonetic importance. Depending on the kind of sound, the tube system has different topology: • for vowel-like sounds, pharynx and mouth form one tube; • for nasalized vowels, the tube is branched, with transmission from pharynx through mouth and nose; • for nasal consonants, transmission is through pharynx and nose, with the closed mouth tract as a "shunt" line.
    [Show full text]
  • A Parametric Sound Object Model for Sound Texture Synthesis
    Daniel Mohlmann¨ A Parametric Sound Object Model for Sound Texture Synthesis Dissertation zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften | Dr.-Ing. | Vorgelegt im Fachbereich 3 (Mathematik und Informatik) der Universitat¨ Bremen im Juni 2011 Gutachter: Prof. Dr. Otthein Herzog Universit¨atBremen Prof. Dr. J¨ornLoviscach Fachhochschule Bielefeld Abstract This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are con- sidered. Four essential processing steps for sound texture analysis are identified, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fixed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of pa- rameters over time. In contrast to standard spectral modeling techniques, this repre- sentation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of different length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of different sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modula- tion, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations.
    [Show full text]
  • A Study on Application of Digital Signal Processing in Voice Disorder Diagnosis 1N
    Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Study on Application of Digital Signal Processing in Voice Disorder Diagnosis 1N. A. Sheela Selvakumari*, 2Dr. V. Radha 1 Asst. Prof., Department of Computer Science, PSGR Krishnammal College for Women, Coimbatore, Tamilnadu, India 2 Professor, Department of Computer Science, Avinashiligam Institute for Home Science and Higher Education for Women, Coimbatore, Tamilnadu, India Abstract — The Investigation of the human voice has become an important area of study for its numerous applications in medical as well as industrial disciplines. People suffering from pathologic voices have to face many difficulties in their daily lives. The voice pathologic disorders are associated with respiratory, nasal, neural and larynx diseases. Thus, analysis and diagnosis of vocal disorders have become an important medical procedure. This has inspired a great deal of research in voice analysis measure for developing models to evaluate human verbal communication capabilities. Voice analysis mainly deals with extraction of some parameters from voice signals, which is used to process the voice in appropriate for the particular application by using suitable techniques. The use of these techniques combined with classification methods provides the development of expert aided systems for the detection of voice pathologies. This Study paper states the certain common medical conditions which affect voice patterns of patients and the tests which are used as diagnosis for voice disorder. Keywords— Digital Signal Processing, Voice Disorder, Pattern Classification, Pathological Voices I. INTRODUCTION Digital signal processing (DSP) is the numerical manipulation of signals, usually with the intention to Measure, Filter, Produce or Compress continuous Analog Signals.
    [Show full text]