Low Bit Rate Speech Coding

Low Bit Rate Speech Coding Carl Kritzinger Thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Engineering Sciences at the University of Stellenbosch Promoter: Dr. T.R. Niesler April 2006 Declaration I, the undersigned, hereby declare that the work contained in this thesis is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree. Signature Date Abstract Despite enormous advances in digital communication, the voice is still the primary tool with which people exchange ideas. However, uncompressed digital speech tends to require prohibitively high data rates (upward of 64kbps), making it impractical for many appli- cations. Speech coding is the process of reducing the data rate of digital voice to manageable levels. Parametric speech coders or vocoders utilise a-priori information about the mech- anism by which speech is produced in order to achieve extremely efficient compression of speech signals (as low as 1 kbps). The greater part of this thesis comprises an investigation into parametric speech coding. This consisted of a review of the mathematical and heuristic tools used in parametric speech coding, as well as the implementation of an accepted standard algorithm for parametric voice coding. In order to examine avenues of improvement for the existing vocoders, we examined some of the mathematical structure underlying parametric speech coding. Following on from this, we developed a novel approach to parametric speech coding which obtained promising results under both objective and subjective evaluation. An additional contribution by this thesis was the comparative subjective evaluation of the effect of parametric speech coding on English and Xhosa speech. We investigated the performance of two different encoding algorithms on the two languages. Opsomming Ten spyte van enorme vordering in digitale kommunikasie is die stem steeds die primêre manier waarmee mense idees wissel. Ongelukkig benodig digitale spraakseine baie hoë datatempos, wat dit onprakties maak vir menigte doeleindes. Spraak kodering is die proses waarmee die datatempo van digitale spraakseine vermin- der word tot bruikbare vlakke. Parametriese spraakkodeerders oftewel vocoders,gebruik voorafbekende informasie oor die meganisme waarmee spraak produseer word om beson- der doeltreffende kodering van spraak seine te verrig (so laag soos 1kbps). Die meerderheid van hierdie tesis bevat ’n studie oor parametriese spraak kodering. Die studie bestaan uit ’n oorsig van die wiskundige en heuristieke tegnieke wat in parametriese spraak kodering gebruik word sowel as ’n implementasie van ’n aanvaarde standaard algoritme vir spraak kodering. Met die oog op moontlike maniere om die bestaande kodeerders te verbeter, het ons die wiskundige struktuur onderliggend aan parametriese spraak kodering ondersoek. Hieruit spruit ’n nuwe algoritme vir parametriese spraak kodering wat onder beide objektiewe en subjektiewe evaluering belowende resultate gelewer het. ’n Verdere bydrae van die tesis is die vergelykende subjektiewe evaluering van die ef- fek van parametriese kodering van Engelse en Xhosa spraak. Ons het die doeltreffendheid van twee verskillende enkoderings algoritmes vir die twee tale bestudeer. To my father, for his quiet greatness. Contents Acknowledgements xv 1 Introduction 1 1.1HistoryofVocoders............................... 2 1.2Objectives.................................... 3 1.3Overview..................................... 4 2 An Overview of Voice Coding Techniques 5 2.1IdealVoiceCoding............................... 6 2.1.1 Quantifyingtheinformationofthespeechsignal........... 7 2.2PulseCodeModulation(PCM)........................ 8 2.3WaveformCoders................................ 10 2.4ParametricCoders............................... 10 2.4.1 SpectrumDescriptions......................... 11 2.4.2 ExcitationModels........................... 12 2.5SegmentalCoders................................ 12 3 Fundamentals of Speech Processing for Speech Coding 14 3.1TheMechanicsofSpeechProduction..................... 14 3.1.1 Physiology................................ 14 3.2 Modelling Human Speech Production . ................... 15 3.2.1 Excitation................................ 17 3.3Psycho-AcousticPhenomena.......................... 17 3.3.1 Masking................................. 18 3.3.2 Non-Linearity.............................. 18 3.4CharacteristicsoftheSpeechWaveform.................... 20 3.4.1 Quasi-Stationarity........................... 20 3.4.2 EnergyBias............................... 21 3.5LinearPredictionandtheAll-PoleModelofSpeechProduction...... 21 3.5.1 DerivationoftheLPSystem...................... 22 3.5.2 RepresentationsoftheLinearPredictor................ 23 ii CONTENTS iii 3.5.3 OptimisationoftheLinearPredictionSystem............ 25 3.5.4 TheLevinson-DurbinAlgorithm.................... 25 3.5.5 TheLeRoux-GueguenAlgorithm................... 26 3.6PitchTrackingandVoicingDetection..................... 27 3.6.1 PitchTracking............................. 27 3.6.2 PitchEstimationErrors........................ 29 3.7SpeechQualityAssessment........................... 30 3.7.1 CategorisingSpeechQuality...................... 30 3.7.2 SubjectiveMetrics........................... 31 3.7.3 ObjectiveMetrics............................ 32 3.7.4 PurposeofObjectiveMetrics..................... 36 4 Standard Voice Coding Techniques 37 4.1 FS1015 - LPC10e . .............................. 37 4.1.1 Pre-EmphasisofSpeech........................ 37 4.1.2 LPAnalysis............................... 37 4.1.3 PitchEstimate............................. 38 4.1.4 VoicingDetection............................ 39 4.1.5 QuantisationofLPParameters.................... 39 4.2 FS1016 - CELP . .............................. 40 4.2.1 AnalysisbySynthesis.......................... 40 4.2.2 PerceptualWeighting.......................... 41 4.2.3 Post-filtering.............................. 41 4.2.4 PitchPredictionFilter......................... 42 4.2.5 FS-1016 Bit Allocation ......................... 42 4.3MELP...................................... 42 4.3.1 TheMELPSpeechProductionModel................. 43 4.3.2 AnImprovedMELPat1.7kbps.................... 48 4.3.3 MELPat600bps............................ 49 4.4Conclusion.................................... 51 5 MELP Implementation 52 5.1Analysis..................................... 53 5.1.1 Pre-Processing............................. 54 5.1.2 PitchEstimationPre-Processing.................... 55 5.1.3 IntegerPitch.............................. 56 5.1.4 FractionalPitchEstimate....................... 56 5.1.5 Band-PassVoicingAnalysis...................... 57 5.1.6 LinearPredictorAnalysis....................... 57 CONTENTS iv 5.1.7 LPResidualCalculation........................ 57 5.1.8 Peakiness................................ 58 5.1.9 Jitter(Aperiodic)Flag......................... 58 5.1.10FinalPitchCalculation......................... 58 5.1.11Gain................................... 58 5.1.12FourierMagnitudes........................... 59 5.1.13AveragePitchCalculation....................... 59 5.2Encoding.................................... 59 5.2.1 Band-PassVoicingQuantisation.................... 59 5.2.2 LinearPredictorQuantisation..................... 59 5.2.3 GainQuantisation........................... 60 5.2.4 PitchQuantisation........................... 60 5.2.5 QuantisationofFourierMagnitudes.................. 60 5.2.6 Redundancy Coding .......................... 61 5.2.7 TransmissionOrder........................... 62 5.3Decoder..................................... 62 5.3.1 ErrorCorrection............................ 62 5.3.2 LPandFourierMagnitudeReconstruction.............. 62 5.4Synthesis..................................... 62 5.4.1 PitchSynchronousSynthesis...................... 62 5.4.2 ParameterInterpolation........................ 63 5.4.3 PitchPeriod............................... 64 5.4.4 ImpulseGeneration........................... 64 5.4.5 MixedExcitationGeneration..................... 64 5.4.6 AdaptiveSpectralEnhancement.................... 65 5.4.7 LinearPredictionSynthesis...................... 66 5.4.8 GainAdjustment............................ 66 5.5Results...................................... 66 5.5.1 MELPAnalysisResults........................ 67 5.5.2 QuantisationEffects.......................... 69 5.6Conclusion.................................... 71 6 The Temporal Decomposition Approach to Voice Coding 73 6.1HistoryofTemporalDecomposition...................... 73 6.2AMathematicalFrameworkforParametricVoiceCoding.......... 74 6.2.1 RepresentationoftheSpeechSignal.................. 74 6.2.2 TheParameterVectorTrajectory................... 75 6.3ParameterSpaceandEncoding........................ 75 6.3.1 IrregularSamplingoftheParameterVector............. 76 CONTENTS v 6.4Conclusion.................................... 79 7 Implementation of an Irregular Frame Rate Vocoder 80 7.1Analysis..................................... 82 7.2FilteringoftheSpeechFeatureTrajectory.................. 82 7.2.1 PitchandVoicing............................ 82 7.2.2 LinearPredictor............................ 83 7.3ReconstructionoftheParameterVectorTrajectory............. 84 7.4RegularMELPasaspecialcaseofIS-MELP................. 84 7.5KeyFrameDetermination........................... 87 7.5.1 KeyFrameDeterminationfromCurvatureoftheTrajectory.... 87 7.5.2 Key Frame Selection by Direct Estimation of Reconstruction Error 88 7.6ThresholdOptimisation...........................

Low Bit Rate Speech Coding

High Efficiency, Moderate Complexity Video Codec Using Only RF IPR

Hardware for Speech and Audio Coding

Efficient Multi-Codec Support for OTT Services: HEVC/H.265 And/Or AV1?

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE Optimized AV1 Inter

Arxiv:2002.01657V1 [Eess.IV] 5 Feb 2020 Port Lossless Model to Compress Images Lossless

Course Outline & Schedule

Video Compression Optimized for Racing Drones

Premam 2015 Malayalam 720P Bdrip X264 AC3 51 14GB 53Golkes

Sparsity in Linear Predictive Coding of Speech

Lossless and Nearly-Lossless Image Compression Based on Combinatorial Transforms

Anti-Forensics of Digital Image Compression Matthew C

Arxiv:1602.05975V3 [Cs.MM] 28 Oct 2017 CDEF Works by Identifying the Direction [4] of Each Block Ed =  (Xp − Μd,K) 