Low Bit Rate Speech Coding
Total Page:16
File Type:pdf, Size:1020Kb
Low Bit Rate Speech Coding Carl Kritzinger Thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Engineering Sciences at the University of Stellenbosch Promoter: Dr. T.R. Niesler April 2006 Declaration I, the undersigned, hereby declare that the work contained in this thesis is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree. Signature Date Abstract Despite enormous advances in digital communication, the voice is still the primary tool with which people exchange ideas. However, uncompressed digital speech tends to require prohibitively high data rates (upward of 64kbps), making it impractical for many appli- cations. Speech coding is the process of reducing the data rate of digital voice to manageable levels. Parametric speech coders or vocoders utilise a-priori information about the mech- anism by which speech is produced in order to achieve extremely efficient compression of speech signals (as low as 1 kbps). The greater part of this thesis comprises an investigation into parametric speech cod- ing. This consisted of a review of the mathematical and heuristic tools used in parametric speech coding, as well as the implementation of an accepted standard algorithm for para- metric voice coding. In order to examine avenues of improvement for the existing vocoders, we examined some of the mathematical structure underlying parametric speech coding. Following on from this, we developed a novel approach to parametric speech coding which obtained promising results under both objective and subjective evaluation. An additional contribution by this thesis was the comparative subjective evaluation of the effect of parametric speech coding on English and Xhosa speech. We investigated the performance of two different encoding algorithms on the two languages. Opsomming Ten spyte van enorme vordering in digitale kommunikasie is die stem steeds die primˆere manier waarmee mense idees wissel. Ongelukkig benodig digitale spraakseine baie ho¨e datatempos, wat dit onprakties maak vir menigte doeleindes. Spraak kodering is die proses waarmee die datatempo van digitale spraakseine vermin- der word tot bruikbare vlakke. Parametriese spraakkodeerders oftewel vocoders,gebruik voorafbekende informasie oor die meganisme waarmee spraak produseer word om beson- der doeltreffende kodering van spraak seine te verrig (so laag soos 1kbps). Die meerderheid van hierdie tesis bevat ’n studie oor parametriese spraak kodering. Die studie bestaan uit ’n oorsig van die wiskundige en heuristieke tegnieke wat in parame- triese spraak kodering gebruik word sowel as ’n implementasie van ’n aanvaarde standaard algoritme vir spraak kodering. Met die oog op moontlike maniere om die bestaande kodeerders te verbeter, het ons die wiskundige struktuur onderliggend aan parametriese spraak kodering ondersoek. Hieruit spruit ’n nuwe algoritme vir parametriese spraak kodering wat onder beide objektiewe en subjektiewe evaluering belowende resultate gelewer het. ’n Verdere bydrae van die tesis is die vergelykende subjektiewe evaluering van die ef- fek van parametriese kodering van Engelse en Xhosa spraak. Ons het die doeltreffendheid van twee verskillende enkoderings algoritmes vir die twee tale bestudeer. To my father, for his quiet greatness. Contents Acknowledgements xv 1 Introduction 1 1.1HistoryofVocoders............................... 2 1.2Objectives.................................... 3 1.3Overview..................................... 4 2 An Overview of Voice Coding Techniques 5 2.1IdealVoiceCoding............................... 6 2.1.1 Quantifyingtheinformationofthespeechsignal........... 7 2.2PulseCodeModulation(PCM)........................ 8 2.3WaveformCoders................................ 10 2.4ParametricCoders............................... 10 2.4.1 SpectrumDescriptions......................... 11 2.4.2 ExcitationModels........................... 12 2.5SegmentalCoders................................ 12 3 Fundamentals of Speech Processing for Speech Coding 14 3.1TheMechanicsofSpeechProduction..................... 14 3.1.1 Physiology................................ 14 3.2 Modelling Human Speech Production . ................... 15 3.2.1 Excitation................................ 17 3.3Psycho-AcousticPhenomena.......................... 17 3.3.1 Masking................................. 18 3.3.2 Non-Linearity.............................. 18 3.4CharacteristicsoftheSpeechWaveform.................... 20 3.4.1 Quasi-Stationarity........................... 20 3.4.2 EnergyBias............................... 21 3.5LinearPredictionandtheAll-PoleModelofSpeechProduction...... 21 3.5.1 DerivationoftheLPSystem...................... 22 3.5.2 RepresentationsoftheLinearPredictor................ 23 ii CONTENTS iii 3.5.3 OptimisationoftheLinearPredictionSystem............ 25 3.5.4 TheLevinson-DurbinAlgorithm.................... 25 3.5.5 TheLeRoux-GueguenAlgorithm................... 26 3.6PitchTrackingandVoicingDetection..................... 27 3.6.1 PitchTracking............................. 27 3.6.2 PitchEstimationErrors........................ 29 3.7SpeechQualityAssessment........................... 30 3.7.1 CategorisingSpeechQuality...................... 30 3.7.2 SubjectiveMetrics........................... 31 3.7.3 ObjectiveMetrics............................ 32 3.7.4 PurposeofObjectiveMetrics..................... 36 4 Standard Voice Coding Techniques 37 4.1 FS1015 - LPC10e . .............................. 37 4.1.1 Pre-EmphasisofSpeech........................ 37 4.1.2 LPAnalysis............................... 37 4.1.3 PitchEstimate............................. 38 4.1.4 VoicingDetection............................ 39 4.1.5 QuantisationofLPParameters.................... 39 4.2 FS1016 - CELP . .............................. 40 4.2.1 AnalysisbySynthesis.......................... 40 4.2.2 PerceptualWeighting.......................... 41 4.2.3 Post-filtering.............................. 41 4.2.4 PitchPredictionFilter......................... 42 4.2.5 FS-1016 Bit Allocation ......................... 42 4.3MELP...................................... 42 4.3.1 TheMELPSpeechProductionModel................. 43 4.3.2 AnImprovedMELPat1.7kbps.................... 48 4.3.3 MELPat600bps............................ 49 4.4Conclusion.................................... 51 5 MELP Implementation 52 5.1Analysis..................................... 53 5.1.1 Pre-Processing............................. 54 5.1.2 PitchEstimationPre-Processing.................... 55 5.1.3 IntegerPitch.............................. 56 5.1.4 FractionalPitchEstimate....................... 56 5.1.5 Band-PassVoicingAnalysis...................... 57 5.1.6 LinearPredictorAnalysis....................... 57 CONTENTS iv 5.1.7 LPResidualCalculation........................ 57 5.1.8 Peakiness................................ 58 5.1.9 Jitter(Aperiodic)Flag......................... 58 5.1.10FinalPitchCalculation......................... 58 5.1.11Gain................................... 58 5.1.12FourierMagnitudes........................... 59 5.1.13AveragePitchCalculation....................... 59 5.2Encoding.................................... 59 5.2.1 Band-PassVoicingQuantisation.................... 59 5.2.2 LinearPredictorQuantisation..................... 59 5.2.3 GainQuantisation........................... 60 5.2.4 PitchQuantisation........................... 60 5.2.5 QuantisationofFourierMagnitudes.................. 60 5.2.6 Redundancy Coding .......................... 61 5.2.7 TransmissionOrder........................... 62 5.3Decoder..................................... 62 5.3.1 ErrorCorrection............................ 62 5.3.2 LPandFourierMagnitudeReconstruction.............. 62 5.4Synthesis..................................... 62 5.4.1 PitchSynchronousSynthesis...................... 62 5.4.2 ParameterInterpolation........................ 63 5.4.3 PitchPeriod............................... 64 5.4.4 ImpulseGeneration........................... 64 5.4.5 MixedExcitationGeneration..................... 64 5.4.6 AdaptiveSpectralEnhancement.................... 65 5.4.7 LinearPredictionSynthesis...................... 66 5.4.8 GainAdjustment............................ 66 5.5Results...................................... 66 5.5.1 MELPAnalysisResults........................ 67 5.5.2 QuantisationEffects.......................... 69 5.6Conclusion.................................... 71 6 The Temporal Decomposition Approach to Voice Coding 73 6.1HistoryofTemporalDecomposition...................... 73 6.2AMathematicalFrameworkforParametricVoiceCoding.......... 74 6.2.1 RepresentationoftheSpeechSignal.................. 74 6.2.2 TheParameterVectorTrajectory................... 75 6.3ParameterSpaceandEncoding........................ 75 6.3.1 IrregularSamplingoftheParameterVector............. 76 CONTENTS v 6.4Conclusion.................................... 79 7 Implementation of an Irregular Frame Rate Vocoder 80 7.1Analysis..................................... 82 7.2FilteringoftheSpeechFeatureTrajectory.................. 82 7.2.1 PitchandVoicing............................ 82 7.2.2 LinearPredictor............................ 83 7.3ReconstructionoftheParameterVectorTrajectory............. 84 7.4RegularMELPasaspecialcaseofIS-MELP................. 84 7.5KeyFrameDetermination........................... 87 7.5.1 KeyFrameDeterminationfromCurvatureoftheTrajectory.... 87 7.5.2 Key Frame Selection by Direct Estimation of Reconstruction Error 88 7.6ThresholdOptimisation...........................