BEN-GURION UNIVERSITY OF THE NEGEV FACULTY OF ENGINEERING SCIENCE

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE

By: Dmitry Lihovetski

January 2011

BEN-GURION UNIVERSITY OF THE NEGEV FACULTY OF ENGINEERING SCIENCE

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE

By: Dmitry Lihovetski

Supervised by: Prof. Ilan D. Shallom Prof. Dov Wulich

Author: Date: ………………..……………….. ………………..

Supervisor: Date: ………………..……………….. ……………….

Supervisor: Date: ………………..……………….. ……………….

Chairman of Graduate Studies Committee: ………………..………………..

Date: ………………. January 2011

SINUSOIDAL MODEL BASED PACKET LOSS CONCEALMENT FOR

WIDEBAND VOIP APPLICATIONS

“Essentially, all models are wrong, but some are useful”1 - George E.P. Box (1919-present)

Dmitry Lihovetski Israel, January 2011

1 Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley.

Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

Abstract

Voice over Internet Protocol (VoIP) has become very popular in recent years. However, since internet delivery does not guarantee quality of service, data packets are often lost due to network congestions or significantly delayed. Packet loss is a fundamental problem in VoIP networks, unless concealed in some way, it produces an annoying disturbance in the resulting gaps. Therefore, there is considerable interest in developing Packet Loss Concealment (PLC) algorithms to compensate for missing voice packets. In this thesis, a new method for concealment of the missing packets for wideband VoIP applications is presented. The proposed approach is based on sinusoidal modeling of speech where it is reconstructed using sinusoidal generators controlled by amplitudes, frequencies and phases. The major idea is to exploit the continuity of sinusoidal representation and its simplified interpolation or extrapolation capabilities for isolated and consecutive packet loss. Sinusoidal modeling is used for producing synthetic speech analyzed from past packets and subsequent packets allowing packet extrapolation and interpolation under models parametric domain. Several model configurations are investigated, based on fully implemented concealment process tested with an objective voice quality method namely, Perceptual Evaluation of Speech Quality (PESQ). Using the large ITU-T coded speech database (Supplementary P.23) for statistical evaluation of results, it is revealed that the proposed algorithm outperform the standardized PLCs, ITU-T G.722 Appendixes III and IV, in average, under all tested packet loss rates and with increasing its superiority on higher rates.

I Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

Keywords

Voice over Internet Protocol, Packet Loss Concealment, Sinusoidal Model Based Packet Loss Concealment, Sinusoidal Modeling of Speech, Speech Normalization, Sinusoidal Model Extrapolation, Sinusoidal Model Interpolation, Sinusoidal Model Matching, Sender-Based Techniques, Receiver-Based Techniques, Voice Quality Evaluation, Perceptual Evaluation of Speech Quality, ITU-T P-Series Supplementary 23, NIST Net Emulation Tool.

II Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

Acknowledgments

I would like to express my sincere gratitude to Prof. Ilan Shallom and Prof. Dov Wulich for their support and encouragement throughout my graduate studies at Ben-Gurion University, especially for their devoted supervision and professional guidance in this research. It was their patient tutoring and invaluable advice that helped me to accomplish this thesis successfully.

I am also thankful to all my fellow graduate students in the Signal Processing Laboratory for their companionship and fruitful discussions. My special thanks to all friends who took part in subjective tests during the development and final evaluation of the algorithm proposed in this research, and to Eitan Talianker and Vadim Mishalov, who accompanied me throughout my entire studies.

Finally, I wish to thank my family and friends for their infinite care, love, support, patience and encouragement during my undergraduate and graduate studies.

Dmitry Lihovetski Israel, January 2011

III Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

Contents

A19TU BSTRACT U19T I

K19TU EYWORDSU19T II

A19TU CKNOWLEDGMENTSU19T III

C19TU ONTENTSU19T IV

L19TU IST OF ABBREVIATIONS AND ACRONYMSU19T VII

L19TU IST OF NOTATIONS AND SYMBOLSU19T XI

L19TU IST OF FIGURESU19T XIV

L19TU IST OF TABLESU19T XVI

119T 19T I19TU NTRODUCTIONU19T 1

1.119TU U19T I19TU NTERNET TELEPHONYU19T ...... 1

1.19TU 2 U19T V19TU OICE QUALITY EVOLUTION, WIDEBAND IMPROVEMENTS U19T ...... 2

1.319TU U19T R19TU ESEARCH GOALS AND RESULTSU19T ...... 4

1.419TU U19T D19TU OCUMENT OUTLINE U19T ...... 4

219T 19T O19T VERVIEW OF VOICE OVER IP TECHNOLOGIES19T 6

2.119TU U19T V19TU OIP STANDARDS AND ORGANIZATIONSU19T ...... 6

2.219TU U19T V19TU OIP ARCHITECTUREU19T ...... 7

2.319TU U19T M19TU EDIA TRANSPORT PROTOCOLSU19T ...... 8

2.419TU U19T M19TU EDIA ENCODING PROTOCOLSU19T ...... 8

2.519TU U19T M19TU EDIA PACKETIZINGU19T ...... 9

2.5.119TU U19T R19TU EAL-TIME TRANSPORT PROTOCOL (RTP) U19T ...... 10

2.619TU U19T V19TU OICE QUALITY IN INTERNET TELEPHONYU19T ...... 11

2.719TU U19T C19TU HARACTERIZATION OF PACKET LOSSU19T ...... 15

2.819TU U19T V19TU OICE QUALITY EVALUATIONU19T ...... 16

2.8.119TU U19T M19TU EAN OPINION SCORE (MOS)U19T ...... 19

2.8.219TU U19T E19TU -MODELU19T ...... 20

2.8.319TU U19T P19TU ERCEPTUAL EVALUATION OF SPEECH QUALITY (PESQ)...... U19T ..22

2.8.419TU U19T N19TU ON-INTRUSIVE VOICE QUALITYU19T ...... 24

2.8.519TU U19T V19TU OICE QUALITY MEASURE SELECTIONU19T ...... 25

2.919TU U19T P19TU ACKET LOSS RECOVERY TECHNIQUESU19T ...... 26

2.9.119TU U19T S19TU ENDER-BASED TECHNIQUESU19T ...... 27

IV Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

2.9.219TU U19T R19TU ECEIVER-BASED TECHNIQUESU19T ...... 27

2.9.319TU U19T C19TU ONCEALMENT ALGORITHMSU19T ...... 28

319T 19T S19T INUSOIDAL MODELING OF SPEECH19T 35

3.119TU U19T S19TU INUSOIDAL MODELING EVOLUTION U19T ...... 35

3.1.119TU U19T R19TU ELATED RESEARCH U19T ...... 36

3.219TU U19T S19TU PEECH PRODUCTION MODELU19T ...... 37

3.319TU U19T S19TU PEECH ANALYSIS/SYNTHESIS BASED ON A SINUSOIDAL REPRESENTATIONU19T ...... 38

3.3.119TU U19T S19TU INUSOIDAL SPEECH MODELU19T ...... 38

3.3.219TU U19T S19TU INUSOIDAL MODEL PARAMETERS ESTIMATIONU19T ...... 40

3.3.319TU U19T S19TU INUSOIDAL MODEL SYNTHESISU19T ...... 42

419T 19T S19T INUSOIDAL MODEL BASED PACKET LOSS CONCEALMENT 19T 45

4.119TU U19T G19TU ENERALIZED CONCEALMENT METHODU19T ...... 45

4.219TU U19T D19TU ESCRIPTION OF THE ALGORITHMU19T ...... 46

4.2.119TU U19T B19TU UILDING BLOCKSU19T ...... 47

4.2.219TU U19T C19TU ONTROL BLOCKU19T ...... 48

4.2.319TU U19T PLC19TU -IN BLOCKU19T ...... 50

4.2.419TU U19T PLC19TU -OUT BLOCKU19T ...... 53

4.2.519TU U19T S19TU UB-BLOCKSU19T ...... 55

519T 19T A19TU LGORITHM EVALUATION U19T 68

5.119TU U19T E19TU VALUATION FRAMEWORKU19T ...... 68

5.219TU U19T S19TU PEECH QUALITY ASSESSMENTU19T ...... 69

5.319TU U19T T19U EST VECTORSU19T ...... 70

5.419TU U19T N19TU ETWORK MODELU19T ...... 71

5.519TU U19T E19TU VALUATION RESULTSU19T ...... 73

5.5.119TU U19T A19TU LGORITHM CONFIGURATIONU19T ...... 73

5.5.219TU U19T O19TU BJECTIVE VOICE QUALITY PERFORMANCE TEST U19T ...... 77

5.5.319TU U19T L19TU ANGUAGE DEPENDENCE TEST U19T ...... 78

5.5.419TU U19T S19TU UBJECTIVE PREFERENCE TEST U19T ...... 83

619T 19T S19T UMMARY AND FUTURE DIRECTIONS19T 84

6.119TU U19T S19TU UMMARYU19T ...... 84

6.219TU U19T F19TU UTURE DIRECTIONSU19T ...... 85

6.2.119TU U19T U19T SING ADAPTIVE JITTER BUFFER IN CONCEALMENT PROCESSU19T ...... 85

6.2.219TU U19T U19T SING MORE CONSECUTIVE FRAMES IN CONCEALMENT PROCESSU19T ...... 85

6.2.319TU U19T U19T SING STATIONARY SPEECH SEGMENTATION FOR ANALYSIS WINDOW

ADAPTATION LOGICU19T ...... 85

6.2.419TU U19T C19TU OMPLEXITY REDUCTION FOR REAL-TIME APPLICATION...... U19T ..86

A19T PPENDIX A. PARABOLIC INTERPOLATION19T 87

A19T PPENDIX B. SINUSOIDAL MODEL MATCHING19T 89

V Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

A19T PPENDIX C. PERCEPTUAL EVALUATION OF SPEECH QUALITY19T 92

A19T PPENDIX D. NIST NET EMULATION TOOL19T 95

A19T PPENDIX E. SEGMENTATION BASED ON SINUSOIDAL MODELING OF SPEECH19T 96

R19TU EFERENCESU19T 100

VI Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

List of Abbreviations and Acronyms

A/D Analog to Digital conversion

AbS Analysis-by-Synthesis

ACELP Algebraic Code Excited Linear Prediction

ACR Absolute Category Rating

ADPCM Adaptive Differential Pulse Code Modulation

AM Amplitude Modulation

ANIQUE+ Auditory Non-Intrusive Quality Estimation Plus

ANSI American National Standards Institute

AR Auto Regressive

ARPANET Advanced Research Projects Agency Network

CELP Code Excited Linear Prediction

CODEC COder / DECoder

CS-ACELP Conjugate Structure Algebraic Code Excited Linear Prediction

D/A Digital to Analog conversion

DFT Discrete Fourier Transform

DPCM Differential Pulse Code Modulation

DSTFT Discrete Short-Time Fourier Transform

DUT Device Under Test

ETSI European Telecommunications Standards Institute

FEC Forward Error Correction

VII Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

FIFO First-In First-Out

GSM Global System for Mobile Communications

IETF Internet Engineering Task Force iLBC Internet Low Bit Rate

IMTC International Multimedia Teleconferencing Consortium

IP Internet Protocol

ISDN Integrated Services Digital Network

ISOC Internet SOCiety

ISP Internet Service Provider

ITSP Internet Telephony Service Provider

ITU International Telecommunications Union

ITU-D ITU - Telecommunications Development Sector

ITU-R ITU - Radiocommunication Sector

ITU-T ITU – Telecommunication Standardization Sector

LAN Local Area Network

LP Linear Prediction

LSP Line Spectral Pair

MOS Mean Opinion Score

MOSLE MOS Listening Effort

MOSLP MOS Loudness Preference

MOS-LQO MOS- Listening Quality Objective

MPE Multi-Pulse Excited

MSE Mean Square Error

MSM Multiresolution Sinusoidal Modeling

VIII Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

OLA Overlap-Add

PAMS Perceptual Analysis Measurement System Analysis/Synthesis Program for Non-Harmonic Sounds Based on a PARSHL Sinusoidal Representation PC Personal Computer

PCM Pulse Code Modulation

PESQ Perceptual Evaluation of Speech Quality

PLC Packet Loss Concealment

POLQA Perceptual Objective Listening Quality Analysis

POTS Plain Old Telephone Service

PSQM Perceptual Speech Quality Measure

PSTN Public Switched Telephone Network

PT Payload Type

QMF Quadrature Mirror Filter

QoS Quality of Service

RFC Request For Comments

RPE Regular-Pulse Excited

RTCP Real-time Transport Control Protocol

RTP Real-time Transport Protocol

SID Silence Insertion Descriptor / Silence Descriptor

SIP Session Initiation Protocol

SMS Spectral Modeling Synthesis

SNR Signal-to-Noise Ratio

SRTP Secured Real Time Protocol

SSNR Segmental Signal-to-Noise Ratio

IX Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

SSRC Synchronization Source

STFT Short-Time Fourier Transform

SVOPC Sinusoidal Voice Over Packet Coder

TCP Transmission Control Protocol

TELCO TELephone Companies

UDP User Datagram Protocol

VAD Voice Activity Detection

VoIP Voice over Internet Protocol

VoPN Voice over Packet Network

W3C World Wide Web Consortium

WSOLA Waveform-Similarity-Based Synchronized Overlap-Add

X Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

List of Notations and Symbols

Notations

X Estimated signal/model/parameter

X Normalized/Un-scaled signal  X OLA smoothed signal/model/parameter

X Matched signal/model/parameter

X p “Previous” signal/model/parameter

Xc “Current” signal/model/parameter

Xn “Next” signal/model/parameter  X Extrapolated forward signal/model/parameter  X Extrapolated backward signal/model/parameter  X Interpolated signal/model/parameter

Symbols fs Sampling frequency in Hz i Discrete frame index n Discrete time index

Ss,{ 1 ,.., si ,..} Input speech stream of packets

Cc,{ 1 ,.., ci ,..} Encoded speech stream of packets

SsSM,{ SM,1 ,.., s SM ,i ,..} SM-PLC output stream of estimated speech packets

SsIII,{ III,1 ,.., s III ,i ,..} G.722 App.III output stream of estimated speech packets

SsIV,{ IV,1 ,.., s IV ,i ,..} G.722 App.IV output stream of estimated speech packets

XI Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

BFIi Discrete Bad Frame Indication

BFIi Bad Frame Indication vector

NFFT FFT length

N f Length of input frame

Nh Number of history samples

Nla Number of look-ahead samples (available in jitter buffer)

Niss Length of initial speech segment

Nsss Length of stationary speech segment

Na Length of analysis signal

Nc Length of concealing signal

N Scaling gain length g

NSM Length of SM-PLC output signal

NCLC Consecutive loss counter

NOLA OLA length

NNjf snii, s ( ) T Speech signal of frame i n=0,1,..,N f − 1

NNhh Snhi,,, S hi ( ) T History samples respectively to frame i n=[0,1,..,Nh − 1]

NNla la Snla,, i, S la i ( ) T Look-Ahead samples respectively to current i frame n=[0,1,..,Nla − 1]

NNiss iss sniss, s iss ( ) = − T Initial speech segment n [0,1,..,Niss 1]

NNsss sss snsss, s sss ( ) = − T Stationary speech segment n [0,1,..,Nsss 1]

NNaa snaa, s ( ) = − T Analysis signal n [0,1,..,Na 1]

NNcc snci,,, s ci ( ) = − T Concealing signal of frame i n [0,1,..,Nc 1]

NNff T sntr,, i, s tr i ( ) = − Transition signal of frame i n 0,1,..,N f 1

NNgg T gnii, g ( ) = − Scaling gain for frame i n 0,1,..,Ng 1

NNSM SM snSM,, i, s SM i ( ) = − T SM-PLC output signal of frame i n [0,1,..,NSM 1]

XII Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

i M Sinusoidal model of i th frame  Mi “Forward” extrapolation of i th sinusoidal model  Mi “Backward” extrapolation of i th sinusoidal model

 −+ Interpolation between two sinusoidal models ( i −1 and Mii1, 1 i +1) Li Number of sine waves components of the i th frame

i th th Al Amplitude of the l sine wave of the i frame

i th th ωl Frequency of the l sine wave of the i frame

i th th θl Phase of the l sine wave of the i frame

ε i DC of the i th frame

XIII Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

List of Figures

F19T IGURE 1-1: SPECTRUM OF A NARROWBAND AND A WIDEBAND SPEECH SIGNAL 19T ...... 3

F19T IGURE 1-2: VOICE QUALITY TECHNOLOGIES IN DIFFERENT NETWORKS19T ...... 3

F19T IGURE 2-1: VOIP TOPOLOGY19T ...... 7

F19T IGURE 2-2: MEDIA PACKETIZING19T ...... 10

F19T IGURE 2-3: TYPICAL RTP HEADER19T ...... 10

F19T IGURE 2-4: TYPICAL VOIP SYSTEM19T ...... 11

F19T IGURE 2-5: MAIN DELAY SOURCES IN VOIP19T ...... 12

F19T IGURE 2-6: NUMBER OF CONSECUTIVELY LOST PACKETS (AFTER BOLOT ET AL. [29]) 19T ...... 15

F19T IGURE 2-7: CONSECUTIVELY LOST PACKETS DISTRIBUTION (AFTER BOLOT ET AL. [29]) 19T ...... 16

F19T IGURE 2-8: CLASSIFICATION OF SUBJECTIVE QUALITY METHODS AND RELATED ITU STANDARDS AND

RECOMMENDATIONS19T ...... 17

F19T IGURE 2-9: INTRUSIVE AND NON-INTRUSIVE TYPES OF QUALITY ASSESSMENTS. 19T ...... 17

F19T IGURE 2-10: NON-INTRUSIVE MONITORING OF LISTENING AND CONVERSATIONAL QUALITY OVER THE

NETWORK19T ...... 19

F19T IGURE 2-11: E-MODEL MAPPING OF R-FACTOR TO MOS 19T ...... 22

F19T IGURE 2-12: PESQ MAPPING FUNCTIONS19T ...... 24

F19T IGURE 2-13: SENDER-BASED CONCEALMENT TYPES (AFTER PERKINS ET AL. [4])19T ...... 27

F19T IGURE 2-14: RECEIVER-BASED CONCEALMENT TYPES (AFTER PERKINS ET AL. [4]) 19T ...... 28

F19T IGURE 2-15: LP-BASED PACKET LOSS CONCEALMENT (AFTER GÜNDÜZHAN AND MOMTAHAN [59]) 19T ..33

F19T IGURE 2-16: INTERPOLATION BASED PLC (AFTER LINDBLOM AND HEDELIN [64])19T ...... 34

F19T IGURE 3-1: GENERALIZED SOURCE-FILTER MODEL OF SPEECH19T ...... 38

F19T IGURE 3-2: SOURCE-FILTER MODEL OF SPEECH19T ...... 38

F19T IGURE 3-3: BLOCK DIAGRAM OF THE SINUSOIDAL ANALYSIS19T ...... 40

F19T IGURE 3-4: FREQUENCY TRACKING (AFTER MCAULAY AND QUATIERI [25])19T ...... 43

F19T IGURE 3-5: BLOCK DIAGRAM OF THE SINUSOIDAL SYNTHESIS19T ...... 44

F19T IGURE 4-1: VOIP RECEIVER-BASED CONCEALMENT BLOCK DIAGRAM19T ...... 45

F19T IGURE 4-2: PROPOSED PLC TOP-LEVEL BLOCK DIAGRAM19T ...... 46

F19T IGURE 4-3: CONTROL BLOCK19T ...... 48

F19T IGURE 4-4: PLC-IN BLOCK19T ...... 50

F19T IGURE 4-5: PLC-OUT BLOCK19T ...... 54

F19T IGURE 4-6: SINUSOIDAL ANALYSIS19T ...... 57

F19T IGURE 4-7: SINUSOIDAL SYNTHESIS19T ...... 60

F19T IGURE 4-8: OLA BACKWARD MATCHING ILLUSTRATION 19T ...... 64

F19T IGURE 4-9: OLA FORWARD MATCHING ILLUSTRATION 19T ...... 64

XIV Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

F19T IGURE 4-10: SCALING19T ...... 65

F19T IGURE 4-11: PROPOSED PLC DETAILED-LEVEL BLOCK DIAGRAM19T ...... 67

F19T IGURE 5-1: PROPOSED TESTING SCHEME FOR SM-PLC COMPARED TO G.722 PLC APPENDIXES19T ...... 69

F19T IGURE 5-2: EVALUATION FRAMEWORK19T ...... 69

F19T IGURE 5-3: NETWORK EMULATION RECORDING19T ...... 72

F19T IGURE 5-4: VARIOUS PACKET LOSS RATES SIMULATED BY NIST NET EMULATION TOOL19T ...... 72

F19T IGURE 5-5: SENSITIVITY TO SINUSOIDAL MODEL ORDER 19T ...... 74

F19T IGURE 5-6: SENSITIVITY TO FREQUENCY RESOLUTION19T ...... 75

F19T IGURE 5-7: PARABOLIC INTERPOLATION EFFECT ON OBJECTIVE PERFORMANCE19T ...... 76

F19T IGURE 5-8: OBJECTIVE VOICE QUALITY (P-SERIES SUP. 23) 19T ...... 77

F19T IGURE 5-9: STANDARD DEVIATION OF OBJECTIVE VOICE QUALITY19T ...... 78

F19T IGURE 5-10: OBJECTIVE VOICE QUALITY ON ITALIAN (P-SERIES SUP.23) 19T ...... 79

F19T IGURE 5-11: OBJECTIVE VOICE QUALITY ON FRENCH (P-SERIES SUP.23) 19T ...... 80

F19T IGURE 5-12: OBJECTIVE VOICE QUALITY ON JAPANESE (P-SERIES SUP.23) 19T ...... 81

F19T IGURE 5-13: OBJECTIVE VOICE QUALITY ON NORTH-AMERICAN ENGLISH (P-SERIES SUP.23)...... 19T 82

XV Sinusoidal Model Based Packet Loss Concealment for Wideband VoIP Applications

List of Tables

T19TU ABLE 2-1: POPULAR WIDEBAND VOICE IN VOIP U19T ...... 9

T19TU ABLE 2-2: LISTENING-QUALITY SCALEU19T ...... 20

T19TU ABLE 2-3: STANDARD CODING MOSU19T ...... 20

T19TU ABLE 2-4: COMPARISON OF SUBJECTIVE AND OBJECTIVE METHODS FOR QUALITY ASSESSMENT U19T ...... 25

T19TU ABLE 4-1: SUMMARY OF MAJOR PLC-IN CONTROL OPERATIONSU19T ...... 49

T19TU ABLE 4-2: SUMMARY OF MAJOR PLC-OUT CONTROL OPERATIONSU19T ...... 50

T19TU ABLE 4-3: DESCRIPTION OF PLC-IN OPERATIONSU19T ...... 53

T19TU ABLE 4-4: DESCRIPTION OF PLC-OUT OPERATIONSU19T ...... 55

T19TU ABLE 5-1: STRUCTURE OF THE OBJECTIVE EVALUATION DATASET (A SUBSET OF ITU-T SUP. P. 23)U19T .71

T19TU ABLE 5-2: STRUCTURE OF THE INFORMAL EVALUATION DATASET (A SUBSET OF ITU-T SUP. P. 23)U19T ..71

T19TU ABLE 5-3: SUMMARY OF MODEL ORDER SELECTION TEST RESULTSU19T ...... 74

T19TU ABLE 5-4: SUMMARY OF FREQUENCY RESOLUTION TEST RESULTSU19T ...... 75

T19TU ABLE 5-5: SUMMARY FREQUENCY PARABOLIC INTERPOLATION TESTS RESULTSU19T ...... 77

T19TU ABLE 5-6: SUMMARY OF OBJECTIVE VOICE QUALITY TEST RESULTSU19T ...... 77

T19TU ABLE 5-7: SUMMARY OF STANDARD DEVIATION TEST RESULTS U19T ...... 78

T19TU ABLE 5-8: SUMMARY OF OBJECTIVE VOICE QUALITY RESULTS ON ITALIAN U19T ...... 79

T19TU ABLE 5-9: SUMMARY OF OBJECTIVE VOICE QUALITY RESULTS ON FRENCHU19T ...... 80

T19TU ABLE 5-10: SUMMARY OF OBJECTIVE VOICE QUALITY RESULTS ON JAPANESEU19T ...... 81

T19TU ABLE 5-11: SUMMARY OF OBJECTIVE VOICE QUALITY RESULTS ON NORTH-AMERICAN ENGLISHU19T ..... 82

T19TU ABLE 5-12: SUMMARY OF SUBJECTIVE PREFERENCE TEST RESULTSU19T ...... 83

XVI Chapter 1 – Introduction

1 Introduction

1.1 Internet Telephony

The Voice over Internet Protocol (VoIP) technology has entered to its fifth decade since its early seeds with ARPANET in late 60’s [1]. The interest in VoIP is continuously growing in recent years by Internet Service Providers (ISP), TELephone COmpanies (TELCO), and Internet Telephony Service Provider (ITSP) as the replacement of the Plain Old Telephone Service (POTS) infrastructure. The new VoIP technology benefits from toll-bypass, network consolidation and service convergence. Using VoIP technology allows long distance calls with local call charges, unification of network infrastructure for supporting audio and video services allowing cost reduction, etc. However, designing a VoIP network requires special attention to ensure that voice quality is not degraded [2, 3]. The voice is carried from point to point first by encoding short segments of a digitized speaker signal, and then gathered into packets with headers. The voice packets are then sent from one end over the IP network to the other end. On the receiving end, voice packets are decoded and played out to the listener. Asynchronous arrival of voice packets at the receiving end is caused by network elements queues and prioritization policies, changing routing paths and processing delays. This phenomenon, i.e. varying interval of packets arrival (jitter), is usually eliminated by using a jitter buffer. The principle is adding additional delay, so packets arriving within the delay are handled and not dropped. Hence, the process of de-jittering provides seamless play-out on the expense of additional delay. Although the listener may experience additional delay introduced by the jitter buffer and degrades voice quality, it is considerably low (except cases where end-to- end delay pre-exist) [3]. Higher voice quality is often achieved by incorporating an adaptive jitter buffer where the delay window size is dynamically adapted to the actual measured jitter. In a network under congestion conditions, voice packets are often discarded or significantly delayed beyond the window depth of the jitter buffer, therefore the

1 Chapter 1 – Introduction

decoder has no packet to decode. In the cases where no special processing is performed, voice quality degrades. Usually in such cases, an activation of a Packet Loss Concealment (PLC) algorithm takes place, in order to conceal the missing voice packets. The main purpose is that the listener will not notice the annoying sounds caused by the packets loss. PLC algorithms are often classified to receiver-based and transmitter-based methods. In receiver-based methods, the only voice packets available are at the receiver side and they are used to perform the concealment. In transmitter-based methods, the sender adds redundant information to voice packets so that the receiver can make use in packet loss scenario [4]. There is considerable interest in developing receiver-based only PLC algorithms that are independent of method and can be applied with no special inter-operability implications and standardization. This work focuses on a receiver-based approach.

1.2 Voice Quality Evolution

In traditional Public Switched Telephone Network (PSTN) Telephony, all telephony communications are based on the traditional PSTN limited bandwidth (200-3400 Hz and a sample rate of 8 kHz). Although most of the raw voice data is contained in this bandwidth (addressed as narrowband), the quality of sound are low producing phonemes such as “P” and “T”, “F” and “S” and “M” and “N” sounds that are very similar (as their critical energy is carried predominantly in the higher frequencies). Voice over Packet Networks (VoPN) allows the use of a wider speech bandwidth of 50-7000Hz and a sample rate of 16 kHz (addressed as wideband), offering a higher voice quality than PSTN and creating a wonderful face-to-face live experience. Figure 1-1 describes the wideband extension on the spectrum scale. The addition of the 50-200Hz band improves the naturalness and presence, and the addition of the 3400-7000Hz band improves intelligibility, especially in unvoiced regions, that spread to the higher frequencies.

2 Chapter 1 – Introduction

Figure 1-1: Spectrum of a Narrowband and a Wideband speech signal Therefore, because of the dramatic quality improvement attainable, the following generation of speech CODECs (COders/DECoders) for VoIP are wide- band. The first wideband vocoder1 (G.722) was defined by the ITU-T in 1988, when several organizations realized the benefit of using wider bandwidth for telephony coding. Figure 1-2 demonstrates the history of traditional PSTN and cellular networks, and the VoIP technology development over the years. The first wideband voice deployments are only just beginning to occur [5].

Figure 1-2: Voice quality technologies in different networks

1 Vocoder - a voice coder device that usually consists of a speech analyzer, which converts analogue speech waveforms into narrowband/wideband digital signals, and a speech synthesizer, which converts the digital signals into artificial speech sounds.

3 Chapter 1 – Introduction

1.3 Research Goals and Results

The goal of this research is to develop a new method for performing concealment in scenarios where voice packets are lost. The new proposed PLC (called SM-PLC) is independent to the speech CODEC and designed for being used in wideband VoIP and voice streaming applications. The research results include a new algorithm for PLC, which compensates for the missing packets by using sinusoidal modeling approach and phase matching. Moreover, the algorithm can use a subsequent packet for concealment (in the case it is present in the jitter buffer). It may be viewed from decoding stage that our algorithm is non-casual, but in applications, where jitter and decoder are coupled, subsequent packets are often available to the concealment algorithm. Objective voice quality experiments were performed and it was shown that the proposed algorithm outperforms standardized algorithms and yields good quality at higher loss rates. Our subjective tests also showed that the algorithm is not conceding in terms of voice quality to standardized wideband recommendations.

1.4 Document Outline

The work is organized as follows: Chapter 2 – introduces Voice over IP, including underlying transport and encoding protocols, then it overviews voice quality testing methods commonly used, including intrusive and non-intrusive approaches. It also covers the problem of packet loss and classification of recovery methods. Chapter 3 – provides an overview of the sinusoidal modeling of speech in terms of evolution and classical mathematical model. It covers estimation and synthesis using sinusoidal model on which the proposed concealment algorithm is based on. Chapter 4 – describes in details the proposed concealment algorithm, from top-level to low-level. The concealment algorithm integrated into receiver is overviewed for both logic and signal processing parts. Chapter 5 – describes the evaluation framework and provides the results of the tests that were performed using proposed algorithm. The algorithm is verified

4 Chapter 1 – Introduction

with several configurations and packet loss drop rates. It also compares the proposed algorithm with standardized concealment algorithms. Chapter 6 – concludes the thesis with summary of the work and future work directions that we believe may improve performance and integration into real-time Voice over IP devices.

5 Chapter 2 – Overview of Voice over IP Technologies

2 Overview of Voice over IP Technologies

In recent years, many VoIP services were introduced over broadband internet access services, in which subscribers initiate and receive calls as they would do on the traditional PSTN. Providers offer a full telephony service using VoIP, letting in- bound and out-bound calling with unlimited time for in-bound subscribers for a flat monthly fee and out-bound calls with local call charges for both long and short distance calls. VoIP can provide services that were more difficult to implement or expensive using traditional PSTN, e.g., ability to perform more than one telephone call at the time on the same broadband connection, conferencing calls, secure calls, independent location of connection and more.

2.1 VoIP Standards and Organizations

Two key contributors are in charge of maintaining standards that influence VoIP technologies. The first is the International Telecommunications Union (ITU). The ITU's work dates back to the 1860s when agreements were developed to support connections between individual country's telegraph facilities. The ITU has expanded and grown and at present time, its work is divided into three sectors: the Radiocommunication Sector (ITU-R), which manages the available wireless spectrum, the Telecommunication Standardization Sector (ITU-T), which develops internationally agreed upon networking standards and the Telecommunications Development Sector (ITU-D), which endeavors to make modern telecommunications services available to people in developing countries. The second contributor is the worldwide Internet SOCiety (ISOC). The Internet Society has served as the global clearinghouse for Internet-related technologies since 1992 [6], and as such is substantially younger than the ITU. This age difference causes a difference in focus as well. Where the ITU has a rich history in circuit switched communications, such as voice, the more youthful ISOC concentrates more on packet switching and data transmission.

6 Chapter 2 – Overview of Voice over IP Technologies

Other organizations have also influence on VoIP standards, but with a more regional or technology-specific focus. These include the American National Standards Institute (ANSI), the European Telecommunications Standards Institute (ETSI), the World Wide Web Consortium (W3C) and the International Multimedia Teleconferencing Consortium (IMTC). VoIP standards can be sub divided into two major types: signaling protocols and media protocols. Popular signaling protocols include Request For Comments 2543 (RFC2543) - Session Initiation Protocol (SIP) [7], ITU-T H.323 [8] and others who set the infrastructure for controlling VoIP sessions carrying various media streams. Media protocols are divided further into two categories: media transport and media encoding protocols.

2.2 VoIP Architecture

Figure 2-1 shows a simplified map of the major components (partial list) conducting VoIP network and that are related to the target devices that implement concealment algorithms [9]. The VoIP gateway connects traditional telephone interfaces (telephones, fax machine, PSTN interfaces such as E1/T1, ISDN) to IP network. The IP Phone is similar to the traditional telephone, but it has VoIP inside it and it is able to connect the IP network directly. The PC is similar to the IP Phone but its VoIP is implemented in software along other PC applications. An additional VoIP components that connect to the IP network includes: servers for management and administration for call control, etc. The IP network itself may be of different sizes and locations, e.g., it can be an internal organization network or public domain such as internet.

Figure 2-1: VoIP topology

7 Chapter 2 – Overview of Voice over IP Technologies

2.3 Media Transport Protocols

The media transport protocols includes Real Time Protocol (RTP), Secured RTP (SRTP) and Real Time Control Protocol (RTCP) protocols [6, 10]. These protocols are implemented on top of the User Datagram Protocol (UDP). UDP is a connectionless protocol, meaning that no setup is taken before transmission. Unlike the Transmit Control Protocol (TCP), UDP does not provide flow-control or error recovery mechanism and therefore is called unreliable protocol. The simplicity of UDP and its low overhead of protocol handling in comparison to TCP, makes it more suitable for real-time voice call transmissions. UDP protocol has no mechanism to ensure that data packets are delivered in sequential order, or provide Quality of Service (QoS) guarantees. The lack of such mechanisms in VoIP implementations imposes problems of latency, jitter and packet loss. The receiving end has to reconstruct packets that may be out of order, delayed or missing, while ensuring that the audio stream maintains a proper time consistency. The principal cause of packet loss is congestion in the IP network, which can be controlled by congestion management and avoidance. Many researches were suggested [4, 11, 12, 13] to handle packet loss problem on an end-to-end framework with open-loop error control for voice transmissions over the internet. The variation in packet delivery interval is called jitter. Its effects can be mitigated by storing voice packets in buffer (memory) upon arrival, before playing them out. This avoids a condition known as buffer under-runs, in which the play out process runs out of voice data to play because the next voice packet has not yet arrived. However, delay increases by buffer length.

2.4 Media Encoding Protocols

The media encoding protocols includes several coding schemes, which compress the digitized voice or video into smaller bandwidth consumption (increasing throughput by at least factor of 10). Some of the popular wideband voice coding standards common to find in VoIP systems are standardized by ITU-T and are given in Table 2-1.

8 Chapter 2 – Overview of Voice over IP Technologies

Coding Bit Rate Algorithm Voice Quality Principle Application Standard (kbps) G.711.1 PCM 64, 80, Full wideband High-quality speech 96 quality services over broadband (at 96 kbps) networks, especially IP telephony and multi-point speech conferencing. G.722 SB- 48, 56, Commentary ISDN; ADPCM 64 (at 64kbps) Video Conferencing G.722.2 ACELP 6.6-24 Good ISDN; (at 12.65kbps Video Conferencing; and higher) VoPN; 3G Wireless G.729.1 CELP 8-32 Rich wideband VoIP (IP telephony) quality including IP phones, (at 32 kbps) other VoIP handsets, soft phones, IP PBXs; Media servers/gateways; Call center equipment; Voice recording equipment; Test equipment; Audio/video conferencing for enterprise corporate networks or for the mass market (like PSTN emulation over xDSL or wireless access); Voice messaging servers. Table 2-1: Popular wideband voice codecs in VoIP

2.5 Media Packetizing

Voice packets are built from one or more coded speech frames concatenated, where each frame represent duration of 5ms to 30ms. The coded speech frames are output of speech coder pre-defined by the application (usually various coding schemes are available). In VoIP, coding scheme is usually determined during negotiation within a voice call establishment procedures. Voice packets size is set according to the following,

• Coder type - suggests in which rate the speech is compressed. • Packetized time - indicate how many speech frames are concatenated together. Although longer packetized time consumes less bandwidth, it is more vulnerable to packet loss, since a single packet loss can introduce a loss of several speech frames that were encapsulated in this packet.

9 Chapter 2 – Overview of Voice over IP Technologies

The maximum coded speech frames for RTP payload is limited to 320 bytes. A typical Local Area Network (LAN) type packet encapsulation for RTP is demonstrated in Figure 2-2 [9].

ETH IP UDP RTP Coded Speech CS

Octets 14 20 8 12 According to coder 4 <=320 Figure 2-2: Media packetizing

2.5.1 Real-Time Transport Protocol (RTP)

Real time applications require mechanisms to be in place to ensure that a stream of data can be reconstructed accurately. Datagram packets must be reconstructed in the correct order and provide tools for network delays detection. The data is buffered at the receiving end of the link so that it can be played out at a constant rate and reduce the effects of variation in delay (jitter). To support this requirement, two protocols have been developed. These are Real-time Transport Protocol (RTP) and RTP Control Protocol (RTCP). RTP transports the digitized samples of real time information. RTCP provides feedback on the quality of the transmission link. RTP and RTCP do not reduce the overall delay of the real time information, nor do they make any guarantees concerning quality of service. Typically, a 12 bytes RTP header used in VoIP application, which precedes the data payload, is shown in Figure 2-3 below [10]:

bit bit 0 31 V P X CC M PT sequence number time stamp synchronization source (SSRC) identifier

Figure 2-3: Typical RTP header Where,

• Version (V) - identifies the version of RTP (currently 2). • Padding (P) - indication flag for existence of padding octets after the payload data used for encryption algorithms with fixed block sizes for any coding scheme, etc. • Header extension (X) - indication regarding optional fixed length of RTP header.

10 Chapter 2 – Overview of Voice over IP Technologies

• CSRC count (CC) - Although not shown on this header (typical), the header can optionally be expanded to include a list of up to 15 contributing sources. In case of point to point CSRCs are not required • Marker (M) - Allows significant events such as frame boundaries to be marked in the packet stream. • Payload Type (PT) - identifies the format of the RTP payload and determines its interpretation by the application (e.g., which codec type corresponds to packet data). • Sequence number - a unique reference number that increments by one for each RTP packet sent. It allows the receiver to reconstruct the sender's packet sequence, i.e., handle out-of-order and verify in-order arrival. • Timestamp - The time that this packet was transmitted. This field is used in jitter calculation for adapting jitter buffer in order to produce continuous play-out. • Synchronization Source (SSRC) number - a randomly chosen number which identifies the source of the data stream.

2.6 Voice Quality in Internet Telephony

Several identified factors affects voice quality in internet telephony. A simplified block diagram of the speech processing blocks of a typical VoIP system is depicted in Figure 2-4 below [14]:

Figure 2-4: Typical VoIP system At the transmitting end, the speech signal is first digitized in an Analog-to- Digital (A/D) converter. Next, a preprocessing block performs actions such as echo cancelation, noise suppression, automatic gain control, and voice activity detection, depending on the needs of the system and the end user’s environment. Thereafter,

11 Chapter 2 – Overview of Voice over IP Technologies

the speech signal is encoded by a speech codec and the resulting bit stream is transmitted in IP packets. After the packets have passed the IP network, the receiving end converts the packet stream to a voice signal using the following basic processing blocks: a jitter buffer receiving the packets, speech decoding and post processing, e.g., packet loss concealment. Therefore, it is of utmost importance that the characteristics of the IP network are taken into account in the design and implementation of VoIP products as well as in the choice of components such as the speech codec. Next, we look closer at the major factors that affect a voice quality:

• Delay The delay experienced in a call occurs on the transmitting side, in the network and on the receiving side. Most delay on the transmitting side is due to codec (packetizing and look-ahead) and processing delay. In the network, most delay stems from transmission time (serialization and propagation) and router queuing time. Finally, the jitter buffer depth, processing, and, in some implementations, polling intervals add to the delay on the receiving side [15]. See Figure 2-5 [14] for main delay sources.

Figure 2-5: Main delay sources in VoIP

Delay variation (jitter) is usually handled at receiver side with a jitter- buffer to disable effect of delay variation and ordering of packets in cases of out of order arrival of packets [16]. The delay introduced by the jitter buffer is additive to the total end-to-end delay, which must be within applicable limits in order to maintain interactive and intelligible conversation. It is not possible to list a single number for how high latency is acceptable because the perceived quality is affected by additional factors, e.g. the presence of echo. Hence, only some guidelines are applicable.

12 Chapter 2 – Overview of Voice over IP Technologies

The varying network conditions require better delay adjustment, thus adaptive delay is often used in jitter buffers. In order to dismount the delay effect on voice quality ITU recommends that the end-to-end delay should be below 150ms [3]. • Electrical (Network) and Acoustical Echoes Echo is a severe distraction, if the round trip delay is longer than 40 msec. Since the delays in IP telephony systems are significantly higher, the echo is clearly audible to the speaker. Cancelling echo is, therefore, essential to maintaining high quality. Two types of echo can deteriorate speech quality: network echo and acoustic echo (accordingly to [15]). Network echo originates from impedance mismatch in PSTN hybrids. The echo path is stationary, except when the call is transferred to another handset or when a third party is connected to the phone call. This results in an abrupt change in the echo path. Acoustic echo occurs commonly in hands-free equipment and small devices. The echo occurs when the loudspeaker’s sound reflects back to the microphone in an enclosed environment. This echo type also occurs because of the proximity of the loudspeaker and the microphone. The acoustic echo path is non-stationary. • Coding scheme The basic algorithmic building block in a VoIP system, that is always needed, is the speech codec. When initiating a voice call, a session protocol is used for the call setup, where both sides agree on which codec to use. The quality of speech produced by the speech codec defines the upper limit for achievable end-to-end quality. This determines the sound quality for perfect network conditions, in which there are no packet losses, delays, jitter, echoes or other quality-degrading factors. The initial advantage of internet telephony was its flexible and efficient utilization of bandwidth. However, with decreased bandwidth voice compression algorithms introduce degradation in voice quality due their lossy nature. Many compression algorithms [17] have been proposed for narrow-band (8kHz) and wide-band (16kHz) sampling rate, waveform or model based. Among waveform codecs are uncompressed source of Pulse

13 Chapter 2 – Overview of Voice over IP Technologies

Code Modulated (PCM) [18, 19], Differential PCM (DPCM) and Adaptive Differential PCM (ADPCM) [2, 20, 21]. Among model based codecs the most successful (and most commonly used) are time domain Analysis-by- Synthesis (AbS) codecs. Some of them (e.g. Multi-Pulse Excited (MPE), the Regular-Pulse Excited (RPE) and the Code-Excited Linear Predictive (CELP) codecs [22, 23, 24]) use the same linear prediction filter model of the vocal tract as (LPC) vocoders. Others perform sinusoidal coding, where the speech is represented by a sum of sinusoids [25, 26, 27]. • Noise Often additive background noise is introduced to the voice conversion resulting from one or more environment noise sources. Speech enhancements algorithms, that propose removal of background noise, rely mostly on the assumption that noise is more stationary than speech [4]. • Packet loss Packets are lost mainly by congestion conditions in network elements and under-run condition in jitter buffer play-out [16]. Numbers of coded speech frames concatenated into single packet are often used in order to reduce bandwidth by removing lower communication layers headers overhead. Therefore, packet loss may suggest consecutive loss. With no proper handling, packet loss introduces major degradation in perceived voice, much concealment methods were proposed and major approaches are presented in section 2.9. • Hardware Microphones, speakers and other audio devices taking part of the audio circuit affect user’s perceived quality. Several acoustics tests are carried out to test the audio circuit quality. They include frequency response, loudness rating, volume control, idle channel noise [28].

14 Chapter 2 – Overview of Voice over IP Technologies

2.7 Characterization of Packet Loss

Basically, the internet does not provide guaranteed quality of service, therefore bandwidth, delay and packet loss rate are not guaranteed. Packet loss can arise from congestion in routers with overflowed routing queues or other network elements causing packet discarding and packet loss at the receiver side. Moreover, high delay in packet arrival can also yield packet loss by receiver that has nothing to play out. As internet grows and evolves, measurements for end-to-end packet characterization are becoming harder to perform. Paxson [29] count the increasing network heterogeneity and a great logistical difficulties needed by larger-scale measurement. Bolot et al. [30, 31] have shown characteristics of packet loss in audio streams sent over the internet using the RTP protocol. The authors measured the audio loss process and have shown consecutively lost audio packets distributions under different internet traffic load conditions. It was shown that consecutive loss of packets is relatively small, especially in the non-loaded scenario. Figure 2-6 depict a measurement taken between INRIA Sophia Antripolis University in France and University College London in the UK at 3:00 pm in order to track audio stream transmission losses (duration of the measurement is 20 minutes). The measurements were made using G.711 coder [18] with 320 bytes per packet (or 40ms of speech).

Figure 2-6: Number of consecutively lost packets (after Bolot et al. [30]) Figure 2-7 depicts the distribution function of the number of consecutive losses in the measurements.

15 Chapter 2 – Overview of Voice over IP Technologies

Figure 2-7: Consecutively lost packets distribution (after Bolot et al. [30])

2.8 Voice Quality Evaluation

Voice quality is essentially a subjective measure of the clarity, inflection and tone of the conversation between a caller and the recipient. Many characteristics influence the perception of quality including environment and premise equipment, etc. In general, the human ear has come to define “acceptable” voice quality within a limited range of values. With the evolving speech coding technologies embedded in today’s internet telephony there is a clear need for voice quality assessment, moreover research works in the area of speech coding need to be evaluated under various of test conditions. Voice quality testing methods can be divided into two groups: subjective and objective. Since the ultimate goal of voice quality testing is to get a measure of the perceived quality by humans, subjective testing is the more relevant technique. The results of subjective tests are often presented as Mean Opinion Scores (MOS) [32]. However, subjective tests are costly and quality ratings are relative from user to user. As a result, automatic or objective measuring techniques were developed to overcome these perceived shortcomings. The objective methods provide a lower cost and simple alternative; however, they are highly sensitive to processing effects and other impairments [14]. Voice quality is a complex psychoacoustic outcome of the human perception process. As such, it is necessarily subjective, and can be assessed through listening test involving human test subjects that listen to a speech sample and assign a rating to it. A classification of the most popular subjective tests standardized by the ITU is shown in Figure 2-8 below [14]:

16 Chapter 2 – Overview of Voice over IP Technologies

Figure 2-8: Classification of subjective quality methods and related ITU standards and recommendations Alternatively, the classification of objective quality measures can be based on the type of input information they require: intrusive quality measures require access to both the original and distorted speech signal, while non-intrusive measures base their estimate only on the distorted signal, as illustrated in Figure 2-9 below [14]:

Figure 2-9: Intrusive and non-intrusive types of quality assessments. Non-intrusive algorithms do not have access to the reference signal. Objective measures are substantially evolving to improve their ranking accuracy with the perceived user experience. It started from intrusive methods (i.e., reference and degraded signal are provided) and divided into two sub-categories [14]:

• Simple Time and Frequency Domain Measures The simplest class of intrusive objective quality measures consists of waveform-comparison algorithms, such as those based on the Signal-to- Noise Ratio (SNR) and Segmental SNR (SSNR). More advanced comparison algorithm include frequency domain measures that are known to be significantly better correlated with human perception, but still relatively simple to implement. One of their critical advantages is that they are less sensitive to signal misalignment. Some of the most popular frequency domain techniques are the Itakura–Saito (IS), the Cepstral Distance (CD), the Log-Likelihood (LL), and the Log-Area-Ratio (LAR) measures.

17 Chapter 2 – Overview of Voice over IP Technologies

• Psycho-acoustically Motivated Measures The earlier and simplest internal psycho-acoustic measure is the Bark Spectral Distortion (BSD). The BSD is the averaged Euclidean distance between the original and distorted speech signals in the Bark domain. A similar measure is the Information Index (II), according to which the auditory system is modeled by dividing the spectrum into 16 critical bands and applying empirical frequency weights and a hearing threshold for each band. The Coherence Function (CF) is a measure of the Signal-to-Distortion Ratio (SDR). The objective of the coherence function is to turn off uncorrelated signals and pass correlated signals. However, with time, more sophisticated models were developed like Perceptual Analysis Measurement System (PAMS) [33], Perceptual Speech Quality Measure (PSQM) [34], Perceptual Evaluation of Speech Quality (PESQ) [35] and its wideband extension [36]. PESQ is still the most acceptable objective measure of voice quality measure up-to-date. In recent years, a new measurement standard that overcomes limitations of PESQ (such as support for “super-wideband” telephony and most recent technologies) was proposed. This standard is named Perceptual Objective Listening Quality Analysis (POLQA) [37] or ITU-T P.863, and it will be adopted in coming years.

Following group of objective quality measures is the non-intrusive methods, that attract more attention of the developers in recent years. The most known algorithms of them were standardized. Among them are: ITU-T P.563 [38], ANSI ANIQUE/ANIQUE+ [39, 40 and 41]. Moreover, there are many non-standardized methods/techniques [42, 43, 44]. A conversational quality E-model measure that standardized as ITU-T Rec. G.107 [45], also belongs to non-intrusive methods. The broader scope of conversational quality assessment, as compared to listening quality assessment, is illustrated in Figure 2-10 [14].

18 Chapter 2 – Overview of Voice over IP Technologies

Figure 2-10: Non-intrusive monitoring of listening and conversational quality over the network In the following sections the major known objective and subjective measures will be presented and the selected method that was chosen to evaluate the proposed algorithm in this work.

2.8.1 Mean Opinion Score (MOS)

MOS is the most well known and best measure of voice quality, it is a subjective method of quality assessment and is described in ITU-T P.800 [32]. There are two main test methods:

• Conversation opinion test • Listening opinion test

Test subjects judge the quality of the voice transmission system either by carrying on a conversation or by listening to speech samples prepared in advance. The listeners rank the voice quality using the following Absolute Category Rating (ACR) scale.

19 Chapter 2 – Overview of Voice over IP Technologies

Quality of Speech Score Excellent 5 Good 4 Fair 3 Poor 2 Bad 1 Table 2-2: Listening-quality scale MOS is then calculated by taking average on the test participant’s rankings. Additional opinion scales tests are also defined in ITU-T P.800 [32] including

Listening Effort (MOSLE) and Loudness Preference (MOSLP). Using the MOS scale, an average score of 4 and above is considered as toll- quality. MOS was originally designed to assess the quality of different speech coding standards. The following is a summary of the MOS for the most popular wideband coding standards [5].

Coding Standard Max Achievable MOS G.722 3.9-4.3 G.722.2 4.2-4.5 G.729.1 3.9-4.3 G.711.1 up-to 4.7 Microsoft RTA 3.9-4.3 Table 2-3: Standard coding MOS It is worth noting that in order to provide consistency when measuring the MOS, all of the MOS values are reported on wideband MOS scale instead of the traditional narrowband MOS scale that other systems provide.

2.8.2 E-Model

The E-model is a tool for predicting how an average user would rate the voice quality of a phone call with known characterizing transmission parameters (currently 21 input parameters). The E-model is standardized in ITU-T Rec. G.107 [45] (in August 2008 was published PREPUBLISHED RECOMENDATION with Appendix II describing a provisional impairment factor framework for wideband speech transmission, [46]). The objective of the E-model is to determine a transmission quality rating, i.e., the R factor, with a typical range between 0 and 100. The R factor can be converted to estimated listening and conversational quality MOS scores. The E-model does not compare the original and received signals directly. Instead, it uses the sum of equipment impairment factors, each one quantifying the distortion due to a particular factor. Impairment factors include the type of speech

20 Chapter 2 – Overview of Voice over IP Technologies

codec, echo, averaged packet delay, packet delay variation, and the fraction of packets dropped. A fundamental assumption is that the impairments on the psychological scale are additive [14]. The R-rating is a linear combination of the individual impairments and is given by the following formula:

RR=0 −−− Is I d I e− eff + A ( 2.1) Where,

• R 0 - Represents in principle the basic signal-to-noise ratio, including noise sources such as circuit noise and room noise

• I s - Combination of all impairments which occur more or less simultaneously with the voice signal

• I d - Represents the impairments caused by delay and the effective equipment impairment • I e− eff - Represents impairments caused by low bit-rate codecs and impairments due to packet-losses of random distribution • A - Factor used to allow compensation when there are other advantages of access to the user

An estimated Mean Opinion Score for conversational conditions denotes as

MOSCQE [47]. The estimation is in the scale 1 to 5 (as MOS) and obtained from R- factor using the following formula:

1 , R < 0  −6 MOSCQE =1 + 0.035RRR +( − 60)( 100 −R) ⋅⋅ 7 10 , 0 100 The mapping function is shown in Figure 2-11. The E-Model is not a true psychophysical model, and cannot be used to accurately predict the absolute opinion of an individual user. However, for a large number of users, the results are sufficiently accurate for transmission planning purposes.

21 Chapter 2 – Overview of Voice over IP Technologies

R-factor to MOS Mapping Function 5

4.5

4

3.5

3 MOS

2.5

2

1.5

1 0 10 20 30 40 50 60 70 80 90 100 R-factor

Figure 2-11: E-model mapping of R-factor to MOS

2.8.3 Perceptual Evaluation of Speech Quality (PESQ)

PESQ is based on a psychoacoustic model of the human hearing. It measures the effects of one-way speech distortion. The effects of loudness loss, delay, side- tone, echo, and other impairments related to two way interaction are not reflected in PESQ scores. Factors for which PESQ has demonstrated acceptable accuracy are: speech input levels to a codec, transmission channel errors, packet loss and packet loss concealment with CELP codecs, bitrates (if a codec has more than one bit-rate mode), transcodings, environmental noise at the sending side, the effects of varying delay in listening only tests, short-term time warping of audio signal, and long-term time warping of audio signal. It is an objective method of quality assessment. The PESQ replaced the older PSQM (ITU-T P.861 [34]) that was used for measuring mobile transmission voice quality. On contrary to VoIP, the mobile had a fixed delay between the reference and the test signal. However, in VoIP, this is not the case and a new algorithm was needed. The new algorithm added mainly a time alignment improvements. In addition, an intermediate solution PSQM+ was developed , but it failed to handle the varying delay between the test and reference signal. ITU-T Rec. P.862 (PESQ) [35] was designed to evaluate narrowband (3.4 kHz) speech quality, and cannot deal with wideband (7 kHz) speech quality. A recent research focuses on the development of a wideband extension for PESQ, ITU- T Rec. P.862.2 [36].

22 Chapter 2 – Overview of Voice over IP Technologies

In order to compare PESQ with MOS, two adjustments were made:

• Scale fit - the PESQ scale was fit into the MOS scale (ACR), i.e., 1 to 5 • Statistical fit - an adjustment was made to increase the correlation between PESQ and MOS scores

These two mapping evolved into polynomial fitting based PESQ-LQ and standardized MOS-LQO [47] (MOS Listening Quality Objective) exponential fitting, the mapping function and experiential results are given in ITU-T P.862.1 [48]. The PESQ-LQ mapping is given by:

1.0 ,x ≤ 1.7 = y  32 ( 2.3) −0.157268xx + 1.386609 −+ 2.504699 x 2.023345 , x > 1.7 y where argument x denotes the raw PESQ and function denotes the PESQ-LQ.

The MOS-LQON (PESQ-NB - PESQ P.862.1) mapping is given by:

4.999− 0.999 y =0.999 + ( 2.4) 1+ e−+1.4945x 4.6607 y where argument x denotes the raw PESQ and function denotes as the MOS-LQON. The aim of separate recommendation ITU-T P.862.2 is to provide a single mapping from raw P.862 score for a wideband measurement to the Listening Quality Objective Mean Opinion Score (PESQ-WB).

The mapping from PESQ score to MOS-LQOW (PESQ-WB - PESQ P.862.2) [36] can be computed as follows: 4.999− 0.999 y =0.999 + ( 2.5) 1+ e−+1.3669x 3.8224 y where argument x denotes the raw PESQ and function denotes as the MOS-

LQOW. Figure 2-12 describes all these PESQ mapping functions:

23 Chapter 2 – Overview of Voice over IP Technologies

PESQ Mappings 5

4.5 PESQ-LQ MOS-LQO (P.862.1) 4 N MOS-LQOW (P.862.2) 3.5

3

2.5 Mapped Value 2

1.5

1

0.5 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 raw PESQ (P.862) score

Figure 2-12: PESQ mapping functions

2.8.4 Non-Intrusive Voice Quality

Non-intrusive voice quality test method stands for no reference or single end voice quality measurement. While intrusive methods, (e.g., PESQ) inject a reference signal to the Device Under Test (DUT), then objective listening quality is analyzed on degraded system output signal with comparison to the reference signal. Non- intrusive methods like in subjective tests “listen” to the DUT output and estimate objectively voice quality directly. The introduction of non-intrusive methods provides tremendous options for monitoring line voice quality from a distance, on deployed equipment. It removes the need of intrusive method that requires a physical access to the both system sides. This access is usually available during equipment development and not on deployed systems. Two predominant non-intrusive methods were standardized:

• Single Sided Speech Quality Measure given by ITU-T recommendation P.563 [38]. In this method a wide range of speech parameters are estimated from the signal (e.g., pitch, cepstrum, etc.), then weighted on basis of their perceptual relevancy and then added linearly to achieve speech degradation model.

24 Chapter 2 – Overview of Voice over IP Technologies

• Auditory Non-Intrusive Quality Estimation Plus (ANIQUE+) Perceptual Model for Non-Intrusive Estimation of Narrowband Speech Quality given by ANSI [40 and 41]. In this method, significant part is dedicated to articulation analysis and signal envelope characteristic. The signal is observed as Amplitude Modulation (AM) where the envelope represents the modulation. The model intends to identify how likely or not the envelope modulation has physical sense. The degradation is calculated with additional degradation factors analyzed. Moreover, it is concerned superior in terms of MOS consistency over the P.563.

The main disadvantage of non-intrusive methods is their accuracy and correlation with subjective listening quality tests (MOS) comparing to intrusive methods (e.g., PESQ).

2.8.5 Voice Quality Measure Selection

The best method of determining speech quality is to conduct traditional MOS listening quality tests with panels of human listeners. Extensive guidelines are given in ITU-T recommendations P.800 and P.830 [32, 49]. The results of these tests are averaged in order to give MOS, but such tests are expensive and time consuming, and are still impractical for testing in the field and in academic researches as well. Moreover, subjective tests are not suitable to monitor the QoS of a network on a daily basis, but objective measures can be used for this purpose at a very low cost. The main aspects that affect the applicability of objective and subjective measures are summarized in Table 2-4 below [14]:

Subjective Objective Criterion Measures Measures Cost - + Reproducibility - + Automation - + Unforeseen impairments + + Table 2-4: Comparison of subjective and objective methods for quality assessment The symbol “+” is used to denote that the method is advantageous over the other method, denoted by “-“. The challenging part of objective voice quality testing is to model this very human expectation of “quality” with mathematical equations that correlate with the

25 Chapter 2 – Overview of Voice over IP Technologies user’s experience. The results are testing methodologies that produce a number that corresponds to how a vast majority of users will perceive the conversation. New approaches for non-intrusive voice quality estimation are promising but up to date they are inferior of intrusive methods in terms of accuracy to a real subjective listening quality test and they are intended only for narrowband VoIP applications. The most favored standardized objective method that has higher correlation with MOS is the PESQ method. This method is also applicable to wideband speech signals (ITU-T P.862.2 [36]). The PESQ was chosen to be used in this work for estimating the packet loss concealment algorithm performance. Its free availability, accessibility to the reference signal, and the ability to test large-scale speech database carrying several spoken languages make it attractive in this academic research. In order to verify objective results an informal subjective preference tests were also conducted. Many works in concealment algorithms make use of informal testing but in fact, there is no practical ability to reproduce the experiment accurately in order to evaluate different approaches.

2.9 Packet Loss Recovery Techniques

The human hearing system is sensitive to the discontinuities in the waveform that may be caused by packet loss unless concealed by some method. The unconcealed signal may produce annoying disturbance to the listener, which we would like to eliminate. The packet loss recovery techniques are traditionally divided into two major approaches: sender-based and receiver-based [4]. While receiver-based technique has no requirements from the sender side, in a sender-based technique the receiver is also involved in the reconstruction process and it is called often sender and receiver based technique. The sender-based technique is more related to problem of error correction in communication while receiver-based technique is more about error concealment problem using digital speech signal processing. Sender-based techniques are usually more effective than receiver-based techniques, since they have additional information from the encoder to rely on, while

26 Chapter 2 – Overview of Voice over IP Technologies

receiver-based technique had only the corrupted bit stream. Since this research proposes receiver-based solution, the receiver-based techniques are further explored.

2.9.1 Sender-based Techniques

Under the sender-based techniques, two sub-approaches are distinguished: active and passive approaches. Active techniques involve retransmission of lost packet, but due to timing constraints of delay for internet telephony, it is usually avoided. Another approach is packet redundant, where each packet is transmitted several times and statistically reduces the packet loss in proportion to the redundancy depth [50]. Passive techniques on the other hand may be used by methods of interleaving of samples to avoid chunk of samples loss, but at the expense of known delay function of interleaving depth. Another passive approach is Forward Error Correction where a redundant data is sent along with the data packets and where it may be used to correct errors. A generic Forward Error Correction (FEC) scheme for internet protocols was suggested in [51]. Apart of interleaving technique, others increase required bandwidth. Many of the techniques belong to error correction rather than concealment. A good example where two methods are used is the Global System for Mobile Communication (GSM) system. i.e., interleaving and FEC are both being used in GSM mobiles.

Classification of the sender-based techniques is shown in Figure 2-13.

Sender-based repair

Active Passive

Retransmission Inerleaving Forward error correction

Media-independent Media-specific Figure 2-13: Sender-based concealment types (after Perkins et al. [4])

2.9.2 Receiver-based Techniques

The receiver-based techniques generate replacement packets, instead of lost packets, with the information available at the receiver side only.

27 Chapter 2 – Overview of Voice over IP Technologies

In speech applications, the analysis may comprise from previous packets played out and possibly consecutively packets that are available at the receiver and were not played-out yet. The incentive is that speech signals are characterized by strong correlation in consecutively frames. Clearly, if the period of packets loss exceeds the duration of the voice occurrence, it is impossible to reconstruct it, simply because there is no information one can use to reconstruct missing packet. Classification proposal of the receiver-based concealment techniques is given in Figure 2-14.

Receiver-based repair

Insertion Interpolation Regeneration

Splicing Silence substitution Packet repetition Interpolation of transmitted state Model-based recovery

Wavefrom subtitution Pitch waveform replication Time scale modification Figure 2-14: Receiver-based concealment types (after Perkins et al. [4]) Three categories of techniques can be observed:

• Insertion-based - repair by filling the loss using packet repetition, silence (zero stuffing) or splicing. These techniques are easy to implement but have poor performance. • Interpolation-based - pattern matching and interpolation to derive a replacement packet which is expected to be similar to the lost packet. These techniques are more difficult to implement but perform better than the insertion methods. • Regeneration-based - model based derived from decoder state and from packets surrounding the loss to generate a replacement for the lost packet. These techniques are the most difficult to implement but outperform other techniques.

2.9.3 Concealment Algorithms

Packet loss concealment focuses on reducing the perceptual impact of a packet not being available in the jitter-buffer at the time it is needed for play-out. The options available for performing concealment operations depend on the data available in correctly received packets. The purpose of this section is to review

28 Chapter 2 – Overview of Voice over IP Technologies packet loss concealment algorithms (receiver-based only), which follow some of classified techniques described in 2.9.2. Usually concealment algorithms are receiver-based, so they do not require additional bandwidth, signal processing and data handling overhead at the transmitter side. This means that different packet loss concealment techniques may be applied at the receiver side according to specific quality requirements. Moreover, degree of freedom is added to receiver, accompanied with jitter buffers to exploit consecutively packets. Additionally, the backwards compatibility provides a simple integration into existing systems. The classic speech production model divides the signal into voiced and unvoiced segments. Voiced segments maintain high periodicity and concealment algorithms try to exploit it during packet loss, using signal segments around the gap in order to fill it. However, speech signal is characterized as highly non-stationary signal, but it is assumed to be stationary for short time periods of up to 40ms. Therefore, the applicability of using this phenomenon is limited only for small speech segments. Moreover, in the case of entire phoneme loss, the signal cannot be recovered.

2.9.3.1 Splicing

The simplest technique is splicing, where a zero-length fill-in is used. The concealment is performed by splicing together the audio on either side of the loss and no gap is left. However, the timing of the stream is disrupted. This technique has been evaluated by Gruber and Strawczynski [52] and provided poor results. It was tolerable for low frame loss rate (below 3 percent) and with short gaps, otherwise results were intolerable. Although this method is easy to implement, it results in generally provides poor performance.

2.9.3.2 Silence Substitution

The simplest technique of filling missing speech segment by samples with the value 0, often called “zero stuffing” method [53, 54]. However, even for low packet loss rate (1%) the speech quality degrade significantly. Important to notice that

29 Chapter 2 – Overview of Voice over IP Technologies

many methods end with the “zero stuffing” method when a sustained packet loss occurs for long time.

2.9.3.3 Noise Insertion

With only slight increase in complexity, noise insertion is a replacement for the missing speech period. The noise insertion makes use of “phonemic restoration” phenomenon, where the interpolation ability of the human auditory system increases if noise rather than silence is perceived instead of the missing speech segment. In addition, significant portion of voice conversation is half-duplex, and in order to save bandwidth codecs often make use of silence compression schemes. They make use of Voice Activity Detector (VAD) and low-bit rate Silence Insertion Descriptor (SID) frames that are sent instead of regular packets in order to generate a comfort noise on the other end. It is possible to use this information transmitted by the sender for appropriate noise generation. When compared to silence substitution, the use of white noise has been shown to give both subjectively better quality and improved intelligibility. It is therefore, recommended as a replacement for silence substitution [4].

2.9.3.4 Waveform Substitution

The technique is based on filling the gap by speech segment correctly received and finds from it a suitable signal segment to cover the loss. Goodman et al. [53, 54, 55] studied the use of waveform substitution in packet voice systems. They examined both one and two-sided techniques that use templates to locate suitable pitch patterns either side of the loss. The technique includes:

• Identification of gaps in the signal either by using sequence numbers of packets or timestamps. • Storing recently received signal segments in history buffer. • Signal processing to replace the missing segment.

This method is suitable for replacement of short loss speech segments and especially not during phoneme transition.

30 Chapter 2 – Overview of Voice over IP Technologies

2.9.3.5 Packet Repetition

Another simple technique is repetition of the most recently received packet in order to fill the missing speech segment. It is only necessary to store in buffer the last packet and using packet repetition as long as needed. Naturally, packetizing duration is not related to speech features, such as the pitch period, therefore discontinuities in the signal are introduced with annoying disturbance. This technique is considered to provide only slight improved speech quality over silence substitution.

2.9.3.6 Pitch Waveform Replication

This technique measures the pitch period from the signal preceding the gap and copies it until the gap is filled. In order to smooth the gap boundaries transitions phase matching [56] was also suggested. One of the earliest applications of pitch waveform replication technique is recommended by ITU-T G.711 Appendix I [57] for packet loss concealment. Its approach is to maintain a history buffer that is used to calculate the pitch period and extract waveforms during an erasure. The algorithm introduces output delay to allow overlap-add in erasure start allowing smooth transition between the real and synthesized signal. When packet loss period is short, only last pitch period is used to fill the gap, while in long erasure a couple of last pitch periods are used from accumulated history aiming for a more natural voice. On long erasures, the signal is being attenuated as the erasure progresses. The synthetic signal is attenuated because when erasure gets longer, it is more likely to diverge from the real signal. Avoiding the attenuation may introduce strange artifacts, which are created by holding certain types of sounds too long. When first good frame arrives after erasure a smooth transition using overlap-add is performed for period relative to the pitch period and no longer than one frame. The more sophisticated application of waveform extrapolation based on pitch detection is introduced by ITU-T G.722 Appendix III [58] (one of the two recommended appendixes for packet loss concealment for G.722). The algorithm performs the packet loss concealment in the 16-kHz output domain of the G.722 decoder. Periodic waveform extrapolation is used to fill in the waveform of lost packets, mixed with LP filtered noise according to signal characteristics prior to the

31 Chapter 2 – Overview of Voice over IP Technologies

loss. The extrapolated 16-kHz signal is passed through the Quadrature Mirror Filter (QMF) analysis filter bank, and the higher and lower sub-band signals are passed to partial sub-band ADPCM encoders to update the states of the sub-band ADPCM decoders. Additional processing takes place for each packet loss in order to provide a smooth transition from the extrapolated waveform to the waveform decoded from the received packets. Among other things, the states of the sub-band ADPCM decoders are phase aligned with the first received packet after a packet loss, and the decoded waveform is time-warped in order to align with the extrapolated waveform before the two are overlap-added to smooth the transition. For protracted packet loss, the algorithm gradually mutes the output. The algorithm operates on an intrinsic 10-ms frame size. Even more advanced techniques exist, such as using a time-scale modification to stretch the waveform from adjacent packets in order to bridge the gap (created by a packet loss) smoothly. Although these techniques tend to be more computationally expensive, they achieve better performance than other methods of waveform extrapolation.

2.9.3.7 Time-scale Modification

Time-scale modification is used for “stretching” the signal before the missing gap in order to fill the gap. The pitch period is maintained and in order to avoid discontinuities overlap-add is used with filled and received speech. This technique is based on the Waveform Similarity Based Overlap-Add (WSOLA) algorithm [59]. While high voice quality is preserved using time-scale modification it introduces significant delay.

2.9.3.8 Linear Prediction based Waveform Substitution

Most of previously described techniques are waveform-based concealment, accompanied with applicable speech processing (e.g., pitch detection, overlap-add, etc.). The Linear Prediction (LP) model [60] was also suggested as a technique to conceal in the packet loss problem. Gündüzhan and Momtahan proposed a high- performance packet-loss concealment algorithm for PCM coded speech [61]. This implementation uses the linear predictive model of speech production to estimate the

32 Chapter 2 – Overview of Voice over IP Technologies vocal tract and excitation information from the previously received packets to reconstruct the signal contained in the missing packet. The algorithm estimates the spectral characteristics of a missing speech segment, and then synthesizes a high- quality approximation of the missing segment using the LP speech production model [62, 63]. Block diagram of the PLC is shown in Figure 2-15 below:

Figure 2-15: LP-based packet loss concealment (after Gündüzhan and Momtahan [61]) This LP-based PLC algorithm should be implemented entirely at the receiver side of a transmission channel. Speech is generated by passing an excitation signal through an inverse LP filter (synthesis filter). In this model, a speech signal is composed of two components:

• LP analysis parameters that model the vocal tract information • a residual signal that contains the excitation information

The basic operation of the algorithm is to estimate these two components for the missing speech segment, based on the LP analysis of the previously received speech frames. The two components provide two degrees of freedom for concealment, i.e., spectral shape and excitation. ITU-T G.722 Appendix IV [64] for a packet loss concealment uses a similar approach: the extrapolation of the lower sub-band signal is performed by extraction of the residual signal using linear prediction, followed by its periodic repetition and synthesis. The modification strategy of the residual signal depends on the preceding frame characteristics that are determined by classification. The higher sub-band extrapolation is much simpler than the lower sub-band: it consists of repeating a pitch synchronously (if the signal is not classified as voiced, the entire frame is repeated).

33 Chapter 2 – Overview of Voice over IP Technologies

2.9.3.9 Concealment Based on Sinusoidal Model

Jonas Lindblom and Per Hedelin proposed packet loss concealment techniques based on sinusoidal modeling [65, 66]. These algorithms operate on the source-filter components of the speech signal (obtained through a linear prediction analysis) and should be implemented in the receivers of a potential system. A sinusoidal model of successfully received speech is utilized to recover lost frames using both extrapolation and interpolation techniques of the LP residual (however filter coefficients are extrapolated/interpolated separately). Informal subjective testing shows promising results: the recovered packet losses are often inaudible to a listener. The illustration of concealment based on interpolation technique is shown in Figure 2-16 below:

Figure 2-16: Interpolation based PLC (after Lindblom and Hedelin [66])

Speech signals are denoted by sn , LP filter coefficients by A , and LP residuals by rn . Superscript plus and minus sign, denote time (‘after’ and ‘before’ the frame-erasure respectively). Signals in this example are authentic, and the smooth result renders the underlying frame-erasure inaudible to a listener.

34 Chapter 3 – Sinusoidal Modeling of Speech

3 Sinusoidal Modeling of Speech

Many representations for speech signals were proposed for speech coding, synthesis and recognition. Decomposition of the speech signal is commonly adapted as a source passed through a linear time-varying filter. This filter can be derived from models of speech production based on the theory of acoustics. The source represents the airflow at the vocal cords, and the filter represents the resonances of the vocal tract, which change over time. This approach assumes that the glottal excitations are impulse-like train or noise for voiced or unvoiced speech respectively [62]. Moreover, models assume a mixture of both excitations types using weighted contributions. Another approach is the Sinusoidal Modeling. In this case, the speech signal is represented by a linear combination of sinusoidal waves. In the analysis phase, the speech signal is divided into overlapping segments, a Discrete Fourier Transform (DFT) is computed for each segment, and a set of desired spectral peaks are identified. Each peak is presented by frequency, amplitude, and phase, while remaining spectral information is being discarded. In the synthesis stage, the frequency, amplitude, and phase of each selected peak is used for generating sinusoid and then all contributions are summed together. The proposed algorithm for packet loss concealment is based on this approach. The following sections present the sinusoidal modeling evolution, speech production model, and classical sinusoidal modeling.

3.1 Sinusoidal Modeling Evolution

The sinusoidal model as developed by McAulay and Quatieri, provides a way of representing speech signal as a sum of discrete sinusoids [25]. The sinusoids in the model are continually varying in amplitude, frequency, and phase. In contrary to the fixed unconnected sinusoidal basis functions used in Short-Time Fourier Transform (STFT). The sinusoidal modeling was originally designed for use in speech coding [25, 26, 67], however, it was soon discovered that this parametric

35 Chapter 3 – Sinusoidal Modeling of Speech

method of representing speech in terms of discrete sinusoids was useful in other applications of speech signal processing. The sinusoidal model is used for time-scale modification of speech [68, 69, 70, 71, 72], music modeling [73, 74, 75, 76], pitch modification of speech [68, 69, 77], co-channel interference suppression [78], background noise suppression [79], speech synthesis [77, 80, 81], singing voice synthesis [77], music synthesis and modification [76, 82] and hearing compensation [83]. Moreover, the sinusoidal model is used widely in packet loss concealment algorithms [65, 66, 84].

3.1.1 Related Research

The most remarkable studies in the area of spectral modeling of speech and audio are listed in this section.

3.1.1.1 Harmonic Modeling

Almeida and Silva [85] suggested a sine-wave-based speech compression system, which uses a pitch estimate to establish a harmonic set of sine waves. The sine-wave phases are computed at the harmonic frequencies. To compensate for any errors that might be introduced by the harmonic sine-wave representation, a residual waveform is coded along with the underlying sine-wave parameters.

3.1.1.2 MQ Technique

Robert McAulay and Thomas Quatieri [25] put the foundation for complete sinusoidal representation of speech signals (mentioned above). It is often called the MQ technique.

3.1.1.3 PARSHL

Julius O. Smith III and Xavier Serra [86] developed the Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation (PARSHL). The main characteristic that differentiates this model from the traditional MQ is the selectivity of spectral information, where a parabolic interpolator was used in order to increase the frequency resolution in the peak picking procedure.

36 Chapter 3 – Sinusoidal Modeling of Speech

3.1.1.4 MQAN

Robert Maher and James Beauchamp introduced psychoacoustic masking effects into the MQ model [87]. It was achieved by defining a spectral peak magnitude threshold in terms of the difference in magnitude between the largest peak in the frame and the peak under consideration.

3.1.1.5 SMS

Xavier Serra suggested Spectral Modeling Synthesis (SMS) [88], where the non-sinusoidal aspects of the signal, i.e., residual noise signal was handled. First the input signal is modeled by sinusoidal model, then the synthesized signal (using sinusoidal model) is subtracted (in time domain) from the input signal in order to produce the residual signal and further modeling of the residual spectra.

3.1.1.6 MSM

Scott Levine et al. suggested Multiresolution Sinusoidal Modeling (MSM) [75] in order to handle polyphonic input signals as opposed to the MQ method, where it is limited to speech.

3.1.1.7 Sinusoidal AbS

George and Smith suggested Analysis-by-Synthesis/Overlap-Add combined technique [71] in order to determine the sinusoidal model parameters instead of the MQ peak picking method.

3.1.1.8 SVOPC

Lindblom suggested Sinusoidal Voice Over Packet Coder (SVOPC) [89] based on quasi-harmonic modeling, especially suited for VoIP applications with frame erasure conditions.

3.2 Speech Production Model

Widely used method in speech modeling is based on the source-filter-model of speech production. Figure 3-1 below describes generalized source-filter-model.

37 Chapter 3 – Sinusoidal Modeling of Speech

Voiced Gain source

Speech Filter signal

Unvoiced source

Figure 3-1: Generalized source-filter model of speech The glottal excitation signal could be either voiced source with fundamental frequency or unvoiced noise-like source. However, it suffices to assume that the glottal excitation can be in one of two possible states, corresponding to voiced or unvoiced speech. A mixed excitation of these two may also be used for voiced consonants sounds. The filter represents only the resonances of the vocal tract, so additional provision is needed for the effects of the shape of the glottal waveform and the radiation characteristics of the mouth. Figure 3-2 describes the additional models adapted to the generalized source-filter model. However, nasals sounds representation using this model is inaccurate.

Gain

Impulse Glottal Pitch train pulse Period generator model

Vocal tract Radiation Speech Gain model model Signal

Random noise generator

Figure 3-2: Source-filter model of speech

3.3 Speech Analysis/Synthesis Based on a Sinusoidal Representation

The provided representation is given after McAualy and Quatieri [25, 90].

3.3.1 Sinusoidal Speech Model

In the linear speech production model, the speech waveform s(t) is assumed to be the result of passing the glottal excitation wave e(t) through a linear time- varying filter, which represents the vocal tract. If the time-varying impulse response

38 Chapter 3 – Sinusoidal Modeling of Speech of the vocal tract filter is ht(,)τ , then speech waveform s(t) can be written as time- varying convolution.

t st()=∫ ht ( −⋅τ ,) t e ( ττ ) d ( 3.1) 0 In the sinusoidal model, glottal excitation is represented by a sum of sine waves of various amplitudes, frequencies and phases. Thus, the excitation can be represented as

Lt() t = ωσ σ+ φ ( 3.2) et() Re∑ al ()exp t j∫ ll ( ) d l=1 0 where al (t) and ωl ()t represent time varying amplitude and frequency, and φl is the fixed phase offset, Lt() denotes the number of sine waves components at time t . The time-varying vocal tract transfer function is represented as

HtMt(ωω ,)= ( ,)exp[ jΦ ( ω ,) t] ( 3.3) Where Mt(ω ,) and Φ(ω ,)t denote the time-varying magnitude and phase respectively.

If the parameters of the excitation al (t) and ωl ()t remain constant over the impulse response duration of the vocal tract filter (common assumption), the representation can be written as

Lt() t st()= Re atM ()ω (), tt exp jωσ ( ) d σ+Φ ω (), tt + φ ( 3.4) ∑ ll{ ∫0 l l l} l=1  By combing the excitation and vocal tract terms, the representation can be written as

Lt()

st()= ∑ All ()exp t[ jψ () t] ( 3.5) l=1 where

= ω All() t a () tM l (), t t t ( 3.6) ψ()t= ωσ ( ) d σ +Φ ω (), tt + φ ll∫0  ll

th

P represent the amplitude and phase of the l P sine wave along the frequency track

ωl ()t (referred to as an instantaneous frequency).

39 Chapter 3 – Sinusoidal Modeling of Speech

3.3.2 Sinusoidal Model Parameters Estimation

The problem encountered with modeling the speech waveform is to extract parameters that represent a quasi-stationary portion of the waveform. Those parameters are required later to synthesize an approximation that is perceived close as possible to the original speech. The model parameters (amplitudes, frequencies and phases) are estimated by applying weighted Discrete Short-Time Fourier Transform (DSTFT) to a short frame, which is a part of the speech waveform. A weighted approach is taken to minimize the effects of rectangular window that is implicit in the definition of the DSTFT, i.e., side lobes, spectral leakage, etc. [91]. In the case of pure voiced frame, the weighted DSTFT of speech will have peaks occurring at all pitch harmonics. The amplitudes and phases of the component waves are estimated at the peaks from the high resolution DSTFT using a simple peak-picking algorithm [25]. The basic analysis system for the sine model is shown in Figure 3-3 below:

Speech Amplitudes Input Windowing DFT ABS Peak Picking Frequencies

-1 TAN Phases

Figure 3-3: Block diagram of the sinusoidal analysis

First, the speech is handled in piece-wise sequence of frames with durationT .

th The center of the analysis window for the i frame is marked as ti . It is assumed that, the vocal tract and glottal parameters are constant over an interval of at least th th Atii()= A the analysis window. The l sine wave for the i frame is described using ll ii and ωωll()t = , while phase can be written as

ii i ψωl()t= l ( tt −+ kl ) θ ( 3.7)

i th th where θl denotes the phase of the l sine wave of the center of the i frame at time tt= i . The waveform st() can be written as

40 Chapter 3 – Sinusoidal Modeling of Speech

Li = iiθω i− st( )∑ Al exp( j l )exp j li ( t t ) ( 3.8) l=1

i where L= Lt()i .

By shifting the time interval to origin with t′  tt− i and converting to discrete time, the waveform can be written as

Li ii s( n )= ∑γωll exp(jn ) ( 3.9) l=1

ii i th th where γθll= Ajexp( l ) represent the l complex amplitude for the l sine wave. The minimum mean-squared error criterion was used to verify the fitness of the synthetic signal with the original waveform.

i 2 ε =∑ yn() − sn () ( 3.10) n In [25], it is shown that the error is minimized by selecting all the harmonic frequencies in the speech bandwidth for the purely voiced speech, and where the synthetic speech can be written as

Li ii s( n )= ∑γωl exp(jnl 0 ) ( 3.11) l=1

ii i where ω00= 2/ πτ and where τ 0 denotes the pitch period assumed constant during the

i th frame. The results for the purely voiced scenario through Discrete Fourier Transform (DFT) are equivalent to Fourier series representation of periodic waveform. The same estimator structure can be applied even when the ideal voiced speech assumption is no longer valid provided the analysis window is “wide enough” that to satisfy:

ii 4π ωω−≥ ( 3.12) ll−1 N +1 where N +1 is the duration of the analysis window. This condition is driven from spectral property of the rectangular window embedded implicitly in the STFT; its Fourier transform (sinc) convolves with the genuine signal spectrum. For the theoretical discussion, it was assumed that outside

41 Chapter 3 – Sinusoidal Modeling of Speech

the range described above the sinc is “zero” (which is not accurate due to its side lobes). In order to minimize the spectral leakage weighted STFT is used. The same estimator is applied for the case of unvoiced speech, but the straightforward error minimization proof in the pure voiced is not valid for the unvoiced scenario. The Karhunen-Loève expansion for noise-like signals [92] was used. This analysis concluded that a sinusoidal representation is valid for the unvoiced signals, provided the frequencies selected are “close enough”, such that the power spectral density shape changes slowly over consecutive frequencies. Using window width of at least 20ms will lead “on the average” to a set of periodogram peaks approximately 100Hz apart. This should provide sufficiently dense sampling to satisfy the constraints of the sinusoidal Karhunen-Loève representation for unvoiced speech frame along with the amplitudes and frequencies, which are estimated using the above procedure.

3.3.3 Sinusoidal Model Synthesis

The sinusoidal analysis described in the previous section assumed, that the vocal tract and glottal parameters are constant over the analysis window. Thus, a constant set of parameters (amplitudes, frequencies, and phases for set of sine waves) was extracted and synthetic speech was given by:

Li ˆ = ˆˆiωθˆ ii+ sn( )∑ Al cos  n ll ( 3.13) l=1 where nN=0,1,2, , − 1 and where N is the length of the synthesis frame. However, the mentioned parameters are time varying in nature, and straightforward method of summing the sine waves of different parameters results in discontinuities at frame boundaries. These discontinuities significantly degrade synthetic voice quality with annoying clicks sounds. Therefore, there is a need for smoothing the synthetic speech during frame transition, and a method must be found for interpolating the two set of parameters from consecutives frames. The immediate approach is to apply overlap-add and time-weighted techniques to the segments of the sinusoidal components. The overlap-add and time weighting performs well but require a high frame rate, which is not applicable in speech coding where a lower bit rate is preferable. McAulay and Quatieri suggested

42 Chapter 3 – Sinusoidal Modeling of Speech frame-to-frame peak matching algorithm, in which each spectral component in a given frame matches one of the following conditions:

• “matched” condition - a spectral peak was found in consecutive frame within allowed frequency deviation. In this case, a linear interpolation of amplitudes and cubical phase interpolation is performed between the parameter sets of the two frames in order to smooth the frame transition for the given spectral component. • “death” condition - a spectral peak in the first analyzed frame wasn’t found in the consecutive frame and it needs to disappear. In this case, the spectral component is interpolated with itself, except its ending amplitude, which is set to zero. • “birth” condition - a spectral peak in a consecutive frame was found but not in the previous frame; it needs to emerge. In this case, similar to the “death” case, the spectral component is interpolated with itself, except its starting amplitude, which is set to zero.

(a) (b) Figure 3-4: Frequency tracking (after McAulay and Quatieri [25]) (a) Illustration of frequency track, (b) typical frequency track in real speech. The frame-to-frame spectral peak tracking implies that an additional delay frame is required in order to encode the speech. The amplitude interpolation is performed using simple linear method, where

N denotes frame size.

ˆˆii+1 ( AAll− ) An ii() = Aˆ +=− nn, 0,1,..., N 1 ( 3.14) llN A cubic polynomial was suggested for modeling frequency and phase. ˆi However, a special treatment is required, since the measured phase θl is obtained

43 Chapter 3 – Sinusoidal Modeling of Speech with 2π warping. Hence, phase unwrapping must be performed to ensure transition smoothness.

~ 2 3 θ (t) = ζ + γt +αt + βt ( 3.15) The interpolation parameters ζγα,, and β are found using boundary conditions. The instantaneous frequency is a derivate of the phase, and the maximal smoothness conditions are used to solve the phase warping problem. A full description of the cubic phase interpolation approach is given in [25]. Naturally, some or all boundary conditions are not available in packet loss case and a conditional overlap-add is used to in order to overcome the transition between synthetic and real speech signal. Figure 3-5 illustrates the basic block diagram for sinusoidal speech synthesis proposed by McAulay and Quatieri [25].

Synthesized Frequencies Frame to frame phase Sine wave Sum all sine Speech interpolation generator waves Phases

Frame to frame Amplitudes amplitude interpolation

Figure 3-5: Block diagram of the sinusoidal synthesis

44 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

4 Sinusoidal Model Based Packet Loss Concealment

This chapter describes the proposed algorithm (called SM-PLC) for packet loss concealment using sinusoidal modeling of speech based on the method introduced in Chapter 3. The chapter is organized top-down as follows: in the first section, generalized solution is presented; in the second section, the developed procedures and their building blocks are presented.

4.1 Generalized Concealment Method

As was already mentioned, this work focuses on a receiver-based technique. Hence, the concealment algorithm is applied during the decoding stage and uses only data available at the receiving end. Figure 4-1 describes the VoIP-receiving flow, that the proposed algorithm is applied to. RTP packets are received through the network port and de-packetized from RTP headers. Then the coded frames are pushed (asynchronously) into the jitter buffer. The hardware audio codec triggers periodically (synchronously) the decode process, where frames are pulled from the jitter buffer. The frames are usually marked as “good” or “bad” type, and often called Bad Frame Indication (BFI).

Figure 4-1: VoIP receiver-based concealment block diagram

45 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

4.2 Description of the Algorithm

Generally, packet loss concealment methods are heuristically motivated. The first objective is to avoid discontinuities and abrupt transitions on the boundaries of a lost packet that cause annoying sounds and degrade voice quality significantly. The second objective is to reconstruct a signal as similar as possible to the real one, in terms of spectral content, pitch and other properties, using information determined from neighboring frames. Figure 4-2 provides the top-level description of the proposed PLC algorithm (for detailed block diagram see Figure 4-11).

Figure 4-2: Proposed PLC top-level block diagram As the diagram shows, the PLC-In block reconstructs the missing frame using samples from jitter and inner history buffers, while PLC-Out block updates the history and makes transition between concealed and “good” consequent frames, coming from PLC-In and Decoder, respectively. All these operations are controlled by Control block according to Bad Frame Indicators. These three main modules and signals are explained in the following sub-chapters. The PLC technique proposed in this work uses the sinusoidal modeling of speech to estimate the information from previous and consecutive frames in order to reconstruct the signal contained in the missing packet. Without losing the generality of the solution, it was developed to work with frame size of 10ms and sampling rate of 16 kHz.

46 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

4.2.1 Building Blocks

Following Figure 4-2, the concealment process can be divided into three operational blocks (see detailed description in following sections):

• Control This block controls the concealment algorithm process. It includes a simple state machine driven by Bad Frame Indications, denoted as

T BFIi= [BFI i−+11 , BFI ii , BFI ] , of previous, current and consecutive frames stored in jitter and inner (history) buffers. The state machine is updated with each update of jitter buffer, when the current frame is output to hardware codec via decoder or PLC. Each state is assigned with a concealment method, and the algorithm applies the method of the current state. • PLC-In This block reconstructs the speech signal starting with “bad” frame indication and ending with the arrival of first “good” frame indication. The

Nc concealed output for frame i denotes as sci, , where Nc is the length of the concealed signal, i.e., the PLC-In output. The block also outputs current

 pi( ) sinusoidal model M to PLC-Out block. The consecutive coded frames available in jitter buffer (denoted as

= NNpp NNpp=  N p N p Cla, i cc i++ 12, i ,... , where c jc j(1) , c j( 2) ,.., cN jp( ) and N p is a constant payload length of frame j ) are used as input for “bad” frame i . • PLC-Out This block is an intermediate stage that handles inputs of variable sizes

Nc N f from PLC-In ( sci, ) and Decoder ( si ) blocks and outputs audio samples of constant size to the hardware interface with the additional algorithmic delay of 2.5ms. It also smoothes the transition between the outputs of the concealment process (PLC-In) and of the normal decoding. The block outputs audio

N f samples (denoted as sSM, i ) according to the Control unit state.

• Decoder

47 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

This block is responsible for decoding process of standard coded frames. It is activated for normal decoding process and for early decoding in cases where concealment is based on consecutive frames. The Decoder input is

NNpp=  N p N p N f denoted as cic i(1) , c i( 2) ,.., cN ip( ) and the output is denoted assi .

4.2.2 Control Block

Figure 4-3 describes the Control block, that accepts BFI of current and consecutive frames from jitter buffer i.e., {BFIii , BFI +1} . The state machine of the Control block maintains these BFIs as well as the BFI of the previous frame

T ( BFIi= [BFI i−+11 , BFI ii , BFI ] ). Where BFI value can be one of the following: 1-“bad”, 0-“good”, n/a- not available (when the information about next frame is not available in jitter buffer, neither “good” nor “bad”). The Control block sets method of operation to be carried out by the PLC-In and PLC-Out blocks.

Figure 4-3: Control block

4.2.2.1 PLC-In Control

The PLC algorithm aims to conceal the missing frames similar as possible to the real frames before and after the missing gap in terms of speech properties. Therefore, in addition to the usual sinusoidal AbS, model parameter matching to previous and next (if available) real frame is applied. Moreover, if possible, sinusoidal model interpolation is applied. Otherwise, only a history buffer is used to estimate sinusoidal model of the missing gap.

48 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

In scenarios, where the gap is more than one frame, sinusoidal extrapolation from previous model is used. Table 4-1 summarizes the methods performed by the PLC-In block per state.

No BFIi State Suggested Method of Operation

T 1 [0,1, n/a ] Perform analysis with previous frames, then extrapolate and

T [0,1,1] match sinusoidal model.

T 2 [1,1, n/a ] Use previous sinusoidal model with extrapolation. T [1,1,1]

T 3 [0,1, 0] Perform analysis on previous and future frames, extrapolate sinusoidal models, forward and backward respectively, and then interpolate sinusoidal model, if possible.

T 4 [1,1, 0] Perform analysis with future frame and extrapolate previous and future sinusoidal models, forward and backward respectively, then interpolate sinusoidal model, if possible. Table 4-1: Summary of major PLC-In Control operations

4.2.2.2 PLC-Out Control

The PLC algorithm aims to conceal the missing frames smoothly as possible to real signal, before and after the missing gap. Therefore, the transition smoothing between synthetic to real speech and vice versa is done using OLA matching. In the case, the concealed or “good” frames are consecutive, they are output to the history buffer unchanged. Table 4-2 summarizes the methods performed by the PLC-Out block per state.

No BFIi State Suggested Method of Operation

T 1 [1, 0, ~ ] Perform transition smoothing between extrapolated previous

signal and normal decoding frame.

T 2 [0,0, ~] Output normal decoding frame.

T 3 [0,1, ~ ] Perform smoothing of concealed signal and of previous frame’s

end. Output the transition followed by the concealed signal directly to the hardware audio codec.

49 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

T 4 [1,1, ~ ] Output concealed frame.

Table 4-2: Summary of major PLC-Out Control operations The “don’t care“ symbol “~” is used to denote that the information about this BFI is irrelevant to the operation method.

4.2.3 PLC-In Block

Figure 4-4 describes the PLC-In block, which acts according to the suggested method set by the Control block.

Figure 4-4: PLC-In Block Table 4-3 provides a detailed description of the methods performed by the block (as set by the Control block) according to the states in Table 4-1.

50 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

BFIi State Description

T Nha, [0,1, n/a ] • Select analysis signal shi, from history buffer using analysis

T [0,1,1] window adaptation logic.

Nha, • Perform normalization of analysis signal shi, .

Nha, • Perform sinusoidal analysis on normalized analysis signal shi, .

pi( ) • Extrapolate sinusoidal model ( M ) forward and update it in

 pi( ) terms of phase integration – the resulting model denotes as M .

 pi( ) • Match M with the previous frame in terms of phase offset and

 pi( ) update model parameters – the resulting model denotes as M .  Nc  pi( ) • Generate synthetic speech s using sinusoidal model M . ci,

Nc Nc • Scale sci, and output the resulting signal sci, to the PLC-Out block.

T pi( ) [1,1, n/a ] • Extrapolate previous sinusoidal model ( M ) forward and update

T [1,1,1] it in terms of phase integration – the resulting model denotes as

 pi( ) M .  Nc  pi( ) • Generate synthetic speech s using sinusoidal model M . ci,

Nc Nc • Scale sci, and output the resulting signal sci, to the PLC-Out block.

T Nha, [0,1, 0] • Select analysis signal shi, from history buffer using analysis window adaptation logic.

N N • Perform normalization of analysis signal s ha, and s la, a hi, la, i .

Nha, • Perform sinusoidal analysis on normalized analysis signals shi, .

Nla, a and sla, i .

pi( ) • Extrapolate sinusoidal model ( M ) forward and update it in

 pi( ) terms of phase integration – the resulting model denotes as M .

ni( ) • Extrapolate sinusoidal model ( M ) backward and update it in

51 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

BFIi State Description

 ni( ) terms of phase integration – the resulting model denotes as M .

 pi( ) • Match M with the previous frame in terms of phase offset and

 pi( ) update model parameters – the resulting model denotes as M . • If analysis window from history buffer is longer than analysis signal available in look-ahead buffer, match the sinusoidal model

 pi( ) M forward, in the terms of frequency modification. • If analysis signal taken from history buffer is the same size as analysis signal taken from look-ahead buffer (160 samples),

 pi( ) interpolate between previous and next sinusoidal models ( M

 ni( ) and M respectively). The interpolated model denotes as

 ni( ), pi( ) M .

Nc • Generate synthetic speech sci, using appropriate model (either   pi( )  ni( ), pi( ) M or M ).

Nc Nc • Scale sci, and output the resulting signal sci, to the PLC-Out block.

T pi( ) [1,1, 0] • Extrapolate sinusoidal model M forward and update it in terms

 pi( ) of phase integration – the resulting model denotes as M .

Nla, a • Perform normalization of analysis signal sla, i .

Nla, a Perform sinusoidal analysis on normalized analysis signal sla, i .

ni( ) • Extrapolate sinusoidal model M backward and update it in

 ni( ) terms of phase integration – the resulting model denotes as M . • If analysis window from history buffer is longer than analysis signal available in look-ahead buffer, match the sinusoidal model

 pi( ) M forward, in the terms of frequency modification. • If analysis signal taken from history buffer is the same size as analysis signal taken from look-ahead buffer (160 samples),

52 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

BFIi State Description  pi( ) interpolate between previous and next sinusoidal models ( M

 ni( ) and M respectively). The interpolated model denotes as

 ni( ), pi( ) M .

Nc • Generate synthetic speech sci, using appropriate model (either   pi( )  ni( ), pi( ) M or M ).

Nc Nc • Scale sci, and output the resulting signal sci, to the PLC-Out block. Table 4-3: Description of PLC-In operations

4.2.4 PLC-Out Block

Figure 4-5 describes the PLC-Out block. Its major role is to output (to the

N f hardware audio codec) the samples (denoted by sSM, i ), that result from regular decoding. Moreover, when concealment is applied, the block performs transition smoothing between decoded and concealed signals. The Control block selects the signal that should be output to the hardware: it

sNc sNc is either PLC-In input ci, , backward transition (based on concealed signal) tr, i , N sN f s f forward transition tr, i or Decoder input i . The selected signal passes through history buffer where delayed by constant algorithmic delay of 2.5ms.

53 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

Figure 4-5: PLC-Out block

No The output soi, denotes the signal that results from the method set by the Control block. The Table 4-4 describes these methods according to the states in Table 4-2.

BFIi State Description

T pi( ) [1, 0, ~ ] • Extrapolate previous sinusoidal model M and update it in terms

 pi( ) of phase integration – the resulting model denotes as M .

 pi( ) • Generate synthetic speech using M , resulting with normalized

NOLA estimation of current frame sˆi .

NOLA • Scale the signal – resulting with sˆi . • Perform OLA matching of 2.5ms between real and synthetic speech, i.e., first 2.5ms of “good” frame and 2.5ms of extrapolated

NOLA N f frame sˆi . The transition output signal denotes as str, i .

T N [0,0, ~] • Output the decoded frame s f , directly. i

54 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

T [0,1, ~ ] • Perform OLA matching between current synthetic signal and

buffered quarter of previous “good” frame. Output transition signal

Nc of length Nc , denoted as str, i .

T N [1,1, ~ ] • Output concealed signal s c directly. ci, Table 4-4: Description of PLC-Out operations The “don’t care“ symbol “~” is used to denote that the information about this BFI is irrelevant to the operation method.

In addition, pitch detector is used for estimating pitch value phi, of history samples for adapting the analysis window length.

4.2.5 Sub-Blocks

This section describes each sub-block in the concealment process.

4.2.5.1 Analysis Window Adaptation, Pitch Estimation

In [25], the sinusoidal analysis frame is shown to be at least 2.5 times the average pitch duration. A longer frame, although applying to any human source of pitch duration, can mask changes in the signal due to long analysis window. Therefore, in the proposed block a pitch detector is applied for the calculation of the analysis window length. The detector monitors the history buffer and updates the pitch value phi, each time a new frame arrives. When concealment operation is

required, a calculation based on phi, is performed. The calculated analysis window length is limited by the history buffer size from above. The lower limit is initially the length of a single frame (160 samples) but can be increased when pitch

estimation fails (e.g. due to unvoiced or transient speech) - in this case, the p is hi, kept constant and the lower limit is enlarged by 160 samples. We chose the SIFT algorithm as the pitch detector in the evaluation system [93].

55 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

4.2.5.2 Normalizing Stage

The normalizing stage of the proposed algorithm is based on the least squares estimation. An analyzed signal with length Na is divided into sub frames with length

N f of quarter of the frame length, i.e., N = , with 50% overlap. Each sub-frame sub 4 denotes as

NN N sub sub a T sai,,=ss ai ( nn) = a ( ) NNsub  sub  ( 4.1) n=1 +−⋅(i 1)  ,...,( iN −⋅ 1) +sub  22   where i is sub-frame index.

Nsub For each sub-frame, sai, , energy value is calculated by

T ssNNsub ⋅ sub ˆ K ( ai,,) ai Eisub ( ) = ( 4.2) Nsub where K denotes the total number overlapped sub-frames in the analyzed signal, and meets the following equation:

( NNa− sub ) K = 21+ ( 4.3) Nsub After the estimation of energies for all K sub-frames, the least squares estimation is performed:

ˆ − b T1 TKˆ = (HH) HEsub ( 4.4) aˆ where,

Nsub 10⋅ 2 H =  ( 4.5) N 11(K −⋅) sub 2 Therefore, the energy of the analyzed signal is estimated using a linear

T ˆ equation of estimated parameters abˆ, :

ˆ Na ˆ Eaa=[0,1, ,N −⋅+ 1] abˆ ( 4.6)

56 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

In order to avoid anomalies in energy estimation, due to silence signal and/or linear estimation that can lead to cross zero value, the following correction is made:

− ˆˆNNaa≤=ˆii10 ˆ EEa( a min arg∀=i1,.., K{E sub}) max arg{ 10 , min arg∀=i1,.., K{E sub}} ( 4.7) The output of the normalization procedure is

NN N T saa==⋅=−ss( nn) a( ) 1 , n[ 0,..,N 1] ( 4.8) aa aˆ Na a Ea (n)

4.2.5.3 Sinusoidal Analysis

The sinusoidal modeling is based on selecting the highest spectral peaks from the Discrete Short Time Fourier Transform (DSTFT). In [25], this approach is proved for the purely voiced case. However, the straightforward approach of sinusoidal analysis is “good enough” for modeling unvoiced speech, since it produces uniformly distributed peaks with random phases. Figure 4-6 describes the procedure used for estimating the model’s parameters of the normalized analyzed signal.

Figure 4-6: Sinusoidal analysis

57 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

The estimated sinusoidal model output, denotes as

aˆˆ a aa a M = Al,,ωθˆˆ ll , ε ( 4.9) {( )lL=1,.., a } The procedure for sinusoidal analysis is:

• DC Removing The mean value of the analyzed signal is subtracted from the signal. It is added back later (at the sinusoidal synthesis stage). • Windowing

Na The analyzed signal frame sa of length Na is windowed using Hamming window in order to reduce effect of spectral leakage, side lobes, etc [91]. • FFT

NNaa The windowed signal saa( nw)⋅ ( n) is zero padded to FFT nN=0,..,a − 1

length of NFFT =1024 (for achieving “moderate” resolution in the frequency domain) and is transformed using FFT algorithm. See frequency resolution test for more details (section 5.5.1.2). • ABS

The absolute value of the transformed signal is taken, X ()ω , in order to

find spectrum peaks by peak picking algorithm. • Peak Picking Algorithm

The absolute spectrum X ()ω is being searched for peaks by

differencing and searching for sign changes. A found peak is valid if meets two conditions:

1. its height is at least maximum minus 40dB, ( XX(ωl ) ≥⋅0.01 max );

2. it is a local maximum over its four neighbors,

( XX(ωωl) ≥ ( l+ jj; =±± 1, 2 )) . All valid peaks are sorted according to their magnitudes. Then, at most L = 30 (for achieving optimal performance with moderate computational cost) of the highest peaks are taken to reproduce speech signal that was analyzed. See model order test for more details (section 5.5.1.1).

58 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

• Parabolic Interpolation In order to increase a frequency resolution, a parabolic interpolation technique is used for each selected peak [86] (see appendix A for detailed description of the parabolic interpolation algorithm). The output of this procedure is

a * ffss ωˆll= 22 πkp= ωπ + ( 4.10) NNFFT FFT See increased resolution test for parabolic interpolation effect details (section 5.5.1.3). • ABS & Tan-1

Once the spectral peaks are selected from X ()ω , their magnitude and

phase properties can be found from the real and imaginary parts of complex interpolated transform ˆˆaa ˆ a X()ωωˆˆl= X Rl () +⋅ jX I () ω ˆ l ( 4.11) Magnitude and phase are computed, respectively:

ˆ aaaˆˆ22 AXl= Rl()ωωˆˆ + X I () l a ( 4.12) a −1 Xˆ ()ωˆ θˆ = tan Il l ˆ a X Rl()ωˆ

4.2.5.4 Sinusoidal Synthesis

The synthesis of speech based on the sinusoidal model Ma applies multiple independent sinusoidal generators whose outputs are summated, as described in Figure 4-7. However, since the sinusoidal parameters can vary over time domain (e.g., when sinusoidal model interpolation is done), the sinusoidal modeling is a function of n and is provided as,

aˆˆ a aa a M (n) = Anl( ),,ωθˆˆ ll( n) , ε(n) ( 4.13) {( )lL=1,.., a } The number of the generators depends on the sinusoidal model order and the actual number of the picked spectral peaks. Figure 4-7 describes the procedure used for sinusoidal synthesis.

59 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

Figure 4-7: Sinusoidal synthesis Each sinusoidal generator uses three parameters: discrete time index, sampling frequency (that can be variable over time domain) and phase offset (notation will be given later, in the matching stage). The estimated sinusoidal synthesized signal output, denotes as

a L n ˆ Na ˆˆa aa a sas( nf,,( n) ϕ) =∑ Al( n)cos ωˆˆl(n) +++ θϕε l (n) ( 4.14) l=1 fns ( )

4.2.5.5 Sinusoidal Model Extrapolation

Sinusoidal model extrapolation is used to update model’s parameters in order to get continuity between analyzed and synthetic signals and to handle consecutive losses, when there is no additional information regarding the spectrum of the missed signal. In extrapolation process the same spectral peaks are reused, while the phases ˆ θl of all sines are updated using simple instantaneous frequency integration ( )lL=1,.., for “past” and “future” analysis based modeling:

60 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

  pi( ) ˆˆpi( ) pi( ) pi( ) Na pi( ) θlll=+⋅ θωˆ ,lL = 1,.., i  fs θl =  ( 4.15)  N θˆˆni( ) =−⋅ θω ni( ) ˆ ni( ) c = ni( )  lll ,lL 1,..,  fs

pi( ) where Na and Nc denote the length of the previous analysis frame (for forward extrapolation) and the length of the concealing frame (for backward extrapolation) respectively. However, amplitudes, frequencies and DC values left unchanged during extrapolation:   ˆˆpi( ) pi( ) pi( ) i All= Al, = 1,.., L Al =   ( 4.16)  ˆˆni( ) ni( ) ni( ) All= Al, = 1,.., L   pi( ) pi( ) pi( ) i ωωˆˆll=,lL = 1,.., ω =  ( 4.17) l  ni( ) ni( ) ni( ) ωωˆˆll=,lL = 1,..,   pi( ) pi( ) pi( ) i εεˆˆll=,lL = 1,.., ε =  ( 4.18) l ni( ) ni( ) ni( ) εεˆˆll=,lL = 1,.., The sinusoidal model of frame i denotes as,    pi( ) = ˆˆ pi( ) ωθˆˆ pi( ) pi( ) εpi( ) M ( Al,, ll) pi( ) ,  { lL=1,.., } i= iωθ ii ε i= M Al,, ll i ,   {( )lL=1,.., } ( 4.19)   ni( ) ni( ) ni( ) ni( ) ni( ) = ˆˆωθˆˆ ε M ( Al,, ll) pi( ) ,  { lL=1,.., }

4.2.5.6 Sinusoidal Model Interpolation

Sinusoidal model interpolation is performed when future analysis signal meets the constraints. All sinusoidal parameters are interpolated. i.e., the amplitudes, frequencies, phases and DC differ between frames. A modified “birth” and “death” procedure, shown in [25], is performed in order to interpolate only between spectral peaks that are close enough in the frequency axis. Unmatched frequencies are interpolated with themselves, but with “zeroed” amplitude in the start or in the end in the case of “birth” or “death”, respectively.

61 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

The sinusoidal model interpolation is given as a combination of the previous

pi( ) ni( ) and the next to i th frame sinusoidal models, M and M respectively, and denotes as

 pi( ), ni( ) i i ii i MM(n) =( n) = An( ),,ωθ( n) ( n) , ε( n) ( 4.20) {( l ll)lL=1,.., i } where:

 ˆˆni( ) − pi( ) ˆ pi( ) AAll Anl + , for "matched"  Nc −1  pi( )  Aˆ Ani () = Aˆ pi( ) − l n, for "death" l nN=0,..,c − 1  l ( 4.21)  Nc −1  ni( ) Aˆ  l n, for "birth"  Nc −1  ni( ) pi( ) pi( ) ωωˆˆll− ωˆl + n, for "matched"  Nc −1  ωωi ()n = ˆ pi( ) , for "death" l nN=0,..,c − 1  l ( 4.22)  ni( ) ωˆl , for "birth"   ni( ) pi( ) pi( ) εεˆˆ− εεi ()nn=ˆ + ( 4.23) nN=0,..,c − 1 Nc −1

i θl is kept as initial condition and is not changed, i.e.,   ˆ pi( ) θl , for "matched"   i  ˆ pi( ) θθll=  , for "death" ( 4.24)   θˆ ni( ) , for "birth"  l

4.2.5.7 Matching Stage

The proposed algorithm uses two matching methods: model based and OLA methods. Model matching is used to find optimal model parameters that cause reconstructed signal to be similar as much as possible to analysis signals. However,

62 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

OLA matching is done for smoothing the transition between synthetic and real signals.

4.2.5.7.1 Sinusoidal Model Matching

For matching with previous frame in terms of phase offset, “phase” matching is performed on sinusoidal model, as described in following equation:

p cp p ϕϕ= arg max p sˆ ns, ⋅ ϕπ∈[0,2 ) ( )  { n=−−N f :1 } Lc  ( 4.25) c p ˆˆcn cc p c sAˆ n,ϕ T = cos ωθϕˆˆ++ + ε ( ) ∑ l ll n=0:N f − 1 l=1 fs For matching with the next frame in terms of frequency modification, “stretch” matching is performed on sinusoidal model, as described in following equation:

n cn n f= arg max n sfˆ ns, ⋅ ssf∈[0.95 ff ,1.05 ] ( )  s ss{ n=NNff:2 − 1 } Lc  c n ˆˆc n cc c sffˆ nn,,s( s) T =Alcos ωθˆˆll++ ε ( )n=0:N − 1 ∑ ( 4.26) f l=1 fs (n) n n ffss− ffss(nn, ) = f s + N f −1 Additional sinusoidal model matching techniques are described in Appendix B.

4.2.5.7.2 Overlap107B -Add Matching

When “bad” frame follows a “good” one, concealed signal is extrapolated

back for 2.5ms (OLA length NOLA = 40 samples) and OLA matched with last 2.5ms of the previous “good” frame that is not played yet (due to delay in history buffer), as described in Figure 4-8.

63 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

Figure 4-8: OLA backward matching illustration When “good” frame follows a “bad” one, estimation of the current frame (based on extrapolation of previous frame) is OLA matched with the first 2.5ms of the current frame, as described in Figure 4-9.

Figure 4-9: OLA forward matching illustration

4.2.5.8 Scaling Stage

Unfortunately, any concealment process (as good as it can be), cannot predict the missing packet information perfectly. In order to avoid artificially long continuation that annoys the listener, scaling envelope framework is required. Scaling down is often found in concealment algorithms, performed on signal directly or on energy parametric representative. Since the algorithm works on normalized speech segments, a combined approach is implemented in the proposed algorithm: the gain is calculated on energy estimation characteristics, but affects directly the signal envelope (Figure 4-10):

Nc NN cc sci,,= sg ci ⋅ i ( 4.27)

64 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

Nc Nc Gain denotes as gi , is a function of the previous calculated gain (gi−1 ) and

ˆˆpiN( ),,aa niN( ) the energy characteristics of previous and next analysis signals (EEaa, ) , if they exist.

• First Loss Scaling In this case, the envelope of the concealed signal should remain the same as of the previous frame. For this reason, the gain is constant and equals to square root of end point of estimated energy of previous analyzed signal.

pi( ), N 1 Nc ˆ a 2 gni( ) = Eaa( N −1) ( 4.28) nN=0,..,c − 1 • Consecutive Loss Scaling In this case, each next concealed frame is linearly muted to three quarters of its initial envelope. Initial value equals to previous concealed envelope end point.

1 Nc gNic−1 ( −1) NNcc4 gni( ) = gNic−1 ( −−1) n ( 4.29) nN=0,..,c − 1 Nc −1 • Interpolation Scaling In this case, scaling gain is also interpolated between previous gain end point and first point of square root estimation of energy of future analyzed signal.

ni( ), N 1 Eˆ a (01) 2 −− gNNc ( ) NNcca ic−1 gni( ) = gNic−1 ( −+1) n ( 4.30) nN=0,..,c − 1 Nc −1 pi( ), N 1 Nc ˆ a 2 When BFIi−1 = 0, gNic−1 ( −−11)  E a( N a) .

• OLA Matching Scaling For transition frame, scaling gain is still constant and equals to the gain at the end of the previous frame.

NNOLA c gi( n) = gNic−1 ( −1) ( 4.31) nN=0,..,OLA − 1

Figure 4-10: Scaling

65 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

4.2.5.9 History Buffer

The block generally buffers the decoded or concealed signal (in case of packet loss) with additional algorithmic delay of 2.5ms. The delayed samples are used for smoothing transition between real and synthetic signals in packet loss scenario. Additionally, this block is used to store samples (last 30 ms) for sinusoidal analysis in concealment scheme and for pitch estimation (needed for the calculation analysis window length). The block operations consist of following steps:

No • A new frame i , denotes as soi, , pushed to the buffer, while memory accommodates previous frames. • The memory is shifted by one frame and the oldest frame i − 4 is flushed out. • The new frame i now occupies the slot previously hosted by frame i −1.

N f • Finally, the frame, denoted as sSM, i is output with delay of 2.5ms.

• When backward smooth transition is done, the signal that enters the history

No buffer soi, consists of current concealed frame and backward transition of 2.5ms length. The transition replaces the current un-played samples and outputs directly to the hardware audio codec without being delayed.

66 Chapter 4 – Sinusoidal Model Based Packet Loss Concealment

Figure 4-11: Proposed PLC detailed-level block diagram

67 Chapter 5 – Algorithm Evaluation

5 Algorithm Evaluation

In order to evaluate the quality of the proposed algorithm, we have to compare it with other known standardized algorithms for packet loss concealment, designed for wideband coding. We choose the ITU-T G.722 Appendixes III and IV [58, 64] concealment algorithms to be a base for comparison since they perform well and are very popular in wideband VoIP applications (their concepts are mentioned above). To achieve a statistical significance of results we use a large-scale multi- language database (ITU-T P-Series supplementary 23). For the objective evaluation of the proposed algorithm, we use PESQ [35, 36], which is the leading objective quality measure for wideband speech signals. Moreover, to achieve a subjective evaluation we perform informal tests on a small subset. To module the IP network and its impairments we use a popular free emulation tool NIST Net [94]. In the following sections, we describe in more details the evaluation setup and list the performed tests and their results.

5.1 Evaluation Framework

This section is devoted to an objective evaluation framework we use to compare between the potentials of the G.722 Appendixes III/IV and the SM-PLC in a real-world application. Unlike the G.722 Recommendations, the SM-PLC algorithm is external to the decoder. Hence, to obtain an accurate comparison on equal terms, we should take into account the effects of voice coding. For this reason, we include G.722 encoding and decoding in the testing scheme for the proposed algorithm, as shown in Figure 5-1.

68 Chapter 5 – Algorithm Evaluation

Figure 5-1: Proposed testing scheme for SM-PLC compared to G.722 PLC Appendixes Moreover, we should take into account that the knowledge of the decoder state enables more accurate lost data reconstruction, giving the built-in advantage for the G.722 recommendations. Since the method we use for evaluation (PESQ) is intrusive, we also need to feed it with the original un-coded signal. Combining all of the above, we propose the evaluation framework as described in Figure 5-2.

Figure 5-2: Evaluation framework

5.2 Speech Quality Assessment

As mentioned before (sections 2.8.3, 2.8.5), we use the Perceptual Evaluation of Speech Quality (PESQ) [35, 36] to evaluate voice quality and to test the proposed algorithm. The raw scores provided by this method (ITU-T Rec. P.862.2 [36]) vary in the range of -0.5 to 4.5. However, to allow a linear comparison with MOS we need to transform these scores into MOS-LQOW [47] scores (that vary in the range

69 Chapter 5 – Algorithm Evaluation of 1 to 5). The function for this transformation (presented in P.862.2) has been optimized on a large corpus of subjective data representing different applications and languages. See detailed description for PESQ in Appendix C.

5.3 Test Vectors

In order to evaluate the performance of the algorithm correctly, either by subjective or automated tests, we have to utilize a suitable database of speech signals. The database should be as diverse as possible, to ensure that the algorithm is tested in many different conditions (including the speech sentences pattern and silent periods, minimum and maximum signal duration, etc). Such database of coded and source speech material is provided by Supplement 23 to the P series of ITU-T Recommendations [95]. The purpose of this database is to provide source, pre-processed and processed speech material, and related subjective test plans and scores, for the development of new and revised ITU Recommendations relating to objective voice quality measures. The speech data in the database consist of three experiments and five organizations that contributed the speech data: AT&T (USA), CNET (France), CSELT (Italy), Nortel (formerly BNR, Canada), and NTT (Japan). Moreover, all speech samples in the database last 8 sec and include two short sentences separated by a silent period of at least 1 sec. In our experiment, we use the full pre-processed speech materials (1828 files), recorded in 16-bit linear PCM and sampling rate of 16 kHz, from the three ITU experiments (i.e., Interworking With Other Wireless and Transmission Standards, Effect of Environmental Noise and Effect of Channel Degradations experiments). We select this large-scale multi-language dataset (described in Table 5-1) to achieve a statistical significance of results.

Exp.1 Exp.2 Exp.3 Total

# of Files # of Files # of Files Lang. Files English 188 144 208 540 (AT&T, Nortel) French 188 144 208 540 (CNET) Italian n/a n/a 208 208 (CSELT) Japanese 188 144 208 540 (NTT)

70 Chapter 5 – Algorithm Evaluation

Total Exp. 564 432 832 1828 Files Table 5-1: Structure of the objective evaluation dataset (a subset of ITU-T Sup. P. 23) When performing informal tests and during the research, we have used a small subset of the dataset mentioned above. This subset consists of 5 male and 5 female sentences from each experiment and all supported languages, as shown in Table 5-2. For the final subjective evaluation, presented in the results section, we use only the English part of this subset.

Exp.1 Exp.2 Exp.3 Total

# of Files # of Files # of Files Lang. Files English 10 10 10 30 (AT&T, Nortel) French 10 10 10 30 (CNET) Italian n/a n/a 10 10 (CSELT) Japanese 10 10 10 30 (NTT) Total Exp. 30 30 40 100 Files Table 5-2: Structure of the informal evaluation dataset (a subset of ITU-T Sup. P. 23)

5.4 Network41B Model

There are three major methods to perform a network related experiments. The first one is simulation, where a real network conditions are created and then the simulator uses the models associated to the simulated network elements in order to compute the estimated results for the experiment. The second method involves using a real network to perform the experiment. This technique enables measurements that are more accurate, but enforces a limited range of test scenarios, since the real network is difficult to control. The third method combines the previously described methods by using a network emulator - a tool that reproduces the effect of real networks on traffic by introducing artificial impairments. These impairments are controllable and a wide range of network conditions can be examined. Moreover, the real traffic enables analysis of the effects that network conditions have on real applications. We select the third method due to its significant advantage and use the NIST

Net Emulation Tool [94, 96F 96] to perform the actual emulation (see description for

71 Chapter 5 – Algorithm Evaluation

NIST Net Emulation Tool in Appendix D). To perform a real network modeling by this emulator, we use a pre-configured set of parameters that simulate the impairments on RTP traffic passing through two VoIP gateways as described in Figure 5-3 [9].

RTP RTP RTP Adaptive Decoder VoIP NIST Net De-Packetizing jitter buffer gateway emulator

VoIP gateway BFI Control Config Control impairments record

Capturing Control

Figure 5-3: Network emulation recording The examples of traffic impairments induced by the emulator during the network modeling are presented in Figure 5-4.

Bad Frame Indicators, Loss Rate - 1% 1 0.5 0 0 100 200 300 400 500 600 700 800 Frames Bad Frame Indicators, Loss Rate - 3% 1 0.5 0 0 100 200 300 400 500 600 700 800 Frames Bad Frame Indicators, Loss Rate - 5% 1 0.5 0 0 100 200 300 400 500 600 700 800 Frames Bad Frame Indicators, Loss Rate - 10% 1 0.5 0 0 100 200 300 400 500 600 700 800 Frames Bad Frame Indicators, Loss Rate - 15% 1 0.5 0 0 100 200 300 400 500 600 700 800 Frames

Figure 5-4: Various packet loss rates simulated by NIST Net Emulation Tool The shown simulations (packet streams for each loss rate) were used in the evaluation stage.

72 Chapter 5 – Algorithm Evaluation

5.5 Evaluation Results

In this section, we present the testing scenario and the results of objective and subjective evaluation of the proposed algorithm as compared to ITU G.722 Appendixes, III and IV.

5.5.1 Algorithm Configuration

The algorithm we propose in our research is highly configurable and we need to fix the set of parameters providing the best performance. After performing a fine- tuning on the evaluation dataset (Table 5-2), using the packet loss simulated by NIST Net (Figure 5-4), we have selected the following configuration:

• Sampling Frequency fs =16 kHz

• Frame length N f =160 samples

• FFT length of NFFT =1024 points (frequency resolution ∆≈f16 Hz )

• Sinusoidal model order (maximal no. of sine waves) is set to L = 30 • Analysis window type – hamming window (with first side lobe attenuation of -43dB)

• History buffer size Nh = 480 samples (3 frames / 30msec)

• Look-Ahead buffer size Nla =160 samples (1 frame / 10msec) • Interpolation peak matching frequency resolution is set to ∆=f25 Hz

• Overlap-Add matching length is NOLA = 40 samples (2.5msec)

The parameters like FFT length and model order affect two aspects: performance and complexity of the algorithm. In the following subsections, we describe the impact of the trade-off between these two aspects on parameters’ selection.

5.5.1.1 Model Order Selection Test

To fix the value of the model order parameter we have performed set of tests, with the value of this parameter varying between 10 and 50. We have achieved the optimal results for the value of 30. However, in real-world applications, where

73 Chapter 5 – Algorithm Evaluation complexity could be an issue, the value of 20 can be applied with no significant quality degradation, as shown in Figure 5-5.

4 PL 1% 3.8 PL 3% PL 5% 3.6 PL 10% PL 15% 3.4

3.2 W

3

MOS-LQO 2.8

2.6

2.4

2.2

2 5 10 20 30 40 50 Model Order

Figure 5-5: Sensitivity to sinusoidal model order The tests involved in fixing the model order parameter have the following properties:

• FFT length of NFFT = 2048 points ( ∆≈f8 Hz ) • Parabolic interpolation is not used • Look-Ahead buffer is unavailable

The results of the performed tests are listed in Table 5-3.

MOS-LQOW Packet Loss Rate [%] Model Order 1 3 5 10 15 5 3.83 3.35 3.16 2.58 2.18 10 3.86 3.42 3.21 2.66 2.26 20 3.87 3.45 3.23 2.70 2.31 30 3.87 3.45 3.23 2.71 2.32 40 3.87 3.45 3.23 2.71 2.32 50 3.87 3.45 3.23 2.71 2.32 Table 5-3: Summary of model order selection test results

74 Chapter 5 – Algorithm Evaluation

5.5.1.2 Frequency Resolution Test

To fix the FFT length we have performed set of tests, with the length varying between 256 and 8128 (frequency resolutions from ~64Hz to ~2Hz). We have achieved the optimal results for the value of 2048. Later we show that this value can be reduced by introducing a parabolic interpolation.

4

NFFT 256

NFFT 512 3.5 NFFT 1024

NFFT 2048

NFFT 4096 3 W NFFT 8128

MOS-LQO 2.5

2

1.5 1 3 5 10 15 Packet Loss Rate %

Figure 5-6: Sensitivity to frequency resolution The tests involved in fixing the FFT length have the following properties:

• Sinusoidal model order (maximal number of sine waves) is set to L = 30 • Parabolic interpolation is not used • Look-Ahead buffer is unavailable

The results of the performed tests are listed in Table 5-3.

MOS-LQOW Packet Loss Rate [%] FFT Length 1 3 5 10 15 256 3.62 2.82 2.55 1.98 1.71 512 3.79 3.21 2.99 2.42 2.08 1024 3.85 3.42 3.18 2.66 2.28 2048 3.87 3.45 3.23 2.71 2.32 4096 3.88 3.47 3.25 2.72 2.33 8128 3.88 3.47 3.25 2.73 2.34 Table 5-4: Summary of frequency resolution test results

75 Chapter 5 – Algorithm Evaluation

5.5.1.3 Increased Frequency Resolution Test

Figure 5-7 shows that the use of parabolic interpolation leads to a moderate improvement in objective test results. This improvement enables us to reduce the FFT length and, consequently, the complexity of the algorithm with no remarkable performance degradation compared to the longer FFT length. Hence, we apply the parabolic interpolation and fix the FFT length to the value of 1024.

4

without parabolic interpolation, NFFT 1024 3.8 with parabolic interpolation, NFFT 1024 without parabolic interpolation, N 2048 3.6 FFT

with parabolic interpolation, NFFT 2048 3.4 W 3.2

3 MOS-LQO

2.8

2.6

2.4

2.2 1 3 5 10 15 Packet Loss Rate %

Figure 5-7: Parabolic interpolation effect on objective performance The tests involved in fixing the FFT length have the following properties:

• FFT length of NFFT = [1024,2048] points ( ∆≈f[16,8] Hz )

• Sinusoidal model order (maximal number of sine waves) is set to L = 30 • Parabolic interpolation is used • Look-Ahead buffer is unavailable

The results of the performed tests are listed in Table 5-3.

MOS-LQOW Packet Loss Rate [%] Interpolation 1 3 5 10 15 = No ( NFFT 1024) 3.85 3.42 3.18 2.66 2.28 = Yes ( NFFT 1024) 3.87 3.46 3.24 2.72 2.33 = No ( NFFT 2048) 3.87 3.45 3.23 2.71 2.32

76 Chapter 5 – Algorithm Evaluation

= Yes ( NFFT 2048) 3.87 3.47 3.25 2.73 2.34 Table 5-5: Summary frequency parabolic interpolation tests results

5.5.2 Objective Voice Quality Performance Test

In Figure 5-8 we compare the mean MOS-LQOW scores of the evaluated concealment methods for various packet loss rates. This figure makes it obvious that in terms of the objective evaluation the SM-PLC is superior to G.722 Appendixes III and IV PLC for all loss rates. In addition, this figure demonstrates that even better performance is achieved when the look-ahead is available to SM-PLC.

Data: sup.p.23\PRE_PROC\ PESQ Results (Mean) - all sup.23 data files

G.722 4 G.722 App.IV G.722 App.III 3.5 SM-PLC LA-OFF SM-PLC LA-ON

3 W

2.5 MOS-LQO

2

1.5

1 1 3 5 10 15 Packet Loss Rate %

Figure 5-8: Objective voice quality (P-Series Sup. 23) The summary of the performed evaluations is collected in Table 5-6.

MOS-LQOW Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 3.81 3.21 2.93 2.33 1.99 G.722 Appendix III 3.86 3.34 3.08 2.53 2.20 SM-PLC LA-OFF 3.91 3.47 3.26 2.72 2.38 SM-PLC LA-ON 3.93 3.53 3.34 2.85 2.52 Table 5-6: Summary of objective voice quality test results

77 Chapter 5 – Algorithm Evaluation

In Figure 5-9, we compare the standard deviation in the MOS-LQOW scores of the evaluated methods for various packet loss rates. From it, we conclude that, in terms of the standard deviation of scores, the proposed SM-PLC is comparable to the G722 Recommendations and even better (has lower STD). Hence, we claim that the obtained high performance of the proposed algorithm is consistent.

Data: sup.p.23\PRE_PROC\ PESQ Results (STD) - all sup.23 data files 0.4 G.722 G.722 App.IV 0.35 G.722 App.III SM-PLC LA-OFF SM-PLC LA-ON 0.3 W

0.25 MOS-LQO

0.2

0.15

0.1 1 3 5 10 15 Packet Loss Rate %

Figure 5-9: Standard deviation of objective voice quality The summary of the performed evaluations is collected in Table 5-7.

STD(MOS-LQOW) Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 0.22 0.28 0.31 0.27 0.27 G.722 Appendix III 0.21 0.25 0.28 0.26 0.28 SM-PLC LA-OFF 0.19 0.23 0.24 0.23 0.27 SM-PLC LA-ON 0.20 0.22 0.25 0.24 0.28 Table 5-7: Summary of standard deviation test results

5.5.3 Language Dependence Test

Moreover, we show that the performance of the SM-PLC remains consistent when the algorithm is applied for various languages.

78 Chapter 5 – Algorithm Evaluation

5.5.3.1 Italian

The test was carried with the pre-processed part of Italian speakers from ITU P-Series Sup.23 database (Table 5-1).

Data: sup.p.23\PRE_PROC\Italian\ Italian - PESQ Results (Mean) - all sup.23 data files

G.722 4 G.722 App.IV G.722 App.III 3.5 SM-PLC LA-OFF SM-PLC LA-ON

3 W

2.5 MOS-LQO

2

1.5

1 1 3 5 10 15 Packet Loss Rate %

Figure 5-10: Objective voice quality on Italian (P-Series Sup.23)

MOS-LQOW Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 3.83 3.27 2.84 2.32 1.95 G.722 Appendix III 3.85 3.38 2.96 2.45 2.12 SM-PLC LA-OFF 3.93 3.50 3.16 2.65 2.31 SM-PLC LA-ON 3.96 3.56 3.24 2.79 2.45 Table 5-8: Summary of objective voice quality results on Italian

79 Chapter 5 – Algorithm Evaluation

5.5.3.2 French

The test was carried with the pre-processed part of French speakers from ITU P-Series Sup.23 database (Table 5-1).

Data: sup.p.23\PRE_PROC\French\ French - PESQ Results (Mean) - all sup.23 data files

G.722 4 G.722 App.IV G.722 App.III 3.5 SM-PLC LA-OFF SM-PLC LA-ON

3 W

2.5 MOS-LQO

2

1.5

1 1 3 5 10 15 Packet Loss Rate %

Figure 5-11: Objective voice quality on French (P-Series Sup.23)

MOS-LQOW Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 3.86 3.27 3.05 2.42 2.17 G.722 Appendix III 3.92 3.40 3.19 2.61 2.39 SM-PLC LA-OFF 3.97 3.52 3.35 2.77 2.54 SM-PLC LA-ON 3.98 3.59 3.41 2.91 2.68 Table 5-9: Summary of objective voice quality results on French

80 Chapter 5 – Algorithm Evaluation

5.5.3.3 Japanese

The test was carried with the pre-processed part of Japanese speakers from ITU P-Series Sup.23 database (Table 5-1).

Data: sup.p.23\PRE_PROC\Japanese\ Japanese - PESQ Results (Mean) - all sup.23 data files

G.722 4 G.722 App.IV G.722 App.III 3.5 SM-PLC LA-OFF SM-PLC LA-ON

3 W

2.5 MOS-LQO

2

1.5

1 1 3 5 10 15 Packet Loss Rate %

Figure 5-12: Objective voice quality on Japanese (P-Series Sup.23)

MOS-LQOW Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 3.73 3.20 2.95 2.34 2.00 G.722 Appendix III 3.77 3.33 3.10 2.57 2.24 SM-PLC LA-OFF 3.83 3.48 3.29 2.77 2.45 SM-PLC LA-ON 3.85 3.53 3.36 2.90 2.60 Table 5-10: Summary of objective voice quality results on Japanese

81 Chapter 5 – Algorithm Evaluation

5.5.3.4 English (North-American)

The test was carried with the pre-processed part of English speakers from ITU P-Series Sup.23 database (Table 5-1).

Data: sup.p.23\PRE_PROC\English\ English - PESQ Results (Mean) - all sup.23 data files

G.722 4 G.722 App.IV G.722 App.III 3.5 SM-PLC LA-OFF SM-PLC LA-ON

3 W

2.5 MOS-LQO

2

1.5

1 1 3 5 10 15 Packet Loss Rate %

Figure 5-13: Objective voice quality on North-American English (P-Series Sup.23)

MOS-LQOW Packet Loss Rate [%] PLC Method 1 3 5 10 15 G.722 Appendix IV 3.83 3.13 2.84 2.23 1.81 G.722 Appendix III 3.88 3.27 3.00 2.44 2.02 SM-PLC LA-OFF 3.92 3.39 3.19 2.64 2.18 SM-PLC LA-ON 3.96 3.47 3.27 2.77 2.31 Table 5-11: Summary of objective voice quality results on North-American English

82 Chapter 5 – Algorithm Evaluation

5.5.4 Subjective Preference Test

Additional to objective voice quality tests, the subjective one is also required in order to avoid possible evaluation errors in tests that do not involve human perception. During the subjective test, all languages were used in order to validate correctness of concealment process. In the final test, 10 listeners were introduced with English sentences from the testing database Table 5-2 (half spoken by males and half by females). Each sentence was deteriorated by packet loss and then played back to each listener: once after being passed through the proposed PLC algorithm, and once after being passed through the algorithm recommended in ITU-T G.722 Appendix III. For each pair of sentences, the speakers were asked to vote which one sounds better. The test was repeated for three different Experiments for packet loss rate of 15%. No special testing environment was set up and regular headphones were used. The listeners were allowed to replay the sentences a couple of times before voting. The results of this preference test are summarized in Table 5-12.

Preference [%] Male Female Avg. Loss Rate 15% G.722 Appendix III 55% 42% 48.5% SM-PLC LA-OFF 45% 58% 51.5% Table 5-12: Summary of subjective preference test results Despite of inherent advantage in standardized algorithm (due to knowing decoder states that guarantees smoother transitions), at least half of voters favored the speech resulting from the proposed algorithm.

83 Chapter 6 – Summary and Future Directions

6 Summary and Future Directions

Voice over IP has a real-time delay constraint in order to enable interactive voice call over the Internet. Therefore, UDP protocol is being used but unfortunately has no mechanism for guaranteeing certain Quality of Service. Therefore, VoIP applications have to overcome impairments introduced during voice packets transmission, mainly packet loss, delay, jitter and miss-order. This impairments unless handled degrade significantly voice quality perceived by the user. In this Chapter, the work will be summarized and future research directions will be suggested.

6.1 Summary

This work introduced a PLC method aiming to improve the perceived voice quality for Voice over IP users. The proposed method was implemented as a reciver- based algorithm designed for use with wideband speech signals sampled at a rate of 16KHz, regardless of the type of the coder. Sinusoidal Representation of Speech Model was used to reconstruct the missing packets. The model parameters were estimated from history and consecutive packets, and then synthetic speech was produced to fill missing gap. Transitions from synthetic to real signal were smoothed using OLA matching. The algorithmic delay of 2.5ms was introduced in order to handle smooth real to synthetic speech transition. The concealed output signal was scaled to avoid unnatural continuation of speech. To improve performance the future frames were used under assumption that they are usually available in VoIP applications, where jitter buffer and voice decoder are coupled. The proposed scheme was tested under various configurations and packet loss probabilities. Informal subjective testing and objective method showed promising results, mostly superior to G.722 Recommendations. These results were achieved

84 Chapter 6 – Summary and Future Directions

even without the need to integrate the mechanism in the decoder itself, making the proposed algorithm easily applied to systems with other types of codecs.

6.2 Future Directions

6.2.1 Using Adaptive Jitter Buffer in Concealment Process

The proposed packet loss concealment process exploits consecutive frames, if such are available during the packet loss period. Therefore, the existence of jitter buffer in VoIP applications may be used, instead of introducing additional algorithmic delay that degrades voice quality. This buffer should be implemented as adaptive, allowing better adaptation to the real delay variance.

6.2.2 Using More Consecutive Frames in Concealment Process

The evaluation showed that the usage of single consecutive frame from jitter buffer can significantly improve the quality of resulting speech. Involving additional frames available in jitter buffer in concealment process can provide even more significant quality gains.

6.2.3 Using Stationary Speech Segmentation for Analysis Window Adaptation Logic

Successful selection of analysis signal can be crucial for sinusoidal analysis performance. The application of pitch detector does not perform well in all scenarios. The alternative method for analysis window segmentation is described in Appendix E. This method is based on sinusoidal modeling of speech, and shows promising results. Unfortunately, it can perform well only for male or for female speech, but not for both, due to different pitch characteristics (different selection criterions). Introducing classification of speech will allow application of this algorithm, leading to more precise window adaptation.

85 Chapter 6 – Summary and Future Directions

6.2.4 Complexity Reduction for Real-Time Application

Since the proposed algorithm is computationally demanding, certain optimizations will be required in order to implement it in a real-time application. Such optimizations can include using numerical shortcuts, especially with regard to analysis stage.

86 Appendix A – Parabolic Interpolation

Appendix A. Parabolic Interpolation

The interpolation concept is based on parabola function assumed crossing at 3 highest absolute spectrum bins (including the found peak). The parabola maximum represents the interpolated peak (magnitude and frequency). The interpolation process is performed for each selected peak. Figure describes the bins and notations used for describing the interpolation process (the bins are indexed relatively to the peak).

y β

γ α x −1 1 0 Figure A-1: Parabolic interpolation For each peak found, general parabola form is suggested:

2 yx() ax (−+ p ) b (A.1)

where x denotes bin relative indexing, with origin in the peak found. The parabola satisfies y(−= 1)α , yy(0) = βγ , (1) =. Bins logarithmic magnitudes are calculated using:

αω 20log10X (l− 1 )

βω 20log10 X (l ) (A.2)

γω 20log10X (l+ 1 )

where X ()ωl denotes spectrum value at ωl (found peak). After solving for the parabola peak location p, we get:

1 αγ− p = (A.3) 22α−+ βγ The new peak location estimation:

* kkp= + (A.4)

87 Appendix A – Parabolic Interpolation

Where k denotes originally found peak spectrum index, and k* denotes interpolated peak index (non-integer), the peak height is estimated using:

1 yk()* =−−β ( αγ )p (A.5) 4 yk()* should be computed separately for real and imaginary parts of the complex spectrum to yield a complex-valued peak estimate (magnitude and phase).

88 Appendix B – Sinusoidal Model Matching

Appendix B. Sinusoidal Model Matching

Sinusoidal model matching is used to find optimal model parameters that cause reconstructed signal to be similar as much as possible to analysis signals. The appendix describes three types of matching methods that can be implemented with sinusoidal modeling of speech:

• “Phase matching” – sinusoidal model matched in terms of phase offset • “Stretch matching” – sinusoidal model matched in terms of frequency modification • “Breathe matching” – sinusoidal model matched in terms of time domain modification

The common representation of speech by sinusoidal modeling using these three matches is described by the following:

L  pn ˆˆn sˆ(nn, fs( ) ,ϕ , nn ,) =∑ Al cos ωθϕˆˆll+++ ε l=1 fs (n) T n =nNnpn +0 : −+ 1 , "breath"matching f (B.1) ffnp− =pp+−ss ffss(nn) np( n), "stretch"matching Nf −+1 nn − ϕϕ= { pnor ϕ} , "phas"matching Phase Matching Matching with previous frame is described in following equation:

p pp ϕϕ= arg maxp sˆ ns , ⋅ ϕπ∈[0,2 ) ( )  { n=−−N f :1 } L  (B.2) ppˆˆn sAˆ n,ϕ T = cos ωθϕˆˆ++ + ε ( ) ∑ l ll n=0:N f − 1 l=1 fs Matching with next frame is described in following equation:

89 Appendix B – Sinusoidal Model Matching

n nn ϕϕ= arg maxp sˆ ns , ⋅ ϕπ∈[0,2 ) ( )  { n=NNff:2 − 1 } L  (B.3) nnˆˆn sAˆ n,ϕ T = cos ωθϕˆˆ++ + ε ( ) ∑ l ll n=0:N f − 1 l=1 fs Stretch Matching Matching with previous frame is described in following equation:

p pp f= arg max p sfˆ ns, ⋅ ssf∈[0.95 ff ,1.05 ] ( )  s ss{ n=−−N f :1 } L  p ˆˆn sfˆ nn,,s( f s) T =Alcos ωθˆˆll++ ε ( )n=0:N − 1 ∑ (B.4) f l=1 fs (n) p ppffss− ffss(nn, ) = f s + N f −1 Matching with next frame is described in following equation:

n nn f= arg max n sfˆ ns, ⋅ ssf∈[0.95 ff ,1.05 ] ( )  s ss{ n=NNff:2 − 1 } L  n ˆˆn sfˆ nn,,s( f s) T =Alcos ωθˆˆll++ ε ( )n=0:N − 1 ∑ (B.5) f l=1 fs (n) n n ffss− ffss(nn, ) = f s + N f −1 Breathe Matching Matching with previous frame described in following equation:

p= ˆ pp⋅ narg max p∈pp sn(ns, ) p n nninit , end { n=−[Nnf : −+ 1] }  (B.8) L n ˆ p =ˆˆωθˆˆ++ ε sn(n, ) pT∑ Alcos  ll n=[nN :f − 1] l=1 fs p p ∈− ⋅ where nninit, end [ 2.5ms ,2.5ms] fs for first concealing match. Matching with next frame described in following equation:

n= ˆ nn⋅ narg max nn∈n sn(ns, ) n n nninit , end { n=[NNf :2f −+ 1] n }  (B.9) L n ˆ n =ˆˆωθˆˆ++ ε sn(n, ) n T ∑ Al cos ll n=[0:Nnf −+ 1 ] l=1 fs n n ∈− ⋅ where nninit, end [ 2.5ms ,2.5ms] fs for first concealing match.

90 Appendix B – Sinusoidal Model Matching

However, the overall expansion/compression for frame i is limited by frame length, described by the rule bellow:

i np ∑ nnjj−≤ N f (B.10) j=1 p p n n where the calculation of nninit, end and of nninit, end is described by equations:

i i p= np−− p= − pn− ninit ∑( nnj j) N f nendN f ∑( nnj j ) j=1 j=1 p p if( ninit <−2.5 ms ⋅ fs ) if( nend > 2.5 ms⋅ fs ) p p ninit =−⋅2.5ms fs nend =2.5ms ⋅ fs (B.11) p p inf ( init > 0) if( nend < 0) p p ninit = 0 nend = 0 i i n= pn−− n= − np− ninit ∑(nnj j) N f nendN f ∑( nnj j ) j=1 j=1 n n if( ninit <−2.5 ms ⋅ fs ) if( nend > 2.5 ms⋅ fs ) n n ninit =−⋅2.5ms fs nend =2.5ms ⋅ fs (B.12) n n inf ( init > 0) if( nend < 0) n n ninit = 0 nend = 0 When “breathe matching” method is used, the procedure of matching speech segments during concealment algorithm produces a variable size of concealed frame from one hand. However, the hardware interface requires fixed size frame from the other hand. For this reason, “Play-Out Buffer” should be used in this case as described in [9]. The role of the Play-Out Buffer is to handle variable input frames and produce fixed size frames.

91 Appendix C – Perceptual Evaluation of Speech Quality

Appendix C. Perceptual Evaluation of Speech Quality

PESQ provides an objective measure that predicts the results of subjective listening tests on telephony systems. To measure speech quality, PESQ uses a perceptual model to compare the original signal with the degraded signal at the output of the network system. The result of comparing the reference and degraded signals is a quality score. This score is analogous to the subjective Mean Opinion Score (MOS) measured using panel tests according to ITU-T P.800 [32].

Reference signal

Distorting PESQ system PESQ quality score

Degraded signal

Figure C-1: Usage of PESQ PESQ Input Signals PESQ requires two inputs: the original test signal and the degraded version of this that has been passed through the distorting system. In addition, the model needs to know the sampling rate of these files, which may be either 8 or 16 kHz. In addition, the test signal should be speech-like and at the correct level. PESQ Operations Figure describes the PESQ processing flow:

Reference Level Input Auditory Signal Alignment Filtering Transform Utterance Crude Time Alignment Alignment Re-align Degraded Level Input Disturbance PESQ bad intervals Signal Alignment Filtering Processing Score

Auditory Transform

Figure C-2: PESQ processing elements

92 Appendix C – Perceptual Evaluation of Speech Quality

The model includes the following stages: • Level alignment The reference speech signal and the degraded signal are aligned into same constant power level in order to compare the signals. It is necessary because the reference signal does not have to be to be at a defined level and because the amplification of the system under test is unknown before testing. • Input filtering The send path of a telephone handset usually filters speech with a characteristic similar to the standard modified Intermediate Reference System (IRS) transmit characteristic [49]. It is generally accepted that it has less effect on quality than coding distortions do. • Time alignment The system may include a delay, which may be variable. In order to compare the reference and degraded signals, they need to be lined up with each other. PESQ applies voice activity detection to the signals to identify those parts of the signal that are speech, ignoring noise. Time alignment is then done in three stages: overall speech signals (utterances) alignment, overlapping sections of the speech (frames) alignment and re-alignment of “bad intervals” indicated by successor auditory transform block. • Auditory transform Both signals are passed through an auditory transform that mimics certain key properties of human hearing. In order to compare the reference and degraded signals, taking account of how a listener would have heard them. This gives a representation in time and frequency of the perceived loudness of the signal, known as the sensation surface. • Disturbance processing The difference between the sensation surfaces for the reference and degraded files is known as the error surface. It represents any audible differences introduced by the system under test. The error surface is analyzed by a process that takes account of amplitude masking effect, that small distortions in a signal are inaudible in the presence of loud signals. Two disturbance parameters are calculated: the absolute (symmetric)

93 Appendix C – Perceptual Evaluation of Speech Quality

disturbance representing absolute audible error and the additive (asymmetric) disturbance representing audible errors that are significantly louder than the reference.

Finally, the error parameters are converted to a quality score, which is a linear combination of the average symmetric disturbance value and the average asymmetric disturbance value.

94 Appendix D – NIST Net Emulation Tool

Appendix D. NIST Net Emulation Tool

NIST Net [94, 96] is considered as one of the most popular network emulators and is widely used for the evaluation of VoIP applications. The NIST Net is a specialized router, which emulates (statistically) the behavior of an entire network in a single hop. NIST Net selectively applies network effects (such as delay, loss, jitter) to a traffic passing through it, based on user-supplied settings. NIST Net works through a table of emulator entries. Each entry consists of three parts:

• Specification of packets matched by this entry: it covers most fields in IP and higher protocol headers: source and destination addresses, higher-level protocol identifier, class/type of service field, source and destination ports, etc • Effects to be applied to matching packets: packet delay, jitter, packet reordering and loss (either random or congestion-dependent), packet duplication and bandwidth limitations • Statistics to keep about the packets matching this entry

Thousands of these emulator entries can be loaded at a time, each with individual network effects specified.

95 Appendix E – Segmentation Based on Sinusoidal Modeling of Speech

Appendix E. Segmentation Based on Sinusoidal Modeling of Speech

The sinusoidal modeling of speech can be used for estimating stationary speech segment. However, this method is feasible only in researches rather than in real-time applications, due to its complexity. Figure E-1 describes the procedure used for segmentation based on sinusoidal modeling of speech.

Figure E-1: Segmentation block diagram The procedure for segmentation consists of:

• History Normalizing First of all, the signal in history buffer is normalized. However, due to its length (3 frames), the signal is divided into frames of normal frame size

96 Appendix E – Segmentation Based on Sinusoidal Modeling of Speech

N f with 50% overlap, and then each frame is normalized separately, based on least square estimation. The output of the normalization algorithm denotes as

T NNhh N h 1 Shi,,==⋅=−SS hi ( nn) hi ,( )  , n[ 0,..,Nh 1] (E.1) ˆ Nh Ehi, (n)  N ˆ Nh h where Ehi, is history energy estimation Shi, .  ˆ Nh Notation for estimating energy vector Ehi, based on OLA smoothing will be given later in this appendix. • Initializing Speech Segment This procedure can be based on pitch detector or any other parameters,

e.g. ZCR, R1 , etc.

Niss The ISS denotes as siss , where Niss is the length of the ISS. • SM Analysis Sinusoidal model analysis is performed on initial segment, and produces sinusoidal model Miss .

• SM Extrapolation Sinusoidal model is then forward extrapolated and updated in terms of  phase integration, producing Miss .

• SM Matching Sinusoidal model matching is done in terms of phase offset to be  matched with previous frame, producing M iss .

• SM Synthesis Sinusoidal generation of synthesis speech is then performed, producing

ˆ Nh history estimation denoted as Shi, .

• Speech Segment Selection The output, stationary speech segment, of the segmentation procedure denotes as

T NNsss = h = − ssssS h, i ( nn) ,[ NNInit ,..., h 1] (E.2)

where NInit is calculated using error vector, denoted as

97 Appendix E – Segmentation Based on Sinusoidal Modeling of Speech

Nh NN hhˆ eSShi,= hi ,, − hi (E.3) The examples of the proposed segmentation procedure is given in Figure E-2 below:

Example of segmentation based on sinusoidal modeling principle 2.5 Norm. History Norm. Hist. Est. 2

1.5

1

0.5

Amp. 0

-0.5

-1

-1.5

261 samples Example of segmentation based on sinusoidal modeling principle

2

1.5 Norm. History 1 Norm. Hist. Est. 0.5

Amp. 0

-0.5

-1

-1.5

291 samples Example of segmentation based on sinusoidal modeling principle

2

1

0 Amp.

-1

-2 Norm. History Norm. Hist. Est. 301 samples Example of segmentation based on sinusoidal modeling principle Norm. History 2.5 Norm. Hist. Est. 2

1.5

1

Amp. 0.5

0

-0.5

-1

191 samples Figure E-2: Examples of proposed segmentation method Normalizing based on OLA smoothing After calculating energy estimations for all frames, they are overlap-added to provide smooth transition between energies, as described in Figure E-2.

Estimated energies are denoted as Eˆ N f n , where sub, j( j ) j=1,..,5

98 Appendix E – Segmentation Based on Sinusoidal Modeling of Speech

T NNff   n jf=1,..,5 =1 +( j −⋅ 1)  ,...,( jN −⋅ 1) +  (E.4) 22  

Figure E-2: Normalization based OLA smoothing (illustration)

1 2* The symbols (nnnjjj,,) that appear in the figure, denote the first and the second half of the frame, and OLA indexes respectively:

T 1 N f N f n jf=1 +( j −⋅ 1) ,...,( jN −⋅ 1) + 2 2 2 T  2 N f N f n jf=1 +( jN −⋅ 1) + 2 ,...,( jN−⋅ 1) + f (E.5) 2 2 T  * N f N f n jf=1 +( jN −⋅ 1) + 2 ,...,( jN−⋅ 1) + f 2 2 The resulting history energy estimation is:

ˆ N f 1 Esub,1(n 1 )  N N f * Enˆˆh = E (E.6) h,, i sub j( j )

N f 2 Eˆ n sub,5( 5 )

99 References

References

[1] B. M. Leiner, V. G. Cerf, D. D. Clark, R. E. Kahn, L. Kleinrock, D. C. Lynch, J. Postel, L. G. Roberts, and S. Wolff, A Brief History of the Internet, Version 3.32, Internet Society, 2003. http://www.isoc.org/internet/history/brief.shtml [2] D. Collins, Carrier Grade Voice over IP, McGraw-Hill Networking, New York, 2001. [3] J. Davidson, J. Peters, and B. Gracely, Voice over IP Fundamentals, Cisco Systems, March 2000. [4] C. Perkins, O. Hodson, and V. Hardman, “A Survey of Packet Loss Recovery Techniques for Streaming Audio”, In Proceedings of the IEEE Network Magazine, vol. 12, no. 5, pp. 40-48, September-October 1998. [5] _____, HD VoIP Sounds Better, Brief Introduction, AudioCodes Ltd., March 2009. [6] The Internet Engineering Task Force, Internet Society. http://www.ietf.org [7] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, “SIP: Session Initiation Protocol”, RFC 2543, IETF, March 1999. ftp://ftp.ietf.org/rfc/rfc2543.txt [8] ITU-T Recommendation H.323: Packet-based multimedia communications systems, June 2006. [9] Y. Gil, Packet Loss Concealment For Voice Applications, M.Sc. Thesis, Ben- Gurion University of the Negev, Faculty of Engineering Science, February 2009. [10] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications”, RFC 1889, IETF, January 1996. ftp://ftp.ietf.org/rfc/rfc1889.txt [11] N. S. Jayant, “High Quality Networking of Audio-Visual Information”, IEEE Communications Magazine, pp. 84-95, September 1993. [12] C. Perkins and O. Hodson, “Options for the Repair of Streaming Media”, RFC 2354, IETF, June 1998.

100 References

ftp://ftp.ietf.org/rfc/rfc2354.txt [13] G. Carle and E. W. Biersack, “Survey of Error Recovery Techniques for IP- based Audio-Visual Multicast Applications”, IEEE Network Magazine, vol. 11, no. 6, pp. 24-36, 1997. [14] Jacob Benesty, M. Mohan Sondhi, Yiteng Huang, Springer Handbook of Speech Processing, Springer-Verlag Berlin Heidelberg 2008. [15] _____, Speech Coding and Speech Quality in IP Telephony, Global IP Sound, Inc., 2001. [16] A. Kos, B. Klepec, and S. Tomazic, “Techniques for Performance Improvement of VoIP Applications”, In IEEE Melecon 2002, Cairo Egypt, May 2002. [17] A. Spanias, “Speech Coding: A Tutorial Review”, In Proceedings of the IEEE, vol. 82, pp. 1541-1582, October 1994. [18] ITU-T Recommendation G.711: Pulse Code Modulation (PCM) of voice frequencies, November 1988. [19] ITU-T Recommendation G.711.1: Wideband embedded extension for G.711 pulse code modulation, March 2008. [20] ITU-T Recommendation G.726: 40, 32, 24, 16 Kbit/s Adaptive Differential Pulse Code Modulation (ADPCM), December 1990. [21] ITU-T Recommendation G.722: 7KHz audio-coding within 64Kbit/s, November 1988. [22] ITU-T Recommendation G.729: Coding of speech at 8Kbit/s using Conjugate- Structure Algebraic-Code-Excited Linear Prediction (CS-ACELP), March 1996. [23] ITU-T Recommendation G.729.1: G.729-based embedded variable bit-rate coder: An 8-32 Kbit/s scalable wideband coder bitstream interoperable with G.729, May 2006. [24] ITU-T Recommendation G.722.2: Wideband coding of speech at around 16 Kbit/s using Adaptive Multi-rate Wideband (AMR-WB), January 2002. [25] R. J. McAulay and T. F. Quatieri, “Speech Analysis/Synthesis Based on a Sinusoidal Representation”, In Proceedings of the IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, pp. 744-754, August 1986. [26] R. J. McAulay and T. F. Quatieri, “Mid-rate Coding based on a Sinusoidal Representation of Speech”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 945, Tampa, FL, 1985.

101 References

[27] C. O. Etemoglu, V. Cuperman and A. Gersho, “Speech Coding with an Analysis-by-Synthesis Sinusoidal Model, In Proceedings of IEEE Intenational Conference on Acoustic, Speech and Signal Processing, vol. 3, pp 1371-1374, Istanbul, August 2002. [28] _____, Audio Quality Terminology, AVAYA Inc., 2005. [29] V. Paxson, “End-to-End Internet Packet Dynamics”, In IEEE/ACM Transaction on Networking, vol. 7, pp. 277-292, June 1999. [30] J. C. Bolot and A. V. Garcia, “Control Mechanisms for Packet Audio in the Internet”, In Proceedings IEEE INFOCOM, pp. 232-239, San Francisco, CA, April 1996. [31] J. C. Bolot, “Characterizing End-to-End Packet Delay and Loss in the Internet”, Journal of High-Speed Networks, vol. 2, pp. 305-323, 1993. [32] ITU-T Recommendation P.800: Methods for subjective determination of transmission quality, August 1996. [33] A. W. Rix and M. P. Hollier, “The Perceptual Analysis Measurement System for Robust End-to-End Speech Quality Assessment”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1515-1518, Istanbul, June 2000. [34] ITU-T Recommendation P.861: Objective quality measurement of telephone- band (300-3400 Hz) speech codecs, February 1998. [35] ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, February 2001. [36] ITU-T Recommendation: P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs, November 2005. [37] The Next-Generation Mobile Voice Quality Testing Standard. http://www.polqa.info/ [38] ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications, May 2004. [39] D. Kim, “ANIQUE: An Auditory Model for Single-ended Speech Quality Estimation”, In Proceedings of the IEEE Transaction on Speech and Audio Processing, vol. 13, pp. 821- 831, September 2005. [40] D. Kim and M. Tarraf, “Enhanced Perceptual Model for Non-intrusive Speech Quality Assessment”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 829-832, 2006.

102 References

[41] D. Kim and A. Tarraf, “ANIQUE+: A New American National Standard for Non-Intrusive Estimation of Narrowband Speech Quality”, Bell Labs Technical Journal Volume 12, May 2007. [42] T. H. Falk and W.Y. Chan, “Enhanced Non-Intrusive Speech Quality Measurement Using Degradation Models”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2006. [43] V. Grancharov, D. Y. Zhao, J. Lindblom and W. B. Kleijn, “Low Complexity, Non-Intrusive Speech Quality Assessment”, In Proceedings of the IEEE Audio, Speech and Language Processing, vol. 14, pp. 1948-1956, November 2006. [44] T. H. Falk, H. Yuan and W.Y. Chan, “Single-Ended Quality Measurement of Noise Suppressed Speech Based on Kullback-Leibler Distances”, Journal of Multimedia, September 2007. [45] ITU-T Recommendation G.107: The E-model, a computational model for use in transmission planning, March 2005. [46] ITU-T Recommendation G.107: The E-model, a computational model for use in transmission planning, PREPUBLISHED RECOMMENDATION, August 2008. [47] ITU-T Recommendation P.800.1: Mean Opinion Score (MOS) terminology, 2006. [48] ITU-T Recommendation P.862.1: Mapping function for transforming P.862 raw result scores to MOS-LQO, November 2003. [49] ITU-T Recommendation P.830: Subjective performance assessment of telephone-band and wideband digital codecs, February 1996. [50] C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J. Bolot, A. Vega-Garcia and S. Fosse-Parisis, “RTP Payload for Redundant Audio Data”, RFC 2198, IETF, September 1997. ftp://ftp.ietf.org/rfc/rfc2198.txt [51] J. Rosenberg and H. Schulzrinne, “An RTP Payload Format for Generic Forward Error Correction”, RFC 2733, IETF, December 1999. ftp://ftp.ietf.org/rfc/rfc2733.txt [52] J. Gruber, L. Strawczynski, “Subjective Effects of Variable Delay in Speech Clipping in Dynamically Managed Voice Systems”, In Proceedings of the IEEE Transaction on COM-33, 1985.

103 References

[53] D. J. Goodman, G. B. Lockhart, O. J. Wasem and W. Wong, “Waveform Substitution Techniques For Recovering Missing Speech Segments In Packet Voice Communications”, In Proceeding of the IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, pp. 1440-1448, December 1986. [54] D. I. Goodman, G. Jaffe, G. B. Lockhart and W. C. Wong, “Waveform Substitution Techniques For Recovering Missing Speech Segments In Packet Voice Communications”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 11, pp. 105-108, April 1986. [55] D. J. Goodman, O. J. Wasem, C. A. Dvorak, and H. G. Page, “The Effect of Waveform Substitution on the Quality of PCM Packet Communications”, In IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 36, pp. 342-348, March 1988. [56] R. Valenzuela and C. Animalu, “A New Voice Packet Reconstruction Technique”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1334-1336, May 1989. [57] ITU-T Recommendation G.711 Appendix I: A high quality low-complexity algorithm for packet loss concealment with G.711, September 1999. [58] ITU-T Recommendation G.722 Appendix III: A high-quality packet loss concealment algorithm for G.722, November 2006. [59] W. Verhelst and M. Roelands, “An Overlap-Add Technique based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 554-557, April 1993. [60] J. Makhoul, “Linear prediction: A Tutorial Review”, In Proceedings of the IEEE, vol. 63, pp. 561-580, April 1975. [61] E. Gündüzhan and K. Momtahan, “A Linear Prediction Based Packet Loss Concealment Algorithm for PCM Coded Speech”, In Proceedings of the IEEE Transactions on Speech and Audio Processing, vol. 9, pp. 778-785, November 2001. [62] J. R. Deller, Discrete-Time Processing of Speech Signals, Prentice Hall, Englewood Cliffs 1993. [63] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Signal Processing Series, Englewood Cliffs, NJ, 1978. [64] ITU-T Recommendation G.722 Appendix IV: A low-complexity algorithm for packet loss concealment with G.722, July 2007.

104 References

[65] J. Lindblom and P. Hedelin, “Packet Loss Concealment Based on Sinusoidal Extrapolation”, In Proceeding of the IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 1, pp. 173-176, 2002. [66] J. Lindblom and P. Hedelin, “Packet Loss Concealment Based on Sinusoidal Modeling”, In Proceeding of the IEEE Workshop on Speech Coding, pp. 65- 67, October 2002. [67] C. Li, A. Gersho, V. Cuperman, “Analysis-by-Synthesis Low-Rate Multimode Harmonic Speech Coding”, In Proceedings of the Sixth European Conference on Speech Communication and Technology, September 1999. [68] T. Quatieri and R. McAulay, “Speech Transformations Based on a Sinusoidal Representation”, In Proceedings of the IEEE Transactions on Signal Processing, vol. ASSP-34, no. 6, pp. 1449–1464, December 1986. [69] T. F. Quatieri and R. J. McAulay, “Shape Invariant Time-Scale and Pitch Modification of Speech”, In Proceedings of the IEEE Transactions on Signal Processing, vol. 40, no. 3, pp. 497-510, March 1992. [70] T. F. Quatieri, R. B. Dunn, and T. E. Hanna, “Time-Scale Modification of Complex Acoustic Signals”, In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1993. [71] E. B. George and M. J. T. Smith, “Speech Analysis/Synthesis and Modification Using an Analysis-by-Synthesis/Overlap-Add Sinusoidal Model”, In Proceedings of the IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 389-406, September 1997. [72] M. W. Macon and M. A. Clements, “Sinusoidal Modeling and Modification of Unvoiced Speech”, In Proceeding of the IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 557-560, November 1997. [73] E. B. George, An Analysis-by-Synthesis Approach to Sinusoidal Modeling Applied to Speech and Music Signal Processing, Ph.D. Thesis, Georgia Institute of Technology, November 1991. [74] X. Serra, “Musical Sound Modeling with Sinusoids Plus Noise”, In Musical Signal Processing, Swets and Zeitlinger Publishers, 1997. [75] S. N. Levine, T. S. Verma and J. O. Smith III, “Multiresolution Sinusoidal Modeling for Wideband Audio with Modifications”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 6, pp. 3585-3588, May 1998. [76] S. N. Levine and J. O. Smith III, “A Switched Parametric & Transform Audio Coder”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 985-988, May 1999.

105 References

[77] M. W. Macon, Speech and Voice Synthesis Based on Sinusoidal Modeling, Ph.D. Thesis, Georgia Institute of Technology, October 1996. [78] T. F. Quatieri and R. G. Danisewicz, “An Approach to Co-channel Talker Interference Suppression using a Sinusoidal Model for Speech”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 565-568, April 1988. [79] J. M. Kates, “Speech Enhancement Based on a Sinusoidal Model”, In Journal of Speech and Hearing Research, vol. 37, pp. 449-464, April 1994. [80] M. W. Macon and M. A. Clements, “Speech Synthesis based on an Overlap- Add Sinusoidal Model”, In Journal of the Acoustical Society of America, vol. 97, no. 5, pp. 3246, May 1995. [81] D. Chazan, R. Hoory, A. Sagi, S. Shechtman, A. Sorin, Z.W. Shuang and R. Bakis, “High Quality Sinusoidal Modeling of Wideband Speech for the Purposes of Speech Synthesis and Modification”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, May 2006. [82] M. Goodwin and M. Vetterli, “Time-frequency Signal Models for Music Analysis, Transformation, and Synthesis”, In Proceedings of the IEEE International symposium, pp. 133-136, June 1996. [83] J. C. Rutledge and M. A. Clements, “Compensation for Recruitment of Loudness in Sensorineural Hearing Impairments using a Sinusoidal Model of Speech”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3641-3644, April 1991. [84] K. Nakamura, K. Funaki, “An Improvement of G.711 PLC Using Sinusoidal model”, The International Conference on Computer as Tool, EUROCON, vol. 3, pp. 1670-1673, November 2005. [85] L. B. Almeida and F. M. Silva, “Variable-Frequency Synthesis: An Improved Harmonic Coding Scheme”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 437-440, March 1984. [86] J. O. Smith III and X. Serra, “PARSHL: An Analysis/Synthesis Program for Non- harmonic Sounds based on a Sinusoidal Representation”, Proceedings of the International Computer Music Conference, Stanford, California, 1987. [87] R. C. Maher, An Approach for the Separation of Voices in Composite Musical Signals, Ph.D. dissertation,. Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1989.

106 References

[88] X. Serra, A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition, Ph.D. Thesis, Stanford University, October 1989. [89] J. Lindblom, “A Sinusoidal Voice over Packet Coder Tailored for the Frame- erasure Channel”, In Proceedings of the IEEE Transactions on Speech and Audio Processing, vol. 13, pp. 787–798, September 2005. [90] T. F. Quatieri, Discrete-Time Speech Signal Processing, Prentice-Hall, New Jersey, 2002. [91] F. J. Harris, “On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform”, In Proceedings of the IEEE Transactions on Speech and Audio Processing, vol. 66, pp. 51-83, 1978. [92] H. Van Trees, Detection, Estimation and Modulation Theory, Part1: Detection, Estimation, and Linear Modulation Theory, John Wiley & Sons, Inc., 2001. [93] J. D. Markel, “The SIFT Algorithm for Fundamental Frequency Estimation”, In IEEE Transactions on Audio and Electroacoustics, vol. 20, pp. 367-377, December 1972. [94] National Institute of Standard and Technology, The NIST Net network Emulator. http://www-x.antd.nist.gov/nistnet/ [95] ITU-T Recommendation Series P Supplement 23: ITU-T coded-speech database, February 1998. [96] M. Carson and D. Santay, “NIST Net: a Linux-Based Network Emulation Tool”, ACM SIGCOMM Computer Communication, vol. 33, pp. 111-126, July 2003.

107

0Bתקציר

22Tשיחות טלפון על גבי רשת האינטרנט (VoIP) הפכו לפופולאריות בשנים האחרונות. מאחר וברשת האינטרנט לא קיים מנגנון ל"איכות השירות", חלק מהחבילות אובדות או מושהות באופן משמעותי כך שלא ניתן להשמיען עבור יישומים אינטראקטיביים כגון שיחות טלפון.

22Tאובדן חבילות הינה בעיה עקרונית ברשתות טלפוניה מבוססות אינטרנט, עקב כך איכות הקול והמובנות יורדת עם העלייה בשיעור אובדן החבילות. כמוכן ניתן להבחין בהפרעות במעברים ממצב רגיל (ללא אובדן) ובין מצב בו אובדות חבילות ולהיפך. לפי כך קיים עניין רב בפיתוח אלגוריתמים לפיצוי על אובדן של חבילות דיבור כך שיקטינו תופעות הללו.

22T בעבודה זו מוצגת שיטה חדשה לפיצוי על אובדן חבילות דיבור עבור אפליקציות רחבות סרט (50-7000Hz) ללא תלות במקודד הדיבור – "מבוסס מפענח". השיטה המוצעת מבוססת על ייצוג אות דיבור בעזרת מודל סינוסי. במודל זה אות הדיבור מיוצג באמצעות מספר גנרטורים סינוסיים באמפליטודת, תדירויות ופאזות שונות.

22T הרעיון העיקרי הוא לנצל את הרציפות המובנית של מודל סינוסי והאפשרות לבצע אינטרפולציה או אקסטרפולציה קלה יחסית על פרמטרי המודל.

22Tאות הדיבור הסינתטי במקרה של אובדן, נוצר מתוך שיערוך מודל סינוסי מתוך דגמים קודמים שנשמרו מחבילות שהתקבלו קודם לאובדן החבילה או החבילות ו/או מדגמים עתידיים מחבילות אשר עתידות להישמע ועדיין לא הושמעו אך כבר התקבלו בצד המקלט.

22T המודל המוצע נבחן בהשוואה לשני אלגוריתמים מתוכננים לפיצוי על אובד חבילות דיבור - ITU-T G.722 Appendix III/IV PLC, תחת מספר תצורות בסביבת בדיקות הכוללת מדד אנאליטי אובייקטיבי, המבוסס על מודל תפישתי לאיכות הקול (PESQ), ובשימוש בבסיס נתונים גדול ומתוקנן - ITU-T P-Series Sup.23, לקבלת תוצאות בעלות משמעות סטטיסטית. מהתוצאות עולה כי השיטה המוצעת טובה יותר מהאלגוריתמים המתוכננים, כאשר השיפור בביצועים מתקבל לאורך כל אחוזי האיבוד.

אוניברסיטת בן-גוריון בנגב הפקולטה למדעי ההנדסה

המחלקה להנדסת חשמל ומחשבים

פיצוי אובדן חבילות דיבור מבוסס מודל סינוסוידאלי ליישומי דיבור רחב סרט

חיבור זה מהווה חלק מהדרישות לקבלת תואר מגיסטר בהנדסה

מאת: דמיטרי ליחובצקי

מנחים: פרופ' אילן ד. שלום פרופ' דב ווליך

חתימת המחבר: תאריך: ………………. ………………..………………..

מנחה: תאריך: ………………. ………………..………………..

מנחה: תאריך: ………………. ………………..………………..

אישור יו"ר ועדת תואר שני מחלקתית: ………………..………………..

תאריך: ……………….

ינואר 2011 שבט התשעא

אוניברסיטת בן-גוריון בנגב הפקולטה למדעי ההנדסה

המחלקה להנדסת חשמל ומחשבים

פיצוי אובדן חבילות דיבור מבוסס מודל סינוסוידאלי ליישומי דיבור רחב סרט

חיבור זה מהווה חלק מהדרישות לקבלת תואר מגיסטר בהנדסה

מאת: דמיטרי ליחובצקי

ינואר 2011 שבט התשעא