Lapped Transforms in Perceptual Coding of Wideband Audio

Total Page:16

File Type:pdf, Size:1020Kb

Lapped Transforms in Perceptual Coding of Wideband Audio Lapped Transforms in Perceptual Coding of Wideband Audio Sien Ruan Department of Electrical & Computer Engineering McGill University Montreal, Canada December 2004 A thesis submitted to McGill University in partial fulfillment of the requirements for the degree of Master of Engineering. c 2004 Sien Ruan ° i To my beloved parents ii Abstract Audio coding paradigms depend on time-frequency transformations to remove statistical redundancy in audio signals and reduce data bit rate, while maintaining high fidelity of the reconstructed signal. Sophisticated perceptual audio coding further exploits perceptual redundancy in audio signals by incorporating perceptual masking phenomena. This thesis focuses on the investigation of different coding transformations that can be used to compute perceptual distortion measures effectively; among them the lapped transform, which is most widely used in nowadays audio coders. Moreover, an innovative lapped transform is developed that can vary overlap percentage at arbitrary degrees. The new lapped transform is applicable on the transient audio by capturing the time-varying characteristics of the signal. iii Sommaire Les paradigmes de codage audio d´ependent des transformations de temps-fr´equence pour enlever la redondance statistique dans les signaux audio et pour r´eduire le taux de trans- mission de donn´ees, tout en maintenant la fid´elit´e´elev´ee du signal reconstruit. Le codage sophistiqu´eperceptuel de l’audio exploite davantage la redondance perceptuelle dans les signaux audio en incorporant des ph´enom`enes de masquage perceptuels. Cette th`ese se concentre sur la recherche sur les diff´erentes transformations de codage qui peuvent ˆetre employ´ees pour calculer des mesures de d´eformation perceptuelles efficacement, parmi elles, la transformation enroul´e, qui est la plus largement r´epandue dans les codeurs audio de nos jours. D’ailleurs, on d´eveloppe une transformation enroul´ee innovatrice qui peut changer le pourcentage de chevauchement `ades degr´es arbitraires. La nouvelle transformation en- roul´ee est applicable avec l’acoustique passag`ere en capturant les caract´eristiques variantes avec le temps du signal. iv Acknowledgments I would like to acknowledge my supervisor, Prof. Peter Kabal, for his support and guidance throughout my graduate studies at McGill University. Prof. Kabal’s kind treatment to his students is highly appreciated. I would also like to thank Ricky Der for working with me and advising me through the work. My thanks go to my fellow TSP graduate students for their close friendship; especially Alexander M. Wyglinski for the various technical assistances. I am sincerely indebted to my parents for all the encouragement they have given to me. They are the reason for who I am today. To my mother, Mrs. Dejun Zhao and my father, Mr. Liwu Ruan, thank you. v Contents 1 Introduction 1 1.1 AudioCodingTechniques ........................... 1 1.1.1 ParametricCoders ........................... 1 1.1.2 WaveformCoders............................ 2 1.2 Time-to-Frequency Transformations . ..... 3 1.3 ThesisContributions .............................. 4 1.4 ThesisSynopsis................................. 4 2 Perceptual Audio Coding: Psychoacoustic Audio Compression 6 2.1 HumanAuditoryMasking ........................... 6 2.1.1 HearingSystem............................. 7 2.1.2 PerceptionofLoudness......................... 7 2.1.3 CriticalBands.............................. 8 2.1.4 MaskingPhenomena .......................... 10 2.2 Example Perceptual Model: Johnston’s Model . ..... 11 2.2.1 LoudnessNormalization . 11 2.2.2 Masking Threshold Calculation . 11 2.2.3 PerceptualEntropy........................... 13 2.3 PerceptualAudioCoderStructure. 14 2.3.1 Time-to-Frequency Transformation . 15 2.3.2 Psychoacoustic Analysis . 17 2.3.3 Adaptive Bit Allocation . 17 2.3.4 Quantization .............................. 18 2.3.5 BitstreamFormatting . 20 Contents vi 3 Signal Decomposition with Lapped Transforms 21 3.1 BlockTransforms ................................ 22 3.2 LappedTransforms ............................... 22 3.2.1 LTOrthogonalConstraints. 23 3.3 FilterBanks: SubbandSignalProcessing . 26 3.3.1 Perfect Reconstruction Conditions . 27 3.3.2 Filter Bank Representation of the LT . 28 3.4 ModulatedLappedTransforms . 28 3.4.1 Perfect Reconstruction Conditions . 28 3.5 AdaptiveFilterBanks ............................. 33 3.5.1 Window Switching with Perfect Reconstruction . 33 4 MP3 and AAC Filter Banks 35 4.1 Time-to-Frequency Transformations of MP3 and AAC . ..... 35 4.1.1 MP3 Transformation: Hybrid Filter Bank . 35 4.1.2 AAC Transformation: Pure MDCT Filter Bank . 43 4.2 PerformanceEvaluation . 44 4.2.1 FullCoderDescription . 44 4.2.2 AudioQualityMeasurements . 49 4.2.3 ExperimentResults........................... 50 4.3 Psychoacoustic Transforms of DFT and MDCT . 52 4.3.1 InherentMismatchProblem . 52 4.3.2 ExperimentResults........................... 54 5 Partially Overlapped Lapped Transforms 55 5.1 Motivation of Partially Overlapped LT: NMR Distortion . ....... 55 5.2 Construction of Partially Overlapped LT . ..... 56 5.2.1 MLT as DST via Pre- and Post-Filtering . 56 5.2.2 SmallerOverlapSolution . 60 5.3 PerformanceEvaluation . 62 5.3.1 Pre-echoMitigation........................... 62 5.3.2 Optimal Overlapping Point for Transient Audio . 65 Contents vii 6 Conclusion 66 6.1 ThesisSummary ................................ 66 6.2 FutureResearchDirections. 68 A Greedy Algorithm and Entropy Computation 70 A.1 GreedyAlgorithm................................ 70 A.2 EntropyComputation ............................. 71 viii List of Figures 2.1 Absolute threshold of hearing for normal listeners. ...... 8 2.2 Genericperceptualaudioencoder . 14 2.3 SineMDCT-window(576points). 16 3.1 General signal processing system using the lapped transform. ...... 23 3.2 Signal processing with a lapped transform with L = 2M........... 24 3.3 Typical subband processing system, using the filter bank. 26 3.4 Magnitude frequency response of a MLT (M =10). ............. 29 4.1 MPEG-1LayerIIIdecompositionstructure. 36 4.2 Layer III prototype filter (b) and the original window (a). .......... 37 4.3 Magnituderesponseofthelowpassfilter. 38 4.4 Magnitude response of the polyphase filter bank (M =32).......... 38 4.5 Switching from a long sine window to a short one via a start window. 41 4.6 Layer III aliasing-butterfly, encoder/decoder. ......... 41 4.7 Layer III aliasing reduction encoder/decoder diagram. ........... 42 4.8 Block diagram of the encoder of the full audio coder. ........ 45 4.9 Frequency response of the MDCT basis function hk(n), M =4........ 53 5.1 Flowgraph of the Modified Discrete Cosine Transform. 57 5.2 Flowgraph of MDCT as block DST via butterfly pre-filtering. ...... 58 5.3 Global viewpoint of MDCT as pre-filtering at DST block boundaries. 59 5.4 Pre-DST lapped transforms at arbitrary overlaps (L< 2M). ........ 61 5.5 Post-DST lapped transforms at arbitrary overlaps (L< 2M)......... 62 List of Figures ix 5.6 Partially overlapped Pre-DST example showing pre-echo mitigation for sound files of castanets. ................................... 64 x List of Tables 2.1 CriticalbandsmeasuredbyScharf. ... 9 4.1 MOS is a number mapping to the above subjective quality. ....... 50 4.2 Subjective listening tests: Hybrid filter bank (Hybrid) vs. Pure MDCT filter bank (Pure)................................... 51 4.3 PESQ MOS values: Hybrid filter bank (Hybrid) vs. Pure MDCT filter bank (Pure)...................................... 51 4.4 PESQ MOS values: DFT spectrum (DFT ) vs. MDCT spectrum (MDCT ) 54 5.1 Subjective listening tests of Pre-DST coded test files of castanets. ..... 65 xi List of Terms AAC MPEG-2 Advanced Audio Coding ADPCM Adaptive Differential Pulse Code Modulation CELP Code Excited Linear Prediction DCT Discrete Cosine Transform DFT Discrete Fourier Transform DPCM Differential Pulse Code Modulation DST Discrete Sine Transform EBU-SQAM European Broadcasting Union — Sound Quality Assessment Material ERB Equivalent Rectangular Bandwidth FIR Finite Impulse Response IMDCT Inverse Modified Discrete Cosine Transform ITU International Telecommunication Union MDCT Modified Discrete Cosine Transform MDST Modified Discrete Sine Transform MLT Modulated Lapped Transform MOS Mean Opinion Score MPEG Moving Picture Experts Group MP3 MPEG-1LayerIII PCM Pulse Code Modulation NMN Noise-Masking-Noise NMR Noise-to-Masking Ratio NMT Noise-Masking-Tone LOT Lapped Orthogonal Transform List of Terms xii LT Lapped Transform QMF QuadratureMirrorFilter PE Perceptual Entropy PEAQ Perceptual Evaluation of Audio Quality PESQ Perceptual Evaluation of Speech Quality PR Perfect Reconstruction Pre-DST Pre-filtered Discrete Sine Transform SFM Spectral Flatness Measure SMR Signal-to-Masking Ratio SNR Signal-to-Noise Ratio SPL SoundPressureLevel TDAC Time-Domain Aliasing Cancellation TMN Tone-Masking-Noise TNS Temporal Noise Shaping VQ Vector Quantization 1 Chapter 1 Introduction 1.1 Audio Coding Techniques Audio coding algorithms are concerned with the digital representation of sound using in- formation bits. A number of paradigms have been proposed for the digital compression of audio signals. Roughly, audio coders can be grouped as either parametric coders or wave- form coders. The concept of perceptual audio coding is relevant in the latter case, where auditory perception characteristics are applicable [1]. 1.1.1 Parametric Coders Parametric coders represent the source of the signal with a few parameters. Such coders are suitable for speech signals
Recommended publications
  • MIGRATING RADIO CALL-IN TALK SHOWS to WIDEBAND AUDIO Radio Is the Original Social Network
    Established 1961 2013 WBA Broadcasters Clinic Madison, WI MIGRATING RADIO CALL-IN TALK SHOWS TO WIDEBAND AUDIO Radio is the original Social Network • Serves local or national audience • Allows real-time commentary from the masses • The telephone becomes the medium • Telephone technical factors have limited the appeal of the radio “Social Network” Telephones have changed over the years But Telephone Sound has not changed (and has gotten worse) This is very bad for Radio Why do phones sound bad? • System designed for efficiency not comfort • Sampling rate of 8kHz chosen for all calls • 4 kHz max response • Enough for intelligibility • Loses depth, nuance, personality • Listener fatigue Why do phones sound so bad ? (cont) • Low end of telephone calls have intentional high- pass filtering • Meant to avoid AC power hum pickup in phone lines • Lose 2-1/2 Octaves of speech audio on low end • Not relevant for digital Why Phones Sound bad (cont) Los Angeles Times -- January 10, 2009 Verizon Communications Inc., the second-biggest U.S. telephone company, plans to do away with traditional phone lines within seven years as it moves to carry all calls over the Internet. An Internet-based service can be maintained at a fraction of the cost of a phone network and helps Verizon offer a greater range of services, Stratton said. "We've built our business over the years with circuit-switched voice being our bread and butter...but increasingly, we are in the business of selling, basically, data connectivity," Chief Marketing Officer John Stratton said. VoIP
    [Show full text]
  • Polycom Voice
    Level 1 Technical – Polycom Voice Level 1 Technical Polycom Voice Contents 1 - Glossary .......................................................................................................................... 2 2 - Polycom Voice Networks ................................................................................................. 3 Polycom UC Software ...................................................................................................... 3 Provisioning ..................................................................................................................... 3 3 - Key Features (Desktop and Conference Phones) ............................................................ 5 OpenSIP Integration ........................................................................................................ 5 Microsoft Integration ........................................................................................................ 5 Lync Qualification ............................................................................................................ 5 Better Together over Ethernet ......................................................................................... 5 BroadSoft UC-One Integration ......................................................................................... 5 Conference Link / Conference Link2 ................................................................................ 6 Polycom Desktop Connector ..........................................................................................
    [Show full text]
  • NTT DOCOMO Technical Journal
    3GPP EVS Codec for Unrivaled Speech Quality and Future Audio Communication over VoLTE EVS Speech Coding Standardization NTT DOCOMO has been engaged in the standardization of Research Laboratories Kimitaka Tsutsumi the 3GPP EVS codec, which is designed specifically for Kei Kikuiri VoLTE to further enhance speech quality, and has contributed to establishing a far-sighted strategy for making the EVS codec cover a variety of future communication services. Journal NTT DOCOMO has also proposed technical solutions that provide speech quality as high as FM radio broadcasts and that achieve both high coding efficiency and high audio quality not possible with any of the state-of-the-art speech codecs. The EVS codec will drive the emergence of a new style of speech communication entertainment that will combine BGM, sound effects, and voice in novel ways for mobile users. 2 Technical Band (AMR-WB)* [2] that is used in also encode music at high levels of quality 1. Introduction NTT DOCOMO’s VoLTE and that sup- and efficiency for non-real-time services, The launch of Voice over LTE (VoL- port wideband speech with a sampling 3GPP experts agreed to adopt high re- TE) services and flat-rate voice service frequency*3 of 16 kHz. In contrast, EVS quirements in the EVS codec for music has demonstrated the importance of high- has been designed to support super-wide- despite its main target of real-time com- quality telephony service to mobile users. band*4 speech with a sampling frequen- munication. Furthermore, considering In line with this trend, the 3rd Genera- cy of 32 kHz thereby achieving speech that telephony services using AMR-WB tion Partnership Project (3GPP) complet- of FM-radio quality*5.
    [Show full text]
  • The Growing Importance of HD Voice in Applications the Growing Importance of HD Voice in Applications White Paper
    White Paper The Growing Importance of HD Voice in Applications The Growing Importance of HD Voice in Applications White Paper Executive Summary A new excitement has entered the voice communications industry with the advent of wideband audio, commonly known as High Definition Voice (HD Voice). Although enterprises have gradually been moving to HD VoIP within their own networks, these networks have traditionally remained “islands” of HD because interoperability with other networks that also supported HD Voice has been difficult. With the introduction of HD Voice on mobile networks, which has been launched on numerous commercial mobile networks and many wireline VoIP networks worldwide, consumers can finally experience this new technology firsthand. Verizon, AT&T, T-Mobile, Deutsche Telekom, Orange and other mobile operators now offer HD Voice as a standard feature. Because mobile users tend to adopt new technology rapidly, replacing their mobile devices seemingly as fast as the newest models are released, and because landline VoIP speech is typically done via a headset, the growth of HD Voice continues to be high and in turn, the need for HD- capable applications will further accelerate. This white paper provides an introduction to HD Voice and discusses its current adoption rate and future potential, including use case examples which paint a picture that HD Voice upgrades to network infrastructure and applications will be seen as important, and perhaps as a necessity to many. 2 The Growing Importance of HD Voice in Applications White Paper Table of Contents What Is HD Voice? . 4 Where is HD Voice Being Deployed? . 4 Use Case Examples .
    [Show full text]
  • Fast Computational Structures for an Efficient Implementation of The
    ARTICLE IN PRESS Signal Processing ] (]]]]) ]]]–]]] Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Fast computational structures for an efficient implementation of the complete TDAC analysis/synthesis MDCT/MDST filter banks Vladimir Britanak a,Ã, Huibert J. Lincklaen Arrie¨ns b a Institute of Informatics, Slovak Academy of Sciences, Dubravska cesta 9, 845 07 Bratislava, Slovak Republic b Delft University of Technology, Department of Electrical Engineering, Mathematics and Computer Science, Mekelweg 4, 2628 CD Delft, The Netherlands article info abstract Article history: A new fast computational structure identical both for the forward and backward Received 27 May 2008 modified discrete cosine/sine transform (MDCT/MDST) computation is described. It is Received in revised form the result of a systematic construction of a fast algorithm for an efficient implementa- 8 January 2009 tion of the complete time domain aliasing cancelation (TDAC) analysis/synthesis MDCT/ Accepted 10 January 2009 MDST filter banks. It is shown that the same computational structure can be used both for the encoder and the decoder, thus significantly reducing design time and resources. Keywords: The corresponding generalized signal flow graph is regular and defines new sparse Modified discrete cosine transform matrix factorizations of the discrete cosine transform of type IV (DCT-IV) and MDCT/ Modified discrete sine transform MDST matrices. The identical fast MDCT computational structure provides an efficient Modulated
    [Show full text]
  • HD Voice – a Revolution in Voice Communication
    HD Voice – a revolution in voice communication Besides data capacity and coverage, which are one of the most important factors related to customers’ satisfaction in mobile telephony nowadays, we must not forget about the intrinsic characteristic of the mobile communication – the Voice. Ever since the nineties and the introduction of GSM there have not been much improvements in the area of voice communication and quality of sound has not seen any major changes. Smart Network going forward! Mobile phones made such a progress in recent years that they have almost replaced PCs, but their basic function, voice calls, is still irreplaceable and vital in mobile communication and it has to be seamless. In order to grow our customer satisfaction and expand our service portfolio, Smart Network engineers of Telenor Serbia have enabled HD Voice by introducing new network features and transitioning voice communication to all IP network. This transition delivers crystal-clear communication between the two parties greatly enhancing customer experience during voice communication over smartphones. Enough with the yelling into smartphones! HD Voice (or High-Definition Voice) represents a significant upgrade to sound quality in mobile communications. Thanks to this feature users experience clarity, smoothly reduced background noise and a feeling that the person they are talking to is standing right next to them or of "being in the same room" with the person on the other end of the phone line. On the more technical side, “HD Voice is essentially wideband audio technology, something that long has been used for conference calling and VoIP apps. Instead of limiting a call frequency to between 300 Hz and 3.4 kHz, a wideband audio call transmits at a range of 50 Hz to 7 kHz, or higher.
    [Show full text]
  • Low Bit-Rate Speech Coding with Vq-Vae and a Wavenet Decoder
    ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 735-739. IEEE, 2019. DOI: 10.1109/ICASSP.2019.8683277. c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. LOW BIT-RATE SPEECH CODING WITH VQ-VAE AND A WAVENET DECODER Cristina Garbaceaˆ 1,Aaron¨ van den Oord2, Yazhe Li2, Felicia S C Lim3, Alejandro Luebs3, Oriol Vinyals2, Thomas C Walters2 1University of Michigan, Ann Arbor, USA 2DeepMind, London, UK 3Google, San Francisco, USA ABSTRACT compute the true information rate of speech to be less than In order to efficiently transmit and store speech signals, 100 bps, yet current systems typically require a rate roughly speech codecs create a minimally redundant representation two orders of magnitude higher than this to produce good of the input signal which is then decoded at the receiver quality speech, suggesting that there is significant room for with the best possible perceptual quality. In this work we improvement in speech coding. demonstrate that a neural network architecture based on VQ- The WaveNet [8] text-to-speech model shows the power VAE with a WaveNet decoder can be used to perform very of learning from raw data to generate speech. Kleijn et al. [9] low bit-rate speech coding with high reconstruction qual- use a learned WaveNet decoder to produce audio comparable ity.
    [Show full text]
  • LOW COMPLEXITY H.264 to VC-1 TRANSCODER by VIDHYA
    LOW COMPLEXITY H.264 TO VC-1 TRANSCODER by VIDHYA VIJAYAKUMAR Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN ELECTRICAL ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON AUGUST 2010 Copyright © by Vidhya Vijayakumar 2010 All Rights Reserved ACKNOWLEDGEMENTS As true as it would be with any research effort, this endeavor would not have been possible without the guidance and support of a number of people whom I stand to thank at this juncture. First and foremost, I express my sincere gratitude to my advisor and mentor, Dr. K.R. Rao, who has been the backbone of this whole exercise. I am greatly indebted for all the things that I have learnt from him, academically and otherwise. I thank Dr. Ishfaq Ahmad for being my co-advisor and mentor and for his invaluable guidance and support. I was fortunate to work with Dr. Ahmad as his research assistant on the latest trends in video compression and it has been an invaluable experience. I thank my mentor, Mr. Vishy Swaminathan, and my team members at Adobe Systems for giving me an opportunity to work in the industry and guide me during my internship. I would like to thank the other members of my advisory committee Dr. W. Alan Davis and Dr. William E Dillon for reviewing the thesis document and offering insightful comments. I express my gratitude Dr. Jonathan Bredow and the Electrical Engineering department for purchasing the software required for this thesis and giving me the chance to work on cutting edge technologies.
    [Show full text]
  • Perceptual Audio Coding Contents
    12/6/2007 Perceptual Audio Coding Henrique Malvar Managing Director, Redmond Lab UW Lecture – December 6, 2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 2 1 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 3 Many applications need digital audio • Communication – Digital TV, Telephony (VoIP) & teleconferencing – Voice mail, voice annotations on e-mail, voice recording •Business – Internet call centers – Multimedia presentations • Entertainment – 150 songs on standard CD – thousands of songs on portable music players – Internet / Satellite radio, HD Radio – Games, DVD Movies 4 2 12/6/2007 Contents • Motivation • “Source coding”: good for speech • “Sink coding”: Auditory Masking • Block & Lapped Transforms • Audio compression •Examples 5 Linear Predictive Coding (LPC) LPC periodic excitation N coefficients x()nen= ()+−∑ axnrr ( ) gains r=1 pitch period e(n) Synthesis x(n) Combine Filter noise excitation synthesized speech 6 3 12/6/2007 LPC basics – analysis/synthesis synthesis parameters Analysis Synthesis algorithm Filter residual waveform N en()= xn ()−−∑ axnr ( r ) r=1 original speech synthesized speech 7 LPC variant - CELP selection Encoder index LPC original gain coefficients speech . Synthesis . Filter Decoder excitation codebook synthesized speech 8 4 12/6/2007 LPC variant - multipulse LPC coefficients excitation Synthesis
    [Show full text]
  • Using Daala Intra Frames for Still Picture Coding
    Using Daala Intra Frames for Still Picture Coding Nathan E. Egge, Jean-Marc Valin, Timothy B. Terriberry, Thomas Daede, Christopher Montgomery Mozilla Mountain View, CA, USA Abstract—Recent advances in video codec technology have techniques. Unsignaled horizontal and vertical prediction (see been successfully applied to the field of still picture coding. Daala Section II-D), and low complexity prediction of chroma coef- is a new royalty-free video codec under active development that ficients from their spatially coincident luma coefficients (see contains several novel coding technologies. We consider using Daala intra frames for still picture coding and show that it is Section II-E) are two examples. competitive with other video-derived image formats. A finished version of Daala could be used to create an excellent, royalty-free image format. I. INTRODUCTION The Daala video codec is a joint research effort between Xiph.Org and Mozilla to develop a next-generation video codec that is both (1) competitive performance with the state- of-the-art and (2) distributable on a royalty free basis. In work- ing to produce a royalty free video codec, Daala introduces new coding techniques to replace those used in the traditional block-based video codec pipeline. These techniques, while not Fig. 1. State of blocks within the decode pipeline of a codec using lapped completely finished, already demonstrate that it is possible to transforms. Immediate neighbors of the target block (bold lines) cannot be deliver a video codec that meets these goals. used for spatial prediction as they still require post-filtering (dotted lines). All video codecs have, in some form, the ability to code a still frame without prior information.
    [Show full text]
  • Daala: a Perceptually-Driven Still Picture Codec
    DAALA: A PERCEPTUALLY-DRIVEN STILL PICTURE CODEC Jean-Marc Valin, Nathan E. Egge, Thomas Daede, Timothy B. Terriberry, Christopher Montgomery Mozilla, Mountain View, CA, USA Xiph.Org Foundation ABSTRACT and vertical directions [1]. Also, DC coefficients are com- bined recursively using a Haar transform, up to the level of Daala is a new royalty-free video codec based on perceptually- 64x64 superblocks. driven coding techniques. We explore using its keyframe format for still picture coding and show how it has improved over the past year. We believe the technology used in Daala Multi-Symbol Entropy Coder could be the basis of an excellent, royalty-free image format. Most recent video codecs encode information using binary arithmetic coding, meaning that each symbol can only take 1. INTRODUCTION two values. The Daala range coder supports up to 16 values per symbol, making it possible to encode fewer symbols [6]. Daala is a royalty-free video codec designed to avoid tra- This is equivalent to coding up to four binary values in parallel ditional patent-encumbered techniques used in most cur- and reduces serial dependencies. rent video codecs. In this paper, we propose to use Daala’s keyframe format for still picture coding. In June 2015, Daala was compared to other still picture codecs at the 2015 Picture Perceptual Vector Quantization Coding Symposium (PCS) [1,2]. Since then many improve- Rather than use scalar quantization like the vast majority of ments were made to the bitstream to improve its quality. picture and video codecs, Daala is based on perceptual vector These include reduced overlap in the lapped transform, finer quantization (PVQ) [7].
    [Show full text]
  • ITU-T G.711.1: Extending G.711 to Higher-Quality Wideband Speech
    HIWASAKI LAYOUT 9/22/09 2:24 PM Page 110 ITU-T STANDARDS ITU-T G.711.1: Extending G.711 to Higher-Quality Wideband Speech Yusuke Hiwasaki and Hitoshi Ohmuro, NTT Corporation ABSTRACT 7 kHz, called wideband speech, which is equiva- lent to audio signals conveyed in AM radio In March 2008 the ITU-T approved a new broadcasts. One of the most popular applications wideband speech codec called ITU-T G.711.1. of voice over IP (VoIP) is remote audio-visual This Recommendation extends G.711, the most conferences, where hands-free terminals are widely deployed speech codec, to 7 kHz audio often used. In that case, intelligibility becomes bandwidth and is optimized for voice over IP more important than when using handsets applications. The most important feature of this because participants usually sit around a terminal codec is that the G.711.1 bitstream can be at a certain distance from a loudspeaker. This is transcoded into a G.711 bitstream by simple where wideband speech coders, which can repro- truncation. G.711.1 operates at 64, 80, and 96 duce speech at high fidelity and intelligibility, are kb/s, and is designed to achieve very short delay particularly favored. and low complexity. ITU-T evaluation results Today, the majority of fixed-line digital show that the codec fulfils all the requirements telecommunications terminals are equipped with defined in the terms of reference. This article ITU-T G.711 (log-compressed pulse code modu- presents the codec requirements and design con- lation [PCM]) capability. In fact, for communica- straints, describes how standardization was con- tion using Real-Time Transport Protocol (RTP) ducted, and reports on the codec performance over IP networks, G.711 support is mandatory.
    [Show full text]