SHORTEN

Simple lo s sle s s andne arlo s sle s s waveform compre s s ion

Tony Robinson

Technical rep ort CUEDFINFENGTR

Cambr idge Univers ityEngineer ing Department

Trumpington Street Cambr idge CB PZ UK

Decemb er

Ab stract

Thi s rep ort de scr ib e s a program that p erforms compre ss ion of waveform le s such

as audio data A s imple pre dictivemodel of thewaveform i s us e d followed by Human

co dingofthe pre diction re s iduals Thi s i s b oth f ast andnear optimal for many com

monly o ccur ingwaveform s ignals Thi s f rameworkisthen extended to lossy co ding

under the conditions of maximisingthe s egmental s ignal to noi s e ratio on a p er f rame

bas i s and co dingto a xe d acceptable s ignal to noi s e ratio

Intro duction

It i s common tostore digitised waveforms on computers andtheresulting le s can often

consume s ignicantamountsofstorage space General compre s s ion algor ithms do not

p erform very well on the s e le s as they f ail totakeinto accountthestructure of thedataand

thenature of the s ignal contained there in Typically a waveform le will cons i st of s igned

bit numb ers andthere will b e s ignicantsample to sample correlation A compre s s ion

utility for the s e le must b e re asonably f ast p ortable accept datainamostpopular formats

and give s ignicant compre s s ion Thi s rep ort de scr ib e s shorten a program for the UNIX

and DOS environmentswhich aims tomeet the s e requirements

A s ignicantapplication of thi s program i s tothe problem of compre s s ion of sp eech

le s for di str ibution on CDROM Thi s rep ort starts withade scr iption of thi s domain then

di scus s e s thetwomain problems as so ciate d with general waveform compre s s ion namely

pre dictivemodellingandresidual co ding Thi s f rameworkisthen extended to lossy coding

Finallytheshorten implementation i s de scr ib e d andanappendix details the command line

options

Compre s s ion for sp eech corp ora

One imp ortant use for lossless waveform compression is to compre s s sp eech corp ora for

di str ibution on CDROM Stateofthe art sp eech recognition systems require gigabytes of

acoustic data for mo del e stimation whichtakes many CDROMs tostore Us e of compre s s ion

software b othreduce s the di str ibution co st andthenumber of CDROM change s require d

toreadthecompletedata s et

Thekey f actors in thede s ign of compre s s ion software for sp eech corp ora are thatthere

must b e no p erceptual degradation in the sp eech s ignal andthatthedecompre s s ion routine

must b e f ast and p ortable

There has b een much researchinto ecient sp eechcodingtechnique s andmanystand

ards have b een e stabli shed However mo st of thi s workhas b een for telephonyapplications

where de dicated hardware can used to p erform the co dingandwhere it i s imp ortantthat

the re sulting system operates atawell dene d bit rate In suchapplications lo s sy co ding

i s acceptable andindee d nece s sary order to guarantee thatthesystem op erates atthe xe d

bit rate

Similarly there has b een muchworkinde s ign of general purpose lossless compre s sors

for workstation us e Suchsystems do not guarantee any compre s s ion for an arbitrary

le but in general achieveworthwhile compre s s ion in re asonable timeongeneral purpose

computers

Sp eech corp ora compre s s ion nee ds somefeature s of b oth systems Lo s sle s s compre s

s ion i s an advantage as it guarantee s there i s no p erceptual degradation in the sp eech

s ignal However theestabli she d compre s s ion utilitie s do not exploit theknown structure

of the sp eech s ignal Hence shorten was wr itten to ll thi s gap andisnow in us e in the

di str ibution of CDROMs containing sp eechdatabas e s

The recordings us e d as example s in s ection and s ection are f rom the TIMIT corpus

which i s di str ibute d as bit kHz line ar PCM sample s Thi s format i s in common us e d

for continuous sp eech recognition research corp ora The recordings were collecte d us inga

Sennheiser HMD noisecancellinghe admounte d microphoneinlow noise conditions

All ten utterance s f rom sp e aker fcjf are used which amounttoatotal of s econds or

about sample s

Waveform Mo deling

Compression is achieved by building a pre dictivemodel of thewaveform a go o d intro duc

tion for sp eech i s JayantandNoll An e stabli shed model for a widevar ietyofwaveforms

is thatofanautoregre s s ivemodel also known as line ar pre dictive co ding LPC Here the

pre dicted waveform i s a line ar combination of past sample s

p

X

st a st i

i

i

Thecode d s ignal et i s the dierence b etween theestimateoftheline ar pre dictors t

andthe sp eech s ignal st

et st st

However manywaveforms of intere st are not stationarythatisthebestvalue s for the

co ecientsofthe pre dictor a vary f rom one s ection of thewaveform toanother It i s

i

often re asonable toassumethatthe s ignal i s p s eudostationary ie there exi stsa timespan

over which re asonable value s for the line ar pre dictor can b e found Thus thethree main

stage s in the co ding pro ce s s are blo cking pre dictivemodelling andresidual co ding

Blo cking

Thetime f rameover whichsample s are blo cked dep ends to some extentonthenature of

the s ignal It i s inecienttoblo ckontoo short a time scale as thi s incurs an overhead in

the computation and transmi s s ion of the pre diction parameters It i s also inecienttouse

atime scale over whichthesignal characteristics change appreciably as thi s will re sultin

a p o orer mo del of the s ignal However in the implementation described belowthe linear

pre dictor parameters typically takemuch le s s information to transmit than the residual

s ignal so thechoice of window length i s not cr itical Thedef aultvalue in theshorten

implementation i s which re sults in ms f rame s for a s ignal sample d at kHz

Sample interle aved signals are handelle d by tre atingeachdata stre am as indep endent

Even in cas e s where thereisaknown correlation b etween the stre ams suchasinstereo

audio the withinchannel correlations are often s ignicantly gre ater than the cro s schannel

correlations so for lo s sle s s or ne arlo s sle s s co dingthe exploitation of thi s additional correl

ation only re sultsinsmall additional gains

A rectangular window i s us e d in preference toanytap er ingwindowasthe aim i s to

mo del just those sample s within theblo ck not the sp ectral characteristics of the s egment

surroundingtheblo ck The windowlength i s longer than theblo cksizebythe pre diction

order whichistypically three sample s

Line ar Pre diction

Shorten supp ortstwo forms of line ar pre diction thestandard pth order LPC analys i s of

equation andarestricte d form wherebythe co ecients are s elected from one of four

xe d p olynomial pre dictors

In thecaseofthegeneral LPC algor ithm the pre diction co ecients a are quantised in

i

accordance withthe same Laplacian di str ibution us e d for theresidual s ignal andde scr ib e d

in s ection The exp ected numb er of bits p er co ecient i s as thi s was foundtobe

a go o d tradeo b etween mo delling accuracy andmodel storage Thestandard Durbins

algor ithm for computingthe LPC co ecients f rom theauto correlation co ecients is used

in a incremental wayOneachiteration theme an square d value of the pre diction re s idual

i s calculated andthi s i s us e d to computethe exp ected number of bitsnee ded tocode

the residual s ignal Thi s i s added tothenumb er of bitsnee ded tocodethe pre diction

co ecientsandthe LPC order i s s elected to minimis e thetotal As the computation of

theauto correlation co ecientsisthe mo st exp ens ivestep in this process the s e arch for

theoptimal mo del order i s terminated when the last twomodels have re sulte d in a higher

bit rate Whilst it i s possible to construct s ignals thatdefe atthis search pro ce dure in

practice for sp eechsignals it has b een foundthatthe o ccas ional us e of a lower pre diction

order re sults in an ins ignicant incre as e in the bit rateandhas the additional s ide eect of

requir ing le s s computetodeco de

A re str ictiveformofthe line ar pre dictor has b een foundto b e us eful In thi s cas e the

pre diction co ecients are tho s e sp ecie d byttinga p order p olynomial tothe last p data

p oints eg a linetothe last two p oints

s t



s t st



s t st st



s t st st st



Writing e tasthe error s ignal f rom the ithpolynomial pre dictor

i

e t st



e t e t e t

  

e t e t e t

  

e t e t e t

  

As can b e s een f rom equations there i s an ecient recurs ivealgorithm for comput

ingthe s et of p olynomial pre diction re s iduals Eachresidual term i s forme d f rom the

dierence of the previous order pre dictors As e achterm involve s only a few integer addi

tionssubtractions it i s p o s s ible to compute all pre dictors and s elect the b e st Moreover

as thesumofabsolutevalue s i s line arly related tothevar iance thi s may be used as the

bas i s of pre dictor s election andsothewhole process is cheap to computeasitinvolves no

multiplications

Figure shows b oth forms of pre diction for a range of maximum pre dictor orders

The gure shows that rst and s econdorder pre diction provides a substantial incre as e in

compre s s ion andthathigher order pre dictors provide relatively little improvement The

gure also shows that for thi s example mo st of thetotal compre s s ion can b e obtaine d us ing

no pre diction that i s a zerothorder co der achieved about compre s s ion andthebest

pre dictor Hence for lo s sle s s compre s s ion it i s imp ortant not towastetoo much

computeonthe pre dictor andtoto p erform the residual co ding eciently

Re s idual Co ding

The sample s in the pre diction residual are nowassumed tobeuncorrelated andtherefore

may b e co ded indep endentlyThe problem of re s idual co dingistherefore tondanappro

pr iate form for the probabilitydens ityfunction pdf of the di str ibution of re s idual value s 54

52 polnomial 50

48 lpc

46

44

compression (% original size) 42

40 0 2 4 6 8

maximum predictor order

Figure compre s s ion against maximum pre diction order

so thatthey can b e eciently mo delle d Figure s and showthe pdf for the s egmentally

normalize d re s idual of thepolynomial pre dictor the full line ar pre dictor shows a s imilar

pdf The ob s erved value s are shown as op en circle s theGaus s ian pdf i s shown as dot

dash lineandthe Laplacian or double s ide d exp onential di str ibution i s shown as a dashed

line The s e gure s demonstratethatthe Laplacian pdf tstheobserve d di str ibution

very well Thi s i s convenientasthere i s a s imple Human co de for thi s di str ibution

To form thi s co de a numb er i s divided into a s ign bit the nthlow order bitsandthethe

remaining high order bits The high order bits are tre ate d as an integer andthi s number

of s are transmitted followed bya terminatingThe n loworder bitsthen follow as in

the example in table

s ign lower number full

Number bit bits of s co de

Table Example s of Human co de s for n

As with all Human co desawhole number of bits are used per sample re sultingin 0.08 Laplace

Gaussian 0.06

0.04 probability

0.02

0 -30 -20 -10 0 10 20

prediction residual

Figure Ob s erved Gaus s ian and quantize d Laplacian pdf

-2

Laplace -4 Gaussian

-6

-8

-10 log2 probability/bits

-12 -40 -20 0 20 40

prediction residual

Figure Ob s erved Gaus s ian Laplacian andquantize d Laplacian pdf andlog pdf



instantaneous deco dingatthe exp ens e of intro ducingquantization error in the pdf Thi s i s

illustrated withthepointsmarke d in gure In the example n giving a minimum

co de lengthofThe error intro duce d by co ding accordingtothe Laplacian pdf instead

of the true pdf i s only bits p er sample andthe error intro duce d byusing Human

co de s i s only bits p er sample The s e are small compare d toatypical co de lengthof

for kHz sp eech corp ora

Thi s Human co de i s also s imple in thatitmay b e enco ded anddeco de d with a few

logical op erations Thus the implementation nee d not employ a tree s e archfordeco ding

so re ducingthe computational andstorage overhe ads as so ciate d with transmitting a more

general pdf

Theoptimal number of loworder bitsto b e transmitte d directly i s line arly related to

thevar iance of the s ignal The Laplacian i s dene d as

p

jxj

p

px e



where jxj is theabsolutevalue of x and is thevar iance of the di str ibution Takingthe

exp ectation of jxj gives

Z

E jxj jxjpxdx

p

Z

p

x

x e dx



Z

p p

x x

e dx xe





p

For optimal Human co dingwenee d tondthenumber of loworder bits nsuchthatsuch

n

thathalf the sample s lie in therange Thi s ensure s thatthe Human co deisn bits

nk 

long with probabilityand n k longwith probability whichisoptimal

Z

n



pxdx

n



Z

n

p



jxj

p

e dx

n



p

n



e

Solvingforn gives

p

n log log



log log E jxj



When p olynomial lters are us e d n i s obtaine d f rom E jxjusing equation In the LPC

implementation n is der ived from whichisobtaine d directly f rom the calculations for

pre dictor co ecientstheusingtheauto correlation method

Lo s sy co ding

The previous s ections haveoutlined the completewaveform compression algorithm for

lo s sle s s co ding There are a wide range of applications whereby somelossinwaveform

accuracy i s an acceptable tradeo in retur n for b etter compre s s ion A re asonably cle an

way to implementthi s i s todynamically change thequantisation level on a s egmentby

s egment bas i s Not only do e s thi s pre s ervethewaveform shap e buttheresultingdistortion

can b e e as ily understood As sumingthatthe sample s are uniformally di str ibute d within

thenew quantisation interval of nthen the probabilityofanyonevalue in thi s range i s



n andthe noise power intro duce d i s i for thelower value s that are rounded down and



for those value s that are rounde d up Hence thetotal noi s e p ower intro duce d by n i

the incre as e d quantisation i s

n

n

X X

  

A

i n i n

n

i

in

It may also b e as sumed thatthe s ignal was uniformally di str ibute d in the or iginal quant

R





isation interval b efore digitisation ie a quantisation error of x dx



Shorten supportstwomain types of lossy coding the cas e where every s egmentis

co ded atthesamerate andthe cas e where the bit rateisdynamically adapted tomaintain

a sp ecie d s egmental s ignal to noise ratio In therstmode thevar iance of the pre dic

tion re s idual of theoriginal waveform i s e stimated andthen theappropr iatequantisation

p erformed to limit thebitrate In are as of thewaveform where there are strong sample to

sample correlations thi s re sultsinarelatively high s ignal to noi s e ratio and in are as with

little correlation thesignal to noise ratio approaches thatofthesignal p ower divided bythe

quantisation noi s e of equation In the s econdmode thi s equation is used toestimate

the gre ate st additional quantisation that can b e p erforme d whilst maintaining a sp ecie d

s egmental s ignal to noise ratio In b oth cases thenew quantisation interval n i s re str icted

to beapower of two for computational eciency

Compre s s ion Performance

The previous s ections havedemonstrated thatlow order line ar pre diction followed by Hu

man co dingtotheLaplace di str ibution re sults in an ecient lossless waveform co der

Table compare s thi s technique tothepopular general purp o s e compre s s ion utilitie s that

are available Thetable shows thatthe sp eech sp ecic compression utilitycanachieve con

siderably b etter compre s s ion than more general tools The compression anddecompression

sp ee ds are the f actors f aster than re al timewhen execute d on a standard SparcStation I

except the re sult for the g ADPCM compre s s ion whichwas implemente d on a SGI In

digo R workstation us ingthe supplie d aifccompre s saifcdecompre s s utilitie s TheSGI

timings were scale d bya factor of whichwas determined bythe relative execution times

of shorten decompre s s ion on thetwoplatforms

program s ize compre s s decompre s s

sp ee d sp ee d

UNIX compre s s

UNIX

GNU

shorten def aultfast

shorten LPC slow

aifcdecompre s s lossy

Table Compre s s ion rates and speeds

Toinvestigatethe eects of lossy codingonspeech recognition p erformance thetest

portion of the TIMIT databas e was co ded at four bits p er sample andthe re sulting sp eech

was recognised witha stateofthe art phone recognition system Bothshorten andthe

g ADPCM standard gavenegligible additional errors about more errors over the

bas eline of errors butitwas nece s sary toapply a f actor of four scalingtothe

waveform for us e withthe g ADPCM algor ithm g ADPCM without scalingand

thetelephony quality g ADPCM algor ithm designe d for kHz samplingandop erated

at kHz b oth pro duce d s ignicantly more errors approximately in errors

Co dingthi s databas e at four bit p er sample re sultsinapproximately another f actor of two

compre s s ion over lo s sle s s co ding

Decompre s s ion andplayback of bit kHz stereo audio takes approximately

of theavailable pro ce s s ingpower of a DX bas e d machineandofaMHz

Pentium Di sk acce s s accounte d for of thetimeonthe slower machine Performing

compre s s ion tothree bits p er sample give s another f actor of three compression reducing

the di sk acce s s timeprop ortionally and providing f aster execution with no p erceptual

degradation totheauthors e ars Thus re al timedecompre s s ion of high qualityaudio i s

possible for a widerange of p ersonal computers

Conclus ion

Thi s rep ort has described a simple waveform co der de s igned for use withstore d waveform

le s Theuseofasimple line ar pre dictor followed by Human co ding accordingtothe

Laplacian di str ibution has b een foundtobeappropr iatefortheexample s studie d Var ious

technique s havebeenadopted to improvethe eciency re sultinginrealtimeop eration on

manyplatforms Lo s sy compre s s ion i s supp orted bothto a sp ecie d bit rateandtoa

sp ecie d s ignal tonoiseratio Mo st s imple sample le formats are accepte d re sultingina

exible tool for theworkstation environment

Reference s

John Garofolo Tony Robinson and Jonathan Fi scus Thedevelopment of le formats

for very large sp eech corp ora Sphere andshorten In Proc ICASSPvolume I page s

N S JayantandPNoll Digital Coding of Waveforms Prentice Hall Englewood

Clis NJ ISBN

R F Rice and JRPlaunt Adaptivevar iablelengthcoding for ecient compre s

s ion of spacecraft televi s ion data IEEE Transactions on Communication Technology

PenShuYeh Rob ert Rice andWar ner Miller On theoptimalityofcodeoptions for a

universal noi sle s s co der JPL Publication Jet Propuls ion Lab orator ie s February

Rob ert F Rice Some practical noi s ele s s co dingtechnique s Part I I Mo dule PSIK

JPL Publication Jet Propuls ion Lab orator ie s Novemb er

Appendix Theshorten man page vers ion

SHORTEN USER COMMANDS SHORTEN

NAME

shorten fast compression for waveform files

SYNOPSIS

shorten hl a bytes b samples c channels d

bytes m blocks n dB p order q bits r

bits t filetype v version waveformfile

shortenedfile

shorten x hl a bytes d bytes shortenedfile

waveformfile

DESCRIPTION

shorten reduces the size of waveform files such as audio

using of prediction residuals and optional

additional quantisation In lossless mode the amount of

compression obtained depends on the nature of the waveform

Those composing of low frequencies and low amplitudes give

the best compression which may be or better Lossy

compression operates by specifying a minimum acceptableseg

mental signal to noise ratio or a maximum bit rate Lossy

compression operates by zeroing the lower order bits of the

waveform so retaining waveform shape

If both file names are specified then these are used as the

input and output files The first file name can be replaced

by to read from standard input and likewise the second

filename can be replaced by to write to standard output

Under UNIX if only one file name is specified then that

name is used for input and the output file name is generated

by adding the suffix shn on compression and removing the

shn suffix on decompression In these cases the input

file is removed on completion The use of automatic file

name generation is not currently supported under DOS If no

file names are specified shorten reads from standard input

and writes to standard output Whenever possible the out

put file inherits the permissions owner group access and

modification times of the input file

OPTIONS

a align bytes

Specify the number of bytes to be copied verbatim

before compression begins This option can be used to

preserve fixed length ASCII headers on waveform files

and may be necessary if the header length is an odd

number of bytes

b block size

Specify the number of samples to be grouped into a

block for processing Within a block the signal ele

ments are expected to have the same spectral charac

teristics The default option works well for a large

range of audio files

c channels

Specify the number of independent interwoven channels

For two signals at and bt the original data format

is assumed to be abab

d discard bytes

Specify the number of bytes to be discarded before

compression or decompression This may be used to

delete header information from a file Refer to the a

option for storing the header information in the

compressed file

h Give a short message specifying usage options

l Prints the software license specifying the conditions

for the distribution and usage of this software

m blocks

Specify the number of past blocks to be used to esti

mate the mean and power of the signal The value of

zero disables this prediction and the mean is assumed

to lie in the middle of the range of the relevant data

type ie at zero for signed quantities The

default value is nonzero for format versions and

above

n noise level

Specify the minimum acceptable segmental signal to

noise ratio in dB The signal power is taken as the

variance of the samples in the current block The

noise power is the quantisation noise incurred by cod

ing the current block assuming that samples are unifor

mally distributed over the quantisation interval The

bit rate is dynamically changed to maintain the desired

signal to noise ratio The default value represents

lossless coding

p prediction order

Specify the maximum order of the linear predictive

filter The default value of zero disables the use of

linear prediction and a polynomial interpolation method

is used instead The use of the linear predictive

filter generally results in a small improvement in

compression ratio at the expense of execution time

This is the only option to use a significant amount of

floating point processing during compression

Decompression still uses a minimal number of floating

point operations

Decompression time is normally about twice that of the

default polynomial interpolation For version and

compression time is linear in the specified maximum

order as all lower values are searched for the greatest

expected compression the number of bits required to

transmit the prediction residual is monotonically

decreasing with prediction order but transmittingeach

filter coefficient requires about bits For ver

sion and above the search is started at zero order

and terminated when the last two prediction orders give

a larger expected bit rate than the minimum found to

date This is a reasonable strategy for many real

world signals you may revert back to the exhaustive

algorithm by setting v to check that this works for

your signal type

q quantisation level

Specify the number of low order bits in each sample

which can be discarded set to zero This is useful

if these bits carry no information for example when

the signal is corrupted by noise

r bit rate

Specify the expected maximum number of bits per sample

The upper bound on the bit rate is achieved by setting

the low order bits of the sample to zero hence max

imising the segmental signal to noise ratio

t file type

Gives the type of the sound sample file as one of

ulawsususxuxshluhlslhulh

ulaw is the natural file type of ulaw encoded files

such as the default sun au files All the other

types have initial s or u for signed or unsigned data

followed by or as the number of bits per sample

No further extension means the data is in the natural

byte order a trailing x specifies byte swapped data

hl explicitly states the byte order as high byte fol

lowed by low byte and lh the converse The default is

s meaning signed bit integers in the natural byte

order

Specific optimisations are applied to ulaw files If

is specified then a check is made

that the whole dynamic range is used useful for files

recorded on a SparcStation with the volume set too

high If is specified then the

data is internally converted to linear The lossy

option r has been observed to give little degrada

tion

v version

Specify the binary format version number of compressed

files Legal values are and higher numbers

generally giving better compression The current

release can write all format versions although con

tinuation of this support is not guaranteed Support

for decompression of all earlier format versions is

guaranteed

x extract

Reconstruct the original file All other command line

options except a and d are ignored

METHODOLOGY

shorten works by blocking the signal making a model of each

block in order to remove temporal redundancy then Huffman

coding the quantised prediction residual

Blocking

The signal is read in a block of about or samples

and converted to integers with expected mean of zero

Samplewiseinterleaved data is converted to separate chan

nels which are assumed independent

Decorrelation

Four functions are computed corresponding to the signal

difference signal second and third order differences The

one with the lowest variance is coded The variance is

measured by summing absolute values for speed and to avoid

overflow

Compression

It is assumed the signal has the Laplacian probability den

sity function of expabsx There is a computationally

efficient way of mapping this density to Huffman codes The

code is in two parts a run of zeros a bounding one and a

fixed number of bits mantissa The number of leading zeros

gives the offset from zero Signed numbers are stored by

calling the function for unsigned numbers with the sign in

the lowest bit Some examples for a bit mantissa

This Huffman code was first used by Robert Rice for more

details see the technical report CUEDFINFENGTR

included with the shorten distribution as files trtex

and trps

SEE ALSO

pack

DIAGNOSTICS

Exit status is normally A warning is issued if the file

is not properly aligned ie a whole number of records

could not be read at the end of the file

BUGS

There are no known bugs An easy way to test shorten for

your system is to use make test if this fails for what

ever reason please report it

No check is made for increasing file size but valid

waveform files generally achieve some compression Even

compressing a file of random bytes which represents the

worst case waveform file only results in a small increase

in the file length about for bit data and for

bit data

There is no provision for different channels containingdif

ferent data types Normally this is not a restriction but

it does mean that if lossy coding is selected for the ulaw

type then all channels use lossy coding

It would be possible for all options to be channel specific

as in the r option I could do this if anyone has a

really good need for it

See also the file Changelog and READMEdos for what might

also be called bugs past and present

Please mail me immediately at the address below if you do

find a bug

AVAILABILITY

The latest version can be obtained by anonymous FTP from

svrftpengcamacuk in directory compspeechsources

The UNIX version is called shortenZ and the DOS

version is called shortzip where represents a digit

AUTHOR

Copyright C by Tony Robinson ajrcamacuk

Shorten is available for noncommercial use without fee

See the LICENSE file for the formal copying and usage res

trictions