New image compression artifact measure
using wavelets
Yung-Kai Lai and C.-C. Jay Kuo
Integrated Media Systems Center
Department of Electrical Engineering-Systems
University of Southern California
Los Angeles, CA 90089-256 4
Jin Li
Sharp Labs. of America
5750 NW Paci c Rim Blvd., Camas, WA 98607
ABSTRACT
Traditional ob jective metrics for the quality measure of co ded images such as the mean squared error (MSE)
and the p eak signal-to-noise ratio (PSNR) do not correlate with the sub jectivehuman visual exp erience well, since
they do not takehuman p erception into account. Quanti cation of artifacts resulted from lossy image compression
techniques is studied based on a human visual system (HVS) mo del and the time-space lo calization prop ertyofthe
wavelet transform is exploited to simulate HVS in this research. As a result of our research, a new image quality
measure by using the wavelet basis function is prop osed. This new metric works for a wide variety of compression
artifacts. Exp erimental results are given to demonstrate that it is more consistent with human sub jective ranking.
KEYWORDS: image quality assessment, compression artifact measure, visual delity criterion, human visual
system (HVS), wavelets.
1 INTRODUCTION
The ob jective of lossy image compression is to store image data eciently by reducing the redundancy of image
content and discarding unimp ortant information while keeping the quality of the image acceptable. Thus, the tradeo
of lossy image compression is the numb er of bits required to represent an image and the quality of the compressed
image. This is usually known as the rate-distortion tradeo . The numberofbitsusedtorecordthecompressed
image can b e measured easily.However, the \closeness" b etween the compressed and the original images is not
a pure ob jective measure, since human p erception also plays an imp ortant role in determining the qualityofthe
compressed image. Atpresent, the most widely used ob jective distortion measure is the mean squared error (MSE)
or the related p eak signal-noise ratio (PSNR). It is well known that MSE do es not correlate very well with the visual
quality p erceived byhuman b eing. This can b e easily explained by the fact that MSE is computed by adding the
squared di erence of individual pixels without considering the visual interaction b etween adjacent pixels.
In comparison with the amount of research on new compression techniques, work on visual quality measure of
compressed images is amazingly little. Most visual quality assessments use rating scales ranging from \excellent"
to \unsatisfactory" in measuring the p erceived image quality. Human assistance is needed in the establishmentof
quality scales and the comparison b etween compressed images and the set of training images. It is dicult to obtain
ob jective and quantitative measures by using this approach [15]. Results achieved are sub jective and qualitative.
Some work tries to mo dify existing quantitative measures to accommo date the factor of human visual p erception.
One is to improveMSEby putting di erentweights to neighb oring regions with di erent distances to the fo cal pixel
[17]. Most of them can b e viewed as a curve- tting metho d to comply with the rating scale metho d. Since late
1970's, researchers have started to pay attention to the imp ortance of the human visual system (HVS) and tried
to include the HVS mo del in the image quality metric [10], [11]. The development of the HVS mo del at that time
was not mature enough and the prop osed mo del could not interpret human visual p erceptual phenomena very well.
Recently, Karunasekera prop osed an ob jective distortion measure to evaluate the blo cking artifact of the blo ck-based
compression technique [12]. Watson [23] and van den Branden Lambrecht [1], [22] prop osed more complete mo dels
and extended their use to compressed videos.
It has b een shown by psychophysic exp eriments that the human visual system is comprised bymany units, each
of which is sensitive to the contrast at some space-frequency lo calized region and functions indep endently. The
overall visual p erception of the luminance or contrast of an ob ject is the aggregate p erformance of the bandpass
frequency resp onse of all units [4]. Based up on this space-frequency lo calized mo del, the visual p erception of co ding
artifacts can b e interpreted as the p erception of lo cal frequency inconsistency. When an image is compressed and
decompressed, the deco ded image has di erent lo cal frequency resp onses than the original image, thus resulting in
compression artifacts.
Since each cell can sense only a nite region with resp ect to the fo cal pixel in an image, Fourier analysis is
not suitable for the mo deling of HVS due to its global analytical prop erty. This issue has b een addressed and
inconsistency b etween theory and observed results by using the pure frequency analysis has b een found. Another
observation is that the common bandwidth of cortical cells is 1 o ctave[4].Itisknown from exp eriments that the
frequency resolution of the neighb orho o d is reduced by an o ctave when the distance from the fo cal p oint doubles [6].
This observation suggests the use of dyadic wavelets for the analysis and mo deling of HVS.
In this work, we consider a wavelet approach for the mo deling of HVS and then build an ob jective quality metric
based on this mo del. To b e more precise, we rst prop ose a new de nition of contrast with resp ect to complex
images by taking the wavelet transform of the image, and using wavelet co ecients to estimate the contrast at
each resolution in the image. By using the multiresolution and space-frequency lo calization prop erties, manyknown
inconsistencies in the psychophysical literature can b e understo o d naturally.Known human visual system functions
such as the contrast sensitivity threshold [20] and the frequency masking e ect [21] can also b e incorp orated in this
mo del. We then prop ose an ob jective metric to measure the extent of the p erceived contrast in every resolution,
and derive an error measure by examining the weighted di erences of wavelet comp onents b etween the original and
compressed images at each resolution.
This pap er is organized as follows. In Section 2, we brie y review existing HVS theory and mo dels. Then, the
wavelet approachisintro duced to simulate the lo calization phenomenon of HVS in Section 3. A new HVS mo del
based on wavelets is prop osed and an image quality assessment system is describ ed in Section 4. Exp erimental results
are given in Section 5 to demonstrate that this new metric works for a wide range of compression artifacts and that
the resulting measure is consistent with human sub jective ranking.
2 MODEL FOR HUMAN VISUAL SYSTEM (HVS)
2.1 Contrast thresholds
Human visual systems (HVS) resp onse to light stimuli of surroundings. One imp ortant observation is that visual
p erception is sensitive to luminance di erence rather than the absolute luminance. Let L and L b e the
max min
maximum and minimum luminances of the waveform around the p ointofinterest. Michelson's contrast, de ned as
L L
max min
; (1) C =
L + L
max min
is found to b e nearly constant when used to represent the just noticeable luminance di erence.
Psychophysic exp eriments showed that HVS is comprised bymany units, eachofwhich is fo cused on a certain
p oint in the vision eld and only sensitive to the contrast in a certain frequency band. The overall visual p erception
of ob ject luminance or contrast is the aggregate p erformance of each cell's frequency resp onse [4]. Since HVS cannot
provide in nite luminance resolution, a contrast threshold exists in each frequency band. The contrast threshold
value is a function of the spatial frequency which can b e determined exp erimentally.Atypical contrast sensitivity
curve, de ned as the recipro cal of the contrast curve, is shown in Fig. 1 [4]. As shown in the gure, HVS has the
highest luminance resolution around 3-10 cycles p er degree, and the sensitivityattenuates at b oth high and low
frequency ends. An image artifact can b e sensed if its contrast is ab ove the threshold at sp eci c spatial frequencies.
3 10
2 10
Sensitivity Threshold 1 10
0 10 −1 0 1 2 10 10 10 10
Spatial Frequency (cycles/deg)
Figure 1: A typical contrast sensitivitycurveofhuman b eings.
2.2 Suprathreshold contrast
The subthreshold stimuli, i.e. stimuli with a contrast lower than the threshold, cannot b e sensed byhuman.
The suprathreshold contrast is of concern in this subsection. Visual resp onse to the suprathreshold contrast involves
human sub jective rating. Even though it is dicult to nd an precise formula for its mo deling, it is generally agreed
that the estimated resp onse R is a function of the spatial frequency and follows a p ower law [14]:
p
R = k (C C ) ; (2)
T
where C is the suprathreshold contrast, C is the contrast threshold, and the exp onent p varies b etween 0.38 and some
T
value more than 1.0 [7]. Moreover, the contrast constancy phenomenon [9] states that the suprathreshold resp onse
is almost constant for a given contrast at a wide range of spatial frequencies. These two seemingly contradicting
observations were mediated rst by Cannon [26] and then by Georgeson's 3-stage mo del [8].
2.3 Channel interactions
Although cells narrowly tune to di erent frequency bands, they are not strictly band-limited, and interactions
among adjacent frequency channels were well observed in the literature. Two primary e ects were often discussed in
the literature. They are the summation e ect and the masking e ect.
The summation e ect is an inter-channel e ect saying that the neighb oring frequency channels contribute to the
total contrast. Therefore, a subthreshold contrast may still pro duce a small resp onse if there exist other excitory
stimuli in nearby frequency channels. This e ect can b e mo deled as a contrast threshold C decrease [13]:
T
1
!
s
N
i
X
s
i
; C = jC j
T Ti
i=0
th
where C is the contrast sensitivity threshold of the i closest channel, N is the total neighb oring channels a ecting Ti
the p erception of the target channel, C represents the original contrast without summation and s 's are the
T 0 i
exp onential comp onents to b e determined.
The masking e ect is another inter-channel e ect which states that the visibilityofa stimuli at some frequency
could b e impaired by the presence of other stimuli in nearby frequency channels. One well-known example in
compressed image artifact analysis is that the blo cking artifact of blo ck-based compression schemes, which consists
of high-frequency edge comp onents, is less visible in the texture regions. This e ect can b e viewed as a mo di cation
of contrast threshold and mo deled as an exp onential function of b oth their frequency and contrast di erences [1],
[21]:
!
N
m m
1;i 2;i
X
f C
i i
; (3) C = C
T T 0
C f
T 0 0
i=0
where C , C , N have the same de nition as in the subthreshold summation case, C and f are the actual
Ti T 0 i i
th
contrast and the spatial frequency of the i closest channel, resp ectively, and m 's and m 's are the co ecients
1;i 2;i
to b e determined. The validity of Eqn. (3) will b e examined in Section 5.2.
2.4 Spatial inhomogeneity
Most research results whichledtothecontrast threshold curveasshown in Fig. 1 were obtained with foveal vision,
i.e. measured very close the xation p oint. However, p eripheral vision researches showed that the p erceived image
gradually blurs with increasing distance from the xation p oint. In other words, the spatial frequency sensitivity
of the vision system decreases as the eccentricity from the fo cal p oint increases [6], [13], and it was shown that
the equal-sensitivity neighb orho o d radius is inversely prop ortional to the spatial frequency, i.e. the pro duct of the
spatial frequency and the resolvable radius from the fo cal p oint is almost constant [6]. Direct consequences of this
inhomogeneity phenomenon are that the resolvable region in the vision eld varies with the spatial frequency and
that contrasts at di erent resolutions play di erent roles in the visual system regarding their resp ective ranges.
3 SIGNAL LOCALIZATION AND WAVELET REPRESENTATION
3.1 Space-frequency lo calization
Limited by their spatial lo cation on the retina, the receptors can only fo cus on their own certain regions of the
visual eld. The frequency resp onse of the xation p ointischaracterized by the typical contrast threshold function
as shown in Fig. 1, but the high frequency resp onse will further attenuate as the eccentricity from the fo cal p oint
increases. Therefore, the frequency resp onse of visual stimuli is not only band-limited in the frequency domain but
also space-lo calized in the spatial domain.
The commonlyusedFourier frequency analysis is, however, a global pro cess whichgives all spatial comp onents
the same weighting.Itiswell known as Heisenb erg's uncertainty principle that the lo calization in b oth spatial and
frequency domains cannot b e achieved simultaneously. Gab or transform, which is a Gaussian-windowed Fourier
transform, was proved to achieve the limit of the Heisenb erg inequality. Gab or gratings were thus widely used in
mo dern psychophysical exp eriments.
The parameters of the Gaussian windowwere chosen at researchers' preferences and various degrees of lo calization
were achieved [19]. That is, byvarying the Gaussian envelop e parameters, the passbands of Gab or gratings were
overlapp ed to a di erent extent. There are some limitations in the Gab or representation. First, it is dicult to
analyze the stimuli whose frequency resp onses fall in the overlapp ed band. Second, since the Gab or lter is an I IR
(in nite impulse resp onse) lter, truncation is still needed for practical implementation. Thus, the lo calization is not
fully ensured after truncation. Toovercome these diculties, we adopt a wavelet approach for signal analysis in the next section.
3.2 Wavelet approach
The wavelet transform provides a go o d space-frequency lo calization prop erty [2] and can b e implemented by using
the multichannel lter banks. Compactly supp orted wavelets such as the Daub echeis lters [2] can b e implemented
with FIR ( nite impulse resp onse) lters. The space-frequency lo calization is optimized among all p ossible FIR lters
with the given length for the Daub echies lters. Another useful prop ertyofthewavelet lter is that the frequency
resp onse of the contrast threshold can b e easily obtained by prop erly cho osing the wavelet basis. Vision researches
with Gab or ltering usually used the resp onse amplitude, represented by dBs, to approximate the contrast threshold
function de ned by (1) [1]. However, the e ect of this approximation has not yet b een thoroughly investigated.
The Haar wavelet is the simplest basis function in the compactly supp orted wavelet family. It provides a capability
to compute the contrast directly from the resp onses of the low and high frequency subbands. For the Haar wavelet,
the lter co ecients for the low and high frequency lter banks are given by
1
p
n = 1; 0;
2
h [n] = (4)
0
0 otherwise
8
1
p
n =0;
<
2
1
p
n = 1; (5) h [n] =
1
2
:
0 otherwise,
resp ectively. Assume that a discrete-time input signal x[n] is the staircase contrast pattern
L n<0
max
x[n]= (6)
L n 0;
min
The resp onses y [n]andy [n] at the 0th resolution after ltering with h [n]and h [n], resp ectively,are
0;0 1;0 0 1
p
8
2L n<