Design of a No-Reference Perceptual Artifact Metric Master Thesis in Media & Knowledge Engineering

Thesis Committee: Prof. Dr. I. Heynderickx

Prof. Dr. R.L. Lagendijk

Dr. W.P. Brinkman

Dr. K.V. Hindriks

MSc. H. Liu

Author N.C.R. Klomp Email [email protected] Student number 1099957

Supervisor Prof. Dr. I. Heynderickx Group MMI

Author N.C.R. Klomp Email [email protected]

Keywords: objective metric, ringing artifact, perceptual edge detection, luminance masking, texture masking

This report is made as a part of a Master thesis project at the faculty of Computer Science at the Delft University of Technology. All rights reserved. Nothing from this publication may be reproduced without written permission by the copyright holder. No rights whatsoever can be claimed on grounds of this report.

Printed in The Netherlands Master Thesis N.C.R. Klomp ______Index

1 Introduction ...... 5

1.1 Research Question ...... 7

1.2 Ringing Artifact ...... 8

1.3 Objective Metrics and the HVS...... 10

1.4 Ringing Metrics ...... 14

2 Literature ...... 17

2.1 PCA Ringing Metric [31] ...... 17

2.2 Ratio Ringing Metric [28] ...... 19

2.3 Horizontal Ringing Metric [27] ...... 20

2.4 No-Reference Ringing Metric [31] ...... 21

2.5 Morphological Filtering Ringing Metric [29] ...... 24

2.6 Region Clustering Ringing Metric [30] ...... 30

2.7 Conclusion ...... 36

3 Approach ...... 39

3.1 Overview...... 41

3.2 Ringing Region Detection ...... 42

3.2.1 Color Conversion...... 42

3.2.2 Edge Preserving Smoothing ...... 42

3.2.3 Edge Detection ...... 48

3.2.4 Perceptual Grouping ...... 50

3.2.5 Local Region Classification ...... 52

3.2.6 Human Vision Model ...... 53

3.2.7 Region Regrouping ...... 57

3.2.8 Spurious Ringing Object Removal ...... 59

3.3 Ringing Quantification ...... 62

4 Experiments ...... 65

4.1 Ringing Region Experiment ...... 65

4.1.1 Experimental procedure ...... 65

4.1.2 Subjective Data Processing ...... 66

4.1.3 Performance Evaluation ...... 67

4.1.4 Performance Evaluation Metric Components ...... 69

4.1.5 Performance Evaluation against Metrics from Literature ...... 72

4.2 Ringing Annoyance Experiment ...... 75

4.2.1 Experimental procedure ...... 75

4.2.2 Subjective Data Processing ...... 76

4.2.3 Evaluation Criteria ...... 81

4.2.4 Performance Evaluation ...... 82

5 Conclusion ...... 85

6 Recommendations ...... 87

7 References ...... 89

8 Appendix ...... 91

______

© MKE – TU Delft 2008 Page 3

Master Thesis N.C.R. Klomp ______1 Introduction

ue to quick advancements of new technologies and increasing resolution of digital media the past D decade has witnessed a great need for storage and transmission. Uncompressed images and video are still demanding in terms of transmission bandwidth and storage space. Despite rapid progress in mass-storage density, processor speeds, and digital communication systems, demand for data storage capacity and data-transmission bandwidth continues to outstrip the capabilities of available technologies. A challenge in the field of information communication is to fit a large amount of visual information into the narrow bandwidth of transmission channels or a limited storage space, while maintaining the best possible perceived quality for the viewer [1][2]. Therefore, compression techniques for images and video have been developed. High Definition Television, satellite communication and digital photography would not be feasible without a high degree of compression.

A common characteristic of most images is that neighboring pixels are correlated, and therefore, contain redundant information. Two fundamental components of compression are redundancy and irrelevancy reduction [3]. Redundancy reduction aims at removing duplication from the signal source and irrelevancy reduction omits parts of the signal that will not be noticed by humans. Image compression techniques can be classified in lossless and lossy compression. In lossless compression the compressed image is numerically identical to the original image. Therefore, the original image can be reconstructed from the compressed image to its full extent. The drawback is that lossless compression can only achieve a modest amount of compression. An image reconstructed following lossy compression contains degradation relative to the original. Often this is because the compression scheme completely discards redundant information. Although lossy compression is capable of achieving much higher compression some of the image information is lost during the compression process [4]. Lossy compression is generally used for images and videos, where a certain amount of information loss will not be detected by most viewers.

The discrete cosine transform (DCT) is used in most of the international standards for still image (e.g. JPEG) and video compression (e.g. MPEG). The DCT is similar to the discrete (DFT): it transforms a signal or image from the spatial domain to the . Unlike DFT, DCT is real- valued and provides a better approximation of a signal with fewer coefficients [3][5]. Its optimality for highly correlated signals, consequent good performance at high to medium bit rates and the availability of fast implementation algorithms and VLSI chips all contributed to DCT being the choice in many image and video coders [6]. A DCT expresses a sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. DCT is applied to blocks of 8x8 or 16x16 pixels, after which the block coefficients are quantized and coded separately, where the quantization varies with frequency. The reconstructed image can be easily obtained at the decoder by performing the de-quantization and inverse DCT transform. Because most adjacent image pixels are highly correlated, the high frequencies are coarser quantized than the lower frequencies to reduce image size. In practice at higher compression an image will not be perfectly recovered, especially near high frequencies and block boundaries, which will lead to visible compression artifacts [1]. ______

© MKE – TU Delft 2008 Page 5 Master Thesis N.C.R. Klomp ______

Compression artifacts are a particular class of that can become visible in compressed images or videos. The strength of these artifacts depends on the data source, compression bit-rate and underlying compression scheme and can range in visibility from imperceptible to very annoying [7]. The annoyance of these artifacts is an important measure in the viewer's judgment of visual quality [7]. Therefore, a lot of research effort is devoted to remove or reduce coding artifacts in order to improve the overall perceived quality of artifact impaired images and videos. Knowledge about the location and strength of individual artifacts is of importance in the field of compression codec design and image quality enhancement. In practice it is very unrealistic to have humans manually examine each image for potential artifacts. In current visual communication systems, the receiving end, for example a TV-set, typically adopts various video enhancement algorithms to reduce compression artifacts. However, complete removal of all targeted artifacts is hardly reached in practice, partly due to the limited processing capacity of a digital receiver. Furthermore, removing one type of artifact often generates a new type of artifact. It is suggested in [7] that artifact removal, i.e. which artifact should be removed and to what extent, can be decided based on a ranking scheme of annoyance generated by individual artifacts and is dependent on the bit-rate of the stream.

In such a scenario objective metrics are highly needed. Objective metrics can determine the quality degradation caused by artifacts and adapt the processing chain for artifact prioritization and reduction accordingly. Several objective metrics for individual artifacts together can help to decide which artifact should have highest removal priority and can give a good prediction of the overall image quality [7]. Since the structure information of artifacts is well known, objective metrics designed for individual artifacts rather than for overall image quality are simpler, and therefore, more realistic [1]. Objective metrics are used because they are repeatable, fast, small differences can be detected and a quantitative score can be generated per image. In order to successfully quantify individual compression artifacts, one must be able to identify where they appear in a given image and how annoyed viewers are by seeing them. Therefore, objective metrics identify artifacts and characterize their strength of degradation. Since viewers are the ultimate assessor of visual information, objective metrics should be strongly correlated with subjective data. Two categories of metrics can be distinguished: full-reference and no- reference metrics [8]. For full-reference metrics it is assumed that the original, undistorted image, is available, such that the difference between the distorted and undistorted image can be calculated. However, in a lot of applications the original image is not available. Also in case image-enhancement is applied, it makes no sense to use a full-reference metric, i.e. to use the difference between the enhanced and original image. Here, an objective metric should be deduced from the enhanced image only. These objective metrics are referred to as no-reference metrics. Human observers can easily detect artifacts without comparing with a reference. Automatically quantifying artifacts without the original is very difficult, because the distinction between image features and artifacts is often very complex. No-reference metrics seek to assign a score that is consistent with human perception using knowledge about the artifacts.

______

Page 6 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

The three most annoying compression artifacts using DCT coding are blocking, blurring and ringing [7]. Blocking is defined as discontinuities found at the boundaries of adjacent blocks in reconstructed images. Blurring manifests itself as a loss of spatial detail and a reduction in sharpness of edges in regions of images with moderate to high spatial activity. Ringing is visible near high contrast edges and appears as a rippling outwards from the edge [1]. At low compression bit-rate, blocking is the most annoying artifact followed by blurring and ringing. At higher compression bit-rate, ringing is the most annoying artifact followed by blurring and blocking [7]. Because the overall annoyance is thus dependent on the interaction between compression bit-rate and the type of artifact, good objective metrics for all three artifacts are necessary. Most research on artifact quantification and removal was denoted to blockiness and blurriness, which are relatively easy to detect due their regular structure [8]. Several no-reference metrics have already been developed for these artifacts, and some of them show a very good performance [9][11]. So far, there is not yet a well performing no-reference ringing metric. Despite of the relevance of having a good ringing metric, only a limited amount of research work was done to quantify perceived . In the future video enhancement algorithms would highly benefit from a no-reference ringing metric which can predict the location and strength of ringing artifacts strongly correlated with the human perception.

1.1 Research Question

The research question in this thesis is to develop a no-reference ringing metric which can be applied in a video chain in the near future. This no-reference ringing metric should correlate with the human perception of ringing, and therefore, the results of the metric should be evaluated with subjective data from psychovisual experiments. The approach is to first simplify the problem towards quantifying ringing for still images, and to check and evaluate ringing metrics available in literature also limited to still images. The main advantages and disadvantages of these metrics can be used to define potential improvements and simplifications towards the development of a new no-reference ringing metric.

This thesis is structured as follows. The remainder of this chapter will further introduce the ringing artifact and more related objective metrics in more detail. In chapter 2 a detailed description is given of ringing metrics available in literature and what their main advantages and disadvantages are. Chapter 3 describes the approach to the development of a new no-reference ringing metric, and explains which improvements and simplifications are made on existing metrics. The procedure, data analyses and evaluation of two psychovisual experiments are given in chapter 4. Finally, conclusions are drawn in chapter 5 and recommendations are given in chapter 6.

______

© MKE – TU Delft 2008 Page 7 Master Thesis N.C.R. Klomp ______

1.2 Ringing Artifact

As mentioned before the ringing artifact is an oscillation in the spatial domain near sharp intensity transitions, such as edges [1]. In DCT coded images it manifests itself outwards from the edge up to the encompassing block’s boundary. The oscillation becomes weaker as its position from the edge gets farther. Ringing is most visible perpendicular to high contrast edges. An example of ringing artifacts is shown in Figure 1b and in large size in Appendix A, where ringing is most evident along the edge of the house in the sky. Although the ringing artifact is a well-known problem, only very few remedies to remove it have been reported in the literature. The difficulty with the ringing artifact is that it is relatively hard to detect and distinguish it from image features.

(a) original Image (b) compressed image

Figure 1 – Example of ringing artifacts

The ringing artifact is fundamentally associated with . The latter is an of or other eigenfunction series occurring at simple discontinuities [12]. The overshoot is a consequence of trying to approximate a discontinuous function, with a finite sum of continuous functions, such as cosines in DCT. A finite sum of continuous functions is necessarily continuous, and therefore, cannot approximate the area around the discontinuity within any arbitrarily chosen accuracy limit as shown in Figure 2. From the comparison between Figure 2a and Figure 2b the error of the overshoot reduces in width and energy, but nonetheless converges to a fixed height as the number of terms rises. The larger the discontinuity, the larger this overshoot will be. Therefore, ringing artifacts are most visible around edges of high frequency in images.

______

Page 8 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

With JPEG and MPEG compression individual blocks in an image are represented as a sum of cosine functions oscillating at different frequencies. The weights of these cosine functions are called the DCT coefficients. The subsequent quantization of these DCT coefficients results in the generation of an error to a reconstructed image block. Under high compression the high frequency DCT coefficients are more truncated, which causes more visible ringing artifacts, than under low compression where the high frequency DCT coefficients are retained.

(a) 15 cosines (b) 30 cosines

Figure 2 - Gibbs Phenomenon in a square wave approximation with cosines

______

© MKE – TU Delft 2008 Page 9 Master Thesis N.C.R. Klomp ______

1.3 Objective Metrics and the HVS

The human visual system (HVS) is the ultimate assessor of most visual information, taking into account the way humans perceive quality aspects while removing perceptual redundancies. This knowledge can be greatly beneficial for matching objective metrics to human visual perception. Although, humans can easily detect image distortions, designing objective metrics is still a very difficult challenge mainly due to the limited understanding of the HVS. How the HVS extracts an image quality judgment from information in an image is not completely understood yet [13]. The limitation of metrics based purely on a pixel based approach can be seen in the mean square error (MSE). The MSE takes the average of the square of the difference in image content between the original and compressed image. This metric insufficiently reflects quality distortions visible to the human eye, and as a result its correlation with subjective image quality is very bad [13]. The performance of the metric has been proved to be enhanced by incorporating certain properties of the HVS [14]. Therefore, in general, incorporating HVS characteristics in an objective metric may potentially be a useful approach towards a more reliable metric.

Advances in human vision research provided important information in the working of the HVS. This knowledge has been primarily adopted to design a variety of computational vision models for image quality assessment [15]. The essential task of incorporating properties of the HVS into a metric lies in simulating its operations, involving some lower level processing (e.g. contrast sensitivity and masking) and some higher level processing (e.g. cognitive interaction) [16]. However, as the HVS is extremely complex, objective metrics based on a full model of the HVS often cost a lot of computational power. This is conflicting with implementing it for real-time video applications. Therefore, investigations should be carried out to reduce the complexity without compromising the overall performance.

Ringing cloud edge

edge

Ringing sky

(a) (b) (c)

Figure 3 – Illustration of luminance masking on ringing artifacts ______

Page 10 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

The HVS can be modeled in objective metrics based on a top-down approach, bottom-up approach or a combination of both. The top-down or goal-directed approach tries to simulate the HVS based on a specific task and the bottom-up or stimulus driven approach tries to simulate the HVS based on features in an image. In general, objective metrics which model the HVS based on a bottom-up approach are less complex, and therefore, better applicable in real-time processing applications. Several objective metrics which model the HVS based on a bottom-up approach, already reduced the complexity of the HVS by only modeling a few important properties [21][22]. It is concluded that by modeling only the most important properties of the HVS instead of a full model the accuracy of the metric is not significantly affected [17]. Two fundamental characteristics of the HVS, which affect the visibility of an artifact in the spatial domain are the averaged luminance surrounding the artifact and the spatial non-uniformity in the background luminance [18]. These phenomena are well known as luminance and texture masking. Masking designates the reduction of detectability of an image component (target) by the simultaneous presence of an another (masker), and is strongest when both components have the same or similar frequency, orientation and location [19]. In the case of luminance masking the visibility of ringing is largely reduced in an extremely dark or light surrounding, while it is observed more easily on a background with an averaged luminance value [19]. This is described by the Weber - Fechner law, which states that the sensitivity of the human eye to small luminance changes depends on the luminance value. Figure 3 shows an illustration of luminance masking. While ringing artifacts appear on both sides of the edge, as can be seen in Figure 3c, the artifacts are only visible in the blue sky, while they are masked in the white cloud, as shown in Figure 3b, due to the difference in background luminance. Texture masking is that the visibility of ringing is significantly affected by the spatial activity in the local background, e.g. ringing is masked when located in a textured region, while it is most visible against a smooth background [20]. The reason for this is that the human eye has a low-pass characteristic. Fast changing transitions that differ in chrominance or luminance are not distinguished by the human eye. Figure 4 shows an illustration of texture masking. While ringing artifacts appear on both sides of the edge, as can be seen in Figure 4c, the artifact it is visible in the smooth sky, while it is masked in the textured wall in Figure 4b.

______

© MKE – TU Delft 2008 Page 11 Master Thesis N.C.R. Klomp ______

Ringing sky edge

edge

Ringing wall

(a) (b) (c)

Figure 4 - Illustration of texture masking on ringing artifacts

Because more research has been devoted to perceived blockiness it is interesting to see how blockiness metrics are build and how properties of the HVS are included. This can help in the development of a new ringing metric. Existing no-reference blockiness metrics such as in [21] and [22] show a good performance and model properties of the HVS. They generally include the following steps:

(1) Detecting the location of the blocking artifacts (2) Quantifying blockiness purely pixel based (3) Modeling the HVS, i.e. calculating visibility due to masking (4) Integrating these two steps to calculate a blockiness score

Due to the characteristics of the DCT coding scheme blocking artifacts are only visible at the borders of the blocks and have a regular structure [8]. A no-reference blockiness metric is proposed in [21], where the blocking artifacts are calculated as inter-pixel differences between coding blocks. The HVS model used calculates a weighting coefficient that quantifies the perceptual significance of pixels on the border of the coding blocks due to spatial luminance and texture masking. It is formulated such that luminance masking is assumed to be the most important masking factor. Hence, applying this vision model in a blockiness metric may fail in assessing images where texture masking is dominant. Another drawback of this HVS model is also that it does not allow to deduct the overlapping effect of luminance and texture masking [24].

______

Page 12 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

An alternative no-reference blockiness metric is proposed in [22] where the amount of blockiness is calculated for individual coding blocks, and simultaneously its visibility to the human eye is estimated by means of a just-noticeable- (JND) profile based on the work of [23]. This JND profile provides an image, in which pixels with a high score are clearly more visible to humans than pixels with a low score. The JND profile is estimated from analyzing the local intensity and local spatial frequency around pixels, and therefore, incorporates both the HVS properties of luminance masking and texture masking. The blockiness quantification for an individual coding block is the product of the quantification of the blockiness purely pixel based in the coding block and the JND score averaged over all pixels in the coding block. The blockiness quantification for the whole image is then the average of the blockiness quantification over all individual coding blocks. This metric shows a promising performance, but a main drawback is that it needs to estimate the JND profile for every pixel in the image [22]. This largely increases the computational cost and complexity of the HVS model.

Modeling the HVS only in regions contaminated with artifacts, instead of modeling it for every pixel, might be a useful approach towards the development of a new ringing metric with a reduced computational cost. However, due to the regular structure of blocking artifacts it is far more easy to determine the exact position of blocking artifacts than of ringing artifacts. Therefore, a reliable method to determine the exact regions contaminated with ringing artifacts will be greatly beneficial . Only calculating the HVS for pixels within these regions will lead to a minimum amount of calculations for the HVS model. An additional difficulty arises by representing the overall masking effect of a pixel as one single coefficient. Firstly, the weight of the different masking effects has to be determined. By giving more importance to one specific masking effect the method can fail in assessing images where another masking effect is dominant. Secondly, the overlap of masking effects has to be determined in the calculation of the overall masking, which increases the complexity of the model. An useful approach for a HVS model in an objective metric can be one that removes regions that are visually masked based on individual masking effects separately. The regions that remain can then further be evaluated for artifacts. This will also be done in this approach, and therefore, the weightings for each individual masking effect and for their overlap don’t have to be determined, which makes it less complex and more robust for a wide variety of image content.

______

© MKE – TU Delft 2008 Page 13 Master Thesis N.C.R. Klomp ______

1.4 Ringing Metrics

Until recently, only a limited amount of research has been devoted to quantify perceived ringing. So far, only a limited number of ringing metrics exist in literature. The performance of these proposed metrics is often unknown due to missing evaluations with subjective data. They can roughly be classified in two categories: statistical-based metrics and structure-specific metrics.

Statistical-based metrics determine the amount of ringing from a statistical analyses of image features. One statistical based metric is working by first training a classifier that distinguishes “distorted” from “undistorted” edge points in compressed images based on the local features of the edge points [25]. To enhance the performance a principal component analyses is used to reduce the dimensionality of the feature space. These local features are used to quantify the amount of ringing around the edge points. An alternative no-reference ringing metric uses the percentage of high frequency energy as an indicator of the amount of ringing artifacts in an image [26]. A third metric [28] first determines the ringing regions, which are defined as the areas surrounding strong horizontal and vertical edges. Ringing is quantified as the ratio of pixel activity at middle low and middle high frequencies, localized within these ringing regions. These three statistical-based metrics do not take into account the specific nature of the ringing artifact itself nor its visibility. Generally, ringing artifacts become weaker farther away from the edge and are most visible perpendicular to strong edges. Because the shape and strength of ringing artifacts depend on the distance from and direction of the local edge, it is very doubtful that statistical- based metrics will ever give a good indication of perceived ringing.

An alternative class of metrics are the structure-specific metrics taking into account typical characteristics of a ringing artifact. Since blocking artifacts are only visible on the borders of coding blocks, which share the same size and structure, they can easily be detected [8]. Ringing artifacts, however, are relatively hard to detect and distinguish from image detail. Therefore, a real challenge in the development of a structure-specific ringing metric lies in the difficulty to model the difference between ringing artifacts and image detail. Because of this, structure-specific ringing metrics usually are split into two separate parts:

(1) Ringing region detection, i.e. the regions where ringing artifacts potentially can occur are detected (2) Ringing quantification, i.e. the visibility of ringing artifacts within these ringing regions is quantified

A full-reference structure-specific ringing metric, proposed in [27], defines ringing regions as the areas near strong vertical edges. Ringing is quantified as the difference in intensity between the original and compressed image for pixels within these regions. A no-reference approach of this metric [31] compares the mean and standard deviation of pixel intensities within the ringing regions. The amount of pixels with a similar mean intensity as neighboring pixels, but with a different standard deviation, defines the amount of ringing. It is clear that the previous metrics strongly rely on the performance of the edge ______

Page 14 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______detector used. The edge detector used captures strong edges based on the gradient magnitude only, which means that pixels with a gradient magnitude larger than a certain threshold are considered as part of an edge. The ringing metrics then simply assume that ringing artifacts occur unconditionally in those detected regions. The authors of [31] conclude that the reliability of the metric can be improved by using a more advanced edge detection technique. Especially, at higher compression due to increasing blur, the number of edges found by the edge detector for a given threshold value decreases, such that some of the important ringing regions are missed. This problem may be circumvented by lowering the threshold, but then one runs the risk or of increasing the computational cost by modeling the HVS near irrelevant edges.

In all the structure-specific metrics discussed so far the ringing artifacts within the ringing regions are quantified without considering their visibility. It is not determined whether these ringing regions are highly textured or contain extreme low and high pixel intensities. Therefore, ringing artifacts potentially visually masked by the background are also taken into account. This issue is circumvented by incorporating properties of the HVS into the detection methods, as in [29] and [30]. The no-reference ringing metric in [29] is built upon a global edge map of an image, and uses binary morphological operators to generate a mask for exposing regions that are likely to be contaminated with visible ringing artifacts. Ringing regions in this mask are then further evaluated based on perceptual criteria to decide whether ringing in that particular region is actually visually masked. Finally, ringing in the whole image is only quantified in the regions which are not masked. A similar approach, to generate a mask for regions that are likely contaminated with visible ringing artifacts, is used in [30]. However, a different way of including HVS masking properties is employed. Here, a clustering analysis is performed over the smooth regions of an image to classify them into different objects. Then, masking is incorporated by comparing the mean luminance and activity in a potential ringing region to that of its corresponding object. The ringing metrics of [29] and [30] look promising, since a HVS model is applied to determine only those ringing regions with visible ringing artifacts. However, the models of the HVS used are not optimal in their current form and some parts are computationally very expensive. The HVS model in [29] involves a decision mechanism based on parameters (e.g. variance and threshold), and the procedure towards the selection of the optimal parameters requires a number of calculations. The major cost of the HVS model in [30] is introduced by its clustering method, which contains color clustering and texture clustering. This is an intensive process, including an analysis of color and texture characteristics over the entire image. As a consequence, it is not realistic that it can be applied in a video chain in its current form.

In the next chapter several interesting metrics available in literature are discussed in more detail. The main advantages and disadvantages of each will be described to define potential improvements and simplifications towards the development of a new no-reference ringing metric.

______

© MKE – TU Delft 2008 Page 15

Master Thesis N.C.R. Klomp ______2 Literature

n this chapter some of the more interesting ringing metrics discussed in the introduction are described I in more detail. A closer look to the main advantages and disadvantages of each ringing metric will be used later to define potential improvements and simplifications towards the development of a new no- reference ringing metric. This chapter is structured as follows. Firstly two statistical-based ringing metrics are discussed. Secondly, two structure specific ringing metrics that don’t include a HVS model are described. Finally, two structure-specific ringing metrics that do include a HVS model are described.

2.1 PCA Ringing Metric [31]

Based on ideas published earlier [25] a no-reference statistical-based ringing metric is defined that scales with the compression level of an image. Distortions around edges in compressed images are mainly caused by ringing artifacts. The PCA ringing metric extracts image features around edges and uses a general pattern recognition approach to calculate how many of the edge pixels are distorted when an image is compressed.

Edges can have many different orientations and the features should be measured in the same way for all edge pixels regardless of the edge orientation. Therefore, the feature extraction template as shown in Figure 5 is rotated such that it is always perpendicular to the edge. To find the edges, a sobel operator is applied to the luminance component of the image. Edge centerlines are found through the application of a morphological skeletonization operation. Once the orientation and location of the edge centerline pixels are known, the feature extraction template is placed on each edge centerline pixel and all features are extracted. Assuming that the width of the edge is 1 pixel, the template is 9 by 3 pixels in size. The mean and standard deviation over the pixel values in the 2 by 3 pixel areas A, B, C and D are used as features together with the pixel values in the 9 by 1 pixel profile. Thus, the total number of extracted feature values is 8 + 9 = 17.

The dimensionality of the feature space is reduced using a regular Principle Component Analysis (PCA). PCA is a technique which is often applied in pattern recognition, because it can also boost performance even when dimensionality reduction is not strictly necessary. In the PCA ringing metric eleven principal components are used because then the results are better than when all seventeen original features are used. The classifier is then trained with a set of fifty images, twenty-five of which are labeled as compressed and twenty-five as uncompressed. The ringing metric of an unseen image is defined as the number of edge pixels classified as distorted divided over the total number of edge pixels.

______

© MKE – TU Delft 2008 Page 17 Master Thesis N.C.R. Klomp ______

Figure 5 - Feature extraction template

To test the performance of this metric fourteen viewers were asked to evaluate perceived ringing in a set of fourteen stimuli at two different compression levels [31]. The resulting scores of this experiment were compared to this ringing metric. The Pearson linear correlation and the Spearman rank order correlation were 0.14 and 0.24 respectively. Even though this metric showed promising results in preliminary experiments its correlation with subjective data is almost non-existent in this experiment.

The metric presented in [25] was defined to predict the compression level of an image, and not the amount of visible ringing. Also the influence of image content in predicting visible ringing was recognized in [25], but not taken into account by the ringing metric. Apart from these two possible reasons for the failure of this metric in the experiment, also the lack of sufficient training material might have played a role. A possible way to improve the performance of this metric might be to only include features near edges with visible ringing instead of features near all edges.

______

Page 18 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______2.2 Ratio Ringing Metric [28]

The ratio ringing metric is a no-reference statistical-based ringing metric, which simply assumes that ringing is a ratio of middle-low and middle-high intensities in regions surrounding strong edges. It first creates a binary ringing mask by detecting and dilating the strong edges in the image. The edge detector and structuring element used for this purpose are not mentioned. The operations are performed on the luminance component of the image only. Ringing regions are detected using the binary ringing mask on this luminance component. Then, ringing is quantified by summing over all ringing regions the ratio of region activity in middle low intensity over region activity in middle high intensity. The overall ringing quantification is then defined as:

where m is the total number of pixel rows and n the total number of pixel columns in the image.

N(ARmEdge), N(A’RmEdge) and N(Ringing Mask) are the number of non null pixel values in the respective regions and I(i,j) is the pixel intensity in row i and column j. and are defined as:

(2)

(3)

where , and are thresholds for the low, middle and high intensities.

This metric is very simple, but has the main drawback that not only ringing, but also texture can have a middle-high and middle-low activity. Therefore, it is of great importance to split ringing pixels from texture pixels and to take only the visible ringing pixels into account in the quantification of ringing. Because there is no model defined to separate texture pixels from ringing pixels the ringing quantification is affected by the amount of texture in an image. In most natural images, even in highly compressed ones, there exist more texture than ringing. The quantification of ringing in this metric is probably more dependent on the amount of texture then on the amount of ringing artifacts in an image. Therefore, it is very doubtful that this metric will give a good indication of perceived ringing.

______

© MKE – TU Delft 2008 Page 19 Master Thesis N.C.R. Klomp ______

2.3 Horizontal Ringing Metric [27]

In this full-reference structure-specific metric ringing is quantified only in the horizontal direction along strong vertical edges. A sobel operator is applied to the luminance component of the image to detect vertical edges. Noise and insignificant edges are then removed by applying a threshold to the gradient image. The width at the left- and right-hand side of the edge are defined as |P3 – P1| and |P2 – P1| respectively and the left- and right-hand side ringing regions are defined as |P3’ – P3| and |P2’ – P2| respectively. These regions are shown in Figure 6. The image is then scanned row by row from top-left to bottom-right. When there is a ringing region detected in a row, the difference in luminance between the reference and distorted image (i.e. difference) for each individual pixel is calculated for the left and right part of the ringing region separately. The score for the left and right ringing region are now defined as:

The overall ringing quantification is then defined as:

To test the performance of the Horizontal ringing metric (HRM) ten expert viewers were asked in an experiment to give their judgment of perceived ringing in a set of fifty-four stimuli at six different compression levels [27]. The results of this experiment were compared to the HRM. The Pearson linear correlation and Spearman rank order correlation were 0.85 and 0.86 respectively. It is concluded in [27] that the lower correlation of the HRM compared to the full reference blur metric is due to the fact that subjects found it more difficult to evaluate ringing than blur. This metric only quantified ringing along the horizontal direction instead of in both directions. Therefore, it is possibly not robust for the assessment of images with specific content, i.e. images with a lot of edges in vertical direction. A simple way to improve this metric would be to quantify ringing also in the vertical direction along strong horizontal edges. Apart from this the Pearson linear correlation and Spearman rank order correlation show a good correlation with subjective data. In the next section a no-reference ringing metric [31] based on the ideas of the HRM is proposed. For the development of a new no-reference ringing metric it is interesting to see the difference in performance between the no-reference metric in [31] and this full- reference metric.

______

Page 20 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______2.4 No-Reference Ringing Metric [31]

Based on the ideas of the HRM a no-reference structure-specific ringing metric (NRRM) is proposed. Figure 6, obtained from [27], shows the intensity values of pixels in one row. The dotted line L1 is one row of the original image and the solid line L2 is the same row of the compressed image. Ringing can be observed around the edge at P1 between P3 and P3’ as well as between P2 and P2’. Ringing is detected by analyzing the statistics of neighboring pixels to the left and right, of each pixel in the ringing regions P3 to

P3’ and P2 to P2’. A pixel is defined as a ringing pixel if the intensity of its left mean and right mean are similar, but the left standard deviation and right standard deviation are dissimilar (specifically, one is close to zero, the other is not close to zero).

Intensity

Row number

Figure 6 – Intensities of pixels in one image row

A detected ringing pixel at location (i,j) can be described as follows:

where and denote the left and right mean and and denote the left and right standard deviation for a pixel in row i and column j. The and are threshold values. These thresholds can be tuned to optimize the performance of this metric (α = 3 and β = 5 are used).

______

© MKE – TU Delft 2008 Page 21 Master Thesis N.C.R. Klomp ______

The left and right mean of a pixel are defined as:

(7)

The left and right standard deviation of a pixel can be described as follows:

(8)

where I(i,j) is the pixel intensity in row i and column j and r is a parameter defining the range, over which the mean and standard deviation are calculated (r = 5 is used).

Ringing can then be simply quantified as the sum of detected ringing pixels normalized by the total number of pixels in a given image:

To test the performance of the NRRM fourteen viewers were asked to give their judgment of perceived ringing in a set of fourteen stimuli at two different compression levels [31]. The results of this experiment were compared to the NRRM. The Pearson linear correlation and Spearman rank order correlation were lower than for the full-reference metric, and resulted in 0.65 and 0.52, respectively. Although the HRM showed promising results the correlation with subjective data of the NRRM is somewhat disappointing. It is mentioned in [31] that there were quite some outliers. Most of these outliers consisted of a high score for the metric and a low subjective score. These outliers were probably due to the absence of luminance and texture masking in the NRRM. Some ringing artifacts were surrounded by a high/low luminance background or a lot of texture, which affected their visibility. The absence of a HVS model clearly limited the reliability of the NRRM. Moreover, at higher compression the ______

Page 22 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______number of edges found by the edge detection decreased because of increased blur. As a result, some of the important ringing regions were not found anymore. Apart from these two possible reasons for the mediocre results of the NRRM in the experiment, also the lack of quantifying ringing in vertical direction perpendicular to horizontal edges might have played a role. A possible way to improve the performance of this metric might be to quantify ringing also in the vertical direction along strong horizontal edges, include a more robust edge detector and incorporate certain properties of the HVS, such as spatial masking.

______

© MKE – TU Delft 2008 Page 23 Master Thesis N.C.R. Klomp ______

2.5 Morphological Filtering Ringing Metric [29]

The morphological filtering based ringing metric (MFRM) is one of the most advanced ringing metrics found in literature. It takes ringing morphology as well as certain properties of the HVS into account. As such, it belongs to the structure-specific metrics. An overview of the MFRM is shown in Figure 7 .

Input a Edge 1 b Filtering 2 c Edge 3 d Binary 4 image Detection against noise Linking closing e Ringing 8 h HVS based 7 g 6 f Binary 5 Ex-or quantification Mask modification dilation

Figure 7 – Overview MFRM

Initially edge detection (1) is applied to the luminance component of the input image to detect the strong edges. To remove texture and noise the edge map is filtered using isotropic pruning (2). However, the isotropic pruning operation not only removes texture but also damages some edges. Therefore, in the next step edge linking (3) is applied to repair the damaged edges. Next binary closing (4) and dilation (5) is applied. The ex-or (6) operation of both results forms the raw ringing mask ( . These parts of this raw ringing mask that are visually masked by texture or luminance are removed by the HVS model (7) with the final ringing mask ( ) as result. The overall ringing quantification (8) is calculated over all pixels within the final ringing mask. An example of the outputs are shown in Figure 8.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 8 – Outputs MFRM

______

Page 24 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Each step is described in more detail below. In step 1 the sobel edge operator is applied to the luminance component of the input image. This results in a raw edge map (b). This raw edge map is filtered against noise in step 2 using Hit-or-Miss transformations and isotropic pruning resulting in a cleaned edge map (c). To be more specific, first a few iterations of isotropic pruning are employed to trim shore line segments down to single pixels. The number of iterations depend on the amount of noise in the raw edge map.

Isotropic pruning is a modification of the thinning operation and can be defined as:

where

where \ refers to the difference operator.

Then, a single iteration morphological Hit-or-Miss is applied to remove isolated pixels. A Hit-or-Miss transformation is a template matching process. If A and S are subsets of , and , the Hit-or-Miss transform of A by S with respect to the given partition of S is denoted as and is defined as:

In practice, a hit is scored if and only if upon the translation of the structuring element S by the vector z, and fit into the foreground and background regions of the given binary image respectively. The Hit-or-Miss transformation marks the location of the hit scoring translations shown in Figure 9.

Figure 9 - Example of a Hit-or-Miss operation ______

© MKE – TU Delft 2008 Page 25 Master Thesis N.C.R. Klomp ______

The author of [29] states that the Hit-or-Miss transformation and isotropic pruning reduce noise, but also cause spurious breaks and gaps in the raw edge map. The sobel edge operator is also not able to detect all edges where ringing occurs. Therefore, an edge linking operation is applied to the cleaned edge map (c) to obtain a final edge map (d). A high level description of their edge linking operation is described below.

1 Perform one Hit-or-Miss transformation to detect and mark the open ends of the edges.

2 Visit the next open edge end of the cleaned edge map in raster scan order and apply the following recursive processing at each of these pixels.

a. Mark the current pixel as an edge pixel. Go to step 2b. b. If the current pixel is on the image border then go to step 2, otherwise to step 2c. c. Find the direction of the local gradient vector and classify its direction to one of the classes as shown In Figure 10 . Go to step 2d. d. According to the classification determine the set of candidate edge tracking directions as shown in Figure 10. Go to step 2e. e. If the current pixel has at least one neighboring edge pixel in group 1’s support and also has at least one neighboring edge pixel in group 2’s support then go to step 2. Otherwise go to step 2f. f. If the current pixel has at least one neighboring edge pixel in group 1’s support go to step 2h. Otherwise go to step 2g. g. Within group 1’s support find the pixel with the greatest scaled squared gradient magnitude. If this scaled squared gradient magnitude is greater than or equal to 0.2 then mark this pixel to be the current pixel and go to Step 2a. Otherwise go to step 2. h. Apply step 2g for group 2’s support instead of group 1’s support.

Figure 10 - Candidate edge tracking directions and candidate edge pixels

______

Page 26 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Next a binary closing operation is applied to the final edge map (d) to obtain the edge mask (e). The logical Ex-Or of the edge mask and the dilated edge mask (f) forms the raw ringing mask ( (g) around the edges. After this procedure still some missed edges or texture remain in this raw ringing mask. Therefore, it is further reduced to only pixels where ringing artifact are visible. The solution in this approach is based on a decision process taking into account human visual perception characteristics to distinguish ringing from other image features. The regions which contain visible ringing artifacts are preserved in the final ringing mask ) (h) and image detail actively masking ringing artifacts is excluded. The decision process is based on two aspects: texture masking and luminance masking. For texture masking the local variance in pixel intensities versus the distance from and the gradient strength of the nearest edge is calculated. A study on the structure of synthetic ringing artifacts revealed that the ringing oscillations are governed by a decreasing envelope of peak strength from the edge, and that the variance of ringing pixels behaves as a quadratic function of the nearest edge gradient value. This is implemented by comparing the local variance, LV(i,j), of a pixel at location (i,j) within a 4x4 window in the raw ringing mask to a threshold.

The local variance can be defined as:

where defines the intensity value ( ) of a pixel at location (k,l).

The threshold is calculated as:

where the “distance to the edge” factor can have a value of 1.0, 0.5 and 0.125 depending on the distance to the edge. The edge gradient is the mean gradient value of the edge most nearby this window.

Luminance masking is based on the observation that ringing artifacts in a region with a very high or very low intensity are not visible to humans. This is implemented by comparing the local mean intensity, LM(i,j), of a pixel at location (i,j) within the same 4x4 window to a lower and upper intensity threshold.

______

© MKE – TU Delft 2008 Page 27 Master Thesis N.C.R. Klomp ______

The local mean intensity can be defined as:

The global lower and upper intensity threshold for pixels within this window are two constants which are not given:

The final ringing mask (h) is calculated as:

The final ringing mask is the area of the image within which the visible ringing is detected. The ringing quantification in step 8 is calculated as:

where M is the total number of pixels in the final ringing mask.

______

Page 28 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Ringing regions are detected in the MFRM by applying sobel edge detection, which captures strong edges based on the gradient magnitude. This means that pixels with a gradient magnitude larger than a certain threshold are considered as part of an edge. However, finding an optimal threshold for a specific image is very difficult. This leads to missing obvious ringing regions near perceptual relevant edges in case of a high threshold or too many spurious ringing regions near texture in case of a low threshold. Therefore, it is very doubtful if this edge detection method will provide reliable results.

An interesting aspect of the MFRM over metrics previously discussed is that texture and noise are removed from the edge map by using Hit-or-Miss transformations and isotropic pruning. However, these operations not only remove noise, but also damage edges. To restore these damaged edges an edge linking operation is applied afterwards. However, this edge linking operation is only able to make small edge restorations and is of considerable computational power and complexity. This makes the MFRM not very attractive for application in a video chain.

The HVS model of the MFRM looks promising since it only is applied to the ringing regions instead of to the whole image. The HVS model scans each ringing region and a simple comparison is made to determine if pixels within a small window in the ringing region are visually masked by texture or luminance. Due to the reduced complexity of this approach it is interesting to incorporate a similar HVS model in the new ringing metric. However, ringing artifacts are sometimes hard to distinguish from image detail, even within ringing regions. Therefore, the amount of texture and luminance masking is also affected by the amount of ringing artifacts within these ringing regions. Parts of ringing regions heavily distorted with ringing artifacts can be classified as highly masked regions, while they actually are not, and will therefore be removed by a HVS model. This is definitely a problem for the HVS model of the MFRM.

The MFRM uses the mean level of variance within the ringing regions as quantification of ringing visibility. The calculation of the variance for individual pixels within the ringing regions is relatively simple, and therefore, interesting for the new ringing metric. However, it has a main drawback that the ringing regions are not always perfectly smooth. The amount of variance within ringing regions is not only dependent on ringing artifacts, but also on image content. Therefore, it is doubtful if the MFRM gives reliable results for a variety of image content.

______

© MKE – TU Delft 2008 Page 29 Master Thesis N.C.R. Klomp ______

2.6 Region Clustering Ringing Metric [30]

The region clustering based ringing metric (RCRM) is a structure-specific ringing metric which uses a similar region detection model as the MFRM of [29]. The overview of the metric is shown in Figure 11.

Edge 2 Y Detection Image 1 Edge 2 Gradient 3 Raw Edge 4 Edge 5 Noise 6 Edge 7 Cb conversion Detection combining generation Thinning Removal Linking Edge 2 Cr Detection

Ringing 13 False Region1 2 Texture 11 Color 10 Smooth region 9 Ringing region8 quantification Removal Clustering Clustering detection detection

Figure 11 – Overview RCRM

Initially, the input image is converted (1) to YCbCr and the sobel operator is applied on each of each of the individual Y, Cb and Cr components (2) resulting in three maps of gradient magnitudes. The overall gradient magnitudes are then a combination (3) of the individual Y, Cb and Cr gradient magnitudes. An edge map is generated by applying a threshold on these overall gradient magnitudes (4) and using a morphological skeletonization operation to thin (5) the edges. Similar as in [29] a noise removal (6) and edge linking (7) operation are applied on the edge map. From this map the ringing regions are detected (8) using a binary dilation operation. A clustering method (9) is then applied to find all smooth regions in the image. Then, a combination of color (10) and texture (11) clustering is performed over all these smooth regions to detect the different objects in the image. False regions are removed (12) by comparing the variance of pixels within the ringing regions with the strength of nearby edges. The overall ringing quantification (13) is calculated by comparing the features of the ringing regions with the objects to which they belong including texture and luminance masking properties. Each step is described in more detail below.

In step 1 the image is converted to YCbCr and then in step 2 the sobel operator is applied to the individual Y, Cb and Cr components of the input image. This results in three maps of gradient magnitudes, one for each individual component. In step 3 the color gradient magnitudes are then combined to overall gradient values as:

where and are the gradient magnitudes of the Y, Cb and Cr component.

______

Page 30 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Pixels with their overall gradient magnitudes larger than a certain threshold are initialized as edge pixels in step 4. These raw edge pixels are thinned using a morphological skeletonization operation in step 5, resulting in an edge map. The authors of [30] state that in steps 6-8, similar as in the MFRM [29], a noise removal and edge linking operation are applied on the edge map, and the ringing regions are detected using binary dilation on this map. The key idea in step 9 is that ringing artifacts are only perceivable in smooth regions, in which they have a higher activity than the object to which they belong. An example is shown in Figure 12a and in large size in Appendix B. Ringing artifacts along the bars are well visible against the smooth background of the sky, while they are hardly visible in the water, where they are masked by the texture of the water. Therefore, it is of interest to identify different objects in an image and to study their texture. In general, an image is segmented into three categories: edge, potential ringing region and potential smooth region. Potential ringing regions are regions surrounding edges and potential smooth regions are all regions other than edges or potential ringing regions. Usually, textured regions in an image are covered with edges, and therefore, are not recognized as smooth regions. The noise filter applied in step 6 might also remove some edges, and therefore, lightly textured regions can be misclassified as smooth regions. To cover this, the name “potential” smooth region is used. An example of the identification of the edges, potential ringing regions and potential smooth region from Figure 12a is shown in Figure 12b. Black is representing the edges, gray is representing the potential ringing regions and white is representing the potential smooth regions.

In step 10 different objects within the potential smooth regions are identified using a clustering process. Similar to [32] the objects are clustered on a combination of colour and texture. In the colour clustering process objects are clustered into different categories according to their colour similarity. The Y, Cb and

Cr component are modelled independently of each other. They represent the colour distribution of a region s with a Gaussian model where and . A hierarchical clustering scheme is then applied sequentially in the components. The objects to be clustered are represented by , with n defining the number of clusters (n = 10 is used). Each of the objects is initialized to be a single cluster. The similarity between two clusters and can be described by the method of [33]:

The two most similar clusters are merged into one cluster. The merging process is performed iteratively until there are no two clusters left sufficient similar to each other based on .

Objects clustered for colour similarity can have different texture characteristics. In step 11 these objects are further clustered into smooth and light-textured by applying a K-means algorithm (k = 2) on the object gradient magnitudes of the Y component.

______

© MKE – TU Delft 2008 Page 31 Master Thesis N.C.R. Klomp ______

Figure 12c is an example resulting from the clustering process. Each colour represents a particular object, where black areas represent the edges or parts of the image which are white or black.

In step 12 the false ringing regions are removed. Some edges and textures are missed in the edge detection process because their gradient magnitudes are below the threshold of the edge detector. However, these missed edges and textures can lie in close neighbourhood of the detected edges, and therefore, they can be captured within a potential ringing region. To remove these areas to determine the actual ringing regions, pixels gradient magnitudes of the Y component within the potential ringing regions are compared with a threshold. Edges with a certain mean gradient magnitude typically only generate ringing artifacts with gradient magnitudes below a certain level. This ringing threshold for a pixel is calculated as:

Where is the (i,j)th element of the quantization table Q, and is the average edge gradient magnitude value in the corresponding 8x8 block. Pixels in the potential ringing regions with a gradient magnitude of the Y component larger then are than removed.

In step 13 the overall ringing quantification is calculated by comparing the features of the ringing regions with the objects to which they belong including texture and luminance masking properties. The amount of texture masking of a ringing region is determined by calculating the activity typical for the object to which it belongs. The intensity changes between neighbouring pixels describes the activity of a pixel. A 4-point neighbourhood, i.e. two horizontal neighbours and two vertical neighbours, is used. The activity of a pixel at location (i,j) can be described as follows:

where is the Y component value of a pixel in row i and column j. The amount of texture masking for a ringing region, is determined by the mean of all activity values over all pixels of the object to which it belongs. In Figure 12d the activity values for the different objects are visually represented; a dark greyscale represents objects with a low activity, a light greyscale objects with a high activity, black the edges. This image clearly illustrates that the smooth sky has a lower activity than the more textured water, and therefore, will mask the ringing artifacts less.

______

Page 32 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

(a) (b)

(c) (d)

Figure 12 – Clustering Example

The amount of luminance masking, , for a ringing region is determined by:

where is the mean luminance over the ringing region.

______

© MKE – TU Delft 2008 Page 33 Master Thesis N.C.R. Klomp ______

The overall masking characteristic of a ringing region is the maximum of the texture masking, , and luminance masking, , property, defined as:

The visibility of ringing in a given ringing region is dependent on the masking properties and the size of the ringing region. The visibility of a ringing region is defined as:

where is the size of the ringing region and a(r) is the mean of all activity values over all pixels within the ringing region. An example of the results of the RCRM in a JPEG compressed image with quality levels 90, 70 and 50 are shown in Figure 13. The overall ringing quantification is not described in much detail, but is probably calculated as:

(a) CR = 90 (b) CR = 70 (c) CR = 50

Figure 13 - Visibility ringing regions of an image compressed at three different ratios

______

Page 34 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Interesting aspects of the RCRM over metrics previously discussed is that different objects to which ringing regions belong are used as reference to determine the amount of texture masking. However, a complex clustering method is needed to find different objects in an image. This clustering method is of considerable computational power and can probably not be applied in a video chain. There are also no results given on the performance. So, it is uncertain whether this clustering method will find correct objects in images with other content than used in [30]. Similar as in the MFRM [29] a drawback in the detection of ringing regions in this metric is that edges are found based on gradient magnitudes only and a complex edge link method is used. Apart from these drawbacks the visual assessment, carried out on a small group of stimuli, shows to be promising. Since no more extensive assessment was done, nothing is known about the correlation of this metric to subjective data.

______

© MKE – TU Delft 2008 Page 35 Master Thesis N.C.R. Klomp ______

2.7 Conclusion

Ringing artifacts are most visible around high contrast edges in the spatial domain. The accuracy of objective metrics is in general improved by quantifying artifacts only in the regions where they occur. Ringing regions are detected in most existing ringing metrics [27]-[31] by applying sobel edge detection, which captures strong edges based on the gradient magnitude only. This means that pixels with a gradient magnitude larger than a certain threshold are considered as part of an edge. However, finding an optimal threshold for a specific image is very difficult, as illustrated in Figure 14, where once a high threshold of 0.2 is applied (Figure 14c) and once a low threshold of 0.1 is applied (Figure 14d). The edge map in Figure 14c largely removes texture, while eliminating a number of important edges for ringing too, as may be seen from the comparison to Figure 14b. This may heavily degrade the detection accuracy of perceived ringing. By lowering the threshold, all strong edges are maintained in the edge map (see e.g. Figure 14d), but it contains more texture, that is non-relevant to ringing detection, and consequently results in a large number of unnecessary computations. So, depending on the choice of the threshold, one runs the risk of missing obvious ringing regions near non-detected edges (e.g. in case of a higher threshold value) or finding regions without ringing artifacts near irrelevant edges (e.g. in case of a lower threshold value). The ringing metrics of [29] and [30] use an edge linking method after the edge detection which has to repair the broken edges. However, this edge linking method has a considerable computational cost and complexity and it is doubtful if the results are good. Therefore, in the development of a new ringing metric, a new edge detection model, which is able to detect the more relevant edges in an image can be highly beneficial.

The ringing metrics of [29] and [30] are essentially promising, since a HVS model is incorporated to obtain only the ringing regions where ringing artifacts are visible. An interesting aspect of the metric in [30] over other metrics is that different objects, to which ringing regions belong, are detected and used as reference. However, a complex clustering model is needed, and therefore, this HVS model will probably be too computationally expensive for application in a video chain. In the development of a new ringing metric it is highly desirable to keep the HVS model relatively simple, and minimize the amount of calculations without compromising the detection accuracy. In [29] the HVS model is applied only to the ringing regions instead of to the whole image. A simple comparison is made to determine if pixels within a small window within the ringing regions are visually masked by texture or luminance. Due to the reduced complexity of this approach [29] it is interesting to incorporate a similar HVS model in the new ringing metric.

______

Page 36 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Ringing

(a) original image (b) compressed image

(c) sobel edge map (d) sobel edge map

Figure 14 – Illustration of sobel edge detection thresholds (c) 0.2 and (d) 0.1

The level of variance within the ringing regions is defined as quantification of ringing in [29]. This approach is relatively simple, and therefore, interesting for the new ringing metric. However, it has the main drawback that the ringing regions are not always perfectly smooth. There, the amount of variance within a ringing region is not only dependent on ringing artifacts, but also on texture in the image content. A two dimensional illustration is shown in Figure 15, where the red dots indicate the intensity values of individual pixels. On the left side of the edge ringing artifacts are visible in a smooth ringing region. The fluctuation becomes weaker farther from the edge, and finally extinguishes. On the right side of the edge ringing artifacts are visible in a less smooth ringing region; there the fluctuation is not only influenced by ringing artifacts but also by texture in the image content, which can be seen in the far right end of the figure. In general, texture in the image content will reduce the visibility of the ringing artifacts, as shown also in the image in Appendix B, where ringing artifacts are well visible against the smooth background of the smooth sky while they are hardly visible in the more textured water. Therefore, in the development of the new ringing metric, the influence in the level of variance caused by texture in the image content should be taken into account in the quantification of ringing.

______

© MKE – TU Delft 2008 Page 37 Master Thesis N.C.R. Klomp ______

Figure 15 – Ringing artifacts in smooth and unsmooth ringing regions

______

Page 38 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______3 Approach

s shown in chapter 2, most ringing metrics generally exist of two parts: (1) ringing region detection, A i.e. the regions where ringing artifacts potentially occur are detected and (2) ringing quantification, i.e. the visibility of ringing artifacts within these ringing regions is quantified.

Improvements towards the development of a more reliable ringing metric can therefore be made in more accurately detecting ringing regions and in a better quantification of ringing within the ringing regions. Additionally, it is desirable to reduce the complexity and amount of calculations in both the detection and quantification of the ringing artifacts. The main improvements of the new ringing metric over existing metrics from literature are summarized below. A more detailed description is given in the next sections of this chapter.

To detect ringing regions more accurately our newly proposed ringing metric incorporates an improved edge detection model, which is able to detect only the more perceptually relevant edges in an image. This implies that only the edges most closely related to the occurrence of ringing artifacts are extracted for the subsequent detection phase. Finding the perceptually meaningful edges for ringing detection is based on the removal of irrelevant details from an image. Removing details, such as irrelevant texture and noise, from an image is obtained by smoothing the image. However, regular smoothing filters not only erase details, but also affect the location and precision of edges. Therefore, an edge preserving smoothing filter is used. Then the canny edge detector is applied to the filtered image to obtain the perceptually more meaningful edges in the image. This edge detection model is more robust against other artifacts distorting edge detection and works without a complex repair mechanism [29][30].

A model that removes parts of ringing regions which are visually masked by the background is also beneficial for improving the detection accuracy of ringing regions. In objective metrics from literature usually the HVS model removes parts of the image in which distortions are masked by the background. The HVS model in [29] is only applied to specific regions instead of to the whole image [30], which is also the approach for the HVS model in our metric. However, ringing artifacts are sometimes hard to distinguish from image detail. Therefore, the estimated amount of texture and luminance masking is also affected by the amount of ringing artifacts in these ringing regions. Parts of ringing regions heavily distorted with ringing artifacts can be classified as highly masked regions, while they actually are not, and therefore, will be removed by the HVS model. This is definitely a problem for the HVS model of [29]. Therefore, our newly proposed ringing metric incorporates an improved version of the HVS model of [29], which calculates the masking properties in reference objects nearby the ringing regions which are too far from the edges to be contaminated with ringing artifacts. The idea to use reference objects to determine the amount of masking for ringing regions is based on the HVS model in [30], but is implemented in our metric without a complex clustering model. The regions around edges are locally classified, using simple morphological operators, into three different regions: edge regions, ringing regions and feature regions. The feature regions are used as reference objects to determine which parts of the ringing regions are masked by texture or luminance. As explained in § 1.3, texture and luminance ______

© MKE – TU Delft 2008 Page 39 Master Thesis N.C.R. Klomp ______masking are implemented separately in our HVS model. In this way our HVS model remains simpler than by using an advanced clustering model [30], while it probably gives a better approximation to the effect of masking of ringing artifacts than in [29].

In [29] the level of variance in the ringing regions is defined as quantification of ringing. This approach is relatively simple, and therefore, interesting for our new ringing metric. However, it has the main drawback that ringing quantification is affected by texture in image content in regions which are not perfectly smooth. The metric from [30] also defines the level of variance as quantification for ringing, but subtracts afterwards the level of variance from the reference objects detected by the clustering model. This probably gives a more reliable quantification for ringing, because the contribution in the level of variance by image content is taken into account. However, a complex clustering model is used and the variance has to be calculated over all pixels in the image. In our approach, we chose to use the feature regions as reference objects. The variance within the feature regions is subtracted from the variance within their corresponding ringing regions to quantify ringing apart from the contribution of image content. In this way the metric remains simpler than by using an advanced clustering model [30], while it probably gives a more reliable ringing quantification than in [29].

Similar to the approaches reported in literature, we chose to split the ringing region detection method from the ringing quantification method in the implementation of our newly proposed ringing metric. The details of the implementation of both are described in the next sections of this chapter.

______

Page 40 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______3.1 Overview

A schematic overview of our proposed ringing metric is given in Figure 16. Initially the color image is converted to a luminance only image to reduce computational power. Next, an edge preserving smoothing filter is applied on the luminance component in order to remove texture and noise, while preserving more perceptually relevant edges. From this filtered image an edge map is generated by applying canny edge detection. Edges in this edge map are grouped and a perceptual edge map is formed. From this perceptual edge map various regions, i.e. edge, ringing and feature regions, are defined. The HVS model removes parts of the ringing regions which are visually masked. The remaining ringing and feature regions are further regrouped, i.e. as ringing object or feature object. Spurious ringing objects are detected and where necessary removed. This results in a computational ringing region map. The ringing quantification is calculated over this computational ringing region map, as the influence due to image content on the variance is corrected. This results in a score which defines the visibility of ringing in the image. In the next sections the implementation of these steps is described in more detail.

Color Image

Color Conversion RGB to BW

Smoothing Edge Preservering Smoothing Filter

Edge Detection Canny Edge Detector Perceptual Edge Grouping Edge Pixel Linking Edge Segment Labelling Edge Segment Thresholding

Perceptual Edge Map

Edge Region Classification Edge Region Ringing Region Feature Region

Human Vision Model Texture Masking Luminance Masking

Region Regrouping Edge Segment Classification Ringing Region Labelling Ringing Region Thresholding Ringing region Classification

Spurious Ringing Object Removal Edge Strength Detection Ringing Strength Detection Ringing object Removal

Computational Ringing Region Map

Ringing Quantification

Figure 16 - Overview of our new Ringing Metric ______

© MKE – TU Delft 2008 Page 41 Master Thesis N.C.R. Klomp ______

3.2 Ringing Region Detection

3.2.1 Color Conversion

A lot of research has been devoted to understand the way color information is coded in the human visual system [34][35]. It is generally accepted that the human eye has three types of cone receptors, which are sensitive to a different part of the light spectrum. They are referred to as S-cones (for the smaller wavelengths), M-cones (for the middle wavelengths) and L-cones (for the long wavelengths). In the retina the spectral light distribution is convoluted with the spectral sensitivity curves of the S-, M- and L-cones, which generates three neural signals of different strength. Depending on the ratio of these signals, the HVS will interpret the visual input as a certain color. The generally accepted opponent theory [38] states that there are actually three types of color receptive fields, which are called opponent channels. These are the (1) black-white channel or luminance component from the addition of L- and M- cones, the (2) red-green channel from the subtraction of M- from L-cones, and the (3) yellow-blue channel from the addition of the L- and M-cones minus the S-cones. Since there are substantially more L- and M-cones than S-cones, the density of these two on the retina is much higher, implying that the black-white channel has the highest spatial resolution. This feature is used in current color TV systems, in which a larger bandwidth is allocated to the luminance component than to the two chromaticity components [36]. Furthermore, it has been proved that the results from an image quality metric based on the luminance component only differ only slightly from the results of the corresponding full-color extension [37]. Therefore, in order to reduce the computational load, it is possible for an objective metric to work on the luminance component only without a significant degradation in its prediction accuracy. Digital images are usually coded in RGB color space, which is, not perceptually uniform. For the conversion to the luminance component a weighted sum of the R, G, and B components is calculated as follows:

3.2.2 Edge Preserving Smoothing

Natural images usually contain lots of details, such as texture and noise. Figure 14 illustrates that ordinary edge detectors, which are based on gradient magnitudes only, neglect spatial information. Therefore, they usually fail in extracting perceptually strong edges and their edge map contains a lot of image detail such as texture and noise. It is known that the HVS tends to respond to differences between homogeneous regions rather than to structure within these regions [39]. So for finding the perceptually meaningful edges for ringing detection, our approach is based on the following observation: details existing in homogenous regions are neglected as if viewed from a long distance.

______

Page 42 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Removing details, such as texture and noise, from an image can be obtained by smoothing an image until textual details are significantly reduced. Then, an edge detector can obtain the perceptually more meaningful edges in the image. The best known smoothing technique is linear low-pass filtering. The most widely used filter deploys a Gaussian kernel, since it has been proved to be very close to optimal noise reduction [40]. The function of this Gaussian kernel is shown in Figure 17a. The effect of this low- pass filter is to smooth out high spatial frequency components from an image. The degree of smoothing is determined by the standard deviation of the Gaussian. The Gaussian outputs for each pixel a ‘weighted average’ of that pixel's neighborhood, with the average weighted more towards the value of the central pixel as can be seen in Figure 17b. Since linear low-pass filtering strongly attenuates high frequency components, not only texture, but also the perceptually meaningful edges are smoothened. Because of this characteristic, the changes the spatial location of the edges in the resulting edge map. Figure 18 illustrates the effect of Gaussian smoothing (Figure 18b) on an original image (Figure 18a). The edge map from the smoothened image (Figure 18c) indeed leaves out a lot of irrelevant detail, but the spatial location of the edges is also changed, as shown in Figure 18d. Since ringing artifacts appear close to these edges the detection requires precise spatial localization of the edges. In such a scenario an edge preserving smoothing filter is needed.

(a) (b) Gaussian spatial kernel

The best known smoothingFigure technique 17 – Properties is low-pass of linear a Gaussian filtering. smoothing The most filter widely used filter deploys a

______

© MKE – TU Delft 2008 Page 43 Master Thesis N.C.R. Klomp ______

(a) original image (b) Gaussian smoothened image

(c) edge map (d) edge map in image

Figure 18 – Gaussian smoothing and edge detection

aussian smoothing and edge detection

An edge preserving smoothing filter (EPSF) is a filter which preserves the location of the high contrast edges in the smoothing process. It uses a non-linear operator which is able to remove texture and noise, while keeping edges. Several EPSF’s have been proposed in the literature. The best known ones are based on Kuwahara filtering [40], median filtering [42], bilateral filtering [43] and anisotropic diffusion [44]. The latter is probably the most popular EPSF, to which much research has been devoted to in the last fifteen years. However, it is not computationally efficient since it requires many iterations to achieve the desired output. The bilateral filter has however recently proposed in [43] as a simpler and faster alternative to anisotropic diffusion and has already been shown to be implemented in real-time for high- definition video [45]. Therefore, anisotropic diffusion is not taken into account in our approach. The details of Kuwahara filtering, median filtering and bilateral filtering are described below. In chapter 4, based on an evaluation with subjective data, the filter which performs best is chosen to be incorporated in our new ringing metric.

______

Page 44 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

3.2.2.1 Kuwahara Filtering

The principle of Kuwahara filtering is that the kernel is divided into four regions (a, b, c, d) as can be seen in (29). Although the Kuwahara filter can be implemented for a variety of different window shapes, the filter is described here for a square window of size 5. This window slides over all pixels in an image from the top left to the bottom right. Then, in each of the four regions (a, b, c, d), the mean brightness and the variance are calculated. The output value of the centre pixel (abcd) in the window is the mean value of that region that has the smallest variance.

3.2.2.2 Median Filtering

Similar as to the Kuwahara filter, the median filter can be implemented for a variety of different window shapes. The filter is described here for a square window of size 3 as can be seen in (30). This window slides over all pixels in an image from the top left to the bottom right. First, the median filter sorts the sample values in the window. The output value of the centre pixel (b) in the window is the median value of the sorted values.

______

© MKE – TU Delft 2008 Page 45 Master Thesis N.C.R. Klomp ______

3.2.2.3 Bilateral Filtering

Bilateral filtering refers to a combination of domain and range filtering. It extends the concept of Gaussian smoothing by weighting the filter coefficients with their corresponding relative pixel intensities. Therefore, pixels with a very different intensity compared to the central pixel are weighted less even though they may be at a close distance to the central pixel. This is applied as two Gaussian filters at a localized pixel neighborhood, one in the spatial domain, named the domain filter, and one in the intensity domain, named the range filter.

Let I(a) be the intensity of a pixel at location (i,j). Then for any given pixel , at location (i,j), within a neighborhood of size n, which has a0 at its centre, its coefficient assigned by the range filter, r(a) (i.e. ), is determined by the following function:

Similarly, its coefficient assigned by the domain filter, g(a ,(i.e. ), is determined by the closeness function below:

For the central pixel of the neighborhood a0 , its new value, denoted by h(a0), is:

where k is a normalization constant to maintain zero-gain and is defined as:

______

Page 46 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Both and determine the level of smoothness in the spatial and intensity domain. Setting close to zero reduces the bilateral filter to a simple Gaussian smoothing filter. Pixels close to the central pixel a0 in both space and intensity contribute more than those further away in space and intensity as can be seen in Figure 17b.

This is represented visually in Figure 19, obtained from [46]. When input (a) is filtered with a Gaussian smoothing filter with spatial kernel (b) the result is that in the output (c) noise and the edge are is smoothed out. However, with a spatial kernel of weight f x g (d) for the central pixel the result is that in output (e) noise is smoothed out and the steepness of the edge remains similar to its original steepness.

(a) (b) (c)

(d) (e)

Figure 19 – Bilateral Filtering

The advantage of bilateral filtering over traditional Gaussian filtering can be seen in Figure 20, where applying the edge detector on a bilaterally filtered image results in an edge map with a high precision in the localization of the edges. Since the locality of edges is ensured, this information can then be readily used in subsequent analysis for ringing detection.

______

© MKE – TU Delft 2008 Page 47 Master Thesis N.C.R. Klomp ______

(a) original image (b) bilaterally filtered image

(c) edge map (d) edge map in image

Figure 20 – Bilateral filtering and edge detection

3.2.3 Edge Detection

Since an EPSF is able to successfully preserve edge information, while removing image details, the perceptually more meaningful edges can now more easily be extracted, when applying an edge detector to the filtered image. Canny edge detection [40] has been widely used in a variety of image processing applications. It has been developed to enhance the performance of edge detectors existing at that time.

An optimal edge detection algorithm aims at the following objectives:

(1) good detection, i.e. edges occurring in images should not be missed and there should be no responses to non-edges. (2) good localization, i.e. the distance between the edge pixels as found by the detector and the actual edge has to be minimal. (3) minimal response, i.e. for a given edge the possibility of multiple responses should be minimal.

______

Page 48 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Regular canny edge detection performs six successive steps. Canny edge detection in our approach differs in the first and last step from the regular approach. The first step of regular canny edge detection is filtering out any noise in an original image, before trying to locate and detect any edges. In regular canny edge detection this is done by using a Gaussian smoothing filter. However, in our approach the image is already smoothened by using an EPSF. Therefore, the first step of canny edge detection is skipped. Step two till five in our approach follow the regular steps of canny edge detection. In step two the edge strength is determined. A Sobel operator determines the two dimensional spatial gradient over the whole image. Then, the approximate absolute gradient magnitude at each point defines the edge strength. The Sobel operator uses a pair of 3x3 convolution masks, one estimating the gradient magnitude in the x-direction ( ), and the other estimating the gradient magnitude in the y-direction ( ). The edge strength resulting in a gradient image, as can be seen in Figure 21a, is approximated by using the formula:

In the third step the edge direction is calculated. The formula for finding the edge direction is:

Step four quantizes all the edge directions in four categories: edge directions within 0 to 22.5 and 157.5 to 180 degrees are set to 0 degrees, 22.5 to 67.5 degrees are set to 45 degrees, 67.5 to 112.5 degrees are set to 90 degrees and 112.5 to 157.5 degrees are set to 135 degrees. In step five non-maximum suppression is applied. The gradient magnitudes in the gradient image are scanned along the local corresponding edge directions, by sliding a small window over all pixels along their edge directions. The maximum gradient magnitude within the window is determined, and all pixels within this window with their gradient magnitude value lower than this local maximum are set to zero. This has the effect of suppressing all gradient pixels that are not part of the local maximum. The gradient pixels that are left, are therefore, on the peaks of ridges in the gradient image. This will give a thin line mostly of one pixel thick in the output image as can be seen in Figure 21b.

______

© MKE – TU Delft 2008 Page 49 Master Thesis N.C.R. Klomp ______

(a) gradient image (b) non maximum suppressed

Figure 21 – Non Maximum suppression

Finally, in the sixth step hysteresis thresholding is applied on the gradient image to eliminate streaking. Streaking is the breaking up of an edge contour caused by the operator output fluctuating above and below the threshold. If a single threshold is applied to an image, and an edge has an average strength equal to this threshold, there will be instances where the edge dips below the threshold. Equally, it will also extend above the threshold making an edge look like a dashed line. To avoid this, hysteresis is applied by means of two thresholds, a high and a low threshold. Any pixel in the gradient image that has a value larger than is presumed to be an edge pixel, and is marked as such immediately. Then, any pixel that is connected to this edge pixel and that has a value larger than is also selected as edge pixel. If you think of following an edge, you need a gradient of to start, but you don't stop till you hit a gradient below . For regular canny edge detection the low and high threshold have to be set manually. In our approach the low and high threshold are calculated automatically and are dependent on image content. The high threshold is set such that 85% of the total amount of pixels are cumulated in the magnitude histogram of the gradient image, and the low threshold is set to be 75%.

3.2.4 Perceptual Grouping

Now the edge map is obtained, the next step is the detection of the regions surrounding the edges that are likely to be contaminated with visible ringing artifacts. The HVS has the ability to group lower-level image features to a meaningful higher-level structure, which was investigated by Gestalt psychologist [47]. The HVS is able to perceive individual edge pixels in the edge map as a whole. These perceptual elements, which are constructed from a group of connected individual edge pixels, will be further called as edge segments, and will be used as the basis for ringing detection.

______

Page 50 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Considering the physical structure of ringing artifacts, complicated grouping processes, such as object recognition, are not needed, and the edge segments can be easily extracted as follows.

(1) Edge Pixel Linking: The model links all edge pixels into a set of edge segments. Each edge pixel is scanned in their eight-connected neighborhood. When only one other edge pixel is found the current edge pixel is marked as end pixel. When two other edge pixels are found, the current pixel is marked as line pixel, and when three other edge pixels are found it is marked as junction pixel. An edge segment starts or ends at an end pixel or junction pixel, and continues to grow with connected line pixels until another end or junction pixel is found.

(2) Edge Segment Labeling: The disconnected edge segments are labeled as individual objects.

(3) Edge Segment Thresholding: The edge segments containing less than a certain amount of edge pixels are discarded. This is done with the ringing detection speed and accuracy in mind. In our approach edge segments with less than 15 edge pixels are discarded.

Once the above process is complete, a new edge map is obtained, which is called the perceptual edge map. Figure 22 illustrates this grouping procedure, and the edge segments are labeled with random colors in the resulting edge map.

(a) canny edge map (b) perceptual edge map

Figure 22 – Illustration of Perceptual grouping

______

© MKE – TU Delft 2008 Page 51 Master Thesis N.C.R. Klomp ______

3.2.5 Local Region Classification

From the perceptual edge map all edge segments are examined, to determine the ringing regions and their visibility. Ringing appears as the oscillation around an edge as illustrated in Figure 15, where the red dots indicate the intensity values of individual pixels. However, as explained in § 1.3, not in all such areas ringing is actually perceived. In order to sufficiently characterize the visibility of ringing, the regions around an edge are locally classified into three different regions:

(1) Edge Region (EdReg): the original edge considering the compression induced blur (2) Ringing Region (RiReg): the direct neighborhood of the EdReg, which potentially contains perceived ringing artifacts (3) Feature Region (FeReg): the region indicating the original local background, which is located outwards from the corresponding RiReg where ringing artifacts are extinguished.

A two-dimensional overview of these three different regions around an edge segment can seen in Figure 23. This local region classification is implemented on the perceptual edge map, using morphological operators. The three different regions are obtained in the spatial domain, by thickening the perceptual edge map with different size for the structuring element of a dilation operation. In Figure 24a an image of size 480 x 640 (height x width) is used. The local area around a detected line segment is classified into the EdReg, RiReg, and FeReg with a structuring element with a square form. The size of the structuring elements vary with image resolution in our approach. The width of the structuring element of the EdReg, RiReg and FeReg are 0.25%, 1% and 2% of the image diameter respectively. For the image in Figure 24a this results in a width of 2, 8 and 16 pixels for the structuring elements as can be seen in Figure 24b.

Figure 23 - Local region classification an edge ______

Page 52 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Ringing EdReg RiReg FeReg

(a) edge map of an image (b) local region classification

Figure 24 – Illustration of local region classification for an edge segment

3.2.6 Human Vision Model

As explained in § 1.3, the visibility of ringing is significantly affected by the spatial activity in its local background, i.e. ringing is visually masked when located in a textured region, while it is perceptually prominent against a smooth background. In the yellow block in Figure 24a, it can be seen that ringing artifacts are only visible in the smooth sky (top of the edge), while they are masked in the textured skin (down of the edge). Additionally, as also illustrated in § 1.3, the visibility of ringing is also affected by luminance masking. Hence, applying a HVS model that includes texture and luminance masking is expected to be beneficial in detecting perceived ringing. The problem with implementing masking is that ringing artifacts are sometimes hard to distinguish from image detail within ringing regions. By applying the HVS model over the ringing regions (RiReg) as in [29], the calculation of the amount of texture and luminance masking is affected by the amount of ringing artifacts potentially present. Therefore, parts of ringing regions heavily distorted with ringing artifacts can be misclassified as highly masked regions, while they actually are not. As such, parts of the ringing regions contaminated with ringing artifacts can be removed by a HVS model, which is applied over the ringing regions only, such as in [29]. As an alternative, we calculate the masking properties in the feature regions (FeReg) instead of the ringing regions (RiReg). These feature regions are too far away from the edges to be affected by ringing artifacts, as can be seen in Figure 23. The feature regions are used as reference objects to determine which parts of the ringing regions are masked. The parts of the ringing regions masked are then removed by the HVS model. As also explained in § 1.3, texture and luminance masking are implemented separately in our HVS model.

______

© MKE – TU Delft 2008 Page 53 Master Thesis N.C.R. Klomp ______

3.2.6.1 Texture Masking

The proposed texture masking model is illustrated in Figure 25. It generally involves the following steps:

(1) Calculating the local variance within 3x3 windows in the FeReg following:

where LV(i,j) indicates the local variance of a pixel at location (i,j) within the 3x3 window, and defines the intensity value ( ) of a pixel at location (k,l), as in (27).

(2) The texture map (Figure 25b) is created by applying a threshold on the local variance and is calculated as:

(38)

where the threshold is set to 300 in our model. Note that is calculated on the values , which are intensity values ( ).

(3) Dilation of the texture map with a structuring element with a square form of width 4 (Figure 25c)

(4) Distinguish parts of the FeReg by assigning its regions covered in the dilated texture map into “texture objects” and the remaining areas into “smooth objects” (Figure 25d).

(5) Removing parts of the RiReg that are surrounding the “texture objects” of the FeReg, and discarding the resulting regions of the RiReg with their size under 10 pixels (Figure 25g). The maintaining parts of the RiReg is considered as the “smooth RiReg” (Figure 25h)

______

Page 54 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Textured region

FeReg Texture Map Dil. Texture Map Texture objects

texture obj. smooth obj.

(a) (b) (c) (d)

RiReg RiReg & objects Region removal Smooth RiReg

(e) (f) (g) (h)

Figure 25 - Implementation of the texture masking

3.2.6.2 Luminance Masking

The visibility of ringing is also largely reduced in an extremely dark or light surrounding. Luminance masking is implemented similarly to texture masking. The proposed luminance masking model is illustrated in Figure 26. To reduce computational power, luminance masking is only calculated over the smooth objects in the FeReg.

______

© MKE – TU Delft 2008 Page 55 Master Thesis N.C.R. Klomp ______

It generally involves the following steps:

(1) Calculating the local mean intensity within 3x3 windows in the smooth objects of the FeReg following:

where LI(i,j) indicates the local mean intensity of a pixel at location (i,j) within the 3x3 window, and defines the intensity value ( ) of a pixel at location (k,l), as in (27).

(2) The intensity map (Figure 25b) is created by applying two thresholds on the local intensities and is calculated as:

(40)

where the low and high threshold are set to 50 and 230 in our model.

(3) Dilation of the intensity map with a structuring element with a square form of width 3 (Figure 26c)

(4) Distinguish parts of the FeReg by assigning its regions covered in the dilated intensity map into “visible objects” and “invisible objects” (Figure 26d).

(5) Removing the parts of the RiReg that are surrounding the “invisible objects” of the FeReg, and discarding the resulting parts of RiReg with their size under 10 pixels (Figure 26g). The maintaining parts are considered as the “visible RiReg” because they are not masked by texture and luminance (Figure 26h).

______

Page 56 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Light region

FeReg Intensity Map Dil. Intensity Map Intensity objects

invisible obj. visible obj.

(a) (b) (c) (d)

RiReg RiReg & objects Region removal Visible RiReg

(e) (f) (g) (h) Figure 26 - Implementation of the luminance masking

3.2.7 Region Regrouping

Because the idea in our approach is to quantify ringing locally, the next step is to regroup the ringing and feature regions to ringing objects and feature objects. An ringing object, RiReg(R), belongs to one specific edge segment and is disconnected from other ringing objects. Every feature object, FeReg(R), is classified to a nearby ringing object, and therefore, has only one ringing object to which it belongs. However, more ringing objects can belong to the same edge segment; i.e. the HVS model removed masked areas, and therefore, a specific ringing region can be broken in parts, resulting in several ringing objects. Results of this model are shown in Figure 27.

______

© MKE – TU Delft 2008 Page 57 Master Thesis N.C.R. Klomp ______

It generally involves the following steps:

1. Edge Segment Classification: Ringing regions are classified to their nearby edge segments. Ringing regions belonging to different edge segments have random colors as can be seen in Figure 27d.

2. Ringing Region Labeling: The classified ringing regions are further labeled to ringing objects, RiReg(R), if they are disconnected. They are given random colors as can be seen in Figure 27e.

3. Ringing Region Thresholding: ringing objects smaller than 50 pixels are removed. This is done with the ringing detection speed and accuracy in mind.

4. Ringing Region Classification: Feature regions are classified to their nearby ringing objects. These classified regions are called feature objects, FeReg(R). They are given random colors as can be seen in Figure 27f.

(a) perceptual edge map (b) visible ringing regions (c) feature regions

(d) edge segment classifying (e) ringing objects (f ) feature objects

Figure 27 –Illustration of Region Regrouping

______

Page 58 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

3.2.8 Spurious Ringing Object Removal

So far, only the ringing objects that are not masked are detected. This detection is based on image content only, i.e. ringing objects near sharp edges with low masking properties are exposed. However, the visibility and amount of ringing artifacts is also related to the underlying compression ratio, e.g. the same detected ringing object may result in a different amount of visible ringing artifacts due to the extent of compression. Also, due to the accuracy of the edge detector or the the HVS model, some edges or texture might be missed and can appear within the ringing objects. This implies that there might exist some spurious ringing objects.

Two types of spurious ringing objects can be distinguished:

(1) Unimpaired ringing objects, i.e. ringing objects without any visible ringing artifacts due to low compression (2) Noisy ringing objects, i.e. ringing objects covered with misclassified edge or texture pixels

In our approach spurious ringing objects are removed by calculating the amount of potential ringing pixels within each ringing object, and by removing ringing objects with their number of potential ringing pixels below a certain threshold. Areas contaminated with ringing artifacts are visible because pixels within have a higher variance compared to that of the neighborhood. However, due to some misclassification, there may also exist some textured ringing objects that also have pixels with a higher variance, compared to that of the neighborhood. Therefore, they can be classified as ringing objects while they are actually not. It is found in [30] that an edge with a specific strength can only generate ringing artifacts with a variance below a certain level. In our approach the variance of a pixel within a ringing object is compared to a low and high threshold, which are dependent on the strength of the nearby edge. Pixels with a variance below the low threshold are assumed to be not contaminated with ringing artifacts, and pixels with a variance above the high threshold are assumed to be misclassified as edge or texture pixels. Pixels between the two thresholds are defined as potential ringing pixels.

The proposed spurious ringing object removal model generally involves the following steps:

(1) Calculating the local variance within a 3x3 window of pixels in the RiReg following:

where defines the intensity value ( ) of a pixel at location (k,l), as in (27).

______

© MKE – TU Delft 2008 Page 59 Master Thesis N.C.R. Klomp ______

(2) Calculating the local variance within a 3x3 window of pixels in the EdReg following:

where is defined as above.

(3) Calculating the low and high threshold for all pixels in the RiReg following:

where indicates the pixels in a window of width s (s = 10, α = 0.05 and = 0.50 in our model)

(4) A potential ringing pixel is defined as:

(5) The amount of potential ringing pixels in one ringing object is defined as:

where M indicates the total amount of pixels in .

______

Page 60 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

(6) For each ringing object the amount of potential ringing pixels is calculated. A ringing object is removed if:

where M is defined as above and T indicates the pre-defined threshold for the ringing object (0.75 in our model).

Finally, this whole process results in a binary computational ringing region (CRR) map (MCRR), where all ringing objects are shown which are not removed in the process. Due to this spurious ringing object removal model the CRR map captures more visible ringing regions for a high compression level than for a low compression level of the same image, which is in agreement with human perception of ringing. The impact of compression ratio is not taken into account by any other metric existing in literature. The CRR map of an image compressed at two different compression ratios is shown in Figure 28.

(a) CR = 20 (b) CR = 40 Figure 28 – CRR maps of an image compressed at two different compression ratios

______

© MKE – TU Delft 2008 Page 61 Master Thesis N.C.R. Klomp ______

3.3 Ringing Quantification

In [28] ringing is quantified as the mean variance over all pixels in all ringing regions. The latter is dependent on the amount of texture in the image content and the amount of ringing artifacts. In general, at low quality the variance over pixels within ringing regions becomes higher than at high quality, because there are more visible ringing artifacts.

In our approach all ringing objects surrounding strong perceptual edges, and not visually masked by the background are detected. Highly textured ringing objects are already removed, based on a pre-defined masking threshold. However, there still may exist ringing objects which are lightly textured, at a level not exceeding this pre-defined threshold. The effect of these lightly textured areas should be taken into account, and should not affect the quantification of ringing.

Full-reference metrics, such as in [27], compare the regions near edges in compressed images to the same regions in the corresponding original images. The regions near edges in the original image are used as reference and the difference with the regions in the compressed image is used as basis for the ringing quantification. The difference score gives a good indication on the visibility of ringing artifacts only, because the contribution of texture in the image content can be extracted using the original image. No- reference metrics however, can’t make a comparison to an original image. In [30] the objects most closely related to the ringing regions in the compressed image are used as reference. These objects are compared to the corresponding ringing regions and the difference between them is used as basis for the ringing quantification. However, as already described in § 2.6, a complex clustering model is used in [30], which makes this approach less attractive for the application in a video chain.

Mean Variance 40 Image 1 Image 2 35 30 25 20 15 10 5 0 25 40 55 70 100 25 40 55 70 100 Quality Figure 29 - Mean Variance over all pixels within ringing objects versus feature objects ______

Page 62 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

In our approach, the feature objects are used as reference for ringing quantification. This idea is based on the following observation: feature objects are not contaminated with ringing artifacts, because they are too far away from the edges, as can be seen in Figure 23. Therefore, the variance in the feature objects is only dependent on the amount of texture in the image content. This is confirmed by Figure 29, where the mean variance over all pixels within all ringing objects versus all feature objects is shown for two images at five different compression levels (i.e. quality Q = 25, 40, 55, 70, 100). The blue bars represent the mean variance over all pixels in all ringing objects, the red bars the mean variance over all pixels in all feature objects and the green bars their difference. As the quality Q of an image decreases, the mean variance over all pixels in all ringing objects increases (higher blue bars), due to the presence of more visible ringing artifacts. At the same time the mean variance over all pixels in all feature objects remains more or less constant (red bars are of similar height). In general, this is because of the absence of ringing artifacts in the feature objects, while the amount of texture in the image content is not much affected by increasing the compression level. Figure 29 also illustrates that the mean variance over all pixels within all ringing objects is higher for image 2 at quality 100 (uncompressed) than for image 1 at any quality level. The higher mean variance is probably due to the fact that the content of image 2 is somewhat more textured than the content of image 1, and not to the presence of more ringing artifacts. Therefore, it is very doubtful that quantifying ringing based on the mean variance over all pixels in all ringing objects only is a reliably approach. When quantifying ringing as the difference between the mean variance over all pixels in all ringing objects and the mean variance over all pixels in all feature objects, the contribution of texture in the image content is not included, and therefore, the quantification is based on ringing artifacts only.

In our newly proposed metric, ringing is quantified based on the calculation of the mean variance over all pixels per ringing and feature object individually. Because the variance over all pixels in the ringing and feature objects is already calculated in the HVS and Spurious ringing object removal part of the model, only the mean over the pixels for each individual ringing and feature object has to be calculated. Then, the difference between the mean variance over all pixels in an ringing object and the mean variance over all pixels in its corresponding feature object is defined as the ringing visibility for the specific ringing object. This is visually represented in Figure 30, where the ringing objects of an image compressed at three different levels are shown. The color represents the ringing visibility of a ringing object. Yellow indicates highly visible ringing artifacts and black the absence of ringing in the specific area. The overall ringing quantification is depending on the ringing visibilities of all ringing objects and their corresponding size. On the next page the mathematical description is given.

1 0.9 0.8

0.7

0.6

0.5 0.4 0.3

0.2

0.1

0 (a) Q = 60 (b) Q = 50 (c) Q = 40 Figure 30 – Ringing Visibility of the individual objects in an image compressed at different ratios ______

© MKE – TU Delft 2008 Page 63 Master Thesis N.C.R. Klomp ______

The ringing quantification is calculated by the following steps:

(1) Calculating the mean variance over all pixels in a ringing object. This defined as:

where R is the number of a specific ringing object and its corresponding feature object, N the total number of pixels within and LV(i,j) is the local variance of a pixel at location (i,j) within .

(2) Calculating the mean variance over all pixels in a feature object. This defined as:

where R is as defined above, N the total number of pixels within and LV(i,j) is the local variance of a pixel at location (i,j) within .

(3) Calculating for all ringing objects the difference between the mean variance over all pixels in a ringing object and the mean variance over all pixels in its corresponding feature object. The ringing visibility of a ringing object is calculated as:

where R is as defined above.

(4) The overall ringing quantification for the whole image is calculated as:

where R is as defined above, T the total number of ringing regions and indicates the amount of pixels of .

______

Page 64 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______4 Experiments

n this chapter two psychovisual experiments are described; the ringing region experiment and the I ringing annoyance experiment. The ringing detection method of our newly proposed metric is evaluated with the data from the ringing region experiment and the ringing quantification method is evaluated with the data from the ringing annoyance experiment.

4.1 Ringing Region Experiment

To validate our ringing region detection method, a psychovisual experiment was carried out in which subjects were requested to indicate regions of visible ringing artifacts in compressed natural images. These indicated regions were transformed into a subjective ringing region (SRR) map, indicating where in an image on average people perceive ringing.

4.1.1 Experimental procedure

A set of eight source images, reflecting adequate diversity in image content, were taken from the Kodak Lossless True Color Image Suite [49]. Figure 31 shows the source images. All images were high quality PNG color images of size 768x512 (width x height) pixels. These images were JPEG compressed using MATLAB’s imwrite function at two different compression levels (i.e. quality Q = 25 and 50). This yielded a test database of sixteen stimuli shown in Appendix D, with a variation in the visibility of ringing. The experiment was conducted in a laboratory environment with normal indoor illumination levels. The stimuli were displayed on a 17-inch LCD Iiyama ProLite E431s with a screen resolution of 1024x768 pixels and the viewing distance was 40cm.

The subjects participating in the study were recruited from the MSc program of the Department of Mediamatics at the Delft University of Technology. The twelve students, being eight males and four females, were experienced with image quality assessment and coding artifacts. Before the start of the experiment, an instruction about the goal and procedure of the experiment was given to each individual subject. A training session was conducted showing three examples of synthetic ringing, blocking and blur artifacts, followed by three real-life images in which ringing, blocking and blur artifacts were the most annoying artifacts respectively. When the subject reported to understand ringing and to be able to distinguish it from other types of compression artifacts, a set of images with the same range of ringing visibility as used in the rest of the experiment was presented to the subject. With these examples the participant could exercise in detecting ringing regions. Images in the training session were different from those used in the real experiment. After training, the stimuli were shown in a random order to each subject in a separate session. The subjects were asked to mark the regions where they perceived ringing artifacts on a piece of paper.

______

© MKE – TU Delft 2008 Page 65 Master Thesis N.C.R. Klomp ______

Figure 31 - Source Images of the Ringing Region Experiment 4.1.2 Subjective Data Processing

The spatial location of the marked regions in each image for each subject is transformed into a binary image, in which a white area (i.e. with pixel values of 1) indicates the marked region and a black area (i.e. with pixel values of 0) the unmarked regions. This results for each stimulus in a ringing region (IRR) map (i.e. ) per subject. Then, a mean ringing region (MRR) map (i.e. MMRR) is computed as:

where denotes the IRR map generated for subject s and m denotes the total number of subjects (e.g. m=12 in this experiment). The subjective ringing region (SRR) map (i.e. ) is then derived by removing the outlier regions. This is simply implemented by applying a threshold ( ) to the MRR map as:

(52)

where indicates the value of a pixel at location (i,j) in a binary SRR map. is set in this experiment to 0.5, so that the SRR map contains only the ringing regions in the MRR map when more than 50% of the subjects (i.e. six subjects) perceived ringing. This was done to be sure that the regions contained visible ringing artifacts. The ringing regions in the SRR map are adapted, so that they all have a similar thickness (9 pixels thick). Figure 32 illustrates the subjective data processing to obtain the SRR map of a stimulus. All MRR and SRR maps can be found in Appendix C. ______

Page 66 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

(a) Airplane (Q=25) (b) (c)

Figure 32 - Subjective data processing

4.1.3 Performance Evaluation

Our proposed ringing region detection method is validated with respect to the ground truth resulting from the psychovisual experiment. First, an evaluation is performed to check the performance of the individual components of our detection method and next an evaluation is performed to compare the performance in ringing region detection accuracy of our method with existing alternatives from literature.

The performance evaluation is achieved through comparing the CRR map calculated by a metric to the SRR map derived from the psychovisual experiment. They are both binary images indicating the location information of ringing regions for an image, once objectively (i.e. the CRR Map) and once subjectively (i.e. the SRR map). Based on the work of [50] two different ways for comparison are distinguished: a visual assessment and a quantitative correlation, which both are described in more details below.

A visual assessment provides a subjective evaluation of the correlation between the CRR map and the SRR map. This means that the correlation of both maps is visually represented only. It shows a comparison map (MC), which is an RGB color image generated by:

where is the red channel indicating where the method captures ringing regions that aren’t subjectively perceived (false positive), is the green channel, indicating where the method captures subjectively perceived ringing regions correctly (positive), and is the blue channel indicating subjectively perceived ringing regions not captured by the method (negative). Black regions represent the absence of visible ringing on both maps. Thus, assessors can easily see how the two maps are correlated by the colors in the image. An example is shown in Figure 33 and all results can be found in Appendix D. It becomes clear from Figure 33 that additional ringing regions exist in the CRR map that ______

© MKE – TU Delft 2008 Page 67 Master Thesis N.C.R. Klomp ______do not occur in the SRR map. This is not surprising, since the SRR maps are derived such that they only maintain ringing regions detected by at least half of the subjects.

(a) (b) (c)

Figure 33 – Visual assessment

The objective comparison between the CRR and SRR map is quantitatively measured by two correlation coefficients, and .The correlation coefficient (i.e. ) quantifies the amount of subjectively perceived ringing regions captured correctly by the method (i.e. the green regions in the comparison map) and is calculated following:

where m is the total number of pixel rows and n the total number of pixel columns in the image. The numerator indicates the total number of correlated pixels between the CRR and SRR map, and the denominator indicates the number of pixels within the ringing regions in the SRR map. However, this coefficient gives a quantification of the amount of subjectively perceived ringing regions captured in the CRR map. It is not able to reflect the amount of false ringing regions. A false ringing region is defined as a region detected by the method, and therefore, located in the CRR map, which is not located in the SRR map (i.e. the red regions in the comparison map). Quantification of ringing in these false ringing regions will lead to extra computational cost, and to a degradation of the accuracy of a ringing metric. Therefore, a good ringing detection method will not only capture a high amount of subjectively perceived ringing regions, but also a low amount of false ringing regions. The amount of falsely detected ringing regions is quantified by the coefficient , and is calculated following:

where m and n are as defined above and .

______

Page 68 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

The numerator indicates the total number of pixels belonging to false ringing regions, and the denominator indicates the total number of pixels belonging to the black or blue regions of the comparison map. The ringing region detection accuracy is determined by the two correlation coefficients, and . Evidently, a higher value of combined with a lower value of implies a good detection.

4.1.4 Performance Evaluation Metric Components

First different alternatives for all individual components in our newly proposed ringing region detection method are evaluated. When alternatives for an individual component are evaluated, a default configuration for the other components is used. These are the configurations that are underlined in the list below. The following alternatives for the three individual components are distinguished:

(1) Edge preserving smoothing; The Kuwahara filter [40], median filter [42] and bilateral filter [43] are compared. (2) Edge detection; Our proposed canny edge detector with its corresponding grouping model is compared against regular Sobel edge detection with the edge link model from [29] and [30]. (3) HVS model; Our proposed HVS model and spurious ringing object removal model is compared against using no HVS model and spurious ringing object removal at all.

In the first step we compare the EPSF’s. First the parameters of each EPSF were tuned to yield the highest performance possible for the set of stimuli used in the experiment. This resulted for the Kuwahara filter in kernel size of 5x5, for the median filter in a kernel size of 5x5, and for the bilateral filter in a kernel size of 5 x 5, a of 3 and a of 5. Using these parameter settings the and scores were calculated for each stimulus and each EPSF, the results of which are shown in Figure 34. The and score for an EPSF average over all stimuli are shown in Table 1. From these data it can be concluded that the bilateral filter gives overall the best results. When using the Kuwahara filter, the number of false regions is low (i.e. low score). But also, the score is lower than for the other filters, which means that a lot of subjectively perceived ringing regions are not captured. When using the median filter these subjectively perceived ringing regions are captured better (i.e. a high score), but also a lot of false ringing regions are detected (i.e. a high score). Because of its good performance in accurately detection ringing regions, the bilateral filter is used as EPSF in our new ringing metric. Moreover, the bilateral filter has already been shown to be implemented in real-time for high-definition video [45].

Average score Bilateral filter Kuwahara filter Median filter 0.88 0.78 0.87 0.13 0.15 0.21 0.07 0.10 0.07 0.04 0.09 0.09 Table 1 – EPSF evaluation with and scores averaged over all stimuli

______

© MKE – TU Delft 2008 Page 69 Master Thesis N.C.R. Klomp ______

Bilateral Kuwahara Median ρ1 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

Bilateral Kuwahara Median ρ2 0.4

0.3

0.2

0.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

Figure 34 – EPSF and scores for the sixteen stimuli

In the second step we compare the edge detection alternatives. To make a fair evaluation, both edge detectors are evaluated with the same threshold value for the gradient magnitude. In contrast to the sobel edge detector, used in existing methods in literature, the canny edge detector in this method uses hysteresis thresholding, implying that it requires a high and a low threshold. The threshold for the sobel edge detector and the low threshold for the hysteresis thresholding in the canny edge detector are set such that 75% of the total amount of pixels are cumulated in the magnitude histogram of the gradient image. The high threshold for the hysteresis thresholding is set to 85%. Both the edge detectors are applied on bilaterally filtered images. The and scores for the two edge detectors averaged over all stimuli are shown in Table 2. As can be seen the canny edge detector in our approach clearly outperforms the sobel edge detector used in existing methods in literature. Not only more subjectively perceived ringing regions are captured (i.e. a higher score), but also less false regions are detected (i.e. a lower score). This trend is also found for most individual stimuli as can be seen in Figure 35.

Average score Canny edge Detector Sobel Edge Detector 0.88 0.82 0.13 0.17 0.07 0.10 0.04 0.09 Table 2 – Edge Detector evaluation with and scores averaged over all stimuli

______

Page 70 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

ρ1 Canny Sobel 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

ρ2 Canny Sobel 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

Figure 35 – Edge Detector and scores for the sixteen stimuli

In the third step we evaluate the use of our HVS and spurious ringing object removal model. When including these, we don’t expect that more subjectively perceived ringing regions will be captured. The goal of using these models, is only to remove as much as possible false ringing regions. Therefore, an important aspect is to see how well these models are able to distinguish real ringing regions from false ringing regions. When these models don’t work well some of the real ringing regions may be classified as false ringing regions, and be removed in the process. Table 3 shows the and scores of including and excluding both models, averaged over all stimuli. It illustrates that when using our models the score is not lowered while a much lower score for is obtained. Figure 36, showing the corresponding and scores per stimulus, demonstrates that only for stimulus six some of the real ringing regions are removed as false ringing regions by our models, resulting in a lower score for that specific stimulus. For almost all stimuli, the score is halved by our models, implying that about twice the number of false regions is removed by our HVS and spurious ringing object removal model. Because these models hardly cause any degradation in the detection of real ringing regions and show an excellent performance in the removal of false ringing regions, this is expected to lead to a more accurate quantification of ringing.

Average score HVS & Spurious region removal No HVS & Spurious region removal 0.88 0.88 0.13 0.34 0.07 0.08 0.04 0.07 Table 3 – HVS evaluation with and scores averaged over all stimuli

______

© MKE – TU Delft 2008 Page 71 Master Thesis N.C.R. Klomp ______

ρ1 HVS & Spurious ringing object removal No HVS & Spurious ringing object removal 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

ρ2 HVS & Spurious ringing object removal no HVS & Spurious ringing object removal 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

Figure 36 – HVS & spurious ringing object removal and scores for the sixteen stimuli

4.1.5 Performance Evaluation against Metrics from Literature

In this performance evaluation the ringing region detection method of our metric, the NRPM, is compared against three metrics existing in literature, the RCRM [30], the MFRM [29] and the NRRM [31], respectively. Four examples of experimental results for the visual assessment are given in Figure 37, while all the results can be found in Appendix D. The first column shows the stimuli, the second column presents the SRR maps, and the remaining four columns give the comparison maps for the various ringing metrics. The parameters for each metric were tuned to yield the highest performance possible for the set of stimuli used in the experiment. As described earlier the green color indicates the correlated ringing regions between the CRR and SRR map, the red color indicates the ringing regions in the CRR map, but not in the SRR map, and the blue color indicates the ringing regions in the SRR map, but not in the CRR map. Black regions represent the absence of visible ringing in both maps. From this comparison map it can be easily seen how the CRR and SRR maps are correlated. Figure 37 illustrates that all ringing metrics can detect most of the subjectively perceived ringing regions. The NRPM however, detects these ringing regions, while limiting by far the detection of false ringing regions, as compared to the other three metrics. This major improvement is most probably caused by the edge extraction model of the NRPM, which preserves only perceptually relevant edges for subsequent ringing region detection. The use of ordinary edge detection (as in RCRM, MFRM and NRRM) makes the ringing region detection very sensitive to the threshold used; for a high threshold some visually salient edges

______

Page 72 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______may not be detected, such that obvious ringing regions are consequently missed, while for a low threshold many irrelevant edges may be retained, which results in the detection of many false ringing regions. Comparing the results of the NRRM to those of the other metrics demonstrates the advantage of using HVS properties in ringing region detection, where the ringing regions visually masked by local image characteristics (i.e. texture and luminance masking) are removed. The absence of a HVS model in the NRRM most probably explains why the CRR map of the NRPM, RCRM and MFRM exhibit obviously less false ringing regions than that of the NRRM. From this visual assessment the NPRM tend to outperform the other three metrics in terms of detecting subjectively perceived ringing regions while limiting the detection of false ringing regions. To further validate this visual evaluation an objective comparison is carried out.

Stimuli NPRM RCRM MFRM NRRM

Figure 37 - Visual assessment of four stimuli with the Comparison Maps of four metrics

______

© MKE – TU Delft 2008 Page 73 Master Thesis N.C.R. Klomp ______

For the objective comparison the and scores are calculated per stimulus for the four metrics. The values averaged over all stimuli can be found in Table 4, while all individual values are given in Figure 38. In terms of detecting subjectively perceived ringing regions (i.e. ), the NPRM outperforms the other three metrics. The averaged score of the NPRM is the highest with a score of 0.88. It has the largest difference of 0.16 to the RCRM and the smallest difference of 0.10 to the MFRM. Especially for stimuli 9 and 10 (i.e. “Stream”), the NPRM yields a higher score than any of the other metrics. These two stimuli include a lot of texture, which intrinsically makes the detection of ringing regions more difficult. The other metrics fail under this demanding condition mainly due to the disadvantage of the edge detector used. In terms of detecting false ringing regions (i.e. ), the NPRM consistently gives a very low score for all the stimuli. The averaged score of the NPRM is the lowest with a score of 0.13. It has the largest difference of 0.71 to the NRRM and smallest difference of 0.31 to the RCRM. It is noted that the RCRM and the MFRM, despite of their relatively high averaged score for , clearly eliminate a number of false ringing regions with respect to the NRRM. This improvement further confirms the need of a HVS model in the ringing region detection method of a metric.

Average score NPRM RCRM MFRM NRRM 0.88 0.72 0.78 0.76 0.13 0.21 0.43 0.84 0.07 0.17 0.19 0.12 0.04 0.06 0.22 0.08 Table 4 – Region detection method evaluation with and scores averaged over all stimuli

NPRM RCRM MFRM NRRM ρ1 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

NPRM RCRM MFRM NRRM ρ2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Stimuli

Figure 38 - Region detection method and scores for the sixteen stimuli

______

Page 74 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______4.2 Ringing Annoyance Experiment

To validate our ringing quantification method, a psychovisual experiment was carried out to investigate how annoyed humans were by ringing artifacts in natural images. For this experiment, subjects had to score perceived ringing annoyance in a compressed image.

4.2.1 Experimental procedure

A set of eleven source images, reflecting adequate diversity in image content, were taken from the Kodak Lossless True Color Image Suite [49]. Figure 40 shows the source images used in this experiment. All images were high quality PNG color images of size 768x512 (width x height) pixels. These images were JPEG compressed using MATLAB’s imwrite function at four different compression levels (i.e. quality Q = 25, 40, 55, 70). Also the original, non-compressed images were included in the experiment. This yielded a test database of fifty-five stimuli, with a large variation in the visibility of ringing. The experiment was conducted in a laboratory environment with normal indoor illumination levels. The display was a Philips Cineos 37" LCD screen (native resolution of 1920 x 1080 pixels and a screen refresh rate of 60 Hz). The panel had a maximal luminance of 500 cd/m2 and a contrast ratio of 7500:1. The display was driven by a NVIDIA GeForce Ultra 8800 which ran on a Desktop computer. The interface between the panel and Desktop was via a DVI to HDMI cable. The viewing distance from the display was four image heights, which resulted in 60cm.

The subjects participating in the study were recruited from the MSc program of the Department of Mediamatics at the Delft University of Technology. The twenty students, being fourteen males and six females, were experienced with compression artifacts. Before the start of the experiment, an instruction about the goal and procedure of the experiment was given to each individual subject. A training session was conducted showing three examples of synthetic ringing, blocking and blur artifacts, followed by three real-life images in which ringing, blocking and blur artifacts were the most annoying artifacts respectively. When the subject reported to understand ringing and to be able to distinguish it from other types of compression artifacts, a sample set of ten images was shown with different content and compression levels to show the approximate range of ringing artifacts. Then three test images were shown which the subject had to judge. Images in the training session were different from those used in the real experiment. After training, the stimuli were shown in a random order to each subject in a separate session. The subject reported its annoyance of ringing judgment on a quality scale from 0 till 100, where 0 means no ringing annoyance and 100 maximum annoyance as shown in Figure 39.

Figure 39 – Quality Scale

______

© MKE – TU Delft 2008 Page 75 Master Thesis N.C.R. Klomp ______

Figure 40 – Source Images of the Ringing Annoyance Experiment

4.2.2 Subjective Data Processing

After the experiments were finished the results of all subjects were collected in one single data matrix of 55 x 20. A row contained all scores of the twenty subjects for the same stimulus, a column contained all scores of one subject for all stimuli. A simple outlier detection and subject rejection model was applied on the data matrix first. It generally involved the following steps:

First the mean and standard deviation for each stimulus was calculated. The individual score of a subject for a stimulus was considered to be an outlier if it was outside an interval of two standard deviations around the mean score for that stimulus. All scores of a subject were rejected if more than five of its scores were outliers. Overall, one subject out of twenty was rejected and about 3% of the scores were rejected as being outliers.

The first step of the analysis of the results is the calculation of the mean score for a stimulus, , after applying the outlier detection and subject rejection model following:

where is the score of subject i for stimulus j and N is the number of subjects.

______

Page 76 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

When presenting results of an experiment all mean scores should have an associated confidence interval, which is derived from the standard deviation and the size. It is proposed to use the 95% confidence interval which is given by:

where

Figure 41 shows the mean score and 95% confidence interval for the fifty-five stimuli of this experiment. The stimuli are ranked from low to high quality with the images in alphabetical order. From Figure 41 it can be seen that the ringing annoyance scores increase with lower quality for each group of stimuli (i.e. five stimuli with the same image content). As expected, stimuli with a lot of smooth regions and strong edges such as beach, caps and lighthouse have high overall annoyance scores, where highly textured stimuli such as landscape and sea have low overall annoyance scores. This is also shown in Table 5. The confidence interval for the original images is very small due to the fact that most subjects detected these were the artifact-free images and gave them a score of zero.

100

95

90

85

80

75

70

65

60

55

50

45 MeanScore

40

35

30

25

20

15

10

5

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 41 - Mean Scores of the fifty-five Stimuli and 95% confidence intervals ______

© MKE – TU Delft 2008 Page 77 Master Thesis N.C.R. Klomp ______

Quality Mean Score Confidence Interval Stimulus Mean Score Stimulus Mean Score 25 64 18.1 Beach 54 Island 37 40 51 16.9 Boat 30 Landscape 30 55 41 18.4 Buildings 31 Lighthouse 47 70 29 15.7 Caps 47 Motorcross 38 Original 3 5.1 Door 24 Sea 35 Overall 38 14.8 Flower 42 Table 5 - Mean Scores and 95% Confidence intervals

Not all subjects in this experiment used the same range and distribution for scoring. For this reason Z- scores were calculated. The Z-score of subject i for stimulus j, , indicates how many standard deviations a score is above or below the mean and is calculated as:

where is the mean score of subject i and is the standard deviation of subject i.

The mean Z-score for stimulus j, , is calculated as:

where N is the total number of subjects.

The mean Z-scores are shown in Table 6. They are normalized to a scale between 0 and 100. It can be seen that there is a correlation between the mean Z-scores and the quality, where low mean Z-scores have a high quality value. The results of an Anova of the experimental data is shown in Table 7. In the Anova “Content”, “Quality” and “Subject” as well as their interaction are included. The last column of the Anova illustrates that “Content” and “Quality” are significant factors as well as all the interaction effects.

______

Page 78 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Stimuli Quality Z-Score Stimuli Quality Z-Score Stimuli Quality Z-Score Flower Original 0 Island 70 33 Caps 55 58 Beach Original 1 Landscape 55 35 Lighthouse 55 59 Caps Original 2 Buildings 55 36 Landscape 25 59 Buildings Original 2 Door 40 37 Island 40 59 Motorcross Original 2 Flower 70 38 Boat 25 59 Island Original 2 Boat 55 41 Flower 40 65 Boat Original 3 Caps 70 41 Motorcross 40 68 Lighthouse Original 3 Boat 40 43 Sea 25 72 Door Original 4 Buildings 40 44 Island 25 73 Sea Original 5 Sea 55 44 Lighthouse 40 74 Landscape Original 6 Lighthouse 70 45 Motorcross 25 75 Door 70 16 Island 55 45 Caps 40 76 Buildings 70 22 Motorcross 55 48 Beach 55 77 Motorcross 70 23 Landscape 40 50 Buildings 25 78 Landscape 70 24 Beach 70 51 Flower 25 84 Door 55 25 Flower 55 51 Beach 40 86 Boat 70 28 Sea 40 53 Lighthouse 25 89 Sea 70 31 Door 25 54 Caps 25 90 Beach 25 100 Table 6 – Mean Z-Scores of the fifty-five Stimuli

Source Type III Sum of Squares df Mean Square F Sig. Corrected Model 817,21(a) 324 2,52 9,61 0,00 Intercept 1,27E-008 1 1,27E-008 0,00 1,00 Content 88,32 10 8,83 33,67 0,00 Quality 561,83 4 140,45 535,42 0,00 Subject 1,39 18 0,07 0,29 0,99 Content * Quality 38,17 40 0,95 3,63 0,00 Content * Subject 71,81 180 0,39 1,52 0,00 Quality * Subject 37,93 72 0,52 2,00 0,00 Error 181,79 693 0,26 Total 999,00 1018 Corrected Total 999,00 1017 Table 7 – Anova

______

© MKE – TU Delft 2008 Page 79 Master Thesis N.C.R. Klomp ______

Figure 42 shows the normalized mean Z-scores and 95% confidence intervals for the fifty-five stimuli of this experiment. The stimuli are ranked from low to high quality with the images in alphabetical order. Comparing Figure 42 to Figure 41 illustrates that the confidence intervals are somewhat smaller for the mean Z-scores (Figure 42) than for the raw data (Figure 41). This is because in Figure 42 the confidence intervals are calculated over the Z-Scores, which have a less anomalous distribution than the annoyance scores of the subjects. Because of the normalization, the confidence intervals of the mean Z-scores are also less overlapping than the confidence intervals of the mean scores.

110

105

100

95

90

85

80

75

70

65

60

55

50 Mean Z-Score Mean

45

40

35

30

25

20

15

10

5

0

-5 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 42 – Mean Z-Scores of the fifty-five Stimuli and 95% confidence intervals

______

Page 80 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

4.2.3 Evaluation Criteria

Currently, the Video Quality Expert Group (VQEG) considers the standardization of quality assessment methods as one of its working directions. In order to quantify the performance of an objective metric, some statistical tools are proposed [51]. The performance of objective metrics can be quantitatively evaluated with respect to its ability to predict subjective quality ratings, based on prediction accuracy, prediction monotonicity, and prediction consistency [51].

(1) Prediction accuracy indicates the ability of a metric to predict the subjective ratings with low error, and can be determined by means of the Pearson linear correlation coefficient. The Pearson linear correlation coefficient is calculated as:

where T is the total number of stimuli, is the score of metric m for stimulus j, is the mean score of metric m over all stimuli, is the mean Z-score for stimulus j, as calculated in (60), and is calculated as:

where T is as above.

(2) Prediction monotonicity indicates the degree to which the rank order in the metric’s predictions agree with the rank order in the subjective ratings, which can be quantified by the Spearman rank order correlation coefficient. The Spearman rank order correlation coefficient is calculated as:

where T is as above and is the difference in rank between and .

______

© MKE – TU Delft 2008 Page 81 Master Thesis N.C.R. Klomp ______

(3) Prediction consistency characterizes the degree to which the metric maintains prediction accuracy over the range of stimuli. It can be measured with the outlier ratio, which is defined as:

where T is as above and an outlier is defined as:

where represents the standard deviation of the Z-score for stimulus j. The mean Z-scores are approximately normally distributed, and therefore, twice the value represents the 95% confidence interval. Thus, value represents a good threshold for defining an outlier point.

As suggested in [51], a metric’s performance can also be evaluated with nonlinear correlations using a non-linear mapping function for the objective predictions before computing the correlation, e.g. a logistic or quadratic function may be applied to metric results. Nonlinear correlations, however, have the disadvantage of minimizing performance differences between metrics. Hence, to make a more critical comparison, only linear correlations are calculated here.

4.2.4 Performance Evaluation

In this section the performance of five metrics is evaluated, namely our proposed metric, the NPRM, one full-reference metric, the HRM [27], and three no-reference metrics, the RCRM [30], MFRM [29] and the NRRM [31]. For these metrics the ringing quantification scores, further referred to as metric scores, are compared to the mean Z-scores to see how well they correlate with the human perception of ringing. The parameters for each metric were tuned to yield the highest performance possible for the set of stimuli used in the experiment. The metric scores of all metrics and their correlation with the mean Z- scores can be found in Appendix E. The plot of the metric scores of the NPRM is also shown in Figure 43. The stimuli are ranked from low to high quality with the images in alphabetical order, and the metric scores are normalized to a scale between 0 and 100. From Figure 43 it can be seen that the metric scores increase with lower quality for each group of stimuli, which is similar to the subjective data. The metric scores of the lowest quality stimuli (Q = 25) all lie in the range between 50 and 100, which was also the case for the mean Z-scores, as can be seen in Table 6. Subjects in the experiment were well able and very consistent in detecting the absence of visible ringing in the original stimuli. Therefore, the ______

Page 82 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______mean Z-scores of the original stimuli in the experiment all lie in the small range between 0 and 6, with a very small confidence interval as can be seen in Table 6 and Figure 42. However, for our metric it is more difficult to detect the absence of visible ringing in the original stimuli. The metric scores of the original stimuli of the NPRM all lie in the range between 0 and 30, as can be seen in Figure 43. This is also the case for the RCRM and MFRM, as shown in Appendix E. To see how well the metric scores of the NPRM are correlated with the mean Z-scores a scatter plot is shown in Figure 44. The x-axis represents the metric scores and the y-axis the mean Z-scores. The different colors represent the quality level as indicated in the legend. In an ideal situation all scores lie on the green line and the correlation between the metric and subjective data is one. The distance between the mean Z-scores and the metric scores is representing their dissimilarity, and how farther the points are from the green line the lower the correlation. Compared to the scatter plots of the other metrics in Appendix E, the points in the scatter plot in Figure 44 lie relatively close to the green line. A trend, indicating a better correlation for specific quality level, is not visible.

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 43 – Plot NPRM scores of the fifty-five Stimuli

The Pearson linear correlation coefficient, Spearman rank order correlation coefficient and outliers ratio are shown in Table 8. The Pearson linear correlation coefficient of the NPRM is the highest with a score of 0.85. It has the largest difference of 0.35 to the MFRM and smallest difference of 0.05 to the HRM. Also the Spearman Rank Order Correlation coefficient of the NPRM is the highest with a score of 0.85. It has the largest difference of 0.33 to the MFRM and smallest difference of 0.07 to the RCRM. However, the outliers ratio of the NPRM is 0.15, which means that 8 of the 55 stimuli were outliers. This score seems quite high in comparison to the HRM, which has a score of 0.04. The reason for this is that the ______

© MKE – TU Delft 2008 Page 83 Master Thesis N.C.R. Klomp ______subjects in the experiment were well able and very consistent in detecting the absence of visible ringing in the original stimuli. Therefore, the confidence interval for the mean Z-scores is very small for all original stimuli, as can be seen in Figure 42. Because the outlier ratio is dependent on the confidence interval this resulted in 8 of the 15 original stimuli being detected as outlier for the NPRM. For a full reference metric, such as the HRM, this is not an issue, because the HRM compares each stimulus with its reference. In case of an original this results in a metric score of zero, which is within the confidence interval of each original stimulus. When calculating the outlier ratio over the compressed stimuli only, there are no outliers detected for the NPRM, as can be seen in Table 8.

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 44 – Scatter plot Mean Z-Scores versus the NPRM scores

Table 8 shows that the NPRM outperforms the existing no-reference metrics RCRM, MFRM and NRRM. On the Pearson and Spearman correlation it is even better than the full-reference metric, the HRM. This is probably due to the fact that the HRM does not include a HVS model and that its ringing quantification is only based on the difference between pixel intensities. The MFRM and NRRM show a low correlation to perceived ringing probably because their ringing quantification method is too much influenced by texture. The advantage of comparing the variance within ringing regions to reference objects is shown in the high correlation for both the RCRM and NPRM.

NPRM HRM RCRM MFRM NRRM Pearson Linear Correlation 0.85 0.80 0.76 0.50 0.65 Spearman Rank Order Correlation 0.85 0.74 0.78 0.52 0.56 Outliers Ratio 0.15 0.04 0.18 0.40 0.15 Outliers Ratio (without originals) 0.00 0.05 0.04 0.30 0.14 Table 8 – Ringing quantification evaluation ______

Page 84 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______5 Conclusion

n this thesis, the approach towards the development of a new ringing metric, which can quantify I perceived ringing in compressed still images, is presented. Our ringing metric relies on the luminance component of a compressed image only, which is promising for application into a video chain. The metric is split into two separate parts; a ringing region detection method, i.e. the regions where ringing artifacts potentially can occur are detected, and a ringing quantification method, i.e. the visibility of ringing artifacts within these ringing regions is quantified. It is based on the following innovative features.

Firstly, the ringing region detection method adopts a perceptually more meaningful edge detection model using edge-preserving smoothing and canny edge detection for the purpose of accurate ringing region detection. This intrinsically avoids the drawback of applying an ordinary edge detector, which has the risk of missing relevant ringing regions near non-detected edges or of increasing the computational cost by making extra calculations in regions near irrelevant edges.

Secondly, a HVS model is proposed, which contains texture and luminance masking as typical for the HVS. To reduce computational cost this HVS model is only applied on a small area around the detected ringing regions, instead of to the whole image. It is based on local image characteristics in nearby feature regions which avoids that ringing artifacts are misclassified as texture and their corresponding ringing regions are therefore removed.

Thirdly, a spurious ringing object removal model is applied which removes spurious ringing objects. This spurious ringing object removal model calculates for all pixels within the ringing objects if they are affected by ringing, based on a comparison of their local variance with the strength of the nearby edge. Due to this comparison, it can detect and remove unimpaired ringing objects, i.e. ringing objects without any visible ringing artifacts due to low compression, and noisy ringing objects, i.e. ringing objects covered with misclassified edge or texture pixels.

Finally, the ringing quantification method quantifies ringing by taking the effect due to texture in image content on the visibility into account. This method is based on comparing ringing objects with their corresponding feature objects, which act as a reference. By using these references, the strength of ringing artifacts in ringing objects which are lightly textured can be better estimated. Therefore, it is more robust for assessment on a diversity of image content.

Our proposed ringing region detection method and ringing quantification method are validated with respect to subjectively perceived ringing regions and subjective ringing annoyance scores resulting from two psychovisual experiments. Results from the ringing region detection method as well as from the ringing quantification method of our proposed metric are highly correlated with subjective data. Our metric is also compared to alternatives existing in literature, and shows to be promising in terms of both reliability and computational efficiency.

______

© MKE – TU Delft 2008 Page 85

Master Thesis N.C.R. Klomp ______6 Recommendations

Valuation with subjective data has shown that our newly proposed ringing metric is promising. Our E ringing metric was mainly developed with a real-time application in mind. So far, our metric is able to work off-line for still images only. For actual implementation in a video chain more future work has to be carried out.

The complexity and amount of calculations are minimized in our approach, both in the detection and quantification method. Nonetheless, before applying our metric in a video chain, more research is needed to determine which components can be implemented in hardware, and to determine the computational load of our metric. It is known that some important components in our metric, such as bilateral filtering and canny edge detection, are already implemented in hardware for real-time video [45][55]. Although this looks promising, additional evaluations are needed for the remaining components. To evaluate the computational load, we face the problem that our metric exists of a lot of individual components. Therefore, it is difficult to give an estimation about its total computational cost.

For the extension of our metric from stills to video, a sequence of sequential frames has to be evaluated on perceived ringing. From a perception point of view, however, ringing is quantified over a whole sequence instead of as a series of single images. One way of modeling the perception of ringing in video is to quantify ringing for each individual frame and then average these values over the whole sequence. However, it is unknown if every frame should be equally weighted, e.g. ringing might be more easily perceived in frames with relatively “still” content than in frames with highly “moving” content, or the opposite way around. Also, averaging ringing over all frames might not be very attractive, keeping the computational cost in mind. MPEG specifies that the raw frames are compressed into three kinds of frames: intra-coded frames (I-frame), predictive-coded frames (P-frames), and bidirectionally-predictive- coded frames (B-frames). An I-frame is a compressed version of a single uncompressed (raw) frame. P- frames allow the video encoder to store only the changes made with respect to the previous I-Frame (full frame). B-frames help to save space by allowing the video encoder to store data with reference to both the previous and precedent frames. Unlike P-frames and B-frames, I-frames do not depend on data in the preceding or the following frames. It may be possible to reduce computational cost by discarding specific frames for the ringing quantification. Future work should investigate the quantification of ringing over a sequence of frames and looking at possibility to reduce computational cost by reducing the number of frames, without compromising the overall performance.

The quantification of ringing in our metric is defined by the mean ringing visibility over all ringing objects and their size. So far, our quantification method does not weight the specific location of the ringing artifacts. It does not include the amount of attention a given ringing region in the image gets. In practice, however, a viewer’s annoyance by artifacts does not only depend on the artifact visibility, but also on how much attention the corresponding location gets [52]. Therefore, objective metrics are expected to become more reliable when they are weighted with attention maps. Various models to calculate attention maps have been developed both for efficient object detection as well as for ______

© MKE – TU Delft 2008 Page 87 Master Thesis N.C.R. Klomp ______improved video processing [53][54]. Recent models are based on a bottom-up approach, which integrates complex features of the HVS and simulates a hierarchical perceptual representation of the visual input. This makes most of these models rather complex, and not applicable in real-time processing applications. Hence, for using attention maps in an objective metric applicable in a video chain, simplifications to these models are needed. Recently, a quality metric is proposed, which uses a simple region-based attention model [52]. This model takes into account several bottom-up features such as contrast, size, shape, location and background. It is concluded that by including the region-based model the prediction accuracy of the quality metric increases. More research is needed to include a model of visual attention that can be calculated in real-time, to our metric. Existing models should be simplified, but in such a way that they remain reliable enough. The quantification of ringing should be weighted with attention maps derived from this model, and it should be evaluated whether introducing visual attention really improves the metric’s reliability.

______

Page 88 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______7 References

[1] M. Yuen and H. R. Wu, “A survey of hybrid MC/DPCM/DCT video coding distortions”, , vol. 70, no. 3, pp. 247-278, November 1998. [2] M. Shen and C. J. Kuo, “Review of Postprocessing Techniques for Removal”, Journal of Visual Communication and Image Representation, vol. 9, no. 1, pp. 2-14, March 1998. [3] S. Saha, “Image Compression - from DCT to : A Review,” Data Compression, vol. 6, no. 3, Spring 2000 [4] A. K. Jain, “Fundamentals of ,” New Jersey: Prentice Hall Inc., 1989. [5] N. Ahmed, T. Natarajan, K.R. Rao, “Discrete Cosine Transfom”, IEEE Transactions on Computers, vol. 23, no. 1, pp. 90 – 93, January 1974. [6] G. Strang, “The Discrete Cosine Transform”, SIAM Review, vol 41, no. 1, pp. 135-147, 1999. [7] C. C. Koh, S. K. Mitra, J. M. Foley, and I. Heynderickx, “Annoyance of Individual Artifacts in MPEG-2 Compressed Video and Their Relation to Overall Annoyance”, in SPIE Proceedings, Human Vision and Electronic Imaging X, vol. 5666, pp. 595-606, March 2005. [8] P. Marziliano, F. Dufax, S. Winkler and T. Ebrahimi, “Perceptual blur and ringing metrics: Application to JPEG2000”, Signal Processing: Image Communication, vol. 19, pp. 163-172, 2004. [9] P. Marziliano, F. Dufax, S. Winkler and T. Ebrahimi, “A no-reference perceptual blur metric”, Image Processing, vol.3, pp. 57-60, 2002. [10] H.R. Wu and M. Yuen, “A Generalized Block-edge Impairment Metric for Video Coding”, IEEE Signal Processing Letters, vol.70, no.3, pp. 247-278, November 1998. [11] F. Pan, X. Lin, S. Rahardja, W. Lin, E. Ong, S. Yao, Z. Lu and X. Yang, “A locally adaptive algorithm for measuring blocking artifacts in images and videos”, Signal Processing: Image Communication, vol. 19, pp. 499-506, 2004. [12] G. Arfken, "Gibbs Phenomenon" §14.5 in Mathematical Methods for Physicists, 3rd ed. Orlando, FL: Academic Press, pp. 783-787, 1985. [13] Z. Wang, A. C. Bovik, “Modern Image Quality Assessment”, Synthesis Lectures on Image, Video & Multimedia Processing, Morgan & Claypool Publishers, 2006. [14] S. A. Karunasekera, N. G. Kingsbury, “A Distortion Measure for Blocking Artifacts in Images Based on Human Visual Sensitivity”, IEEE Trans. Image Processing, June 1995. [15] W. Osberger, A. J. Maeder, D. McLean, “A Computational Model of the Human Visual System for Image Quality Assessment”, In Proc. DICTA-97, pp. 337-342, December 1997. [16] H. Liu, I. Heynderickx, “A simplified Human Vision Model Applied to a Blocking Artifact Metric”, 12th International Conference on Computer Analysis of Images and Patterns, August 2007. [17] G. Zhai, W. Zhang, X. Yang, Y. Xu, “Image quality metric with an integrated bottom-up and top-down HVS approach”, in Proc. IEEE Vision, Image and Signal Processing, vol. 154, no. 4, pp. 456-460, August 2006. [18] C. H. Chou, Y. C. Li, “A Perceptually Tuned Subband Image Coder Based on the Measure of Just-Noticeable-Distortion profile”, IEEE Transactions on Circuits and Systems for Video Technology, December 1995. [19] T. N. Pappas, R. J. Safranek, “Perceptual criteria for image quality evaluation”, In Handbook of Image and Video Processing, A. C. Bovik, ed., Academic Press, May 2000. [20] R. Barland and A. Saadane, “Reference Free Quality Metric for JPEG-2000 Compressed Images”, in Proc. IEEE ISSPA, vol. 1, pp. 351-354, August 2005. [21] H. R. Wu, M. Yuen, “A Generalized Block-edge Impairment Metric for Video Coding”, IEEE Signal Processing Letters, vol. 70, no. 3, pp. 247-278, November 1998. [22] F. Pan, X. Lin, S. Rahardja, W. Lin, E. Ong, S. Yao, Z. Lu, X. Yang, “A locally adaptive algorithm for measuring blocking artifacts in images and videos”, Signal Processing: Image Communication, vol. 19, no. 6, pp. 499-506, 2004. [23] C. H. Chou, Y. C. Li, “A Perceptually Tuned Subband Image Coder Based on the Measure of Just-Noticeable-Distortion profile”, IEEE Transactions on Circuits and Systems for Video Technology, December 1995. [24] X. Yang, W. Lin,Z. Lu,E. Ong, S Yao, “Motion-compensated residue preprocessing in video coding based on just-noticeable- distortion profile”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 6, pp. 742- 752 , June 2005. [25] H. Tong, M. Li, H.J. Zhang, C. Zhang, “No-Reference Quality Assessment for JPEG2000 Compressed Images”, International Conference on Image Processing, vol. 5, pp. 3539-3542, October 2004. [26] X. Li, “Blind image quality assessment”, in Proc. ICIP 2002, vol. 1, pp. 449-452, September 2002. [27] P. Marziliano, F. Dufax, S. Winkler and T. Ebrahimi, “Perceptual blur and ringing metrics: Application to JPEG2000”, Signal Processing: Image Communication, vol. 19, pp. 163-172, 2004. [28] R. Barland and A. Saadane, “Reference Free Quality Metric for JPEG-2000 Compressed Images”, in Proc. IEEE ISSPA, vol. 1, pp. 351-354, August 2005. ______

© MKE – TU Delft 2008 Page 89 Master Thesis N.C.R. Klomp ______

[29] S.H. Oguz, Y.H. Hu and T.Q. Nguyen, “Image Coding Ringing Artifact Reduction Using Morphological Post-filtering”, in Proc. IEEE MMSP, pp. 628-633, 1998. [30] X. Feng and J.P. Allebach, “Measurement of Ringing Artifacts in JPEG Images”, in Proc. SPIE, vol. 6076, pp. 74-83, Feb. 2006. [31] H. Liu and M. Niemeijer, “Internal Project Report”, TU Delft, Image Quality 2006-2007 [32] J. Chen, T.N. Papps, A. Mojsilovic, B.E. Rogowitz, Adaptive perceptual color-texture image segmentation”, IEEE Trans. Image Processing, vol. 14, pp. 1524-1536, October 2005. [33] D.L. Davies and D.W. Bouldin, “A cluster sepereation measure” IEEE Trans. Pattern Anal. Mach Intell. vol. 1, pp. 224-227, April 1979. [34] E. Hering, “Outlines of a Theory of the Light Sense” , Harvard University Press, Cambridge, MA, 1964. [35] M.D. Fairchild, “Color Appearance Models”, Second Edition, John Wiley & Sons, West Sussex, England, 2005. [36] R.W.G. Hunt, “The reproduction of color”, Fourth Edition, Fountain Press, England, pp. 380, 1987. [37] S. Winkler, “Vision Models and Quality Metrics for Image Processing Applications”, Ph.D. thesis, 2002. [38] G. Buchsbaum and A. Gottschalk, “Trichromacy, Opponent Colours Coding and Optimum Colour Information Transmission in the Retina”, Proceedings of the Royal Society of London. Series B, Biological Sciences, Vol. 220, No. 1218, pp. 89-113, November 1983. [39] R. Lachman, J. Lachman and E. C. Butterfield, “Cognitive Psychology and Information Processing: An Introduction”, The American Journal of Psychology, vol. 92, no. 4, Dec. 1979. [40] J. Canny, “A Computational Approach to Edge Detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679-698, Nov. 1986. [41] M. Kuwahara, K. Hachimura, S.Ehiu and M. Kinoshiata, “Processing of ri-angiocardiographic images”, in Digital Processing of Biomedical Images, pp. 187-203, New York: Plenum, 1976. [42] W.K. Pratt, Digital Image Processing, New York: Wiley, 1978. [43] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images”, in Proc. IEEE ICCV, pp. 836-846, Jan. 1998. [44] P. Perona, J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Transactions on Pattern Anal. Mach. Intell., vol. 12, no.7, pp. 629–639, Jul 1990. [45] J. Chen, S. Paris, F. Durand, “Real-time edge-aware image processing with the bilateral grid”, ACM Transactions on Graphics, vol. 26. no. 3, July 2007. [46] F. Durand, J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range images”, ACM Transactions on Graphics, vol. 21, no. 3, July 2002. [47] R. J. Sternberg, Cognitive Psychology (Fourth Edition). Thomas Wadsworth, 2006. [48] H. Liu and I. Heynderickx, “A Simplified Human Vision Model Applied to a Blocking Artifact Metric”, Lecture Notes in Computer Science, vol. 4673 , pp. 334-341. August 2007. [49] R. Franzen: Kodak Lossless True Color Image Suite. Available: http://www.r0k.us/graphics/kodak/ [50] N. Ouerhani, R. von Wartburg, H. Hugli, R. Muri, “Empirical validation of the saliency-based model of visual attention”, Electronic Letters on Computer Vision and Image Analysis, 2004. [51] VQEG: Final report from the video quality experts group on the validation of objective models of video quality assessment. http://www.vqeg.org/ , August 2003. [52] R. Barland, A. Saadane, “Reference free quality metric using a region based attention model for JPEG-2000 compressed images”, in Proc. Of SPIE-IS&T Electronic Imaging, vol. 6059, January 2006. [53] C. Koch, S. Ullman, “Shifts in Selection in Visual Attention: Toward the Underlying Neural Circuitry”, Human Neurobiology, vol. 4, no. 4, pp. 219-227, 1985. [54] O. le Meur, P. le Callet, D. Barba, D. Thoreau, “A Coherent computational approach to model bottom-up visual attention”, IEEE Transactions on pattern analysis and machine intelligence, vol. 28, no.5, May 2006. [55] C. Kim, J. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications”, IEEE Trans. Circuits and Systems for Video Technology, vol. 12, no. 2, pp. 122-129, February 2002.

______

Page 90 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______8 Appendix

Appendix A

Figure 45 – Example of ringing artifacts in the smooth sky

______

© MKE – TU Delft 2008 Page 91 Master Thesis N.C.R. Klomp ______

Appendix B

Figure 46- Ringing artifacts masked in the textured water while visible in the smooth sky

______

Page 92 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Appendix C

Stimuli (Q=25) (Q=50) (Q=25) (Q=50)

Figure 47 - Subjective data processing of the Ringing Region Experiment

______

© MKE – TU Delft 2008 Page 93 Master Thesis N.C.R. Klomp ______

Appendix D

Nr. Q Stimuli NRPM RCRM MFPM NRRM

1. 25

2. 50

3. 25

4. 50

5. 25

6. 50

7. 25

8. 50

Figure 48 – Visual assessment of eight stimuli

______

Page 94 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

Nr. Q Stimuli NRPM RCRM MFPM NRRM

9. 25

10. 50

11. 25

12. 50

13. 25

14. 50

15. 25

16. 50

Figure 49 - Visual assessment of eight stimuli

______

© MKE – TU Delft 2008 Page 95 Master Thesis N.C.R. Klomp ______

Appendix E

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 50 - Plot NPRM scores of the fifty-five Stimuli

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 51 - Scatter plot Mean Z-Scores versus the NPRM scores

______

Page 96 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 52 - Plot HRM scores of the fifty-five Stimuli

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 53 - Scatter plot Mean Z-Scores versus the HRM scores

______

© MKE – TU Delft 2008 Page 97 Master Thesis N.C.R. Klomp ______

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 54 - Plot RCRM scores of the fifty-five Stimuli

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 55 - Scatter plot Mean Z-Scores versus the RCRM scores

______

Page 98 © MKE – TU Delft 2008 Master Thesis N.C.R. Klomp ______

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 56 - Plot MFRM scores of the fifty-five Stimuli

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 57 - Scatter plot Mean Z-Scores versus the MFRM scores

______

© MKE – TU Delft 2008 Page 99 Master Thesis N.C.R. Klomp ______

100

90

80

70

60

50 Metric Score Metric

40

30

20

10

0 0 beach 5 boat 10 buildings 15 caps 20 door 25 flower 30 island 35 landscape 40 lighthouse 45 motorcross 50 sea 55 Stimuli

Figure 58 - Plot NRRM scores of the fifty-five Stimuli

100 JPEG 25 JPEG 40 JPEG 55 JPEG 70 90 Original

80

70

60

50 Mean Z-Score Mean 40

30

20

10

0

0 10 20 30 40 50 60 70 80 90 100 Metric Score

Figure 59 - Scatter plot Mean Z-Scores versus the NRRM scores

______

Page 100 © MKE – TU Delft 2008