TITLE: Guidelines for Codec Evaluation

ISO/IEC JTC 1/SC

29/WG1N 6242 Date: 2012-11-15

ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG16) Coding of Still Pictures

JBIG JPEG Joint Bi-level Image Joint Photographic Experts Group Experts Group

TITLE: Text of ISO/IEC PDTR 29170-1 SOURCE: AIC ad-hoc group

Editors: Dr. Chaker Larabi and Dr. Thomas Richter PROJECT: AIC STATUS: Proposal REQUEST ACTIONS: DISTRIBUTION: WG1 Distribution List ISO/IECJTC1/SC29/WG1N5598 October 2010

Coding of Still Pictures

JBIG JPEG Joint Bi-level Image Joint Photographic Experts Group Experts Group

TITLE: Guidelines for Codec Evaluation

SOURCE: AIC Ad-Hoc Group Dr. Chaker Larabi, University of Poitiers, France Dr. Thomas Richter, University of Stuttgart, Germany Alexander Schmitt, Fraunhofer IIS, Germany Dr. Sven Simon, University of Stuttgart, Germany

PROJECT: AIC

STATUS:

REQUESTED ACTION: For Review

DISTRIBUTION: For WG1 Review

Contact: ISO/IEC JTC 1/SC 29/WG 1 Convener - Dr. Daniel T. Lee eBay Inc., 2145 Hamilton Avenue, San Jose, California 95125 eBay China Development Center, Shanghai German Center, 88 Keyuan Road, Zhangjiang Hi-Tech Park, Pudong, Shanghai 201203, China Tel: +1 408 914 2846 Fax: +1 253 830 0372, E-mail: [email protected]

2. Evaluation criteria ...... 6 B.1 Image Dimensions (Normative) ...... 9 B.2 Mean Square Error and PSNR ...... 10 B.3 Compression Rate and Bits Per Pixel ...... 11 C.1.1 Definition (Informative) ...... 13 C.1.2 Measurement Procedure (Normative) ...... 13 C.2.1 Definition (Informative) ...... 15 C.2.2 Measurement Procedure (Normative) ...... 15 C.3.1 Definition (Informative) ...... 17 C.3.2 Measurement Procedure ...... 17 C.4.1 Definition (Informative) ...... 18 C.4.2. Measurement Procedure (Normative) ...... 19 C.5.1 Definition (Informative) ...... 19 C.5.2. Measurement Procedure (Normative) ...... 19 C.6.1 Definition (Informative) ...... 22 C.6.2. Measurement Procedure (Normative) ...... 23 C.6.1 Definition (Informative) ...... 25 C.6.2. Measurement Procedure (Normative) ...... 25 C.7.1 Definition (Informative) ...... 26 C.7.2. Measurement Procedure (Normative) ...... 26 C.8.1 Definition (Informative) ...... 27 C.8.2 Measurement Procedure (Normative) ...... 27 Check for software/MPEG methodology ? ...... 28 C.9.1 Definition (Informative) ...... 28 C.9.1 Measurement (Normative) ...... 28 C.10.1 Definition (Informative) ...... 29 C.10.1 Measurement (Normative) ...... 29 C.11.1 Definition (Informative) ...... 30 C.11.1 Measurement (Normative) ...... 30 D.1 Dimension, Size, Depth and Precision Limitations ...... 30 D.2 Extensibility ...... 31 D.3 Backwards Compatibility ...... 31 D.4 Parsing flexibility and scalability (Peter) ...... 31 E.1 Photography ...... 32 E.2 Video Surveillance ...... 32 E.3 Colour Filter Array (CFA) Compression ...... 32 E.3.1 Evaluation of Image Quality (for Lossy Compression) ...... 32

© ISO/IEC 2011 – All rights reserved 3 3 E.3.2 CFA Mean Squared Error ...... 32 E.3.3 CFA Peak Absolute Error ...... 32 E.3.4 Reconstructed Image Quality Evaluations ...... 33 E.3.5 Codec Features ...... 35 E.3.6 Colour Filter Array Geometry ...... 35 E.4 Guideline for subjective quality evaluation methods for compressed medical images ...... 35 E.4.1. Test dataset: ...... 35 E.4.2. Index of compression level: compression ratio ...... 35 E.4.3. Definition of compression ratio: ...... 35 E.4.4. Range of compression ratios: ...... 36 E.4.5. Definition of optimal compression: visually lossless threshold ...... 36 E.4.6. Image Discrimination Task ...... 36 E.5 Archival of Moving Pictures ...... 38 E.6 Document Imaging ...... 38 E.7 Biometric and Security Related Applications ...... 38 E.8 Satellite Imaging and Earth Observation ...... 38 E.9 Pre-Press Imaging ...... 38 E.10 Space Physics ...... 38 E.11 Computer Vision ...... 38 E.12 Forensic Analysis ...... 38

ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commis- sion) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of the joint technical committee is to prepare International Standards. Draft International Stand- ards adopted by the joint technical committee are circulated to national bodies for voting. Publication as an In- ternational Standard requires approval by at least 75 % of the national bodies casting a vote.

In exceptional circumstances, the joint technical committee may propose the publication of a Technical Report of one of the following types:

— type 1, when the required support cannot be obtained for the publication of an International Standard, despite repeated efforts;

— type 2, when the subject is still under technical development or where for any other reason there is the fu- ture but not immediate possibility of an agreement on an International Standard;

— type 3, when the joint technical committee has collected data of a different kind from that which is nor- mally published as an International Standard (“state of the art”, for example).

Technical Reports of types 1 and 2 are subject to review within three years of publication, to decide whether they can be transformed into International Standards. Technical Reports of type 3 do not necessarily have to be reviewed until the data they provide are considered to be no longer valid or useful.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.

ISO/IEC TR 29170-1, which is a Technical Report of type ?, was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hy- permedia information.

1 Introduction This Recommendation|International Standard provides a framework to evaluate image compression algorithms for their suitable for various application domains. To this end, this document provides a set of evaluation tools that allow testing multiple features of such codecs. The definition of which features shall be tested is beyond the scope of this document, but Annex C provides suggestions which tests are suitable for which application domain.

A1 Test Methodology How to run image test, a high level description. A general procedure: [Look into the VQEG test procedures]  Take a sufficiently large test image set that is relevant for the application area, and that covers a sufficiently wide range of image natures. For that, prepare a matrix of image features that are typcial for your application area and your requirements, and ensure that for each feature at least one image is in the test dataset. Such features could include spatial complexity, image size, resolution, bit depths and many others.  Identify a list of test tool, as specified in this document, that evaluate codec properties that are relevant for your application area. Such tools might include rate-distortion performance as defined by the relation of PSNR to the average number of bits per pixel of the (compressed) representation of the image for lossy coding, or as the average number of bits per pixel for lossless coding. Other tools, as well as a precise definition of the above terms are defined throughout this document, it is however recommended to at least apply the above mentioned tools for lossy resp. lossless coding.  Codecs under evaluation typically offer various configuration options such as to tune it to a specific application or optimize it to a specific performance index. If so, the coding technology to evaluate should be configured to perform as best as possible for the selected performance index. For example, image compression algorithms may allow the option to optimize images for best visual performance or to best PSNR performance. If so, the codecs shall be configured consistently in such a way that allows optimal performance for the selected test tool. That is, if the PSNR performance is to be evaluated for two codecs, both codec parameters shall be selected such that the PSNR is as high as possible. If the SSIM performance is to be evaluated, codec parameters shall be selected such that optimal SSIM performance is to be expected.  Applying a benchmark or a test tool in the above sense to a codec will generate one curve (e.g. PSNR over bitrate) or one index (compression ratio for lossless coding) per codec and per image. Such evaluation results should be made available completely without omissions. [NOTE: Move the folowing to an appendix as what has to happen if several tests should be pooled - Furthermore, testing the overall performance of a codec will require pooling individual measurements into one overall score or index. Various pooling strategies may exist, including pooling up the average score, the worst-case score or the best-case score. Which pooling strategy applies is application dependent, and the pooling strategy deployed should be explicitly stated in the test conditions.]

A.1Test Sets

Selection of Test Data

Selecting Test Tool

Running Test Tools

2. Test material The test material to be used for evaluation of an image compression system would comprise the following types and sizes which seem to reflect the type and size of images in potential applications for which an Advanced Image Compression system will be used. Test material with ground truth will be preferable.  2D (spatial) o High resolution (from 4K X 2K)  Spectral (narrow-band and broad-band) / Multispectral / Hyperspectral o High Dynamic Range  3D (volumetric) o Size (from 512 X 512 X 100)  Image sequence (2D + t) o Size (from 2K X 1K X 10 sec @ 24 fr/sec)  4D (3D + t) o Size (from 128 X 128 X 100 X 10 sec @ 5 fr/sec)  Stereoscopic (2D (+t) X 2) o Size (from 2K X 1K X 10 sec @ 24 fr/sec X 2)  Multiview (2D (+t) X N) o Size (from 2K X 1K X (10 sec @ 30 fr/sec) X 64 Views)

Likewise, the type of content for test material should reflect the potential applications of AIC. The following provides some content examples:  Natural scenes  Compound (multi-layer)  Photo-realistic synthetic  Graphics and animations  Scanned content

 Content with application specific noise/texture patterns  Sensor array images  Medical imagery  Remote sensing  … This section has to be discussed and refined at the next meeting in Guangzhou Shanghai (China).To be defined.

2. Evaluation criteria The traditional approach to evaluate performance of lossless image compression systems has been to verify the size of the compressed test material when compared to an anchor. Other issues such as the complexity have also been considered in an ad hoc manner. In lossy image compression systems, the method of choice is often the comparison of rate- distortion performance where rate is the size of the compressed test material and distortion refers to the amount and mechanisms of loss brought by compression when compared to the originals. The most popular distortion metrics are Mean Squared Error (MSE) and Peak Signal to Noise Ratio (PSNR). In some cases, such as the competitive phase of JPEG 2000 standardization, subjective tests were performed to obtain the Mean Opinion Score (MOS) of the quality of the compressed test material, by a panel of human observers. In addition to the above traditional approaches, others can be envisaged in order to better grasp the performance of a given image compression system. The following summarizes such evaluation criteria:

2.1 Quality measurement Two types of quality measurement are possible: Subjective assessment that takes the end- observer in the loop of assessment and objective assessment based on the use of mathematical tools that can use properties of the Human Visual System (HVS).

A.2. Subjective Assessment Subjective assessment of compressed images and image sequences is a mandatory step for the quality evaluation of current technologies and their improvement. PSNR (Peak Signal to Noise Ratio) or MSE (Mean-Square Error) have been used for a long time for the assessment of the compression improvement. However, several studies have demonstrated the lack of correlation between these metrics and the human judgment and one cannot find a universal metric that mimic the human behavior. Hence, the subjective assessment stills the most reliable way to assess the quality of color images. However, these tests are tedious and time consuming and cannot be run without taking into account current standards. In this section, we describe the principal specifications for the management of a successful subjective campaign.

A.2.1. Specifications of the experimental conditions

Observer's Characteristicss: Observers shall be free from any personal involvement with the design of the psychophysical experiment or the generation of, or subject matter depicted by, the test stimuli. . Observers shall be checked for normal vision characteristics as they affect their ability to carry out the assessment task. In most cases, observers should be confirmed to have normal color vision by the

mean of tests like Ishihara… and should be tested for visual acuity by using Snellen or landolt charts at approximately the viewing distance employed in the psychophysical experiment. The number of observers participating in an experiment shall be significant (15 are recommended after the outlier rejection).

Stimulus Properties:

The number of distinct scenes represented in the test stimuli shall be reported and shall be equal to or exceed 3 scenes (and preferably should be equal to or exceed 6 scenes). If less than 6 scenes are used, each shall be preferably depicted or alternatively briefly described, particularly with regard to properties that might influence the importance or obviousness of the stimulus differences.

The nature of the variation (other than scene contents) among the test stimuli shall be described in both subjective terms (image quality attributes) and objective terms (stimulus treatment or generation).

Instructions to the Observer:

The instructions shall state what is to be evaluated by the observer and shall describe the mechanics of the experimental procedure. If the test stimuli vary only in the degree of a single artifactual attribute, and there are no calibrated reference stimuli presented to the observer, then the instructions shall direct the observer to evaluate the attribute varied, rather than to evaluate overall quality. A small set of preview images showing the range of stimulus variations should be shown to observers before they begin their evaluations, and the differences between the preview images should be explained.

Viewing Conditions: For monitor viewing, if the white point u',v' chromaticities are closer to D50 than D65, the white point luminance shall exceed 60~cd/m2; otherwise, it shall exceed 75~cd/m2.

The viewing conditions at the physical locations assumed by multiple stimuli that are compared simultaneously shall be matched in such a degree as critical observers see no consistent differences in quality between identical stimuli presented simultaneously at each of the physical locations. The observer should be able to view each stimulus merely by changing his glance, without having to move his head.

Experimental Duration:

To avoid fatigue, the median duration (over observer) of an experimental session, including review of the instructions, should not exceed 45 minutes. For practical issues, the observer can run a second evaluation session after rest of at least 30 minutes.

A.2.2. MOS calculation and statistical analysis The Mean Opinion Score (MOS) provides a numerical indication of the perceived quality of an image or an image sequence after a process such as compression, quantization, transmission and so on. The MOS is expressed as a single number in the range 1 to 5 in the case of a discrete scale (resp. 1 to 100 in the case of a continuous scale), where 1 is the lowest perceived quality, and 5 (resp. 100) is the highest perceived quality. Its computation allows to study the general behavior of the observers with regards to a given impairment.

MOS calculation The interpretation of the obtained judgments is completely dependant on the nature of the

constructed test. The MOS m j k r is computed for each presentation: 1 N m j k r =  m i j k r Eq. 1 N i = 1 

1 N m = m jkr е ijkr  N i=1

th where m i j k r is the score of the observer i for the degradation j of the image k and the r iteration. N

represents the number of observers. In a similar way, we can calculate the global average scores, m j

and m k , respectively for each test condition (impairment) and each test image.

 Calculation of confidence interval In order to evaluate as well as possible the reliability of the results, a confidence interval is  associated to the MOS. It is commonly adopted that the 95\% confidence interval is enough. This interval is designed as:

m jkr  jkr,m jkr + jkr Eq. 2

лm jkr - d jkr,m jkr +d jkrы where :  s  = 1 . 9 5 j k r j k r N Eq. 3 s d =1.95 jkr jkr N  s j k r represents the standard deviation defined as:

N ( m  m ) 2 j k r i j k r s j k r =  Eq. 4 N 1  i = 1 N (m - m )2 s = jkr ijkr jkr е N - 1 i=1  Outliers rejection One of the objectives of results analysis is also to be able to eliminate from the final calculation either a particular score, or an observer. This rejection allows to correct influences induced by the observer's behavior, or bad choice of test images. The most obstructing effect is incoherence of the answers provided by an observer, which characterizes the non-reproducibility of a measurement. In the ITU-R 500-10 standard [ITU00], a method to eliminate the incoherent results is recommended. To that aim, it is necessary to calculate the MOS and the standard deviations associated with each presentation. These average values are function of two variables the presentations and the observers.

Then, check if this distribution is normal by using the b2 test. The latter β2 is the kurtosis coefficient (i.e. the ratio between the fourth-order moment and the square of the second-order moment).

Therefore, the b2 jkr to be tested is given by: 1 (m - m )4 е jkr ijkr b = N 2 jkr 2 Eq. 5 ж1 2ц з (m jkr - mijkr ) ч иN е ш 1 (m - m )4 N е jkr ijkr b2 jkr = 2 ж1 2ц з (m jkr - mijkr ) ч иN е ш

If b2 jkr b 2 j k r is between 2 and 4, we can consider that the distribution is normal. In order, to

compute Pi and Qi values allowing to take the final decision regarding the outliers, the observations m ijkr for each observer i, each degradation j, each image k, and each iteration r, is compared thanks to a combination of the MOS and the associated standard deviation. The different steps of the algorithm are summarized in the following algorithm.

o Single stimulus (low quality content) o Multiple stimuli (medium and high quality content)

Objective measurement In the recent literature hundreds of paper have proposed objective quality metrics dedicated to several image and video applications. Basically, objective evaluation metrics can be categorized into three groups: full-reference (FR), no-reference (NR), and reduced-reference (RR) metrics. FR metrics need full information of the original images and demand ideal images as references which can be hardly achieved in practice. The traditional methods of FR (such as peak signal-to-noise-ratio PSNR) are based on pixel-wise error and have not always been in agreement with perceived quality measurement. Recently, some FR metrics based on human visual system (HVS) have been proposed. For instance, Wang et al. introduced in1

an alternative complementary framework for quality assessment based on the degradation of structural information. They developed a structural similarity index (SSIM) and demonstrated its promise through a set of intuitive examples. On the other hand, NR metrics aim to evaluate distorted images without any cue from their original ones. However, most of the proposed NR quality metrics are designed for one or sets of predefined specific distortion types and are unlikely to be generalized for evaluating images degraded with other types of distortions. While RR metrics are between FR and NR, they make use of a part of the information from the original images in order to evaluate the visual perception quality of the distorted ones. As the transformed and stored data are reduced, RR metrics have a great potential in some specific applications of quality assessment.

Full reference

Metrics based on MSE a- Mean square error The mean square error (MSE) is the cumulative squared error between the compressed and the original image. M N 1 2 MSE= ∑ ∑  R x , y − S x , y  MN y=1 x=1 where R x , y is the original image, S  x , y is the reproduced version of the original, M and N are the dimensions of the images. The MSE has been used due to its easy calculation and analytical tractability [Teo and Heeger, 1994]. b- RMSE The root mean squared error (RMSE) is the square root of MSE. RMSE=MSE It is well known that MSE and RMSE is inaccurate in predicting perceived distortion [Teo and Heeger, 1994], and therefore many extensions of these measures have been proposed. c- SNR The signal to noise ratio (SNR) of an image is usually deﬁned as the ratio of the mean pixel value to the standard deviation of the pixel values. Due to its simple calculation SNR has been widely used in objective quality measurement.

M N 2 ∑ ∑ S x , y  1 SNR=10 Log y=1 x=1 MN MSE 

d- PSNR The Peak Signal to Noise Ratio is a measure of the peak error between the compressed and the original image. 255 PSNR=20 Log RMSE

The higher the PSNR, the better the quality of the reproduction. SNR and PSNR have usually been used to measure the quality of a compressed or distorted image.

Metrics based on structural similarity

e- SSIM The SSIM (structural similarity) index proposed by Wang et al. [2004] attempts to quantify the visible difference between a distorted image and a reference image. This index is based on the UIQ [Wang and Bovik, 2002]. The algorithm the structural information in an image as those attributes that represent the tructure of the objects in the scene, independent of the average luminance and contrast. The index is based on a combination of luminance, contrast and structure comparison. The comparisons are done for local windows in the image, the overall image quality is the mean of all these local windows. N 1 SSIM  X ,Y = ∑ ssim xi , yi  N i=1

where X and Y are the reference and distorted images, xi and yi are images content in a local window i and N indicates the total number of local windows. Figure X shows the SSIM ﬂowchart, where signal x or signal y has perfect quality and the other is the distorted image.

Figure x: Flowchart of the SSIM metric f- multiscale SSIM A multiscale version of SSIM was proposed by Wang et al. [2003b]. The original and reproduction is run through the SSIM, where the contrast and structure is computed for each subsampled level. The images are low-passed ﬁltered and downsampled by 2. The lightness (l) is only computed in the ﬁnal step, contrast (c) and structure (s) for each step. The overall values are obtained by multiplying the lightness value with the sum of contrast and structure for all subsampled levels. Weighting parameters for l, c and s are suggested based on experimental results. g- Complex Wavelet SSIM Wang and Simoncelli [2005] address the problem SSIM has with translation, scaling and rotation. The solution for this is to extend SSIM to the complex wavelet domain. In order to apply this metric (CWSSIM) for comparing images, the images are decomposed using a complex version of the steerable pyramid transform. The CWSSIM is computed with a sliding window, and the overall similarity is estimated as the average of all local CWSSIM values. From the objective test done, CWSSIM outperform SSIM and MSE. The authors also tested the metric as a similarity measure on 2430 images.

Metrics based on color difference a- S-CIELAB Zhang and Wandell [1996] proposed a spatial extension to the CIELAB color metric (Figure x). This metric should fulﬁll two goals, a spatial ﬁltering to simulate the blurring of the HVS and a consistency with the basic CIELAB calculation for large uniform areas. The image is separated into an opponent-

color space, and each opponent color image is convolved with a kernel determined by the visual spatial sensitivity of that color dimension. Finally the ﬁltered image is transformed into CIE-XYZ, and this representation is transformed using the CIELAB formulae.

Figure x: Flowchart of the s-CIELAB metric b- iCAM Fairchild and Johnson [2002] proposed the iCAM model. This model incorporates an image difference metric and it is influenced by S-CIELAB and research in many different fields. The iCAM adds several preprocessing steps, in addition to a spatial filtering. The spatial filtering used serves to modulate spatial frequencies that are not perceptible, and enhance frequencies that are most perceptible. In addition a module of spatial localization is incorporated, to account for the HVS sensitivity to edges. Also a contrast detection is found within iCAM to account for local and global contrast. Finally a color difference formula is used to calculate a color difference map, which can be further analyzed using different statistical methods [Fairchild and Johnson, 2004], such as mean, maximum, and minimum.

Metrics based on difference prediction

a- VDP This is an algorithm proposed by Daly [1993]. The goal of the Visible Differences Predictor (VDP) is to determine the degree to which physical differences become visible differences. The author states that this is not an image quality metric, but it addresses the problem of describing the differences between two images. The output from this algorithm is an image containing the visible differences between the images. Two different visualization techniques are proposed for the output VDP, the free- field difference map optimized for compression, and the in-context difference showing the output probabilities in color on the reference image. The VDP can be used for all image distortions including blur, noise, algorithm artifacts, banding, blocking, pixellation and tone-scale changes. [NOTE: Provide a pooling strategy: How to pool up local error detection probabilities into a global index]

b- Mantiuk, HDR-VDP Mantiuk et al. [2004] proposed an extension of VDP for HDR images. The performed extension improves the model’s prediction of perceivable differences in the full visible range of luminance and under the adaptation condition. HDR-VDP takes into account aspects of high contrast vision in order to predict perceived differences. This model does not take into account chromatic changes, only luminance.

Metrics based on discrimination a- VDM Lubin [1995] proposed the visual discrimination model (VDM). This model is based on the just- noticiable-differences (JND) model by Carlson and Cohen [1980]. The model design was motivated by speed and accuracy. Input to the model is a reference image and a distorted version, both grayscale. A set of parameters must be defined based on the viewing conditions. The ﬁrst step includes a simulation of the optics of the eye before sampling the retina cone mosaic. The sampling is done by a gaussian convolution and point sampling. The next stage converts the raw luminance signal into units

of local contrast based on a method similar to Peli [1990]. After this each pyramid level is convolved with 4 pairs of spatially oriented filters. Then on the 4 pairs of ﬁlters an energy response is computed. Each energy measure is normalized and each of these values are as input to a non-linear sigmoid function. The distance between the vectors can be calculated and results in a JND map as output, but the values across this map can be used to calculate mean, maximum or other statistical measure of the similarity between the images. This single value can further be converted into probability values. b- Sarnoff JND Vision Model The Sarnoff JND Vision Model [Lubin, 1997] is a method of predicting the perceptual ratings that observers will assign to a degraded color-image sequence relative to its non degraded counterpart. The model takes two images, an original and a degraded image, and produces an estimate of the perceptual difference between them. The model does a front end processing to transform the input signals to light outputs (Y CbCr to Yuv), and then the light output is transformed to psychophysically defined quantities that separately characterize luma and chroma. A luma JND map and a chroma JND map are created. The JND maps are then used for a correlation summary, resulting in a measure of difference between the original and the degraded image. It should be noted that the metric was developed as a video quality metric, showing a high correlation between predicted quality and perceived quality. The model has also been tested on JPEG data, where a high correlation also was found. Lubin concludes that the model has wide applicability as an objective picture quality measurement tool.

Reduced reference To be completed [Nothing really widely accepted – research area, probably not to be included here. What we should state that at this time AIC cannot provide any good recommendations in this area]

No reference To be completed X-ness (blurriness, blockiness, graininess, colorfulness, ringing, flickering, …) [same applies here: Nothing generally accepted, we cannot make recommendations]

A.B. Metrics This Annex defines metrics evaluating various aspects of image compression algorithms. Their application, the detailed measurement procedures and combination into benchmarks is defined in Annex B.C.

A.B.1 Image Dimensions (Normative) The image dimensions describe the size (extent) of the image and the type and precision of the image samples:  Image depth d: The number of colour channels (or components) encoded in an image. A grey scale image consists of one, a colour image of three channels. Additional channels may carry opacity information.  Image width w: Horizontal extent of the image in image pixels. The horizontal extent may depend on the channel in case image components are sub-sampled; in this case, the width for each image channel shall be recorded.  Image height h: Vertical extent of the image in pixels. The vertical extent may depend on the

channel for subsampled images; in this case, the image height for each image channel shall be recorded.  Sample type t: Image samples can be encoded in signed or unsigned integers, floating point or fixed point encoding. The following list enumerates some sample types that may be relevant to describe image characteristics: o Unsigned integer samples o Signed integer samples o Floating point samples o Fixed point samples The sample type t may depend on the channel; if it is channel dependent, the sample type for each channel shall be listed.  Sample precision b: The sample precision b describes the bit depth of a given data type encoding the image sample values and further qualifies the sample type t. The following additional information shall be recorded for the listed sample types: Table BA-1 Unsigned integer samples Total number of bits minimally required to represent the full range of sample values. Signed integer samples Total number of bits, including the sign bit, minimally required to represent the full range of sample values. Floating point samples The total number of bits to represent the samples, comprised of: the sign bit, the number of exponent bits, and the number of mantissa bits. Fixed point samples The total number of bitsrequired to represent the full range of sample bits, comprised of: the sign bit if present, the number of integer bits and the number of fractional bits. NOTE: For example, integer samples in range [0..1023] are here described as ten bit data, regardless of whether the samples are stored in 16 bit values or packed into ten bits each. Integer values in the range [-128..127] are here classified as 8 bit signed data because the data representation consists of one sign bit and seven magnitude bits. The Image Dimension Data shall consist of the full set of data defined above, that is, the number of channels, the width and height of each image channel, the sample type of each channel and the sample precision of each channel.

A.B.2 Mean Square Error and PSNR Mean Square Error and PSNR approximate image quality in a full reference quality assessment framework. NOTE: PSNR and MSE correlate only badly to perceived quality. For a combined compressor/decompressor benchmark, the mean square error between the original and the reconstructed image shall be measured as well. Denote the sample value of the reference image at position x,y in channel c by p(x,y,c) and the sample value of the reconstructed image in channel c at position x,y by q(x,y,c). Denote by d the number of image channels, the width of channel c by w(c) and its height by h(c). Then the Mean Square Error between the reference and the

reconstructed image is defined as follows: 1 d - 1 1 w(c)- 1 h(c)- 1 MSE = е е е (p(x, y,c) - q(x, y,c))2 d c=0 w(c)Чh(c) x=0 y=0 PSNR (Peak Signal to Noise Ratio) is a quantity related to the Mean Square Error and defined as follows: Let c denote the image channel, t(c) the sample type of this channel and b(c) the sample precision of this channel (see A.B.1). Then define the quantity m(c) as follows:

Table BA-21 t(c) = signed or unsigned integers m(c) = 2b(c)-1 t(c) = floating point or fixed point m(c) = 1 The PSNR is then: w(c)- 1 h(c)- 1 ж 2 ц з (p(x, y,c) - q(x, y,c)) ч 1 d - 1 е е PSNR = - 10log з x=0 y=0 ч 10 е 2 Eq. 6 зd c=0 w(c)Чh(c)Чm(c) ч з ч и ш

w(c)- 1 h(c)- 1 ж 2 ц з (p(x, y,c) - q(x, y,c)) ч 1 d - 1 е е PSNR = - 10log з x=0 y=0 ч 10 е 2 зd c=0 w(c)Чh(c)Чm(c) ч з ч и ш NOTE: The purpose of this measurement is not to define an image quality. A separated benchmark exists for this test. It is rather designed to identify pathological cases where incorrect or unreasonable compressed streams are generated.

A.B.3 Compression Rate and Bits Per Pixel Bits Per Pixel and the Compression Rate are two metrics that allow to describe the compression performance of image compression codecs. Bits Per Pixel bpp is defined of the size of the compressed image stream L in bytes and the Image Dimensions (see A.B.1): 8ЧL bpp = d - 1 е w(c)Чh(c) Eq. 7 c=0

8ЧL bpp = d - 1 е w(c)Чh(c) c=0 The compression rate is defined as follows: d - 1 b(c)Чw(c)Чh(c) е Eq. 8 C = c=0 8ЧL

d - 1 е b(c)Чw(c)Чh(c) C = c=0 8ЧL The quantity d is the number of channels of the image, w(c) is the horizontal extend of channel c and h(c) the vertical extent of the channel. The number b(c) is the number of bits required to describe the samples of channel c, see Clause A.B.1.

EDITORs NOTE: Define hardware specifications.

CB Benchmarks This Annex defines several benchmark procedures to measure the performance of image compression algorithms; in addition to the actual measurement itself, each benchmark requires additional deliverables from the metrics defined in Annex A.B. They are listed in table B-1 and again defined in the corresponding subclause. Table BC-1: Deliverables for each Benchmark Benchmark Purpose of the test Primary Deliverable Secondary Deliverables Execution Time Estimation of Execution time per Hardware specifications, Image Benchmark algorithm time pixel Dimensions, PSNR and complexity under compression rate ideal conditions Memory Estimation of space Required memory at Hardware specifications, Image Benchmark complexity under encoding and Dimensions, PSNR and ideal conditions decoding time compression rate Execution Time Execution time per Hardware specifications, Image vs. Memory memory unit Dimensions, PSNR and Requirement compression rate Cache Hit Rate Performance Cache hit rate Hardware specifications, Image Benchmark estimation of data Dimensions, PSNR and locality compression rate Degree of Data Parallelism Parallel Speedup Speedup, Serial Hardware Specifications, Image Benchmark Speedup, Efficiency, Compression, PSNR and Throughput compression rate Error Resilience Bitrate Variation Bitrate Variation PSNR and compression rate Iterative Average Drift and Compression Average PSNR Loss Loss Instruction Benchmark Hardware Complexity Software Implementation Complexity

Power Refer to JTC1 consumption Green ICT

B.C.1 Execution-Time Benchmark

B.C.1.1 Definition (Informative) Execution time is here defined in terms of a benchmark process that allows the fair comparison of several codec implementations with respect to a given architecture and given source data. It is defined as the average ratio of time per pixel when a codec encodes or decodes a given source. While other definitions of complexity stress either the asymptotic number of operations for source sizes going to infinity, or the number of arithmetic operations per pixel, it was deemed that such definitions ignore the overhead of memory transfers and cache locality, as well as the ability to utilize architectures like SIMD found on many modern computer architectures. Readers, however, should understand that the guidelines defined here are only suitable to compare two software implementations on the same hardware running under the same conditions including codec settings, and other definitions of complexity are required for hardware implementations. Such measures are beyond this clause.

B.C.1.2 Measurement Procedure (Normative) This subclause defines measurement procedures to measure the execution times required by several implementations, measured as a ratio of time per pixel. To ensure the reliability and reproducibility of the data, the following conditions shall be met:  The implementations to be compared shall be compiled with full optimization enabled; support for profiling or debugging, if applicable, shall be disabled.  For benchmarking image compression, the implementations shall be run on the same source data set; a standard set of images is provided by WG1 that could be utilized, but otherwise source data considered typical for the target application shall be used. NOTE: The relation between execution time and image size should be expected to be nonlinear in nature due to caching and bandwidth effects; an image test dataset suitable for the desired target application should be agreed upon at the time of the definition of the test.  Options of the implementations under investigation shall be selected such that the execution speed is maximized, ignoring memory requirements and other constraints. However, execution on multiple cores and/or additional hardware such as GPUs shall be disabled. NOTE: Annex XXX defines additional procedures to define the performance of software implementations on such hardware and to define the parallelizability of algorithms.  For benchmarking decompression, the data source depends on whether benchmarking within standards or across standards is conceived: o For benchmarking within the same standard, decompressor performance shall be measured on the same set of bitstreams/file formats generated preferably by a reference implementation of a standard o For benchmarking across standards, each decompressor shall be tested on the output of its corresponding compressor.  Compressors and decompressors shall be measured on identical hardware.  Software shall be run at maximum available CPU speed on the hardware.

NOTE: Many modern computer architectures implement the possibility to adjust the CPU speed dynamically depending on the workload. For the purpose of this test, such speed adjustments limit the reproducibility of the test and hence should be disabled. Failing that, a different strategy to ensure maximal CPU speed is to run compression or decompression over several cycles, monitoring the CPU speed and starting the measurement as soon as the operating system increased the CPU clock speed to a maximum. Often, five to ten cycles on the same data are enough to reach maximum performance.  Execution time of the software shall be measured over N cycles ignoring results for the first M < N cycles. M shall be selected large enough to ensure that the CPU is clocked at maximal speed and source data is loaded into memory and partially cached in memory. N shall be selected large enough to ensure stable results within the measurement precision of the system. NOTE: Typical values for N and M are 5 and 25, respectively, but such values may depend on the nature of the source data, of the algorithm; initial tests carefully observing the measurement results must be performed to select reasonable values.  Starting with the M+1st cycle, the following data shall be collected:

o The total running time tr of the compressor or decompressor for a cycle. This is the end-to-end execution time of the software, not including the time required to load the software into memory, but including the time to load the source data, and including the time to write the output back.

o The total I/O time ti required to load source data into the algorithm, and to write output back.

NOTE: Measuring tr and ti typically requires a modification of the software under test. These times can be gathered by using high-precision timers of the operating system or the host CPU. POSIX.1-2001 defines, for example, a function named gettimeofday() that would provide the necessary functionality to implement such time measurements. It should furthermore be noted that N, the total number of cycles, shall be selected large enough to ensure suitable precision.  Measurements shall be repeated for various target bitrates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values are 0.25, 0.5, 0.75, 1.0, 1.5, 2.0 bits per pixel and lossless compression, if applicable.

 For compressor performance, the overall file size So shall be measured for each target bitrate selected. The result of the benchmark is the average number of milliseconds per megapixel spend for compressing or decompressing an image. It is defined as follows:

tr - ti T := d - 1 Eq. 9 (N - M)Че w(c)Чh(c) c=0

tr - ti T := d - 1 (N - M)Че w(c)Чh(c) c=0

where tr and ti are the overall execution time of the program respectively of the I/O operations measured in milliseconds, N is the total number of cycles, M is the number of initial cycles, w is the width of the image in pixels, h the height of the image in pixels and d the number of components (channels) of the image.

The values T, together with the compression rate and the PSNR (see Annex A) shall be reported for each implementation benchmarked and for each target bitrate, along with the information on the CPU model, its clock speed and its cache size.

B.C.2 Memory Benchmark

B.C.2.1 Definition (Informative) This annex describes the measurement procedures benchmarking the memory requirements of several implementations. As such, the memory requirements are always specific to implementations and not to algorithms, and the purpose of this benchmark is only to compare two or more implementations side by side. Implementations may offer various modes for compression and decompression of images; this benchmark is designed to measure the peak memory requirement for a compressor or decompressor mode minimizing the required memory. It thus identifies the minimal environment under which an implementation is able to operate. For example, it is beneficial for this benchmark if an implementation is able to provide a sliding window mode by which only segments of the source image need to be loaded in memory and a continuous stream of output data is generated. Similarly, it is beneficial for a decompressor if it need not to hold the complete image or compressed stream in memory at once and can decompress the image incrementally. It depends, however, also on the codestream design whether such modes are possible, and it is the aim of this benchmark to identify such shortcomings. It should be understood that algorithms may not perform optimally under memory constrained conditions, and compression performance in terms of execution time or quality may suffer.

B.C.2.2 Measurement Procedure (Normative) This annex defines measurement procedures to measure memory requirements of two implementations, measured in bytes per pixel.  For the purpose of this test, the two implementations shall be run on the same source data set; a standard set of images is provided by WG1 that could be utilized.  Options of the two implementations under investigation shall be selected such that the memory requirements are minimized, ignoring the execution speed and other constraints.  For benchmarking decompression, the data source depends on whether benchmarking within standards or across standards is conceived: o For benchmarking within the same standard, decompression performance shall be measured on the same set of bitstreams/file formats generated preferably by a reference implementation of a standard; o For benchmarking across standards, each decompression shall be tested on the output of its corresponding compression.  Compression (resp. decompression) shall be measured on identical hardware architectures.  During compression/decompression, the memory allocated by the codec under testing shall be continuously monitored. The data to be measured is the maximum amount of memory, measured in bytes, allocated by the codec at a time. NOTE: Measuring the amount of allocated memory may require installing a patch into the software under inspection. A possible strategy for collecting this data might be to replace malloc()/free() and/or operator new/operator delete by a custom implementation performing the following steps:

Two global variables B and Bm are initialized to zero at program start. For each allocation of N bytes, N is stored along with the allocated memory block and B

is incremented by N. If B becomes greater than Bm, Bm is set to B. Whenever a memory block is released, N is extracted from the memory block and B is decremented by N. Other mechanisms for memory allocation may be possible (such as allocating memory from the stack or pre-allocating static memory) and should be included.

On program termination, Bn holds the peak memory requirement to be reported.  Measurements shall be repeated for increasing image sizes. It is suggested to approximately double the image size until compression or decompression fails. NOTE: While WG1 provides a test image set, larger images can be obtained by duplicating the image content vertically and/or horizontally.  Measurements shall be repeated for various target bitrates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values are 0.25, 0.5, 0.75, 1.0, 1.5, 2.0 bits per pixel and lossless compression, if applicable.

 For compressor performance, the overall file size So shall be measured for each target bitrate selected. NOTE: Especially in memory-constraint compression, output rate and target rate might differ significantly. The purpose of this measurement is to estimate the precision up to which a compressor can operate in memory-constraint mode.  For a combined compressor/decompressor benchmark, the mean square error as defined in Annex A shall be measured. NOTE: The purpose of this measurement is not to define an image quality. A separated benchmark exists for this purpose. It is rather designed to identify pathological cases where incorrect or unreasonable compressed streams are generated. The result of the benchmark is the peak number of bytes per pixel spend for compressing or decompressing the test images. It is defined as follows:

Bm Am = d - 1 Eq. 10 е w(c)Чh(c) c=0

Bm Am = d - 1 е w(c)Чh(c) c=0

where Bm is the peak memory requirement by the codec measured in bytes, w(c) is the width of the channel c in pixels, h(c) the height of channel c and d the number of components (channels) of the image.

The values Am, bpp and PSNR shall be reported for each implementation benchmarked and for each target bitrate and for each image size along with the compression rate or bpp, and the image dimensions.

B.C.3 Cache Hit Rate

B.C.3.1 Definition (Informative) Cache hit rate is defined by the average number of cache hits compared to the total number of memory accesses. An access of the CPU to memory is said to be a cache hit if a cache can provide the requested data. If the CPU has to fetch the data from memory or an alternative higher level cache, this access is called a cache miss. A codec having a high cache hit rate performs accesses in patterns well-predicted by the CPU cache. It will typically also perform faster than a comparable code having a smaller cache hit rate. Cache locality is architecture and implementation specific, both need to be reported in the test results. The purpose of this test is, hence, not an absolute measure, but the fair comparison between implementations. CPUs have typically more than one cache: A relatively small first-level cache, and a second, potentially even third level cache that buffers accesses on first-level cache misses. More than two cache levels might be available as well. Cache locality is mostly interesting for the first-level cache, but results are requested for all available caches of a given CPU architecture. Measuring cache locality requires a tool that has either direct access to the CPU cache statistics, or implements a simulation of the CPU in software and measures the cache hit rate within this simulation. While WG1 does not provide such a tool, Open Source implementations exist that provide the required functionality, e.g. valgrind with its cachegrind front-end implements a software simulation that is suitable for the tests outlined here.

B.C.3.2 Measurement Procedure This subclause defines measurement procedures to measure cache locality of two implementations, measured in percent of cache hits compared to the total cache access ratio.  For the purpose of this test, the two implementations shall be run on the same source data set; a standard set of images is provided by WG1 that could be utilized.  Options of the two implementations under investigation shall be selected to maximize the cache locality; comparable options shall be used for the test. NOTE: Selection of proper coding modes is under discretion of the operator of the test, though should be done under a best-effort basis. A couple of pre-tests are suggested to identify coding modes that maximize the cache-hit rate. Typically, these modes are similar to the modes that minimize the memory requirements, see A.B.2.2.  For benchmarking decompression, the data source depends on whether benchmarking within standards or across standards is conceived: o For benchmarking within the same standard, decompressor performance shall be measured on the same set of bitstreams/file formats generated preferably by a reference implementation of a standard; o For benchmarking across standards, each decompressor shall be tested on the output of its corresponding compressor.  Compressors and decompressors shall be measured on identical hardware architectures.  Code shall be executed on a single CPU core. Additional CPU cores and/or separate hardware such as GPUs shall not be utilized by the codec tested and options shall be selected accordingly.

 During execution, the number of cache accesses Ca and cache hits Ch shall be continuously

monitored for all caches available for the CPU architecture the code runs on.  Measurements shall be repeated for various target bitrates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values for bitdepths are 0.25, 0.5, 0.75, 1.0, 1.5, 2.0 and lossless compression, if applicable.

 For compressor performance, the overall file size So shall be measured for each target bitrate selected. NOTE: Especially in memory-constraint compression, output rate and target rate might differ significantly. The purpose of this measurement is to estimate the precision up to which a compressor can operate in memory-constraint mode.  For a combined compressor/decompressor benchmark, the mean square error as defined in A 1.2 shall be measured. NOTE: The purpose of this measurement is not to define an image quality. A separated benchmark exists for this purpose. It is rather designed to identify pathological cases where incorrect or unreasonable compressed streams are generated.

The result of the benchmark is the cache hit ratio Cr in percent defined as follows:

Ch Cr  100  Ca

where Ch is the total number of cache hits and Ca is the number of cache accesses. Distinguishing between read and write accesses is not necessary for this test, but if the CPU architecture implements more than one cache, cache hit ratios shall be measured and listed individually. The cache hit ratio for the first level cache is then the most interesting result. Secondary results of compressor benchmarks are the output bitrate bpp and the PSNR as defined in Annex A.B.

The values Cr, bpp and PSNR shall be reported for each implementation benchmarked and for each target bitrate and for each image size along with the target bitrate, the CPU architecture and the image dimensions.

B.C.4 Degree of Data Parallelism

B.C.4.1 Definition (Informative) Images consist of a set of data on which operations are executed being identical applied to each element of the data set. If data dependencies enable the parallel execution on subsets of data independently, the codec can be implemented in parallel on multi-core CPUs, GPUs or ICs even if the algorithms of the codecs are purely sequential. Therefore, the usage of data parallelism for the parallelization of the codec is the most convenient and effective way to achieve a high performing parallel implementation. The degree of parallelism is defined as the number of independent subsets of data in the above mentioned sense.

B.C.4.2. Measurement Procedure (Normative) The degree of data parallelism ddp is measured as the number of data units that can be encoded or decoded independently of each other. In order to relate the degree of data parallelism to the size of the image, the ratio rdp is calculated as rdp=N/ddp with N being the total number of pixels of the whole image.

d - 1 NOTE: N can be computed from the image dimensions (see Annex BA) as е w(c)Чd(c) c=0 NOTE: The degree of data parallelism requires deep knowledge on the algorithm in question and cannot, generally, be measured by an automated procedure.

Should be applied to sofrtware modules Data units should be identified à identify bottleneck

B.C.5 Parallel Speedup Benchmark for PC Systems

B.C.5.1 Definition (Informative) The parallel speedup is defined as the gain in performance for a specific parallel implementation of an algorithm on a multi-core processor or parallel hardware platform for e.g. embedded applications versus a sequential implementation on the same hardware platform or processor. The hardware platform depends on the target application. The primary measure to determine the speedup is the s wall-clock time. In general, the measured time depends on the image size which should be provided by each measurement. In addition to the image size the compression ratio should also be provided with each measurement. The parallel speedup, efficiency, and throughput values are derived from the primary time measured, to support the interpretation of the results of this parallel speedup benchmark. The measurement procedure is similar to the execution time benchmark presented in B.C.1.

B.C.5.2. Measurement Procedure (Normative) This subclause defines measurement procedures to measure the execution times required by several implementations, measured in milliseconds per megapixel. To ensure the reliability and reproducibility of the data, the following conditions shall be met:  The implementations to be compared shall be compiled with full optimization enabled; support for profiling or debugging, if applicable, shall be disabled.  For benchmarking image compression, the implementations shall be run on the same source data set; a standard set of images is provided by WG1 that could be utilized, but otherwise source data considered typical for the target application shall be used. NOTE: The relation between execution time and image size should be expected nonlinear in nature due to caching and bandwidth effects; an image test dataset suitable for the desired target application should be agreed upon at the time of the definition of the test.  Options of the implementations under investigation shall be selected such that the execution speed is maximized.  The number of execution units allowed to be used shall be defined in the call for applying this benchmark. Additional execution units more than the tested number shall be disabled if possible. NOTE: For each measurement the hardware platform or multi-core processor used has to be reported. This includes the number of execution units and their type, the amount of memory and cache available to these execution units, and the interconnection type.  The amount of memory needed shall be benchmarked as introduced in B.C.2 Memory Benchmark.  For benchmarking decompression, the data source depends on whether benchmarking within

standards or across standards is conceived: o For benchmarking within the same standard, decompressor performance shall be measured on the same set of bitstreams/file formats generated preferably by a reference implementation of a standard o For benchmarking across standards, each decompressor shall be tested on the output of its corresponding compressor.  Software shall be run at maximum available CPU speed on the hardware. NOTE: Many modern computer architectures implement the possibility to adjust the CPU speed dynamically depending on the workload. For the purpose of this test, such speed adjustments limit the reproducibility of the test and hence should be disabled. Failing that, a different strategy to ensure maximal CPU speed is to run compression or decompression over several cycles, monitoring the CPU speed and starting the measurement as soon as the operating system increased the CPU clock speed to a maximum. Often, five to ten cycles on the same data are enough to reach maximum performance.  Execution time of the software shall be measured over N cycles ignoring results for the first M < N cycles. M shall be selected large enough to ensure that the CPU is clocked at maximal speed and source data is loaded into memory and partially cached in memory. N shall be selected large enough to ensure stable results within the measurement precision of the system. This measurement ha to be repeated at least 3 times reporting the average and the variance of the execution time. NOTE: Typical values for N and M are 5 and 25, respectively, but such values may depend on the nature of the source data, of the algorithm; initial tests carefully observing the measurement results must be performed to select reasonable values.  Starting with the M+1st cycle, the following data shall be collected:

o The total running wall-clock time tr of the compressor or decompressor for a cycle. This is the end-to-end execution time of the software, not including the time required to load the software into memory, but including the time to load the source data, and including the time to write the output back. Also the time for waiting for some other unit to complete or a resource to be available shall be included here.

o The total I/O wall-clock time ti required to load source data into the algorithm, and to write output back. This shall not reflect the time needed for synchronization and communication of the parallel execution units.

o The total wall-clock time tc for communication and synchronization of the excecution units.

NOTE: Measuring tr and ti typically requires a modification of the software under test. These times can be gathered by using high-precision timers of the operating system or the host CPU. POSIX.1-2001 defines, for example, a function named gettimeofday() that would provide the necessary functionality to implement such time measurements. It should furthermore be noted that N, the total number of cycles, shall be selected large enough to ensure suitable precision.  Measurements shall be repeated for various target bitrates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values are 0.25, 0.5, 0.75, 1.0, 1.5, 2.0 bits per pixel and lossless compression, if applicable.

 For compressor performance, the overall file size So shall be measured for each target bitrate selected. The result of the benchmark is the average number of milliseconds per megapixel spend for compressing or decompressing an image on the chosen number of execution units. It is defined as follows:

tr - ti T(P) := d - 1 Eq. 11 (N - M)Че w(c)Чh(c) c=0

tr - ti T(P) := d - 1 (N - M)Че w(c)Чh(c) c=0

where tr and ti are the overall execution time of the program respectively of the I/O operations measured in milliseconds, N is the total number of cycles, M is the number of initial cycles and P is the number of parallel units.

NOTE: The scheduling overhead time tc is by definition already included in the overall time tr. The values T, the compression rate R and PSNR as defined in Annex A shall be reported for each implementation benchmarked and for each target bitrate, along with the information on the CPU model, its clock speed, the number of cores and their cache sizes. Further deliverables are the Relative Speedup S(P,L), the Efficiency E(P,L), and the Throughput. They require performing the measurement steps above with a variable number of computing cores, P, deployed for compression or decompression.  The speedup is defined as: S(P) := T(1) / T(P), where T(P) is the time needed for the parallel version of the algorithm to complete on P CPUs/Nodes.  For T(1) the time needed for the parallel version to complete on one CPU shall be used (giving the relative speedup) unless specified otherwise. If T(1) cannot be obtained due to algorithmic constraints and an estimate for this number has been computed instead (see NOTE), results must state the procedure how this estimate has been performed. NOTE: Unlike benchmark B.C.1, the running time T(1) includes unavoidable overhead to allow parallel execution, even though not used in this test. NOTE: On some platform it might not be possible to implement T(1). In this case, the speedup needs to be given using a different reference value than the sequential one. Be aware that this ratio does not scale linearly in most cases.  If a sequential version of the same algorithm is available for the same platform also the real speedup shall be given, that is, the ratio

Sr := T / T(P) where T is measured as defined in B.C.1. NOTE: The real speedup is defined by the factor that a parallel version on P computation units runs faster than the best sequential implementation of this algorithm on the same platform. For the applications where the sequential version is an option the B.C.1 measurement might be run. In this case, the real speedup calculation can be easily done using already done measurements. If only a parallel version is of interest, there is no need to provide an optimized sequential version of the algorithm in addition.  Efficiency is defined as: E(P) := S(P) /P. The values shall be reported for multiple combinations of parallel computing cores P and multiple image sizes. At least four different values for P, including P=1 should be used.

 Throughput is defined as the number of pixels compressed or decompressed per time: d - 1 е w(c)Чh(c) C(P) := c=0 Eq. 12 T(P)

d - 1 е w(c)Чh(c) C(P) := c= 0 T(P) The output of this benchmark consists of the following information:  The time T(P),

 The compression rate (see Annex A)  The amount of memory required, as defined by Benchmark B.C.2  The variance of the wall-clock time T(P) over several measurement cycles  The throughput C(P).  The relative speedup S(P).  The efficiency E(P).

B.C.6 Implementation Benchmark for Parallel PC System Utilization (Balancing)

B.C.6.1 Definition (Informative) Parallel system utilization is defined as the degree of utilization of all processors in multi-core processor systems during parallel execution of a process. This value is an indicator how well smaller processes are distributed and executed on parallel execution units. A well balanced workload does not necessarily yield a higher execution speed, but parallel system utilization can be used to describe potential free processing resources.

B.C.6.2. Measurement Procedure (Normative) This subclause defines measurement procedures to measure the execution times required by several implementations, measured in milliseconds per megapixel. To ensure the reliability and reproducibility of the data, the following conditions shall be met:  All CPU/Core/Nodes on the evaluation system shall have identical computing power, i.e. this benchmark is only suitable for symmetric multiprocessing hardware.  The implementations to be compared shall be compiled with full optimization enabled; support for profiling or debugging, if applicable, shall be disabled.  For benchmarking image compression, the implementations shall be run on the same source data set; a standard set of images is provided by WG1 that could be utilized, but otherwise source data considered typical for the target application shall be used. NOTE: The relation between execution time and image size should be expected nonlinear in nature due to caching and bandwidth effects; an image test dataset suitable for the desired target application should be agreed upon at the time of the definition of the test.

 Options of the implementations under investigation shall be selected such that the execution speed is maximized.  The number of execution units allowed to be used shall be defined in the call for applying this benchmark. Additional execution units more than the tested number shall be disabled if possible. NOTE: For each measurement the hardware platform or multi-core processor used has to be reported. This includes the number of execution units and their type, the amount of memory and cache available to these execution units, and the interconnection type.  The amount of memory needed shall be benchmarked as defined in B.C.2 Memory Benchmark.  For benchmarking decompression, the data source depends on whether benchmarking within standards or across standards is conceived: o For benchmarking within the same standard, decompressor performance shall be measured on the same set of bitstreams/file formats generated preferably by a reference implementation of a standard o For benchmarking across standards, each decompressor shall be tested on the output of its corresponding compressor.  Software shall be run at maximum available CPU speed on the hardware. NOTE: Many modern computer architectures implement the possibility to adjust the CPU speed dynamically depending on the workload. For the purpose of this test, such speed adjustments limit the reproducibility of the test and hence should be disabled. Failing that, a different strategy to ensure maximal CPU speed is to run compression or decompression over several cycles, monitoring the CPU speed and starting the measurement as soon as the operating system increased the CPU clock speed to a maximum. Often, five to ten cycles on the same data are enough to reach maximum performance.  Execution time of the software shall be measured over N cycles ignoring results for the first M < N cycles. M shall be selected large enough to ensure that the CPU is clocked at maximal speed and source data is loaded into memory and partially cached in memory. N shall be selected large enough to ensure stable results within the measurement precision of the system. This measurement has to be repeated at least 3 times reporting the average and the variance of the execution time. NOTE: Typical values for N and M are 5 and 25, respectively, but such values may depend on the nature of the source data, of the algorithm; initial tests carefully observing the measurement results should be performed to select reasonable values.  Starting with the M+1st cycle, the following data shall be collected:

o The total running wall-clock time tr of the compressor or decompressor for a cycle. This is the end-to-end execution time of the software, not including the time required to load the software into memory, but including the time to load the source data, and including the time to write the output back. Also the time for waiting for some other unit to complete or a resource to be available shall be included here.

o The total I/O wall-clock time ti required to load source data into the algorithm, and to write output back. This shall not reflect the time needed for synchronization and communication of the parallel execution units.

o The total wall-clock time tc for communication and synchronization of the execution units.

NOTE: Measuring tr and ti typically requires a modification of the software under test. These times can be gathered by using high-precision timers of the operating system or the host CPU.

POSIX.1-2001 defines, for example, a function named gettimeofday() that would provide the necessary functionality to implement such time measurements. It should furthermore be noted that N, the total number of cycles, shall be selected large enough to ensure suitable precision.  Measurements shall be repeated for various target bitrates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values are 0.25, 0.5, 0.75, 1.0, 1.5, 2.0 bits per pixel and lossless compression, if applicable.

 For compressor performance, the overall file size So shall be measured for each target bitrate selected. The result of the benchmark is the average parallel system utilization for compressing or decompressing an image on the chosen number of execution units. It is defined as follows:

A - 1 t (A) := t (a) p е process Eq. 13 a=0

A - 1 t (A) := t (a) p е process a=0

A is the number execution unit state changes indicating the switches from idle state to processing state. t process (a) defines the uninterrupted period in which the a-th execution unit stays in processing

state. ta (P) defines therefore the overall processing time assigned to the p-th execution unit.

NOTE: The scheduling overhead time tc is by definition already included in ta (P) .

P - 1 t (P) := t active е p Eq. 14 p=0

P - 1 t (P) := t active е p p=0

P is the total number of considered parallel execution units. tactive (P) is therefore the overall time all execution units staying in processing state. The parallel system utilization is defined as:

t (P) active tr - ti U(P) := Ј U(A) Ј 1 Eq. 15 tr - ti A

t (P) active tr - ti U(P) := Ј U(A) Ј 1 tr - ti A

If U(A) is equal 1 the job could or has been done in sequential order. The above indicators require performing the measurement steps above with a variable number of computing cores, P, deployed for compression or decompression. The following units shall be reported:  The system utilization U(P),  The compression rate (see Annex A)  The amount of memory required, as defined by Benchmark B.C.2  The throughput C(P).

B.C.6 Error Resilience (Peter and co.) Archiving and transmission of images, degree of damage of an image in the case of bit errors on the coded streams : maximum number of damaged pixels for a 1,2,5,10MPixel image in the case of a single bit error.

B.C.6.1 Definition (Informative) Error resilience measures the resilience of an image representation scheme under transmission of the image information over a noisy (lossy) channel that introduces errors into the transmission.

B.C.6.2. Measurement Procedure (Normative) Burst errors Randomly distributed errorsTBC.

B.C.7 Bitrate Variation

B.C.7.1 Definition (Informative) For some applications, it is important that a compression codec is able to generate a continuous stream of symbols, ensuring that some output is generated at least in every given time span, i.e. that the output bitrate does not vary too much over time. For example, carry-over resolution in arithmetic coding might cause arbitrary long delays in the output until the carry can be resolved. For the purpose of this test, the output bitrate is defined as the number of output symbols generated for each input symbol, measured in dependence of the percentage of the input stream fed into the codec. If a compressor can generate parts of its output from incomplete input, it is said to be online- capable, see also clause B.C.10. Only for such codecs this test is defined, and if codecs offer several modes of operation, such a mode must be selected for the purpose of this test.

B.C.7.2. Measurement Procedure (Normative) This annex defines the measurement procedure to empirically determine the maximum bitrate variation of a image compression codec. To enable such a test, the compression codec under evaluation must provide a mode of operation under which an input image can be feed iteratively in smaller units, here called minimal coding units, into the codec. Depending on the compressor, minimum coding units can be individual pixels, stripes or blocks of image data. Such an operation mode is also called online-capable. If the code is not online-capable, this benchmark cannot be implemented.

If possible, this benchmark should be complemented by a theoretical analysis of the worst-case bitrate variance.  If the compressor under inspection offers several compression modes, then online-capable modes shall be used and the mode minimizing the bitrate variation shall be selected for the purpose of the test.  The minimal coding unit of the compressor shall be reported.  The benchmark shall be performed on a suitable test image set agreed upon before running the benchmark. Such a suite of test image data can be obtained from the WG1.  For each coding unit U feed into the compressor, the number of image bits contained in this unit shall be computed. It is defined as follows: 1 B = b(c) i N е е Eq. 16 cОU (x,y)ОU

Bi   b(c) c U(x,y)U

where U is the coding unit, c is the image channel and (x,y) are the pixel positions within the image, and the sum runs over all pixels within the coding unit. The quantity b(x,y,c) is the bit precision of the pixel, as defined in Annex C.D.  For each coding unit feed into the compressor, the output bitrate shall be measured. It is defined by

Bo  8 b Where b is the number of bytes (octets) leaving the codec. NOTE: It might be required to install a patch into software under evaluation to measure the quantity Bo.

 The bitrate for unit U is defined as R(U):=Bo/Bi, the number of output bits generated for each input bit.  R(U) shall be measured for each coding unit going into the compression codec until all of the image is compressed, giving a function of the bitrate over the percentage of the completion of the image.

The output of this benchmark is the variance of the bitrate. Additionally, the compression rate and the PSNR shall be reported, see Annex A for a definition of these terms.

B.C.8 Iterative Compression Loss

B.C.8.1 Definition (Informative) Generation loss is a loss in image quality if the output of a lossy image compression/reconstruction cycle is recompressed again under the same conditions by the same. If this recompression is repeated over several cycles, severe degration of image quality can result. A codec that operates losslessy on its own decompression output is called idempotent.

Generation loss limits the number of repeated compressions/decompressions in an image processing chain if repeated recompression generates severely more distortion than a single compression/decompression cycle. This subclause distinguishes between drift and quality loss. While the former is due to a systematic DC error often due to miscalibration in the color transformation or quantizer, the latter covers all other error sources as well, as for example due to limited precision in the image transformation implementation.

B.C.8.2 Measurement Procedure (Normative) This annex defines measurement procedures for the generation quality loss and the DC drift of a single codec. To this end, compressor and decompressor must be provided. If the compressor/decompressor specification allows certain freedoms, generation loss also depends on the implementation, and generation loss shall also be measured against a reference implementation if available.  Select a compressor/decompressor pair to measure the generation loss of. At least the vendor compressor shall be measured against the vendor decompressor. If a reference implementation is available, the vendor compressor shall also be measured against the reference decompressor, and the reference compressor shall be measured against the vendor decompressor.  Select coding/decoding parameters that minimize generation loss if applicable; execution speed shall be neglected for the purpose of this test.  Repeat the following steps for a selection of test images typical for the application domain of the codec. For natural images, WG1 provides such a test image set.  Repeat the steps for several target bit rates to be agreed on within the framework of a core experiment using this Recommendation | International Standard. Suggested values are 0.25, 0.5, 0.75, 1.0, 1.5 and 2.0 bits per pixel.

 Set n to zero and the image I0 to the source image.

 Compress the image In to the target bit rate, decompress the resulting stream giving image In+1 and measure the following data for n>0: o The archived bit rate of the compressor defined as the average number of bits per pixel,

o The PSNR PSNRn between the first generation image I1 and In+1.

o The drift error Dc for each channel/component between the image, defined as: 1 w(c)- 1 h(c)- 1 D = (p(x, y,c) - q(x, y,c)) c е е Eq. 17 w(c)Чh(c) x=0 y=0

1 w(c)- 1 h(c)- 1 D = (p(x, y,c) - q(x, y,c)) c w(c) h(c) е е Ч x=0 y=0 where w is the width, h the height and d the number of components/channels of the source image and p(x,y,c) is the channel value of the pixel at position x,y in channel c of the original image, q(x,y,c) the channel value at the same location in the distorted image.

 Increment n and repeat the step above, continuously measuring PSNRn and Dc,n for each iteration. These steps shall be carried out N times, where N should be at least five.

The result of the test is the average drift Dc for each component c per component and the average PNSR-loss per generation, PSNR defined as follows:

1 N - 1 1 N - 1 D = D PNSR = PSNR c е c,n е n Eq. 18 N - 1 n=1 N - 1 n=1

1 N - 1 1 N - 1 D = D PNSR = PSNR c N - 1е c,n N - 1е n n=1 n=1 where N is the number of iterations (generations) over which the measurement has been performed.

The value Dc is called the average drift, the value PSNR is the average quality loss.

NOTE: Ideally, D c and PSNR should tend to zero for growing N. A nonzero value of Dc typically indicates some problem of the implementation under investigation.

B.C.9 Instruction Benchmark

Check for software/MPEG methodology ?

B.C.9.1 Definition (Informative) Count the number of load/store/arithmetic instructions used in a codec. Time critical or power critical applications, the number of operations of a sequential software version of the CODEC for each class of operations: it is measured by the number of assembler instructions for different classes of instructions of a reference compiler. (Classes of instructions: arithmetic instructions, load-store instructions, branch instructions, remaining instructions)

B.C.9.1 Measurement (Normative) There seems to be a tool for valgrind to do that. Check this.TBC.

B.C.10 Hardware Complexity (Alexey)

B.C.10.1 Definition (Informative) The hardware complexity of a codec is a term that encompasses numerous properties of the hardware resources needed to implement the codec. The measures of hardware complexity can be of two types: - technologically dependent - technologically independent The technologically dependent measures (such as die size and transistor count) have different values for different logic families (e.g., CMOS, TTL, BiCMOS, ECL, IIL). In contrast to the previous type, the technologically independent measures (such as a gate equivalent) are invariant for a logic family. Further in Appendix C.10, only technologically independent measures will be considered as they are more suitable for the tasks of codec evaluation. The most widespread measure of technologically independent ones is a gate equivalent (GE) that is calculated as a number of two-input drive-strength- one NAND gates used for implementing a codec. On Chip or FPGA implementation of the CODEC : number of gates (2input NAND-Gate-Equivalent) and memory requirement of the hardware implementation of the CODEC for a 1,2,5,10MPixel image. Includes also: On Chip or FPGA implementation of the CODEC : maximum achievable speed of the hardware implementation in MHz using pipelining in the current process technology.

B.C.10.1 Measurement (Normative) The measurement procedure differs for different hardware platforms being used for the codec’s implementation. This difference is determined by the fact that for VLSI there are no other ways 1 to count the number of used gates except to get its value from a report file generated by design software (such as Quartus II for Altera FPGAs, Foundation for Xilinx FPGAs, and Mentor Graphics for ASICs).

The main difficulty for FPGAs and other programmable logic devices is that hardware complexity is defined using the number of different structural elements as the measures of hardware complexity: - logic cells - embedded memory blocks - embedded hardware modules such as multipliers, IP blocks, transceivers, etc. In this case, it is necessary to convert these measures to GE. It is possible using the information containing in documentation provided by a manufacturer. the technology-dependent unit area commonly referred to as gate equivalent. A specification in gate equivalents for a certain circuit reflects a complexity measure, from which a corresponding silicon area can be deduced for a dedicated manufacturing technology. No good way to measure this, implement it in hardware. Transistors/ gates / surface / …

B.C.11 Software Implementation Complexity

B.C.11.1 Definition (Informative) Relative measure which addresses the amount of work necessary to implement and debug a part of CODEC compared to another equivalent part of a CODEC. (example : Arithmetic Coder is more complex with respect to a software implementation compared to a Golomb Coder for JPEG-LS)

B.C.11.1 Measurement (Normative) No good way to measure this, only relative. Do tools exist for that? Does Wikipedia refer to this?TBC.

C.D. Codec Features The following annex defines a list of codec features a vendor shall report for a codec specification; rather than defined by measurements, these features are due to architectural decisions made for the codec under inspection.

C.D.1 Dimension, Size, Depth and Precision Limitations For evaluation of codecs, the following list of parameter ranges of the code in question shall be provided:

1 - In principle, there is an additional way to count the number of gates. It is a direct (destructive) procedure. Using that, it is possible to count the number of electronic components that are physically contained in an integrated circuit. So, these methods are inapplicable to FPGA, CPLD and other architectures of programmable logic devices.

 Maximum file sizes that can be represented by the codestream and/or the codestream. NOTE: The TIFF file format, for example, cannot reasonably handle files larger than 2GB. It is expected that codec providers list such boundaries.  Maximum image dimensions representable by the codestream/file format. NOTE: This item lists the theoretical boundary for the image size that is representable by the codestream/file format syntax. It is not the maximal image size compressible/decidable under a memory or time constraint.  Maximum number of channels/components.  Pixel number formats and precisions representable by the file format/codestream. For this item, codec providers shall list the number formats and the precision/bit depths representing the samples of the image, and limitations for such formats. NOTE: While typically image samples are understood as unsigned integers, this need not to be the case. Additional formats may include signed integers, floating point representations or fixpoint representations. For example, lossy JPEG supports unsigned integers with 8 or 12 bits per pixel, JPEG 2000 signed or unsigned integers from one to 37 bits per pixel, JPEG XR allows floating point representations with 16 bits per pixel, but 32 bit floating point numbers are not representable with full precision.  Image segmentation units, their domain and their limitations. NOTE: Codestreams typically separate images into smaller units that are compressed independently or are independently addressable. Such units could either be spatial units, then often called tiles, or minimum coded units, or could also be defined in the transformation domain, for example precincts or codeblocks. This item requests implementers to list the domain within which these units are located (spatial, transform domain, etc...) their size constraint, whether they are independently decodeable etc.

C.D.2 Extensibility Extensibility is a feature of a the bitstream or fileformat syntax used to encode the images in. A bitstream or fileformat syntax is extensible if syntax elements can be added to it after its definition such that the bitstream remains fully or partially decodable by implementations following the earlier specifications. To specify this feature, providers of a code shall report how and by which means the syntax is extensible.

C.D.3 Backwards Compatibility A proposed compression codec is said to have a codestream or fileformat syntax that is backwards compatible if implementations of existing standards such as JPEG or JPEG 2000 are able to decode such streams either partially or completely. A proponent shall report to which standards ad to which degree a proposed compression algorithm is backwards compatible.

C.D.4 Online-Capability and Embedded CodecsParsing flexibility and scalability (Peter) Online capability is a property of the syntax of the codestream or fileformat of a proposed compression scheme. A syntax is said to be online-capable if its syntax elements can be parsed incrementally without buffering parts of the stream or without forwards or backwards seeking in the stream. An online capable bitstream syntax allows decompressors to start decoding fileformats or codestreams with only parts of the stream available. An implementation allowing such a mode of

operation is said to be online-capable. A codestream or fileformat syntax is said to be embedded if a properly truncated version of the codestream is again a syntactically correct codestream or fileformat and represents an image that is a lower quality or resolution than the image represented by the full stream. Providers of codestream syntax definitions shall describe to which degree the proposed syntax is online capable, or even embedded.

D.E. Application Oriented Evaluation This informative annex lists for application domains a selection of benchmarks of Annex B that might be relevant for the corresponding domain; this list should be understood as a suggestion how to compile the benchmarks from Annex B into experiments to evaluate a codec targeted at one of the listed domains; test conditions in a core experiment should be stated explicitly and might change or override the suggestions listed in this document.

D.E.1 Photography

D.E.2 Video Surveillance

D.E.3 Colour Filter Array (CFA) Compression Acquisition of natural scenes requires capturing the colour information for each image pixel (picture element). In order to deliver an accurate representation for the human visual system, at least three colour components have to be acquired. Many available systems, however, are only able to record one colour component per time and image location, leading to a color filter array representation. A reconstruction phase is then required to transform these data into the final representation offering all colour component information for each pixel.

In order to reduce the amount of data to store, CFA data can be compressed similarly to any other image data. However, since the CFA data represents some intermediate form that has to be translated to the final representation using a reconstruction phase, special evaluation methodologies are required. The following subsections discuss modifications and extensions to the techniques discussed in Annexes A and B.C.

D.E.3.1 Evaluation of Image Quality (for Lossy Compression) The metrics defined in Annex A have to be put into relation to a mean squared error value, since many codec properties can be traded against achievable quality. Note that this simple quality metric does not take the human visual perception into account. For this purpose, separate benchmarks exist. For CFA compression tests, the MSE defined in Annex A has to be replaced by the CFA MSE defined in subclause D.E.3.2. Furthermore, it might be beneficial to add a MSE based evaluation based on the demosaiced image as defined in sublause D.E.3.4.

D.E.3.2 CFA Mean Squared Error The CFA Mean Squared Error directly measures the mathematical distortion without splitting the color components: 1 2 MSE = p - q CFA е ( i i ) Eq. 19 N all pixels i

1 2 MSE = p - q CFA е ( i i ) N all pixels i

where N is the number of pixels, pi is the i-th pixel in the source image, qi the pixel value at the same position in the distorted image obtained after successive compression and decompression. Note that

each pixel has only one associated value.

D.E.3.3 CFA Peak Absolute Error The CFA Peak Absolute Error directly measures the mathematical distortion without splitting the color components: PAE = max p - q CFA all pixels i| i i| Eq. 20

PAE = max p - q CFA all pixels i| i i|

where pi is the i-th pixel in the source image, qi the pixel value at the same position in the distorted image obtained after successive compression and decompression. Note that each pixel has only one associated value.

D.E.3.4 Reconstructed Image Quality Evaluations Since CFA data cannot be directly visualized, image quality evaluation needs to be done based on the reconstructed image data. Two different methods can be distinguished. The first one starts with CFA data directly acquired from a digital camera and is depicted in Figure 1. C E o v Q m a u l p u

Com- Decom- a a a l r t i t i i

pression pression s o y o n n /

reference algorithm for non

g CFA data n i e s n i s l e e c p i o p r

P Same size

C E n n A ) o g . o v o n i F Q i r m n a r i e s o s

p C i l u o s s n t p s u - a i l s a t s l m c e a e a e a r n l r e a e o e r i t n r D t p i i e p c p i d c y s o r i o g n S o n o m i p m r r u n c / o o ( P O c C Figure 1: Evaluation of CFA Image Quality

It includes all processing steps necessary to transform CFA data into a form that can be visualized. The sensor correction encompasses all low level operations such as gain or defective pixel correction. The box labeled processing pipeline includes all steps undertaken to convert the actual CFA data into a three-component image. It includes in particular  demosaicing

 white balance, brightness correction, colour correction2  colour rendering (colour space transform) including gamma. Depending on the application scenarios, different demosaicing algorithms can be chosen as defined in the corresponding call. Low complexity applications might for instance define a simple demosaicing algorithm, while high quality use cases need a more computational intensive selection. It might even be beneficial to compare different demosaicing algorithms in case a CFA compression scheme is tightly coupled with a corresponding algorithm. Since on of the major benefits of CFA based workflows is the possibility of late image corrections, this should be preserved even in case of lossy correction. For this purpose, different parameter sets for the above mentioned algorithms should be applied. The actual setup depends on the application and might for instance include different white balancing settings (tungsten (3200k) versus daylight (6500k)). Similarly, different brightness correction operations such as an increment by a factor of two (one stop) versus very small or no correction can be taken into account. The gray shaded path allows comparing a possible CFA compression candidate against a solution for standard multi-component compression such as JPEG, JPEG2000 and JPEG-XR. The quality evaluation methodologies can be objective measures or subjective measurements as defined in Annex B.C. While this setup has the advantage of starting with real CFA data, it is slightly biased in that the proper objective is not to preserve the demosaiced image, but the real captured scene. Note that this is not equivalent, since demosaicing possibly introduces artifacts that, when reduced by the compression algorithm for instance because of smoothing, might lead to superior results. This can be taken into account using the following setup: n

n o i g n o s i n o e i s y s i t t n s i s e i l l r a s e a e r p u e l u p p c i a m o Q v p m o r c o E p c e d g n i B c i G a s R o m n

n o i g n o s i n o e i s y s i t t s n i s e i l r l a s e a r p e u e l u p c p i a m o Q v p m o r c o E p c e d

The mosaicing box includes all necessary steps in order to transform an RGB image into a “reasonable” CFA image:  inversing the white balance

2 Adapts spectral sensitivity of camera sensor

 inversing the brightness and colour correction as well as gamma transform  inversing the colour rendering  forming the CFA pattern including possible low pass filtering

Note that the results of these steps need to be converted into an integer of desired precision causing possibly rounding errors that will directly impact the achievable PSNR.

D.E.3.5 Codec Features Besides the Codec Features listed in Annex DC, the following additional codec features are useful for evaluation of CFA compression techniques.

D.E.3.6 Colour Filter Array Geometry CFA data can have different geometrical layouts. The most common one is a checkerboard pattern, populated with a classical Bayer mask having twice as many green pixels then red or blue ones 1. Alternatively, the used checker-board pattern can contain sensors for the red, green, blue and “white” colour 2. Besides the checkerboard pattern, also alternative geometrical layouts are available.

Consequently, CFA compression schemes can be evaluated by means of the following table:

Color Filter Array Geometry Supported (yes/no) 1 2 TODO: Further references

D.E.4 Guideline for subjective quality evaluation methods for compressed medical images

E.4.1. Test dataset: Test dataset should include diverse body parts, pathologies, modalities, and scanning parameters [31].

E.4.2. Index of compression level: compression ratio Description: According to guidelines including American College of Radiology Technical Standard for Teleradiology [42] and FDA Guidance for the Submission of Premarket Notifications for Medical Image Management Devices [53], and many scientific reports in the radiology field [31, 64-164], the de facto standard index of the compression level has been compression ratio, that is, the ratio of the number of bits in an original image to that in its compressed version.

E.4.3. Definition of compression ratio: The ratio of original image file size to the compressed size. Description: Some medical images have a bit depth of more than 8 bits/pixel. Therefore the compression ratio can be defined in different ways. For example, CT images have a bit depth of 12 bits/pixel and each pixel was packed on a two-byte boundary with four padding bits. The compression ratio has been defined largely in two ways: a) the ratio of the size of actually used data (12 bits/pixel) to the compressed size (bits/pixel); and b) the ratio of the size of original data including padding bits (16 bits/pixel) to the compressed size. The latter definition is more advantageous in directly calculating the reduction in file size and, thereby, to directly measure of the beneficial effect of compression on the storage and transmission speed [175, 186].

E.4.4. Range of compression ratios: compression ratios near the previously reported optimal compression thresholds Description: These compression ratios would be determined to be below, around, or above the previously reported optimal compression thresholds. Very high compression is inappropriate for medical images [197].

E.4.5. Definition of optimal compression: visually lossless threshold Description: There have been largely two approaches in determining optimal compression thresholds to which image quality is preserved: diagnostically lossless approach and visually lossless approach [8, 10, 12, 131, 186, 2018, 219]. The diagnostically lossless approach aims to preserve diagnostic accuracy in compressed images. Although this approach addresses the diagnostic performance directly, the practicability of this approach is limited as the compression threshold intrinsically varies with the diagnostic task itself (e.g., the size of the target lesion). The visually lossless approach is based on the idea that if compression artifactsartefacts are imperceptible they should not affect the diagnosis. The latter approach has been advocated to be more robust and conservative and, therefore, to be more suitable for medical image compression than the former.

E.4.6. Image Discrimination Task 6-a) Type of readers: Radiologists with different years of clinical experience

6-b) Visual comparison method: alternate display without an intervening blank image

Description: In visual comparisons of images in medical or non-medical fields, three image comparison methods have been previously used at large: alternate display without an intervening blank image, alternate display with an intervening blank image, and side-by-side display. It has been known that these image comparison methods truly have different perceptual sensitivities, thereby affecting the results of medical compression research. Considering that a conservative standpoint is required in assessing the quality of compressed medical images, “alternate display without an intervening blank image” method would be most appropriate because this method is known to be most sensitive to image difference. 6-c) Modified Two Alternative Forced Choice (2AFC)

Each image pair is alternately displayed on a single monitor, while the order of the original and compressed images was randomized. Each radiologist, blinded to the compression ratios, independently determined whether the two images were distinguishable or indistinguishable. Although the test dataset contain pathologies, the radiologists would be asked not to confine their analysis to the pathologic findings. Instead, the radiologists would be asked to examine an entire

image to find any image difference, paying attention to structural details, particularly small vessels and organ edges, and to the texture of solid organs and soft tissues. 6-d) Reading session

Total image pairs (original and compressed images) would be randomly assigned to several reading sessions. The order of the reading sessions would change for each radiologist. Sessions would be separated by a minimum of 24 hours to minimize radiologists’ fatigue and memory bias. 6-e) Reading distance

The reading distance would be limited to a range of the readers’ habitual clinical reading distances. Limiting the reading distance was meant to reproduce our clinical practice, as reading from a distance too close or too far would artificially enhance or degrade the sensitivity of the readers to compression artifactsartefacts. 6-f) The display system and viewing conditions

Details of the display system and viewing conditions should be provided.

Below table is an example.

TABLE I. Display system and viewing conditions in human visual analysis.

Display system Display 1536 × 2048 pixels resolution Display size 31.8 × 42.3 cm 1483 × 1483 pixels (stretched using bilinear Image resolution interpolation) PiViewSTAR (version 5.0.7, SmartPACS, Phillipsburg, Viewing software NJ) Luminance 1.3-395.5 cd/m2 Viewing conditions Ambient room 30 lux light Reading distance 35-75 cm Window setting width, 400 HU; level, 20 HU Magnification not allowed Reading time not constrained

D.E.5 Archival of Moving Pictures TBC

D.E.6 Document Imaging TBC

D.E.7 Biometric and Security Related Applications TBC

D.E.8 Satellite Imaging and Earth Observation TBC

D.E.9 Pre-Press Imaging TBC

D.E.10 Space Physics TBC

D.E.11 Computer Vision TBC

D.E.12 Forensic Analysis TBC

References [1] Bryce E. Bayer; Color imaging array; United States Patent , March 5, 1975 [2] John F. JR. Hamilton, John T. Compton; Processing color and panchromatic pixels; United States Patent Application; July 28, 2005 [3] D. Koff, P. Bak, P. Brownrigg, D. Hosseinzadeh, A. Khademi, A. Kiss, L. Lepanto, T. Michalak, H. Shulman, and A. Volkening, "Pan-Canadian evaluation of irreversible compression ratios ("Lossy" compression) for development of national guidelines," J Digit Imag, vol. 22, pp. 569-578, Dec. 2009. [4] American College of Radiology, "ACR technical standard for teleradiology," http://www.acr.org. Accessed May 1, 2010. [5] U.S. Food and Drug Administration, Center for Devices and Radiological Health, "Guidance for the submission of premarket notifications for medical image management devices," http://www.fda.gov/cdrh/ode/guidance/416.html. Accessed May 1, 2010. [6] V. Bajpai, K. H. Lee, B. Kim, K. J. Kim, T. J. Kim, Y. H. Kim, and H. S. Kang, "The difference of compression artifacts between thin- and thick-section lung CT images," Am J Roentgenol, vol. 191, pp. 38-43, Aug. 2008. [7] T. J. Kim, K. W. Lee, B. Kim, K. J. Kim, E. J. Chun, V. Bajpai, Y. H. Kim, S. Hahn, and K. H. Lee, "Regional variance of visually lossless threshold in compressed chest CT Images: lung versus mediastinum and chest wall," Eur J Radiol, vol. 69, pp. 483-488, Mar. 2008. [8] J. P. Ko, J. Chang, E. Bomsztyk, J. S. Babb, D. P. Naidich, and H. Rusinek, "Effect of CT image compression on computer-assisted lung nodule volume measurement," Radiology, vol. 237, pp. 83-8, Oct. 2005. [9] J. P. Ko, H. Rusinek, D. P. Naidich, G. McGuinness, A. N. Rubinowitz, B. S. Leitman, and J. M. Martino, "Wavelet compression of low-dose chest CT data: effect on lung nodule detection," Radiology, vol. 228, pp. 70-5, Jul. 2003. [10] K. H. Lee, Y. H. Kim, B. H. Kim, K. J. Kim, T. J. Kim, H. J. Kim, and S. Hahn, "Irreversible JPEG 2000 compression of abdominal CT for primary interpretation: assessment of visually lossless threshold," Eur Radiol, vol. 17, pp. 1529-1534, Jun. 2007. [11] H. Ringl, R. Schernthaner, E. Sala, K. El-Rabadi, M. Weber, W. Schima, C. J. Herold, and A. K. Dixon, "Lossy 3D JPEG2000 compression of abdominal CT images in patients with acute abdominal complaints: effect of compression ratio on diagnostic confidence and accuracy," Radiology, vol. 248, pp. 476-84, Aug. 2008. [12] H. Ringl, R. E. Schernthaner, A. A. Bankier, M. Weber, M. Prokop, C. J. Herold, and C. Schaefer-Prokop, "JPEG2000 compression of thin-section CT images of the lung: effect of compression ratio on image quality," Radiology, vol. 240, pp. 869-77, Sep. 2006. [13] H. Ringl, R. E. Schernthaner, C. Kulinna-Cosentini, M. Weber, C. Schaefer-Prokop, C. J. Herold, and W. Schima, "Lossy three-dimensional JPEG2000 compression of abdominal CT images: assessment of the visually lossless threshold and effect of compression ratio on image quality," Radiology, vol. 245, pp. 467-74, Nov. 2007. [14] R. M. Slone, D. H. Foos, B. R. Whiting, E. Muka, D. A. Rubin, T. K. Pilgram, K. S. Kohm, S. S. Young, P. Ho, and D. D. Hendrickson, "Assessment of visually lossless irreversible image compression: comparison of three methods by using an image-comparison workstation," Radiology, vol. 215, pp. 543-553, May. 2000. [15] R. M. Slone, E. Muka, and T. K. Pilgram, "Irreversible JPEG compression of digital chest radiographs for primary interpretation: assessment of visually lossless threshold," Radiology, vol. 228, pp. 425-429, Aug. 2003. [16] H. S. Woo, K. J. Kim, T. J. Kim, S. Hahn, B. H. Kim, Y. H. Kim, C. J. Yoon, and K. H. Lee, "JPEG 2000 compression of abdominal CT: difference in compression tolerance between thin- and thick-section images," Am J Roentgenol, vol. 189, pp. 535-541, Sep. 2007. [17] K. J. Kim, B. Kim, S. W. Choi, Y. H. Kim, S. Hahn, T. J. Kim, S. J. Cha, V. Bajpai, and K. H. Lee, "Definition of compression ratio: difference between two commercial JPEG2000 program libraries," Telemed J E Health, vol. 14, pp. 350-354, May. 2008. [18] K. J. Kim, B. Kim, K. H. Lee, R. Mantiuk, H. S. Kang, J. Seo, S. Y. Kim, and Y. H. Kim, "Objective index of image fidelity for JPEG2000 compressed body CT images," Med Phys, vol. 36, pp. 3218-3226, Jul. 2009.

[19] K. J. Kim, B. Kim, R. Mantiuk, T. Richter, H. Lee, H. S. Kang, J. Seo, and K. H. Lee, "A comparison of three image fidelity metrics of different computational principles for JPEG2000 compressed abdomen CT images," IEEE Trans Med Imaging, vol. 29, pp. 1496- 1503, Aug. 2010. [20] B. Kim, K. H. Lee, K. J. Kim, R. Mantiuk, V. Bajpai, T. J. Kim, Y. H. Kim, C. J. Yoon, and S. Hahn, "Prediction of perceptible artifacts in JPEG2000 compressed abdomen CT images using a perceptual image quality metric," Acad Radiol, vol. 15, pp. 314-325, Mar. 2008. [21] B. Kim, K. H. Lee, K. J. Kim, R. Mantiuk, S. Hahn, T. J. Kim, and Y. H. Kim, "Prediction of perceptible artifacts in JPEG2000 compressed chest CT images using mathematical and perceptual quality metrics," Am J Roentgenol, vol. 190, pp. 328-334, Feb. 2008.