A Hitchhiker's Guide to Structural Similarity

Received January 8, 2021, accepted January 26, 2021. Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2021.3056504

A Hitchhiker’s Guide to Structural Similarity

ABHINAU K. VENKATARAMANAN 1, CHENGYANG WU 1, ALAN C. BOVIK1, (Fellow, IEEE), IOANNIS KATSAVOUNIDIS2, AND ZAFAR SHAHID2 1Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78705, USA 2Facebook, Menlo Park, CA 94025, USA Corresponding author: Abhinau K. Venkataramanan ([email protected]) This work was supported by Facebook Video Infrastructure.

ABSTRACT The Structural Similarity (SSIM) Index is a very widely used image/video quality model that continues to play an important role in the perceptual evaluation of compression algorithms, encoding recipes and numerous other image/video processing algorithms. Several public implementations of the SSIM and Multiscale-SSIM (MS-SSIM) algorithms have been developed, which differ in efﬁciency and performance. This ‘‘bendable ruler’’ makes the process of quality assessment of encoding algorithms unreliable. To address this situation, we studied and compared the functions and performances of popular and widely used implementations of SSIM, and we also considered a variety of design choices. Based on our studies and experiments, we have arrived at a collection of recommendations on how to use SSIM most effectively, including ways to reduce its computational burden.

INDEX TERMS Image/video quality assessment, structural similarity index, pareto optimality, color SSIM, spatio-temporal aggregation, enhanced SSIM.

I. INTRODUCTION perceptual quality models and algorithms which can accu- With the explosion of social media platforms and online rately, reliably, and consistently predict the subjective quality streaming services, video has become the most widely con- of images/videos over this wide range of applications. sumed form of content on the internet, accounting for 60% One way that perceptual quality models can provide sig- of global internet traffic in 2019 [1]. Social media platforms nificant gains in compression is by conducting perceptual have also led to an explosion in the amount of image data Rate-Distortion Optimization RDO [16], where quantiza- being shared and stored online. Handling such large vol- tion parameters, encoding ‘‘recipes,’’ and mode decisions umes of image and video data is inconceivable without the are evaluated by balancing the resulting bitrates against the use of compression algorithms such as JPEG [2], [3], AVIF perceptual quality of the decoded videos. Typically, a set of (AV1 Intra) [4], [5], HEIF (HEVC Intra) [6], H.264 [7], [8], viable encodes is arrived at by constructing a perceptually- HEVC [9], EVC [10], VP9 [11], AV1 [12], SVT-AV1 [13], guided, Pareto-optimal bitrate ladder. and the upcoming VVC and AV2 standards. To understand Pareto-optimality, consider two encodes of The goal of these algorithms is to perform lossy com- a video E1 = (R1, D1) and E2 = (R2, D2), where R and D pression of images and videos to significantly reduce file denote the bitrate and the (perceptual) distortion associated sizes and bandwidth consumption, while incurring little with each encode. If R1 ≤ R2 and D1 ≤ D2, we say that E1 or acceptable reduction of visual quality. In addition to ‘‘dominates’’ E2, since better performance (lower distortion) compression-distorted streaming videos, a large fraction of is obtained at a lower cost (bitrate). So, we can prune any set the images and videos that are shared on social media are of encodes S = {Ei}, by removing all those dominated by any User Generated Content (UGC) [14], [15], i.e., not profes- other encode. The pruned set, say S0, has the property that for sionally created. As a result, even without any additional pro- any two encodes E1 and E2 such that R1 < R2, D1 > D2. That cessing, these images and videos can have impaired quality is, we obtain a set of encodes such that an increase in bitrate because they were captured by uncertain hands. In all these corresponds to a decrease in distortion. Such a set is said to be circumstances, it is imperative to have available automatic Pareto-optimal. In general, we can define Pareto-optimality for any setting where a cost is incurred (bitrate, running The associate editor coordinating the review of this manuscript and time, etc.) to achieve better performance (accuracy, distortion, approving it for publication was Vlad Diaconita . etc.)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021 1 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

The first distortion/quality metric used to measure Moreover, the success of SSIM can also be explained by NSS, image/video quality was the humble Peak Signal to Noise at least in part [30]. Ratio (PSNR), or log-reciprocal Mean Squared Error (MSE) In many situations, reference information is not available between a reference video and possibly distorted versions of as a ‘‘gold standard’’ against which the quality of a test picture it (e.g., by compression). However, while the PSNR metric is or video can be evaluated. No-Reference (NR) quality models simple and easy to calculate, it does not correlate very well have been developed that can accurately predict picture or with subjective opinion scores of picture or video quality [17]. video quality without a reference, by measuring NSS devi- This is because PSNR is a pixel-wise fidelity metric which ations. Notable NR quality models include BLIINDS [31], does not account for spatial or temporal perceptual processes. DIIVINE [32], BRISQUE [33], and NIQE [34]. The latter An important breakthrough on the perceptual quality prob- two, which have attained significant industry penetration, are lem emerged in the form of the Universal Quality Index similar to SSIM since they are defined by simple bandpass (UQI) [18], the first form of the Structural Similarity Index operations over multiple scales, normalized by local spatial (SSIM). Given a pair of images (reference and distorted), energy. For encoding applications where the source video to UQI creates a local quality map, by measuring luminance, be encoded is already impaired by some distortion(s), e.g. contrast and structural similarity over local neighborhoods, UGC, as is often found on sites like YouTube, Facebook, and then pooling (averaging) values of this spatial quality map Instagram, SSIM and NIQE can be combined via a 2-step yielding a single quality prediction (per picture or video assessment process to produce significantly improved encode frame). SSIM was later refined to better account for the quality predictions [35], [36]. interplay between adaptive gain control of the visual signal Evaluating picture and video encodes at scale remains (the basis for masking effects) and saturation at low signal the most high-volume application of quality assessment, and levels [19]. SSIM continues to play a dominant role in this space. Nev- The SSIM concept reached a higher performance in the ertheless, many widely used versions of SSIM exist having form of Multi-Scale SSIM (MS-SSIM) [20], which applies different characteristics. Understanding and unifying these SSIM at five spatial resolutions obtained by successive various implementations would be greatly useful to industry. dyadic sampling. Contrast and structure similarities are com- Moreover, there remain questions regarding the use of SSIM puted at each scale, while luminance similarity is only cal- across different display sizes, devices and viewing distances, culated at the coarsest scale. These scores are then combined as well as how to handle color, and how to combine (pool) using exponential weighting. SSIM and MS-SSIM are widely SSIM scores. Our objective here is to attempt to answer these used by the streaming and social media industries to perceptu- questions, at least in part. ally control the encodes of many billions of picture and video II. BACKGROUND contents annually. The basic SSIM index is a FR picture quality model defined While SSIM and MS-SSIM are most commonly deployed between two luminance images of size M × N, I1(i, j) and on a frame-by-frame basis, temporal extensions of SSIM have I2(i, j) as a multiplicative combination of three terms - lumi- also been developed. In [21], the authors compute frame-wise nance similarity l(i, j), contrast similarity c(i, j) and structure quality scores, weighted by the amount of motion in each similarity s(i, j). Color may be considered, but we will do so frame. In [22], explicit motion fields are used to compute later. SSIM along motion trajectories, an idea that was elaborated These three terms are defined in terms of the local means on in the successful MOVIE index [23]. µ1(i, j), µ2(i, j), standard deviations σ1(i, j), σ2(i, j), and cor- Following the success of SSIM, a great variety of relations σ12(i, j) of luminance, as follows. Let Wij denote picture and video quality models have been proposed. a windowed region of size k × k spanning the indices Among these, the most successful have relied on per- {i,..., i + k − 1} × {j,..., j + k − 1} and let w(m, n) ceptually relevant ‘‘natural scene statistics’’ (NSS) mod- denote weights assigned to each index (m, n) of this window. els, which accurately and reliably describe bandpass and In practice, these weighting functions sum to unity, and have nonlinearly normalized visual signals [24], [25]. Distor- a finite-extent Gaussian or rectangular shape. tions predictably alter these statistics, making it possible The local statistics are then calculated on (and between) to create highly competitive picture and video quality pre- the two images as dictors like the Full-Reference (FR) Visual Information X Fidelity (VIF) index [26], the Spatio-Temporal Reduced Ref- µ1(i, j) = w(m, n)I1(m, n), (1) erence Entropic Differences (ST-RRED) model [27], the m,n∈Wij efficient SpEED-QA [28], and the Video Multi-method X µ2(i, j) = w(m, n)I2(m, n), (2) Assessment Fusion (VMAF) [29] model, which uses a sim- m,n∈Wij ple learning model to fuse quality features derived from 2 = X 2 − 2 NSS models to obtain high performance and widespread σ1 (i, j) w(m, n)I1 (m, n) µ1(i, j), (3) industry adoption. Despite these advances, SSIM remains the m,n∈Wij 2 X 2 2 most widely used perceptual quality algorithm because of its σ2 (i, j) = w(m, n)I2 (m, n) − µ2(i, j), (4) high performance, natural definition, and compute simplicity. m,n∈Wij

2 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

X 2 = 2µ (1 + 3) + C1 σ12(i, j) w(m, n)I1(m, n)I2(m, n) = 1 (14) ∈ 2 + + 2 + m,n Wij µ1 1 (1 3) C1 − i j i j 2 µ1( , )µ2( , ). (5) 2(1 + 3) + C1/µ = 1 . (15) + + 2 + 2 Using these local statistics, the luminance, contrast and 1 (1 3) C1/µ1 structural similarity terms are respectively defined as 2 Since it is usually true that C1 µ1, the luminance term 2µ1(i, j)µ2(i, j) + C1 l is approximately only a function of the relative luminance l(i, j) = , (6) 2 + 2 + change, reflecting luminance masking. µ1(i, j) µ2(i, j) C1 Similarly, a locally perturbed contrast σ in a test image 2σ1(i, j)σ2(i, j) + C2 1 c(i, j) = , (7) may be expressed as σ = (1 + 6)σ , where 6 = 1σ/σ is σ 2(i, j) + σ 2(i, j) + C 2 1 1 1 2 2 the relative change in contrast from distortion. Similar to the σ (i, j) + C s(i, j) = 12 3 , (8) above, we can express the contrast term (7) as σ (i, j)σ (i, j) + C 1 2 3 2 2(1 + 6) + C2/σ where C , C and C are saturation constants that contribute c = 1 . (16) 1 2 3 + + 2 + 2 to numerical stability. Local quality scores are then defined 1 (1 6) C2/σ1 as 2 Since generally C2 σ1 , the contrast term c is approximately a function of the relative, rather than absolute, change Q(i, j) = l(i, j) · c(i, j) · s(i, j). (9) in contrast, thereby accounting the perceptual contrast mask- Adopting the common choice of C3 = C2/2, the contrast ing. and structure terms combine: Given an 8-bit luminance image, assume the nominal 2σ (i, j) + C dynamic range [0, L], where L = 255. Most commonly, cs(i, j) = c(i, j)s(i, j) = 12 2 . (10) 2 + 2 + the saturation constants are chosen relative to the dynamic σ1 (i, j) σ2 (i, j) C2 2 2 range as C1 = (K1 L) and C2 = (K2L) , where K1 and K2 In this way, a SSIM quality map Q(i, j) is defined, which are small constants. can be used to visually localize distortions. Since a single SSIM is quite flexible and allows room for design choices. picture quality score is usually desired, the average value of The recommended implementation of SSIM in [19] is the quality map can be reported as the Mean SSIM (MSSIM) • If min(M, N) > 256, resize images such that score between the two images: min(M, N) = 256. M N • Use a Gaussian weighting window in (1)-(5) having 1 X X SSIM(I , I ) = Q(i, j). (11) k = 11 and σ = 1.5. 1 2 MN i=1 j=1 • Choose regularization constants K1 = 0.01, K2 = 0.03. One of our goals here is to compare and test different, com- SSIM obeys the following desirable properties: monly used implementations of SSIM and MS-SSIM, which = 1) Symmetry: SSIM(I1, I2) SSIM(I2, I1) make different design choices. We conduct performance eval- | | ≤ 1 2) Boundedness: SSIM(I1, I2) 1 uations on existing, well-regarded image and video quality = ⇐⇒ = 3) Unique Maximum: SSIM(I1, I2) 1 I1 I2 databases. We study the effects of several design choices An important property of SSIM is that it accounts for and make recommendations on best practices when utilizing the perceptual phenomenon of Weber’s Law, whereby a Just SSIM. Noticeable Difference (JND) is proportional to the local neighborhood property Q. This is the basis for perceptual III. DATABASES masking of distortions, whereby the visibility of a distortion One of our main goals is to help ‘‘standardize’’ the way 1Q is mediated by the relative perturbation 1Q/Q. SSIM is defined and used. Since many versions of SSIM exist To illustrate the connection between SSIM and Weber and implementing SSIM involves design choices, reliable masking, consider an error 1µ of local luminance µ1 in a and accurate subjective test beds that capture the breadth test image: of theoretical and practical distortions are the indispensable tools for our analysis. Among these, we selected two picture µ = µ + 1µ = µ (1 + 3), (12) 2 1 1 quality databases and two video quality databases that are where 3 = 1µ/µ1 is the relative change in luminance. Then, widely used. the luminance similarity term (6) becomes (dropping spatial indices) A. LIVE IMAGE QUALITY ASSESSMENT DATABASE The LIVE IQA database [37], [38] contains 29 reference 2µ1µ2 + C1 l = (13) pictures, each subjected to the following five distortions (each µ2 + µ2 + C 1 2 1 at four levels of severity).

1Very rarely, distortion can cause a negative correlation between reference • JPEG compression and test image patches. • JPEG2000 compression

VOLUME 9, 2021 3 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

• Gaussian blur by spatial scaling and compression, yielding 70 distorted • White noise videos. We selected this database because of its high rele- • Bit errors in JPEG2000 bit streams vance and commonly observed distortions characteristic of LIVE IQA contains 982 pictures with nearly 30,000 corre- SSIM streaming video deployments at the largest scales. sponding Difference Mean Opinion (DMOS) human subject scores. IV. VERSIONS OF SSIM Next, we take a deep dive into publicly available and B. TAMPERE IMAGE DATABASE 2013 commonly used implementations of SSIM and MS-SSIM. The Tampere Image Database 2013 (TID2013) [39] contains We compare various aspects of their performance, explain the 3000 distorted pictures subjected to 24 impairments at 5 dis- differences between them, and provide recommendations for tortion levels synthetically applied to 25 pristine images. The best practices when using SSIM. This is especially impor- 24 distortions are tant because, as we will see, subtle differences in design • Additive Gaussian noise choices can lead to significant changes in performance and • Additive noise in color components is more intensive efficiency. than additive noise in the luminance component • Spatially correlated noise A. PUBLIC SSIM AND MS-SSIM IMPLEMENTATIONS • Masked noise We considered the following ten SSIM implementations • High frequency noise when carrying out our experiments: • Impulse noise 1) FFMPEG [41] • Quantization noise 2) LIBVMAF [29] • Gaussian blur 3) VideoClarity ClearView Player (ClearView) • Image denoising 4) HDRTools [42] • JPEG compression 5) Daala [43] • JPEG2000 compression 6) Scikit-Image in Python (Rectangular) [44] • JPEG transmission errors 7) Scikit-Image in Python (Gaussian) [44] • JPEG2000 transmission errors 8) Scikit-Video in Python (Rectangular) [45] • Non eccentricity pattern noise 9) Scikit-Video in Python (Gaussian) [45] • Local block-wise distortions of different intensity 10) Tensorflow in Python [46] • Mean shift (intensity shift) 11) MATLAB • Contrast change 12) MATLAB (Fast) • Change of color saturation ‘‘Rectangular’’ refers to using a constant weight window • Multiplicative Gaussian noise function to calculate local statistics, while ‘‘Gaussian’’ refers • Comfort noise to using a Gaussian-shaped window function to calculate • Lossy compression of noisy images local statistics, as in [19]. Only the Python and MATLAB • Image color quantization with dither implementations allow the user to set parameters such as the • Chromatic aberrations SSIM window size. Hence, we tested the other implementa- • Sparse sampling and reconstruction tions using the default parameters. ‘‘Fast’’ in item 12 refers The 3000 pictures in TID 2013 are accompanied by human to an accelerated implementation of SSIM in MATLAB. subjective quality scores on them in the form of over In addition, the following eight MS-SSIM implementa- 500,000 MOS. tions were tested: 1) LIBVMAF C. LIVE VIDEO QUALITY ASSESSMENT DATABASE 2) ClearView The LIVE VQA database [40] contains 10 reference videos, 3) HDRTools each subjected to the following four distortions (each applied 4) Daala at four levels of severity). 5) Daala (Fast) • MPEG-2 compression 6) Scikit-Video in Python (Sum) • H.264 compression 7) Scikit-Video in Python (Product) • Error-prone IP networks 8) Tensorflow in Python • Error-prone wireless networks ‘‘Sum’’ and ‘‘Product’’ in 6 and 7 refer to different ways A total of 150 distorted videos are obtained, on which of aggregating SSIM scores across scales. ‘‘Product’’ refers to 4350 subjective DMOS were obtained. the method proposed in [20], where MS-SSIM is computed as an exponentially-weighted product of SSIM scores from each D. NETFLIX PUBLIC DATABASE scale. In ‘‘Sum’’, the MS-SSIM score is instead a weighted The Netflix Public Database, obtained from the VMAF [29] average of SSIM scores across scales. ‘‘Fast’’ refers to an Github repository, contains 9 reference videos, each distorted accelerated implementation of MS-SSIM in Daala.

4 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 1. Salient features of SSIM implementations.

B. SALIENT FEATURES OF SSIM AND MS-SSIM a) Dyadic down-sampling is performed by low pass IMPLEMENTATIONS filtering using a 2 × 2 average filter, followed by The salient features of various SSIM implementations have down-sampling. been listed in Table1, and the salient features of var- b) Allows aggregating across scales by summation ious MS-SSIM implementations have been listed below. instead of product. For summation, the exponents To avoid repetition, we only discuss the aspects in which each βi are normalized to sum to 1, leading to a convex MS-SSIM implementation deviates from the corresponding combination. SSIM implementation. c) At the coarsest scale, the algorithm uses αM = 1) LIBVMAF βM = γM = 1. a) Dyadic down-sampling is performed using a d) When image dimensions were large, we found 9/7 biorthogonal wavelet filter. that this implementation suffered from incom- patible memory allocation, leading to crashes at 2) Daala run-time. We fixed this error, making the imple- = a) Uses σ 1.5, leading to a Gaussian window of mentation usable. size 11. b) Dyadic down-sampling is performed by 2 × 2 5) Tensorflow average pooling. a) Dyadic down-sampling is carried out by average 3) Daala (Fast) pooling 2 × 2 neighborhoods. a) Multiscale processing is performed at 4 levels. The first four exponents used in the standard C. OFF-THE-SHELF PERFORMANCE USING DEFAULT MS-SSIM formulation are renormalized to sum PARAMETERS to 1. In this section, we evaluate the off-the-shelf performance b) An integer approximation to Gaussian is used. of the implementations discussed above. We first normal- c) Dyadic down-sampling is performed by 2 × 2 ized the subjective scores of pictures/videos each database average pooling. to the range [0, 1] by scaling and shifting. In all of d) When image dimensions were not multiples the experiments in this section, unless mentioned other- of 16, we found that this implementation suffered wise, we computed SSIM scores only on the luminance from memory leaks, which led to a considerable channel. decrease in accuracy. A simple fix restored per- It is well known that the relationship between SSIM (or formance to expected levels. any other quality metric) and subjective scores is non-linear. 4) Scikit-Video To account for this, we fit the five-parameter logistic (5PL)

VOLUME 9, 2021 5 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 2. Off-the-shelf performance of SSIM implementations.

function [47] shown in (17) from SSIM values to subjective Since compression forms an important class of distor- scores: tions encountered in practice, we also report the off-the-shelf performance of SSIM and MS-SSIM implementations on 1 1 compression distorted data in Table 3. When restricting the Q(x) = β1 − + β4x + β5, (17) 2 1 + exp(β2(x − β3)) comparisons to compression, the LIBVMAF, Scikit-Image and FastQA implementations still generally outperform other where x is the SSIM score, βi are parameters of the logistic SSIM implementations, while HDRTools and ClearView gen- function, and Q(x) is the predicted subjective quality. erally outperform other MS-SSIM implementations. After linearizing the SSIM values in this manner, we report the Pearson Correlation Coefficient (PCC) which is a measure of the linear correlation between the predicted and D. PERFORMANCE-EFFICIENCY TRADEOFFS true quality, the Spearman Rank Order Correlation Coeffi- In addition to performance (i.e., correlation with subjective cient (SROCC) which is a measure of the rank correlation scores), it is important to consider the compute efficiency of (monotonicity), and the Root Mean Square Error (RMSE) these implementations. Algorithms that employ sophisticated which is a measure of the error in predicting subjective techniques for down-sampling, calculation of local statistics quality. and multi-scale processing may provide improvements in per- Table 2 shows the results of the experiments, and the best formance, but often incur the cost of additional computational three results in each column have been boldfaced. Among complexity. When deployed at scale, these additional costs SSIM implementations, LIBVMAF and the Scikit-Video can be significant. implementations generally outperformed all other algo- To evaluate the compared algorithms in the context rithms. We attribute this superior performance to the use of of this performance-efficiency tradeoff, we plotted the scaling, which we will expound in Section VI. SROCC achieved by each algorithm against their exe- Among the MS-SSIM implementations, there was no con- cution time. Since some methods leverage multithread- sistent ‘‘winners.’’ Python implementations like Scikit-Video ing/multiprocessing, we report the user time instead of the and the FastQA implementation were often among the wall time of the processes. top-three implementations. Tensorflow’s MS-SSIM imple- As with any run-time experiments, we expected to observe mentation also performs well, lending strong empirical sup- slight variations in execution times between runs due to vary- port to the use of MS-SSIM as a training objective for deep ing system conditions. To account for this, we ran every SSIM networks implemented in Tensorflow. implementation on each database five times and recorded the

6 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 3. Off-the-shelf performance of SSIM and MS-SSIM implementations on compression.

total execution time of each run. We then reported the median This can be done by forming five integral images as run-time over the five runs. follows: We omitted Tensorflow implementations from these exper-  P P iments because they run prohibitively slowly on CPUs, and  I1(m, n) i, j > 0 I (1) i, j = m≤i n≤j , we cannot compare their GPU run-times while all other 1 ( ) (18) 0 otherwise implementations are run on the CPU. We also omit ClearView  P P implementations because we had to run them on custom  I2(m, n) i, j > 0 (1) = m≤i n≤j hardware. The results on each database are shown in Fig. 1, I2 (i, j) , (19) where the Pareto-optimal implementations have been 0 otherwise circled.  P P 2  I1 (m, n) i, j > 0 From these plots, it may be observed that among the (2) = m≤i n≤j I1 (i, j) , (20) implementations that we tested, Daala’s Fast MS-SSIM and 0 otherwise Scikit-Video’s SSIM (using Rectangular windows) imple-  P P I 2(m, n) i, j > 0 mentation are Pareto-optimal most often, followed the  2 I (2)(i, j) = m≤i n≤j , (21) FastQA MS-SSIM implementation. In addition, among the 2 0 otherwise SSIM implementations, Daala, LIBVMAF and the FastQA  P P implementations were often Pareto-optimal across databases.  I1(m, n)I2(m, n) i, j > 0 ≤ ≤ Note that while the concept of Pareto-optimality is often used I12(i, j) = m i n j , (22) in the context of ‘‘optimizing’’ an encode in a rate-distortion 0 otherwise sense by varying a parameter, no parameters were opti- (1) × mized during our experiments. In our setting, an imple- Given the integral image I1 , calculate the sum in any k k mentation is considered to be Pareto-optimal among the set window Wij in constant time via of considered implementations if there is no implementa- (1) = (1) + − + − + (1) − − tion that both achieves a higher SROCC and runs in lesser S1 (i, j) I1 (i k 1, j k 1) I1 (i 1, j 1) time. − (1) + − − − (1) − + − I1 (i k 1, j 1) I1 (i 1, j k 1). (23) The nominal computational complexity of SSIM is O(MNk2). We propose a method to improve the efficiency of This operation would require O(k2) time without the use of an SSIM if the weighting function is rectangular, i.e., w(i, j) = integral image. Similarly, calculate local sums using the other 2 (1) (2) (2) 1/k , by using integral images, also known as summed-area integral images, and denote them S2 (i, j), S1 (i, j), S2 (i, j), tables [48], [49]. and S12(i, j), respectively. Then calculate the necessary local

VOLUME 9, 2021 7 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 1. Correlation vs execution time. statistics as enabled the accelerated encoding and decoding of videos, (1) making the distortion estimation step the bottleneck when µ (i, j) = S (i, j)/k2, (24) 1 1 optimizing encoding ‘‘recipes.’’ From Section IV-D, we know = (1) 2 µ2(i, j) S2 (i, j)/k , (25) that the computational complexity of SSIM in terms of image 2 = (2) 2 − 2 dimensions is O(MN). Including a scale factor α by which we σ1 (i, j) S1 (i, j)/k µ1(i, j), (26) resize the image, the computational complexity is O(α2MN). σ 2(i, j) = S(2)(i, j)/k2 − µ2(i, j), (27) 2 2 2 Due to this quadratic growth, the computational load of dis- 2 σ12(i, j) = S12(i, j)/k − µ1(i, j)µ2(i, j). (28) tortion estimation is an increasingly relevant issue given the In this new way of computing rapid SSIM index, which is prevalence of high-resolution videos. applicable when using a rectangular SSIM window, the com- Therefore, it is of great interest to be able to accurately pute complexity of SSIM is reduced to O(MN). predict the quality of high-resolution videos that are distorted in two steps - scaling followed by compression. For example, V. SCALED SSIM consider High Deﬁnition (HD) videos that are ﬁrst resized Arguably the most widespread use of SSIM (and other pic- to a lower resolution, which we call the compression resolu- ture/video quality models) is in evaluating the quality of com- tion, then encoded and decoded using, for example, H.264 at pression encodes. On streaming and social media platforms, this compression resolution. The videos are then up-sampled pictures and videos are commonly encoded at lower resolu- to the original resolution before they are rendered for dis- tions for transmission. This is done either because the source play. We will refer to this higher resolution as the rendering has low-complexity content and can be down-sampled with resolution. relatively little additional loss (or if the available bandwidth We aim to reduce the computational burden of perceptually- requires it) or to decrease the decoding load at the user’s end. driven RDO by circumventing the computation of SSIM at Perceptual distortion models have become common tools the rendering resolution, i.e. between the HD source and for determining the quality of encodes for Rate-Distortion rendered videos. We propose a suite of algorithms, called Optimization (RDO) [16]. Advances in video hardware have Scaled SSIM in [50], which predict SSIM by only using SSIM

8 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 2. Video compression pipeline.

FIGURE 3. Histogram matching solution. values computed at the lower compression resolution during runtime. The video compression pipeline in which we solve the Scaled SSIM problem is illustrated in Fig.2. We achieve this using two classes of models that efficiently predict Scaled SSIM, which we refer to as • Histogram Matching • Feature-based models All of the proposed models operate on a per-frame basis. We first trained and tested the performance of these models on an in-house video corpus of 60 pristine videos. We compressed these videos at 6 compression scales −144p, 240p, 360p, 480p, 540p, 720p using FFMPEG’s H.264 (libx264) encoder at 11 choices of the Quantization Parameter (QP) - 1, 6, 11,..., 51. In this manner, we obtained a total of 3960 videos having almost 1.75M frames. On this corpus, we can evaluate the accuracy of predicting SSIM scores, i.e., the correlation between predicted and FIGURE 4. Correlation vs sampling interval for histogram matching. true SSIM, which was computed using the ‘‘ssim’’ filter in FFMPEG. However, the end goal is to predict subjective scores, which are not available for this corpus. So, we instead our proposed approach to SSIM computation directly at the evaluated the performance of our models against subjective rendered scale is approximately scores on the Netflix Public Database. 1 1 1 − α2 (1 + β + γ ) + (1 + β). (29) A. HISTOGRAM MATCHING k k We observe a non-linear relationship between SSIM values across encoding resolutions. Because framewise SSIM scores The factors β and γ account for computing and matching are calculated by averaging the local quality map obtained the histograms respectively, which are both O(MN) opera- from SSIM, we can estimate SSIM scores by matching the tions. This ratio is a decreasing function of k, and approaches histograms of these quality maps. However, to match his- α2(1 + β + γ ) as k → ∞. tograms, we require the true histogram at the rendering reso- By comparison, if the rendered SSIM map were not lution, which is what we wish to avoid estimating. sampled, the ratio would be approximately α2. In practice, So, we instead calculate the ‘‘true’’ quality map just once we have observed that the time taken to compute and match every k frames, and assume that the shape of the true his- histograms is comparable to the time taken to compute the togram does not change significantly over a short period SSIM map at the compression scale. So, the computational of k − 1 frames. This allows us to reuse this ‘‘reference burden of the matching step is small, albeit not negligible. map’’ for the next k − 1 frames as a heuristic model against This reduction in computational complexity as k increases which we match the shapes of the next k − 1 histograms at is accompanied by a reduction in performance (accurate pre- the compression scale. This histogram matching algorithm is diction of true SSIM), as shown in Fig. 4. We chose k = 5 in illustrated in Fig. 3. all the experiments unless otherwise mentioned. Let α ∈ (0, 1) be the factor by which we down-sampled the One drawback of this method is that it requires ‘‘guiding source video. Then, the ratio of required computation using information’’ in the form of periodically updated reference

VOLUME 9, 2021 9 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 4. Correlation with true SSIM on corpus test data. TABLE 5. Correlation with DMOS on netflix public database.

feature-based models, four-feature NN performed best. This is to be expected, given the great learning capacity of NNs. quality maps. However, this issue is not a factor in the second It is interesting to note, however, that the learning-free class of models. product model yielded comparable or better performance at a negligible computational cost. Finally, the Histogram B. FEATURE-BASED MODELS Matching model provided near-perfect predictions, outper- As we observed earlier, the net quality degradation occurs forming all other models. The cost of this performance in two steps - scaling and compression. So, we calculate the is the additional periodic calculation of reference quality contribution of each operation and use these as features to maps/histograms. calculate the net distortion. Because our corpus contains videos generated at various Let X be a source video and Sα(X) denote the video compression scales and QPs, we were able to evaluate the obtained by scaling X by a factor α. Then, the result of sensitivity of our best models’ performance to these choices up-sampling the down-sampled video back to the original of encoding parameters. We illustrate this in Fig.5. resolution may be denoted by S 1 (Sα(X)). The SSIM value α We observe that histogram matching performed consis- between X and S 1 (Sα(X)) is a measure of the loss in quality tently well across all encoding parameters with only a slight α from down-sampling the video. Since this SSIM is indepen- decrease in parameters at high-quality regions, i.e., high com- dent of the choice of codec and compression parameters, this pression scale and low QP. We attribute this to the fact that can be pre-computed. most quality values at low QPs are very close to 1. As a result, The second source of quality degradation is compression. the histogram of quality scores at the compression scale is Let C(X; q) be the decoded video obtained by encoding the concentrated close to 1, making histogram matching difficult. source video X using a Quantization Parameter (QP) q. Then, On the other hand, the feature-based models performed the SSIM value between Sα(X) and C(Sα(X); q) measures the poorly in low-quality regions, i.e., low compression scale and loss of quality resulting from compression of the video. high QP. However, videos are seldom compressed at such low We use these two SSIM scores as features to predict the qualities, so this does not affect performance in most practical true SSIM and refer to these models as Two-Feature Models. use cases. In addition, we can also use the scaling factor α and the QP Table5 compares models based on the correlation they q as features. We call such models four-feature models. achieved against subjective opinion scores on the Netflix In both cases, we train three regressors to predict the Public Database. Because our goal is to predict SSIM effi- SSIM value at the rendering scale on each frame. The three ciently, we hold the performance of ‘‘true’’ SSIM as the regressors considered are gold standard against which we evaluated the performance • Linear Support Vector Regressor (Linear SVR) of the Scaled SSIM models. Because videos in this database • Gaussian Radial Basis Function SVR (Gaussian SVR) were generated by setting bitrates instead of QPs, we only • Fully Connected Neural Network (NN) tested our two-feature and histogram matching models. It is The Neural Network is a small fully connected network hav- important to note that these models were not retrained on the ing a single hidden layer with twice the number of neurons Netflix database. as input features. We compared these models to a simple From the table, it may be observed that SSIM estimated by learning-free model, which is used as a baseline. The out- Histogram Matching matches the performance of true SSIM. put of the baseline model is the simple product of the two We also observe that the feature-based models approach true SSIM features. This is similar to the 2stepQA picture quality SSIM’s performance, with the Product Model offering an model proposed in [35], [36], for two-stage distorted pictures. effective low complexity alternative to the learning-based We call this the Product model. models.

C. RESULTS VI. DEVICE DEPENDENCE The correlations between predicted SSIM and true SSIM The Quality of Experience of an individual user varies to a achieved by the various models on our in-house corpus is great extent, depending on not only the visual quality of the shown in Table 4, where ‘‘2’’ and ‘‘4’’ denote the number content, but also various other factors. These include, but may of features input to each learning-based model. Among the not be limited to [51]:

10 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 5. Variation in performance with choice of encoding scale and QP.

1. Context of media contents. impact on user experience. The pass band of the human 2. Users’ viewing preferences. spatial CSF peaks between 1-10 cycles/degree (cpd) (depend- 3. Condition of terminal and application used for displaying on illumination and temporal factors), falling off rapidly ing contents. beyond. Naturally, it is desirable that picture and video quality 4. Network bandwidth and latency. assessment algorithms be able to adapt to screen size against 5. The environment where users view content. assumed viewing distance, i.e., by characterizing the projec- (Background lighting conditions, viewing distances, audio tion of the screen on the retina [54]. Given a screen height H devices, etc.) and a viewing distance D, full-screen viewing angle is From the perspective of compression and quality assess- H ment algorithms, the factors to be considered are net- α = 2 arctan (30) work fluctuations and terminal screens. The rest of this 2D section restricts the scope to only the size of screens then, if the viewing screen contains L pixels on each (vertical and pixel resolutions, excluding the influence of other or horizontal) line, the spatial frequency of the pixel spacing variables. is defined as L A. IMPACT OF SCREEN SIZE fmax = (31) The issue of screen size on viewing quality originally arose 2α in the context of television [52]. Large wall-sized, theatre(s) in cpd. and IMAX displays provide better experiences, with subjects reporting increased feelings of ‘reality,’ i.e. the illusion that C. TRANSFORMING ACROSS VARIOUS SCALES viewers are present at the scene. Studies have also shown a While primitive specifications of viewing distances and pixel possible relation between screen size and viewer’s intensity resolutions have been provided by the ITU [55], existing sub- of response to contents. jective picture and video quality databases are mainly defined by their content and their testing environments. Likewise, B. DISPLAYED CONTENT AND PERCEIVED PROJECTION nearly all quality assessment models that operate over scales The authors of [51] compared the experiences of users of use the down-sampling transform devices of different screen sizes, both on web browsing and HI video viewing. Mean Opinion Scores were found to vary Zα = max 1, (32) significantly, with an approximate difference of 1 =1 MOS 256 between high-end devices and low-end devices of that time. which creates discontinuities as the height is varied (e.g., It was also discovered that viewing of videos on tablets H1 = 510 and H2 = 520) and does not account for view- (iPad, etc.) benefited more from displaying contents of higher ing distance D. However, the Self-Adaptive Scale Transform resolutions (more significant impact on MOS) as compared to (SAST) [56] seeks to remediate these weaknesses. SAST mobile phones. User experiences are a combined function of uses both the viewing distance and the user’s angle of view screen size, content resolution, and viewing distance. This is (including the distance) quantified by the contrast sensitivity (CSF) of the HVS [53] which is broadly band pass so that increases in resolutions rH · W Z = I I (33) beyond a certain limit are not perceivable, hence pose little s H · I

VOLUME 9, 2021 11 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

v u 1 H W = u · I · I The results from a one-way analysis of variance (ANOVA) t (34) 4 tan θH tan θW D D suggested no significant relevance between screen size and 2 2 perceived quality on screens ranging from 4 inches to where D denotes the viewing distance, HI and WI are the 5.7 inches. However, a considerable MOS improvement height and width of the display, and the corresponding projec- of 0.15 was achieved by 1080p content over 720p content, tion on the retina is H by W . Commonly assumed, horizontal but no further improvement was observed by increasing the ◦ ◦ and vertical viewing angles are θH = 40 and θW = 50 . content resolutions to 1440p, suggesting a saturation of per- Because of the band pass nature of the visual system, ceived quality with resolution. Of course, MOS tends to a picture or video may be preprocessed using Adaptive remain constant across content resolutions higher than that of High-frequency Clipping (AHC) [57]. In this method, instead the display, suggesting that service providers restrict spatial of losing high-frequency information during scaling, high fre- resolutions whenever device specifications are available. quencies are selectively assigned smaller weights in a wavelet domain. VII. EFFECT OF WINDOW FUNCTIONS ON SSIM A dedicated database called VDID2014 [58] was created At the heart of SSIM lies the computation of the local to record human visual responses under varying viewing dis- statistics - means, variances, and covariances - comprising tances and pixel resolutions. The authors derived an optimal the luminance, contrast and structure terms. As described scale selection (OSS) model based on the SAST and AHC in SectionII, the computational complexity of SSIM is model. Instead of directly comparing reference and distorted O(MNk2). contents, all frames of the picture or video are first preprocessed according to an assumed or derived viewing distance A. EFFECT OF WINDOW SIZE and pixel resolution, before applying quality assessment algo- While computational complexity increases quadratically with rithms such as SSIM. The OSS model significantly boosted window size, using larger windows does not guarantee better performance of both legacy and modern IQA/VQA models performance. Indeed, since picture and video frames are non- while not significantly increasing computational complexity. stationary, computing local statistics is highly advantageous for local quality prediction. While small values of k can lead D. TRADEOFFS to noisier estimates due to lack of samples, choosing large While modern viewers maintain high expectations of visual values of k risks losing the relevance of local luminance, quality, regardless of the device and picture or video appli- contrast, structure, and distortion. cations being used, efforts to achieve consistent quality have As mentioned earlier, the two most common choices been inconsistent across screens of various sizes and reso- of SSIM windows are Gaussian-shaped and rectangular- lutions. A study of H.264 streams without packet loss [59] shaped. Traditionally, the use of rectangular windows is demonstrated that the required bit rate grows linearly with the not recommended in image processing, due to frequency horizontal screen size, while the expected level of subjective side-lobes and resulting ‘‘noise-leakage.’’ Because the fre- quality was kept fixed. quency response of a rectangular window is a sinc func- However, the required bit rates grew much faster with tion, undesired high-frequencies can leak into the output. increased expectations of perceived quality, with saturation To mitigate this effect, Gaussian filtering is usually preferred, of MOS at high bit rates, and little quality improvement with especially for denoising applications or if noise may be increased bit rate. Notwithstanding future improvements in present. For these reasons, Gaussian-shaped windows were picture/video representation and compression technologies, recommended by the authors of SSIM when calculating local the results of [59] supply approximate upper bounds of MOS statistics. against content resolution and bit rate. To investigate the effect of the choice of window and window size, we considered a set of scaling constant values E. MOBILE DEVICES σ = 1.0, 1.5,... 6.0 of the Gaussian window, while trun- Mobile devices have advanced rapidly in recent years, fea- cating them at a width of about 7σ and forcing the width turing larger screens and higher resolutions. Most legacy to be odd. Since only Python implementations allowed set- databases that focused on explaining the impact of screen ting σ, we restricted our experiments to these implementa- sizes and pixel resolutions were constructed using Personal tions. All these experiments were conducted using a stride Digital Assistants (PDA) or older cell phones with displays of 1. smaller than 4.5 inches. Popular resolutions of the time of Given a Gaussian window of standard deviation σ, one these studies were 320p or 480p, which rarely appear on can construct analogous rectangular windows in three ways contemporary mobile devices. In an effort to investigate the - having the same physical size (i.e., width and height), same issue on screens larger than 5.7 inches and resolutions having the same variance (considering the rectangular win- of 1440p (2K) or 2160p (4K), a recent subjective study [60] dow as a sampled uniform distribution), or having the same focused on more recent mobile devices having larger resolu- (3dB) bandwidth. To specify a rectangular window of size tions and screen sizes. The contents viewed by the subjects 2K + 1, we only need to specify K. Equating the vari- were re-scaled to 4K. ance of the Gaussian to that of a uniform distribution yields

12 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 6. Effect of size and choice of window function on performance.

FIGURE 7. Effect of size and choice of window function on performance on compression data.

l √ m K = σ 3 , where d·e denotes the ceiling operation. Simi- We report the variation of performance against window larly, equating the 3dB bandwidths of the two filters requires size on compression distorted data in Fig. 7. From these plots, K = d1.602σe. it may be seen that when tested on compression distortions, The variation of SSIM performance against window choice SROCC is maximized for smaller window sizes, in the range is shown in Fig. 6. For simplicity, we only show the 7 to 15. In both figures, it may be seen that the Scikit-Video performance of rectangular windows having the same implementations peak at smaller window sizes compared to physical size. We observed similar results for rectangular other implementations. This is explained by the fact that the windows having the same variance and 3dB bandwidth. Scikit-Video implementations downsample images, which Surprisingly, the figure indicates that rectangular windows increases the ‘‘effective’’ size of a window function with are an objectively better design choice than equivalent Gaus- respect to the original image. sian windows! In particular, rectangular windows outperform Gaussian windows for smaller window sizes, achieving B. EFFECT OF STRIDE slightly higher peak performance. Our experiments suggest It is also possible to compute SSIM in a subsampled way, that using windows of linear dimension in the range 15 to by including a stride s, where s is the distance between adja- 20 offers a good tradeoff between performance and com- cent windows on which SSIM is computed. Then, the computation on both picture databases. Using rectangular win- putational complexity of SSIM is O(MNk2/s2). dows also offers the possibility of a significant computational We tested the effect of stride on performance using our advantage, because they can be implemented efficiently using FastQA python implementation, because none of the existing integral images, as discussed in Section II. implementations allow varying the stride. In Fig. 8, we report

VOLUME 9, 2021 13 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 9. Example of fitting 5PL curve to a scatter plot of MOS vs. MS-SSIM.

B. GENERALIZING TO OTHER DATABASES FIGURE 8. Variation of performance with stride. While better performance against subjective scores is obtained after mapping the raw data using logistic functions, this only works on databases where subjective scores are the variation of SROCC with stride, where lines labelled available to help optimize the model parameters. In real life, ‘‘(Comp)’’ denote the performance on compression distorted however, social media and streaming service providers lack data. subjective opinions of their shared of streamed content. This From the figure, it may be seen that the SROCC is largely means that it is uncertain whether a set of fitted parame- unaffected by stride for s ≤ 5. This means that by choosing ters will apply well to unknown data. In order to study the a stride of s = 5, we can obtain a significant improvement generalizability provided by logistic mapping, we optimized in efficiency (25x speedup), with little change in prediction the function parameters on each individual database, mapped performance. the raw data in the other three databases using the obtained model, as a way of assessing performance on unknown con- VIII. MAPPING SCORES TO SUBJECTIVE QUALITY tent. If the correlation metrics were to remain similar, it would Because of the different parameter configuration and approx- demonstrate that the logistic function can be used to provided imations they use, the many available implementations of steady performance on unseen data. SSIM tend to disagree with each other, producing (usually The results of the cross-database generalization experi- slightly) different scores on the same contents. Of course, ments are shown in Table 7. The result in each cell of the any inconsistencies between deployed SSIM models is unde- table is the RMSE obtained by training a 5PL function on sirable, since otherwise in a given application (such as con- a ‘‘source’’ database (SD), then testing it on each ‘‘target’’ trolling an encoder), changing the SSIM implementation may database (TD). From these experiments, we observed that lead to unpredictable results. when using MATLAB SSIM, the 5PL function generalized One way to address this issue is by applying a well between the TID 2013 and LIVE IQA Databases. pre-determined function to map the obtained SSIM results However, we observed poor generalization when fitting to subjective scores. Among a collection of both nonlinear 5PL on the LIVE VQA database, then testing on the LIVE and piecewise linear mappings, the 5PL function in (17) is IQA database. While it may be too much to expect strong particularly useful. performance on video distortions after training on pictures and picture distortions, the lesson learned is still that a user A. IMPROVEMENT DUE TO MAPPING or service provider either select the most relevant database to Fig.9 shows a typical example of fitting raw results (scatter train on, or conduct a user study directed to their use case, plot of MOS vs. MS-SSIM) to a 5PL function. Both the on which a SSIM mapping may be optimized. MOS and objective scores cover fairly wide ranges while the mapping function lies approximately in the middle these. IX. COLOR SSIM It can be easily observed that utilizing the fitted curve Of course, the vast majority of shared and streamed pic- yields a considerable improvement in PCC and RMSE, while tures and videos are in color. Hence it is naturally of the SROCC remains the same due to the monotonic nature interest to understand whether SSIM can be optimized to of the function. Table 6 shows the improvement obtained also account for color distortions. however, most avail- in PCC and RMSE after utilizing the fitted curve. For most able SSIM implementations only operate on the luminance of the evaluated models there is a considerable performance channel. Distortions of the color components may certainly enhancement. exert considerable influence on subjective quality. The most

14 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 6. Improvement in performance due to linearization.

TABLE 7. Generalizability of 5PL SSIM mappings across databases. phasor representations):

q(m, n) = r(m, n) · i + g(m, n) · j + b(m, n) · k. (35)

The quaternion picture or video frame can then be decom- posed into constituent ‘‘DC’’ and ‘‘AC’’ components via

M N 1 X X dc µ = q(m, n), (36) common approach to incorporating color information into , q MN m=1 n=1 SSIM is to calculate it on each color channel, whether RGB, YUV, or other color space, then combine the channel and results. More sophisticated approaches have been taken to incor- acq , q(m, n) − µq. (37) porate color channel information into quality models. For The quaternion contrast is then defined as example, CMSSIM [61] utilized CIELAB color space [62] v distances to better distinguish color distortions and noises. u M N This approach evolved, based on a later subjective study, u 1 X X σq = t acq , (38) into CSSIM [63], which generalizes the calculations of (M − 1)(N − 1) 2 m=1 n=1 SSIM. Another approach, called SHSIM [64] defines hue simi- which, when computed on both reference and test signals, larity (HSIM) much like structural similarity, then combines is used to form a correlation factor uses SSIM scores together with the HSIM scores. The com- M N 1 X X bination of two was found to better predict subjective quality σ = ac · ac , (39) qref ,dis (M − 1)(N − 1) qref qdis than when only using luminance or color. m=1 n=1 The method called Quaternion SSIM (QSSIM) [65] combines multi-valued RGB (or any other tristimulus) color yielding a quaternion formulation similar to the legacy space vectors from picture or video pixels into a quater- grayscale SSIM: nion representation, providing a formal way to assess ! ! 2µq · µq σq luminance and chrominance signals and their degradations QSSIM = ref dis ref ,dis . (40) 2 + 2 2 + 2 together. µqref µqdis µqref µqdis Although different in their formulations, these algorithms express individual frames in a tristimulus color space, B. CMSSIM whether RGB, YUV, or CIELAB depending on the applica- The CMSSIM algorithm first transforms the input picture or tion. In the following, we will define and assess each of these video signal into the CIE XYZ tristimulus color space. These approaches. XYZ pixels are then transformed into luminance, red-green, and blue-yellow planes [63] as A. QUATERNION SSIM       The quaternion SSIM algorithm uses quaternions [65] to rep- Q1 0.279 0.72 −0.107 X resent color pixels with a vector of complex-like components  Q2 =  −0.449 0.29 −0.077  Y  (41) (quaternions are often described as extensions of complex or Q3 0.086 −0.59 0.501 Z

VOLUME 9, 2021 15 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

The resulting chromatic channels are then smoothed using assuming ITU-R BT.709 conversion. However the YCbCr Gaussian kernels, then transformed back into XYZ tristimu- components are defined, a chromatic SSIM model may be lus color space via defined based on a weighted average of the objective qualities       of the individual YCbCr channels X 0.6204 −1.8704 −0.1553 Q1  Y  =  1.3661 0.9316 0.4339   Q2 (42) F(ref , dis) = f (Yref , Ydis) + αf (Cbref , Cbdis) Z 1.5013 1.4176 2.5331 Q 3 + βf (Crref , Crdis) /(1 + α + β) (49) ∗ ∗ ∗ and then into the CIELAB L , a , and b . where f (·, ·) denotes similarity between reference and dis- The dissimilarities between the reference and test chro- torted frames. matic components is then found as In our case, the baseline SSIM is used as the base QA 2 measure, i.e. f (·, ·) = SSIM(ref, dis). This method is used 1E(x, y) = L∗(x, y) − L∗(x, y) 1 2 in popular image and video processing tools like FFMPEG ∗ ∗ 2 and Daala, where a Color SSIM is calculated as + a1(x, y) − a2(x, y) 1 Y Cb Cr ∗ ∗ 2 2 SSIM = 0.8 · SSIM + 0.1 · SSIM + 0.1 · SSIM (50) + b1(x, y) − b2(x, y) (43) Instead of testing all possible combinations of the hyperpa- which are the used to weight the values of the final SSIM rameters during our experiments, we fixed α = β. We found map: α = β = −0.3 to yield optimal results in our experiments, 1E(x, y) with the obtained performances of this optimized Color CSSIM = l(x, y) · c(x, y) · s(x, y) · 1 − . (44) 45 SSIM (CSSIM) on the four databases included in Table 8.

C. HSSIM E. RESULTS The HSSIM index is calculated by first transforming pictures We compared the performances of the above four Color SSIM or frames into an HSV color space. The color quality is models on the same databases, with the results tabulated then predicted using a weighted average of SSIM and hue in Table 8. As may be observed, using color information similarity: can noticeably boost SSIM’s quality prediction power, with QSSIM RGB and YUV yielding the largest gains. SSIM (x, y) + 0.2H (x, y) HSSIM (x, y) = , (45) 1.2 X. SPATIO-TEMPORAL AGGREGATION OF QUALITY where H (x, y) is of the same form as SSIM but operates on SCORES hue channel instead of grayscale. In its native form, SSIM is defined on a pair of image regions Next we discuss straight-forward channel-wise SSIM as and returns a local quality score. When applied to a pair applied in YUV space. This model is also tested on the four of images, a quality map is obtained of (approximately) databases. the same size as the image. The most common method of aggregating these local quality values is to calculate Mean D. CHANNEL-WISE SSIM SSIM (MSSIM) to obtain a SSIM score on the entire image. While image sensors normally capture RGB data in accor- This method of aggregating quality scores is also usually dance with photopic (daylight) retinal sampling, most of the applied when applying SSIM to videos. Frame-wise MSSIM structural information is present in the luminance signal. This scores are calculated between pairs of corresponding frames, implies the existence of a lower information (reduced band- and the average value of MSSIM (over time) is reported as width) color representation. In fact, both the retinal represen- the single SSIM score of the entire video. tation and modern opponent (chromatic) color spaces exploit In the context of HTTP streaming, the authors of [66] this property of visual signals. Modern social media and evaluated various ways of temporal pooling SSIM scores, streaming platforms ordinarily process RGB into a chromatic and found that over longer durations, the simple temporal space such as YUV or YCrCb prior to compression. Likewise, mean performed about as well as other more sophisticated IQA/VQA may be defined on color frames represented by pooling strategies. Here, we summarize and expand this work luminance and chrominance. by simultaneously testing various spatial and temporal aggre- Since chromatic representations reduce the correlation gation methods on the two video databases. between color planes, the chromatic components with We begin by discussing various spatial and temporal pool- reduced entropies may be down-sampled prior to compres- ing strategies that can be used to pool SSIM. Some of these sion. YCbCr values can be obtained directly from RGB via a methods require the tuning of hyperparameters. To optimize linear transformation, typically these hyperparameters, we use the baseline sample mean as the other pooling method, by comparing the SROCC achieved Y = 0.213 × R + 0.715 × G + 0.072 × B (46) by each choice of hyperparameters. That is, when optimizing Cb = 0.539 × (B − Y ) (47) a spatial pooling method, we use temporal mean pooling, and Cr = 0.635 × (R − Y ) (48) vice versa.

16 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 8. Comparison of color SSIM models.

As we will discuss below, many methods have been pro- subjective scores. Let x = (i, j) denote spatial indices. Given posed which leverage either side information, such as visual a quality map Q(x, t) having mean value µQ(t) and standard attention maps, or computationally intensive procedures like deviation σQ(t), the CoV-pooled score is defined as optical flow estimation. While these methods offer principled = ways to improve the SSIM model, we omit them from our SCoV (t) σQ(t)/µQ(t). (51) comparisons because they have a high cost, which is often The same method can be used to pool frame-wise quality unsuitable for practical deployments of SSIM at large scales. scores. One can also adapt the CoV method of temporal In subsequent sections, all of the SSIM quality maps were pooling using windowing, where the CoV is computed over generated using the Scikit-Image implementation of SSIM, short temporal windows before being averaged. That is, given with rectangular windows and the default parameters. While a sequence of ‘‘local’’ temporal CoV values ρQ(t), define the the exact results of the experiments may vary slightly with the windowed-CoV-pooled scores choice of ‘‘base implementation,’’ we expect these trends to 1 X hold across implementations. S − = ρ (t). (52) W CoV T Q t A. MOMENT-BASED POOLING In the same vein, windowed versions of the three A straightforward extension of the averaging operation used Pythagorean means can be used for temporal pooling. Using in SSIM is to replace it by one of the other two Pythagorean framewise SSIM scores obtained using the Scikit-Image means - the Geometric Mean (GM) and the Harmonic Mean implementation with a rectangular window of size 11, (HM). Since local SSIM scores can be negative, we can we tested the performance of the three windowed means only use GM and HM pooling on the structural dissimilar- (W-AM, W-GM, W-HM) and windowed-CoV (W-CoV) ity (DSSIM) scores, i.e. 1 − SSIM. However, this means that pooling. We analyzed the variation of performance of the if the SSIM at any location is close to 1, the pooled score is windowed means against window size, for window sizes dramatically decreased. So, we do not recommend using GM w = 1, 2,... 100. The results of these experiments are shown or HM for spatial pooling. However, framewise SSIM scores in Fig. 10. We omitted the W-CoV method from this plot are nearly always positive, so we investigate the use of the because it gave significantly inferior performance, as shown Pythagorean means for temporal pooling. We also consider in Table9, which lists the best performance of each windowed the sample median, since it is a robust measure of central method. tendency (MCT), unlike the mean. These plots reveal similar trends. There is an initial Another method of pooling quality scores can be found decrease in performance with increased window size, but for in the MOVIE index [23]. It was found that the coefficient large enough windows, there is improvement in performance of variation (CoV) of quality values correlated well with over the baseline. While windowed-CoV performed very

VOLUME 9, 2021 17 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

score is given by  !1/po (p,o) 1 X p S (t) =  (Q(x, t) − µQ(t))  . (54) MD MN x In our experiments, the most common optimal choice was p = 2, corresponding to the standard deviation. When applying MD pooling to temporal scores, the ﬁnal exponent o does not affect the SROCC, since exponentiation is a monotonic FIGURE 10. Windowed-Moment-Pooling SROCC vs Window Size. function. So, while we select p using the SROCC as discussed above, we choose o for temporal pooling by comparing PCC TABLE 9. Performance of Windowed-Moment-Pooling. values.

D. LUMINANCE-WEIGHTED POOLING In [21], the authors proposed a method of spatial pooling which assigns weights to regions of an image based on the local luminance (brightness), which we call Luminance-Weighted (LW) pooling. These weights are used to account for the fact that the HVS is less sensitive to distortions in dark regions. Following our earlier convention, the local mean µ1(x) is a measure of the local luminance in the reference image. Given a lower limit as and an interval length bs, the weighting function is defined as  poorly, the difference between the three Pythagorean means 0, µ1(x) < as is small, with windowed-GM being a good choice. However,  wLW (x) = (µ1(x) − as)/bs, as ≤ µ1(x) < as + bs (55) to observe a reliable improvement in performance, a large 1, µ (x) ≥ a + b window size is needed, of k ≈ 80 on the LIVE VQA database 1 s s and k ≈ 50 on the Netflix Public database. However, such Then, the spatially-weighted SSIM score is given by large values of k could lead to significant delays in real-time 1 X applications, which may not be a reasonable cost considering SLW (t) = wLW (x)Q(x, t) (56) MN the small increase in performance. x We tested the performance of LW-pooling on all four B. FIVE-NUMBER SUMMARY POOLING databases for values of as = 0, 10,..., 100 and bs = The five-number summary (FNS) [67] method was proposed 0, 10,... 50. Note that choosing as = bs = 0 corresponds as a better way of summarizing the histogram of a spatial to the standard baseline SSIM. The experimental variation of quality map, as compared to the simple mean. Given a spatial performance (SROCC) against choices of as and bs is shown quality map Q(x, t) at time t, let Qmin(t) denote the mini- in Fig. 11. mum value, Q1(t) denote the 25th percentile, (lower quar- On the LIVE IQA and Netflix Public databases, tile), Qmed (t) denote the median value, Q3(t) denote the 75th we observed that the best performance was achieved by percentile (upper quartile) and Qmax (t) denote the maximum the baseline as = bs = 0. On the other two databases, value. The five number summary is then defined as the improvement in performance was insignificant, with an elevation of SROCC of less than 0.002. So, we do not Q (t) + Q (t) + Q (t) + Q (t) + Q (t) S (t) = min 1 med 3 max recommend using LW-pooling. FNS 5 (53) E. DISTORTION-WEIGHTED POOLING Of course, FNS may likewise be applied to the framewise Distortion-Weighted (DW) pooling is a method that assigns quality scores as a way of temporal pooling. different weights to low and high-quality regions. We consider the common method of distortion weighting, where the weight assigned to a quality score is proportional to a C. MEAN-DEVIATION POOLING power of the quality score. Concretely, given an exponent p, In [68], the authors proposed a SSIM-like quality index, the spatial DW-pooled score of a spatial quality map Q(x, t) which is then pooled using a ‘‘mean-deviation’’ operation. as time t is given by The deviation is defined as the power o of the Minkowski P p distance of order p between the quality values and its mean. (1 − Q(x, t)) Q(x, t) (p) x More concretely, given a spatial quality map Q(x, t) at time t S (t) = . (57) DW P (1 − Q(x, t))p having mean value µQ(t), the pooled mean-deviation quality x

18 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 12. Spatial PP SROCC vs parameters pt and rt .

recommend using temporal Minkowski pooling, unless a spe- ciﬁc application-relevant dataset is available.

G. PERCENTILE POOLING More sophisticated techniques have been proposed to spatially pool of SSIM scores. In [69], the authors propose FIGURE 11. LW-pooling SROCC vs parameters as and bs. pooling SSIM scores by visual importance. The visual impor- Likewise, DW-pooling may be applied to the times series tance of distortions was measured in two ways: visual atten- of framewise quality scores to perform temporal DW-pooling. tion maps using the Gaze-Attentive Fixation Finding Engine We tested DW-pooling using values of p = 1/8, 1/4,... 8 (GAFFE) [70], and percentile pooling (PP) of quality scores. on all four databases, for both spatial and temporal pool- Because this guide is tailored towards practical application ing. While DW-pooling can lead to a considerable increase of SSIM, we omit the additional computation of running in performance, we also found that the optimal value of p GAFFE and focus only on PP. varied significantly between databases. So, in the absence of The spatial PP method is specified by two parameters: ps, a dataset that the user can use to select p reliably, we do not the percentile of lower values to be modified, and rs, the factor recommend using DW-pooling off the shelf. If a user does by which the lower ps percentile is weighted. The idea of PP have a relevant dataset, then DW-pooling may be profitably is to heavily weight the worst quality regions, which are likely applied. We refer the reader to Table 13 for detailed results. to heavily bias the perception of quality. Define the lowest ps percentile of values of the quality map Q(x, t) by perc(Q, ps). F. MINKOWSKI POOLING The quality values are then re-weighted as The Minkowski Pooling (Mink) method is a generalization of ( Q(x, t)/rs, Q(x, t) ∈ perc(Q, ps) the arithmetic mean, which provides another way to provide Q˜ (rs,ps)(x, t) = . (59) Q(x, t), otherwise additional weight to low quality scores. Because local quality scores can be negative, we pool the DSSIM scores. Given an The PP quality score is then defined as the average of the exponent p, define the spatial Minkowski-pooled score as re-weighted quality values:

p 1 X (r ,p ) 1 X S (t) = (1 − Q(x, t))p . (58) S s s (t) = Q˜ (rs,ps)(x, t). (60) Mink MN PP MN x x

Once again, we tested values of p = 1/8, 1/4,..., 8. Larger values of ps penalize more low-quality values, Note that we omitted p = 1 since it is identical to the which may dilute the distortion severity, while larger values baseline mean pooling. Spatial Minkowski pooling provided of rs weight the low quality regions more heavily. While improvement in performance on the video databases, with the authors of [69] were circumspect regarding the value of p = 4 being a good choice of p across databases. However, percentile pooling, they recommended choosing ps = 6 and as with DW-pooling, the optimal choice of p for tempo- rs = 4000. However, on all four databases, we found that ral pooling varied signiﬁcantly between databases and any all choices of parameters ps and rs performed worse than the improvement in performance was modest. So, we do not baseline. This behavior is illustrated in Fig. 12 for values of

VOLUME 9, 2021 19 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

FIGURE 14. Variation of SSIM and MS-SSIM 3D performance with FIGURE 13. Temporal PP SROCC vs parameters pt and rt . Temporal Window Size Kt .

TABLE 10. Performance of temporal PP SSIM. TABLE 11. Performance of 3D SSIM/MS-SSIM.

ps in the range [0, 25] (%) and rs in the range [1, 5]. While the discrepancy between our results and those of [69] can instead of 2D spatial neighborhoods. Similarly, define MS- be attributed to variations in the implementations, we found SSIM-3D as 3D SSIM computed over multiple spatial scales. that the performance of percentile pooling was inferior across Because we choose rectangular windows, we used integral all the tested databases. So, we do not recommend spatial images to efficiently compute the local statistics. percentile pooling. We tested these 3-D variants of SSIM and MS-SSIM on We also studied temporal PP of aggregated framewise the LIVE VQA and Netflix Public video databases. We used quality scores by penalizing low-quality frames. Once again, rectangular filters of size 11 × 11 × Kt , and investigated to find the optimal choice of the temporal PP parameters the variation of the algorithm’s performance with Kt . The pt and rt , we computed framewise SSIM scores, one which performances of the baseline frame-wise SSIM/MS-SSIM we tested values of p in the range [0, 25] (%) and r in the t t (Kt = 1) models and the best SSIM/MS-SSIM 3D are listed range [1, 5]. Fig. 13a and 13b plot the variation of perfor- in Table 11. The variation in performance of SSIM-3D and mance (SROCC) with choices of p and r on the LIVE VQA t t MS-SSIM-3D with Kt is plotted in Fig. 14. and Netflix Public databases, respectively. The performances From these figures, performance increases by both of the optimal PP algorithm and baseline SSIM are compared SSIM-3D and MS-SSIM-3D relative to the 2D frame-based in Table 10. versions may be observed on the LIVE VQA database, From Table 10, it may be observed that temporal PP gave with the improvement being much more pronounced in only a minor improvement in performance over baseline the case of single scale SSIM. When tested on the SSIM. From the accompanying plots, while there were gen- Netflix-Public database, the improvement was much lower. eral trends in performance with variations of each param- Again, the improvement was lower for MS-SSIM-3D than eter, the prediction performance of the pooled models was SSIM-3D. From the plots, choosing Kt from the interval sensitive to small perturbations of the parameters. Coupled [3, 10] offers solid improvement in performance. Another with the fact that the observed increases in performance advantage of this approach is that using small rectangular were small, it is difficult to reliably identify good choices of temporal windows, performance increases can be obtained the parameters pt and rt . So, we do not recommend using without any increase in computational complexity, by main- percentile pooling for temporal aggregation either. taining rolling sums of the last Kt frames. This can be achieved as below, using a buffer of Kt frames, leading to an H. SPATIO-TEMPORAL SSIM O(MNKt ) memory complexity. Efforts have also been made to create spatio-temporal As in equations (24)-(28), we can calculate the local versions of SSIM. In [71], the authors proposed a 3D statistics from the analogously defined sums over 3D neigh- spatio-temporal SSIM, and its motion-tuned extension. (1) (2) (1) (2) borhoods S1 (i, j, k), S1 (i, j, k), S2 (i, j, k), S2 (i, j, k), and To avoid additional computation related to motion estima- S12(i, j, k). As an illustrative example, consider tion, we consider only the SSIM-3D model and replace the   motion-tuned weighting function by a rectangular window. i j k (1) = X X X Thus, SSIM-3D was defined as identical to SSIM, other than S1 (i, j, k)  I1(m, n, o). that local statistics - mean, standard deviation and correla- m=i−l+1 n=j−l+1 o=k−Kt +1 tion - were computed on 3D spatio-temporal neighborhoods (61)

20 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 12. Comparing the performance of spatial pooling methods on image databases.

Defining the temporal sum I. FINAL RESULTS   Here, we provide comprehensive results of our exper- k iments with the various spatial and temporal pooling (1) = X T1 (i, j, k)  I1(i, j, o) , (62) algorithms described above. In all cases, we refer to each o=k−Kt +1 pooling method by the abbreviations listed above. To include we can rewrite (61) as information about the choice of optimal hyperparameters, we added superscripts to the abbreviated algorithm i j names. So, we denote Mean-Deviation Pooling by MD(p,o), (1) = X X (1) (p) S1 (i, j, k) T1 (m, n, k). (63) Distortion-Weighted Pooling by DW , Minkowski Pooling m=i−l+1 n=j−l+1 by Mink(p) and the Windowed AM, GM and HM algorithms by W-AM(k), W-GM(k) and W-HM(k) respectively, where k (1) Knowing T1 (m, n, k), this sum can be computed effi- denotes the window size. Once again, we linearized pooled ciently using integral images, using the SSIM scores by fitting them with the 5PL function in (17), (1) equations (18)-(23). The temporal sum T1 (m, n, k) itself and report performance in terms of the PCC, SROCC and can be updated efficiently with each new frame, by observing RMSE values. that The performances of the various spatial pooling methods on the two image databases is tabulated in Table 12. On the (1) = (1) − − − + T1 (i, j, k) T1 (i, j, k 1) I1(i, j, k Kt ) I1(i, j, k). video databases, we tested all pairs of spatial and temporal (64) pooling methods, and these results are given in Table 13. In these tables, the columns represent the choice of spatial (2) In the same manner, we can also compute S1 (i, j, k), pooling (SP) method, while the rows represent the choice of (1) (2) S2 (i, j, k), S2 (i, j, k), and S12(i, j, k) efficiently. Combining temporal pooling (TP) methods. these two methods, we can compute SSIM-3D in O(MN) In Table 12, the best performing spatial pooling method time at each frame, irrespective of the temporal size of the is boldfaced. We found that CoV pooling performed best on window Kt . the challenging TID 2013 database, while the baseline Mean Motion vectors were also used to incorporate temporal SSIM method performed best on the LIVE IQA database, information in [72], which proposed a Motion Compen- with CoV pooling a close second. sated SSIM (MC-SSIM) algorithm. Motion vectors were In Table 13, we boldfaced the five best results in each used to find matching blocks in the reference and test video sub-table. It may be observed that MD Pooling and CoV sequences at each temporal index. The SSIM scores between pooling performed best among the spatial pooling meth- these matched blocks were then used to calculate a temporal ods, while using large windowed means performed well quality score. The clear drawback of this method is the com- among the temporal pooling methods, significantly outper- putation of motion vectors, which are expensive and may not forming the spatio-temporal SSIM-3D and MS-SSIM-3D be readily available. algorithms. However, as discussed earlier, using windowed This issue was addressed in [73], which proposed a means requires large windows (k ≈ 50, 80) while providing spatio-temporal SSIM which did not use motion information. only a minor performance improvement over the baseline. Instead, the authors visualize the video as a 3D volume in x, y, In addition, because CoV pooling performed consistently and t, where the frames lie in the x − y planes. The x − t and well across all databases and does not have any hyperpa- y − t planes contain both spatial and temporal information, rameters to tune, we recommend using spatial CoV pooling and can also be compared using SSIM. The spatio-temporal on picture or video frame quality maps, and the standard SSIM (ST-SSIM) model is then defined as the average of the arithmetic mean pooling of temporal frame SSIM scores. This three SSIM values. Note that neighborhoods in the x −y, x −t quality aggregation method is identical to the one used in and y − t directions are special cases of 3-D neighborhoods MOVIE. used in SSIM-3D (obtained by setting the size along one As in previous sections, we repeated the experiments dimension to 1). So, while we discuss ST-SSIM for complete- on compression distorted data, and reported the results ness, we did not include it in our experiments. in Tables 14 and 15. Even when we restricted the distortion

VOLUME 9, 2021 21 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 13. Comparing the performance of pairs of spatial and temporal pooling methods on video databases.

TABLE 14. Comparing the performance of spatial pooling methods on compressed images.

types, we did not obtain concordant values of hyperparame- 2) Uses rectangular windows, to calculate local statistics, ters across the video databases. with a default size of 11 × 11. These rectangular win- Based on our recommendations, we propose a variant dows are implemented using integral images. of SSIM, called ‘‘Enhanced SSIM’’, which we are making 3) Local quality scores are computed with a stride publicly available to the community as a command line of 5. tool. The speciﬁcations of Enhanced SSIM are described 4) The input image is down-sampled by a factor inferred below. from values of D/H, using a default ratio of 3.0, corre- 1) Operates only on the luminance channel. sponding to a typical D/H ratio for TV viewing.

22 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

TABLE 15. Comparing the SROCC achieved by pooling methods on the five parameter logistic function, and demonstrated its compressed live VQA videos. generalizability. Further, while the baseline SSIM model is defined on two luminance images, most practical applications involve media having color. To account for this, we reviewed several Color SSIM models and compared their performance, finding that Quaternion SSIM was a consistently good choice. Finally, we performed a comprehensive evaluation of spatial and temporal aggregation methods used to deploy SSIM on videos. Based on these results, we recommended using spatial CoV pooling and temporal arithmetic mean pooling of framewise SSIM scores. In all, we have conducted a comprehensive study of many design choices involved when implementing SSIM, and made recommendations on the best practices. In addition, we have TABLE 16. Performance of enhanced SSIM. incorporated these recommendations into a variant of SSIM which we call Enhanced SSIM, for which we provide an openware command line tool for use by video quality engi- neers in academia and industry here.

REFERENCES [1] Global Internet Phenomena Report. Accessed: 2019. [Online]. Available: https://www.sandvine.com/global-internet-phenomena-report-2019 [2] I. DIS, ‘‘10918-1. Digital compression and coding of continuous-tone still images (jpeg),’’ CCITT Recommendation T, vol. 81, p. 6, Feb. 1991. [3] W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image Data Compression 5) Coefficient of Variation pooling is used to spatially Standard. New York, NY, USA: Springer, 1992. aggregate the local quality scores. [4] Netflix. AVIF for Next-Generation Image Coding. Accessed: We compare the performance of our implementation with Dec. 15, 2020. [Online]. Available: https://netflixtechblog.com/avif- for-next-generation-image-coding-b1d75675fe4 LIBVMAF in Table 16, since LIBVMAF was the best per- [5] N. Barman and M. G. Martini, ‘‘An evaluation of the next-generation image forming implementation in Section IV. This table highlights coding standard AVIF,’’ in Proc. 12th Int. Conf. Qual. Multimedia Exper. the computational and performance benefits of using our rec- (QoMEX), May 2020, pp. 1–4. ommendations. Once again, ‘‘(Comp)’’ refers to experiments [6] J. Lainema, M. M. Hannuksela, V. K. M. Vadakital, and E. B. Aksu, ‘‘HEVC still image coding and high efficiency image file format,’’ in Proc. conducted on compression data from each database. IEEE Int. Conf. Image Process. (ICIP), Sep. 2016, pp. 71–75. [7] T. Wiegand, Draft ITU-T Recommendation and Final Draft International XI. CONCLUSION Standard of Joint Video Specification, document ITU-T Rec. H. 264, ISO/IEC 14496-10 AVC, JVT-G050, 2003. In this guide, we detailed the results of a series of experiments [8] I. E. Richardson, The H. 264 Advanced Video Compression Standard. we conducted towards determining optimal design choices Hoboken, NJ, USA: Wiley, 2011. when deploying SSIM. We first evaluated the off-the-shelf [9] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, ‘‘Overview of the high efficiency video coding (HEVC) standard,’’ IEEE Trans. Circuits Syst. performance and efficiency of several public implementa- Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. tions of SSIM and MS-SSIM, using which we identified a [10] K. Choi, J. Chen, D. Rusanovskyy, K.-P. Choi, and E. S. Jang, set of Pareto-optimal implementations. Using these results, ‘‘An overview of the MPEG-5 essential video coding standard [standards we also described a method to improve the computational in a nutshell],’’ IEEE Signal Process. Mag., vol. 37, no. 3, pp. 160–167, May 2020. efficiency of SSIM using integral images. We then reviewed [11] D. Mukherjee, J. Bankoski, A. Grange, J. Han, J. Koleszar, P. Wilkins, a method, called Scaled SSIM, to improve the efficiency of Y. Xu, and R. Bultje, ‘‘The latest open-source video codec VP9— computing SSIM across resolutions, when conducting RDO An overview and preliminary results,’’ in Proc. Picture Coding Symp. (PCS), Dec. 2013, pp. 390–393. in encoding pipelines. Following this, we reviewed the depen- [12] Y. Chen et al., ‘‘An overview of core coding tools in the AV1video codec,’’ dence of SSIM performance on the viewing device, where we in Proc. Picture Coding Symp. (PCS), Jun. 2018, pp. 41–45. discussed improvements to SSIM which account for viewing [13] F. Kossentini, H. Guermazi, N. Mahdi, C. Nouira, A. Naghdinezhad, H. Tmar, O. Khlif, P. Worth, and F. Ben Amara, ‘‘The SVT-AV1 encoder: distance and screen size. We then investigated the dependence Overview, features and speed-quality tradeoffs,’’ in Proc. 43rd Appl. Digit. of SSIM on the choice of window, where we conducted Image Process., Int. Soc. Opt. Photon., A. G. Tescher and T. Ebrahimi, extensive experiments to identify good choices for the size Eds. vol. 11510. Bellingham, WA, USA: SPIE, Aug. 2020, pp. 469–490, and type of window function (rectangular windows of size doi: 10.1117/12.2569270. [14] D. Ghadiyaram and A. C. Bovik, ‘‘Massive online crowdsourced study 15–20), thereby validating some observations we made when of subjective and objective picture quality,’’ IEEE Trans. Image Process., testing public SSIM implementations. vol. 25, no. 1, pp. 372–387, Jan. 2016. Due to the non-linear nature of SSIM, it is crucial to [15] Z. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik, ‘‘UGC-VQA: Benchmarking blind video quality assessment for user develop a good mapping function from SSIM scores to sub- generated content,’’ 2020, arXiv:2005.14354. [Online]. Available: jective scores. We tested a popular choice of such a mapping, http://arxiv.org/abs/2005.14354

VOLUME 9, 2021 23 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

[16] I. Katsavounidis, ‘‘Dynamic optimizer—A perceptual video encoding opti- [42] A. Tourapis and D. Singer, HDRTools: A Software Package for mization framework,’’ Netflix Tech Blog, 2018. Video Processing and Analysis, Standard lSO/IEC JTC1/SC29IWG11 [17] Z. Wang and A. C. Bovik, ‘‘Mean squared error: Love it or leave it? A new MPEG2014/m35156, 2014. look at signal fidelity measures,’’ IEEE Signal Process. Mag., vol. 26, no. 1, [43] T. J. Daede, N. E. Egge, J. Valin, G. Martres, and T. B. Terriberry, pp. 98–117, Jan. 2009. ‘‘Daala: A perceptually-driven next generation video codec,’’ CoRR, [18] Z. Wang and A. C. Bovik, ‘‘A universal image quality index,’’ IEEE Signal vol. abs/1603.03129, pp. 1–10, Mar. 2016. [Online]. Available: Process. Lett., vol. 9, no. 3, pp. 81–84, Mar. 2002. http://arxiv.org/abs/1603.03129 [19] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality [44] S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, assessment: From error visibility to structural similarity,’’ IEEE Trans. J. D. Warner, N. Yager, E. Gouillart, and T. Yu, ‘‘Scikit-image: Image Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. processing in Python,’’ PeerJ, vol. 2, p. e453, Jun. 2014. [20] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘‘Multiscale structural simi- [45] Scikit-Video. [Online]. Available: http://www.scikit-video.org/ larity for image quality assessment,’’ in Proc. 37th Asilomar Conf. Signals, [46] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on Syst. Comput., vol. 2, 2003, pp. 1398–1402. Heterogeneous Systems. [Online]. Available: https://www.tensorflow.org/ [21] Z. Wang, L. Lu, and A. C. Bovik, ‘‘Video quality assessment based [47] P. G. Gottschalk and J. R. Dunn, ‘‘The five-parameter logistic: A character- on structural distortion measurement,’’ Signal Process., Image Com- ization and comparison with the four-parameter logistic,’’ Anal. Biochem- mun., vol. 19, no. 2, pp. 121–132, Feb. 2004. [Online]. Available: istry, vol. 343, no. 1, pp. 54–65, Aug. 2005. http://www.sciencedirect.com/science/article/pii/S0923596503000766 [48] F. C. Crow, ‘‘Summed-area tables for texture mapping,’’ SIGGRAPH Com- [22] K. Seshadrinathan and A. C. Bovik, ‘‘A structural similarity metric for put. Graph., vol. 18, no. 3, pp. 207–212, Jan. 1984. [Online]. Available: video based on motion models,’’ in Proc. IEEE Int. Conf. Acoust., Speech https://doi.org/10.1145/964965.808600 Signal Process. (ICASSP), vol. 1, Apr. 2007, pp. I-869–I-872. [49] P. Viola and M. J. Jones, ‘‘Robust real-time face detection,’’ Int. J. Comput. [23] K. Seshadrinathan and A. C. Bovik, ‘‘Motion tuned spatio-temporal quality Vis., vol. 57, no. 2, pp. 137–154, May 2004. assessment of natural videos,’’ IEEE Trans. Image Process., vol. 19, no. 2, [50] A. K. Venkataramanan, C. Wu, and A. C. Bovik, ‘‘Optimizing video pp. 335–350, Feb. 2010. quality estimation across resolutions,’’ in Proc. IEEE 22nd Int. Workshop [24] M. J. Wainwright and E. P. Simoncelli, ‘‘Scale mixtures of Gaussians and Multimedia Signal Process. (MMSP), Sep. 2020, pp. 1–5. the statistics of natural images,’’ in Proc. 12th Int. Conf. Neural Inf. Pro- [51] R. Schatz and S. Egger, ‘‘On the impact of terminal performance and screen cess. Syst. (NIPS). Cambridge, MA, USA: MIT Press, 1999, pp. 855–861. size on QoE,’’ in Proc. ETSI Workshop Sel. Items Telecommun. Qual. [25] Z. Wang and A. Bovik, ‘‘Reduced-and no-reference image quality assess- Matters, 2012, pp. 1–26. ment,’’ IEEE Signal Process. Mag., vol. 28, no. 6, pp. 29–40, Nov. 2011. [52] M. E. Grabe, M. Lombard, R. D. Reich, C. C. Bracken, and T. B. Ditton, [26] H. R. Sheikh and A. C. Bovik, ‘‘Image information and visual quality,’’ ‘‘The role of screen size in viewer experiences of media content,’’ Vis. IEEE Trans. Image Process., vol. 15, no. 2, pp. 430–444, Feb. 2006. Commun. Quart., vol. 6, no. 2, pp. 4–9, Mar. 1999. [27] R. Soundararajan and A. C. Bovik, ‘‘Video quality assessment by reduced [53] F. W. Campbell and J. Robson, ‘‘Application of Fourier analysis to the reference spatio-temporal entropic differencing,’’ IEEE Trans. Circuits visibility of gratings,’’ J. Physiol., vol. 197, no. 3, p. 551, 1968. Syst. Video Technol., vol. 23, no. 4, pp. 684–694, Apr. 2013. [54] S. Winkler, ‘‘Issues in vision modeling for perceptual video quality assess- [28] C. G. Bampis, P. Gupta, R. Soundararajan, and A. C. Bovik, ‘‘SpEED-QA: ment,’’ Signal Process., vol. 78, no. 2, pp. 231–252, Oct. 1999. Spatial efficient entropic differencing for image and video quality,’’ IEEE [55] I. BT, Methodologies for the subjective assessment of the quality of tele- Signal Process. Lett., vol. 24, no. 9, pp. 1333–1337, Sep. 2017. vision images, document Recommendation ITU-R BT.500-14 (10/2019), [29] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, ‘‘Toward ITU, Geneva, Switzerland, 2020. a practical perceptual video quality metric,’’ Netflix Tech Blog, vol. 6, p. 2, [56] K. Gu, G. Zhai, X. Yang, and W. Zhang, ‘‘Self-adaptive scale transform for Jun. 2016. IQA metric,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, [30] K. Seshadrinathan and A. C. Bovik, ‘‘Unifying analysis of full reference pp. 2365–2368. image quality assessment,’’ in Proc. 15th IEEE Int. Conf. Image Process., Oct. 2008, pp. 1200–1203. [57] K. Gu, G. Zhai, M. Liu, Q. Xu, X. Yang, J. Zhou, and W. Zhang, ‘‘Adaptive [31] M. A. Saad, A. C. Bovik, and C. Charrier, ‘‘Blind image quality assess- high-frequency clipping for improved image quality assessment,’’ in Proc. ment: A natural scene statistics approach in the DCT domain,’’ IEEE Trans. Vis. Commun. Image Process. (VCIP), Nov. 2013, pp. 1–5. Image Process., vol. 21, no. 8, pp. 3339–3352, Aug. 2012. [58] K. Gu, M. Liu, G. Zhai, X. Yang, and W. Zhang, ‘‘Quality assessment con- [32] A. K. Moorthy and A. C. Bovik, ‘‘Blind image quality assessment: From sidering viewing distance and image resolution,’’ IEEE Trans. Broadcast., natural scene statistics to perceptual quality,’’ IEEE Trans. Image Process., vol. 61, no. 3, pp. 520–531, Sep. 2015. vol. 20, no. 12, pp. 3350–3364, Dec. 2011. [59] G. Cermak, M. Pinson, and S. Wolf, ‘‘The relationship among video [33] A. Mittal, A. K. Moorthy, and A. C. Bovik, ‘‘No-reference image quality quality, screen resolution, and bit rate,’’ IEEE Trans. Broadcast., vol. 57, assessment in the spatial domain,’’ IEEE Trans. Image Process., vol. 21, no. 2, pp. 258–262, Jun. 2011. no. 12, pp. 4695–4708, Dec. 2012. [60] W. Zou, J. Song, and F. Yang, ‘‘Perceived image quality on mobile phones [34] A. Mittal, R. Soundararajan, and A. C. Bovik, ‘‘Making a ‘completely with different screen resolution,’’ Mobile Inf. Syst., vol. 2016, pp. 1–17, blind’ image quality analyzer,’’ IEEE Signal Process. Lett., vol. 20, no. 3, Mar. 2016. pp. 209–212, Nov. 2013. [61] M. Hassan and C. Bhagvati, ‘‘Structural similarity measure for color [35] A. Bovik, ‘‘Assessing quality of images or videos using a two-stage quality images,’’ Int. J. Comput. Appl., vol. 43, no. 14, pp. 7–12, Apr. 2012. assessment,’’ U.S. Patent 10,529,066, Jan. 7, 2020. [62] C. I. De L’Eclairage, Recommendations on Uniform Color Spaces, Color- [36] X. Yu, C. G. Bampis, P. Gupta, and A. C. Bovik, ‘‘Predicting the quality Difference Equations, Psychometric Color Terms. Paris, France: CIE, of images compressed after distortion in two steps,’’ IEEE Trans. Image 1978. Process., vol. 28, no. 12, pp. 5757–5770, Dec. 2019. [63] M. A. Hassan and M. S. Bashraheel, ‘‘Color-based structural similarity [37] H. R. Sheikh. Image and Video Quality Assessment Research image quality assessment,’’ in Proc. 8th Int. Conf. Inf. Technol. (ICIT), at Live. Accessed: 2003. [Online]. Available: http://live.ece. May 2017, pp. 691–696. utexas.edu/research/quality [64] Y. Shi, Y. Ding, R. Zhang, and J. Li, ‘‘Structure and hue similarity for color [38] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, ‘‘A statistical evaluation of image quality assessment,’’ in Proc. Int. Conf. Electron. Comput. Technol., recent full reference image quality assessment algorithms,’’ IEEE Trans. Feb. 2009, pp. 329–333. Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006. [65] A. Kolaman and O. Yadid-Pecht, ‘‘Quaternion structural similarity: A new [39] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, quality index for color images,’’ IEEE Trans. Image Process., vol. 21, no. 4, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo, ‘‘Image pp. 1526–1536, Apr. 2012. database TID2013: Peculiarities, results and perspectives,’’ Signal Pro- [66] M. Seufert, M. Slanina, S. Egger, and M. Kottkamp, ‘‘‘To pool or not cess., Image Commun., vol. 30, pp. 57–77, Jan. 2015. [Online]. Available: to pool’: A comparison of temporal pooling methods for HTTP adaptive http://www.sciencedirect.com/science/article/pii/S0923596514001490 video streaming,’’ in Proc. 5th Int. Workshop Qual. Multimedia Exper. [40] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, (QoMEX), 2013, pp. 52–57. ‘‘Study of subjective and objective quality assessment of video,’’ IEEE [67] C. G. Zewdie, M. Pedersen, and Z. Wang, ‘‘A new pooling strategy for Trans. Image Process., vol. 19, no. 6, pp. 1427–1441, Jun. 2010. image quality metrics: Five number summary,’’ in Proc. 5th Eur. Workshop [41] Ffmpeg. Accessed: Dec. 15, 2020. [Online]. Available: https://ffmpeg.org/ Vis. Inf. Process. (EUVIP), Dec. 2014, pp. 1–6.

24 VOLUME 9, 2021 A. K. Venkataramanan et al.: Hitchhiker’s Guide to Structural Similarity

[68] H. Ziaei Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet, ‘‘Mean devi- the IEEE Signal Processing Society. He received about ten best journal ation similarity index: Efficient and reliable full-reference image quality paper’ awards, including the 2016 IEEE Signal Processing Society Sustained evaluator,’’ IEEE Access, vol. 4, pp. 5579–5590, 2016. Impact Award. His books include The Essential Guides to Image and Video [69] A. K. Moorthy and A. C. Bovik, ‘‘Visual importance pooling for image Processing. He co-founded and was longest-serving Editor-in-Chief of the quality assessment,’’ IEEE J. Sel. Topics Signal Process., vol. 3, no. 2, IEEE TRANSACTIONS ON IMAGE PROCESSING, and also created/Chaired the IEEE pp. 193–201, Apr. 2009. International Conference on Image Processing which was first held in Austin, [70] U. Rajashekar, I. van der Linde, A. C. Bovik, and L. K. Cormack, ‘‘GAFFE: TX, USA, in 1994. A gaze-attentive fixation finding engine,’’ IEEE Trans. Image Process., vol. 17, no. 4, pp. 564–573, Apr. 2008. [71] A. K. Moorthy and A. C. Bovik, ‘‘Efficient motion weighted spatio- temporal video SSIM index,’’ in Proc. 15th Human Vis. Electron. IOANNIS KATSAVOUNIDIS received the Imag., Int. Soc. Opt. Photon., vol. 7527, B. E. Rogowitz and B.S./M.S. degree from the Aristotle University T. N. Pappas, Eds. Bellingham, WA, USA: SPIE, 2010, pp. 440–448, doi: of Thessaloniki, Greece, in 1991, and the Ph.D. 10.1117/12.844198. degree from the University of Southern California, [72] A. K. Moorthy and A. C. Bovik, ‘‘Efficient video quality assessment along in 1998, all in electrical engineering. He was an temporal trajectories,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 20, Engineer with the Caltech’s Physics Department, no. 11, pp. 1653–1658, Nov. 2010. worked on the MACRO high-energy astrophysics [73] Y. Wang, T. Jiang, S. Ma, and W. Gao, ‘‘Spatio-temporal ssim index experiment in Italy, from 1996 to 2000. He was for video quality assessment,’’ in Proc. Vis. Commun. Image Process., a Director of Software with InterVideo, Fremont, Nov. 2012, pp. 1–6. CA, USA, worked on advanced video codecs, developing technologies around error resilience, and optimizing video encoding and decoding, from 2000 to 2006. In 2007, he cofounded Cidana, ABHINAU K. VENKATARAMANAN received a mobile multimedia software company, and served as it’s the CTO in the B.Tech. degree in electrical engineering from Shanghai, China. From 2008 to 2015, he was an Associate Professor with the the Indian Institute of Technology Hyderabad, Electrical and Computer Engineering Department, University of Thesally, Hyderabad, India, in 2019. He is currently pur- Greece, where he taught and did research in signal, image, and video suing the M.S. & Ph.D. degrees in electrical processing, as well as information theory. From 2015 to 2018, he was a and computer engineering with The University of Senior Research Scientist with Netflix, part of the Encoding Technologies Texas at Austin, Austin, TX, USA. team, where he worked on video quality metrics and optimization prob- In the summer of 2018, he was a Sum- lems, contributing to the development and popularization of VMAF, and mer Research Intern with the Robotics Insti- inventing the Dynamic Optimizer perceptual quality optimization frame- tute, Carnegie Mellon University, as an SN Bose work. Since 2018, he has been a Research Scientist with the Facebook’s Scholar, where he worked on biologically-inspired reinforcement learning. Video Infrastructure team, working on large scale video quality and video He is currently a Graduate Research Assistant with the Laboratory for Image encoding optimization problems that involve HW and SW components. He is and Video Engineering, The University of Texas at Austin. His research inter- actively involved with the Aliance for Open Media (AOM) and the next ests include image and video quality assessment, perceptual optimization, generation royalty-free codec development and the Video Quality Experts deep learning, and reinforcement learning. He was a recipient of the SN Bose Group (VQEG) in multiple projects around video quality. He has over Scholarship (Indo-US Science and Technology Foundation) and the KVPY 100 publications in multiple journals and conferences and over 40 patents. Fellowship (Department of Science and Technology, Government of India). He was also awarded the Institute Silver Medal by the Indian Institute of Technology Hyderabad, for securing the first rank in his department during his undergraduate studies. ZAFAR SHAHID received the B.S. degree from the University of Engineering and Technology Lahore, Pakistan, in 2001, the M.S. degree from CHENGYANG WU received the B.Eng. degree the Institut National des Sciences Appliquées in electrical engineering from Shanghai Jiao Tong (INSA) de Lyon, France, in 2007, and the Ph.D. University, in 2019. He is currently pursuing the degree from the University of Montpellier, France, Ph.D. degree in electrical and computer engineer- in 2010. His Ph.D. was funded by the Centre ing with The University of Texas at Austin, Austin, National de la Recherche Scientifique (CNRS), TX, USA. France. From 2001 to 2006, he was a Video Soft- He is also a Graduate Research Assistant with ware Engineer with Streaming Networks, worked the Laboratory for Image and Video Engineering, on Real-time multimedia products. He was involved in video codec opti- The University of Texas at Austin. His research mization for Philips VLIW Processor TriMedia, and developed OEM/ ODM interests include image and video quality assess- Products for this processor. In 2011, he worked with IRCCyN Labs, Nantes, ment, no-reference methods, machine learning, and computer vision. France, on QOE for scalable video codecs and designed drift-free bit stream watermarking of H.264/AVC. From 2011 to 2015, he worked as a Media ALAN C. BOVIK (Fellow, IEEE) is currently Encoding Architect for multiple start-ups designing both IPTV and 50mSec the Cockrell Family Regents Endowed Chair Pro- end-to-end latency video pipelines for cloud gaming and tele-health. His fessor with The University of Texas at Austin. video pipeline was used to stream SuperBowl 2014 to more than 500K His research interests include image process- viewers simultaneously on Desktop, iOS, and Android. From 2015 to 2016, ing, digital photography, digital television, dig- he worked on cloud-products for tone-mapping and metadata processing of ital streaming video, and visual perception. For HDR content. After this, he worked in GameStream team of Nvidia from his work in these areas, he was a recipient of 2016 to 2019. During this period, he worked on streaming of 4K video the 2019 Progress Medal from The Royal Pho- games from Cloud and from GPU enabled Gaming Desktop at home to tographic Society, the 2019 IEEE Fourier Award, Tegra powered Shield devices and Win/Mac clients. Since 2019, he has been the 2017 Edwin H. Land Medal from The Optical working with Facebook, working on video codec optimization and quality Society, a 2015 Primetime Emmy Award for Outstanding Achievement in problems at scale, involving both software and ASIC. He has more than Engineering Development from the Television Academy, and the Norbert 40 publications and three patents. Wiener Society Award and the Karl Friedrich Gauss Education Award from

VOLUME 9, 2021 25