JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1 On Evaluating Perceptual Quality of Online User-Generated Soobeom Jang, and Jong-Seok Lee, Senior Member, IEEE

Abstract—This paper deals with the issue of the perceptual quality assessment by employing multiple human subjects. quality evaluation of user-generated videos shared online, which However, subjective quality assessment is not feasible for is an important step toward designing -sharing services online videos because of a tremendous amount of videos that maximize users’ satisfaction in terms of quality. We first analyze viewers’ quality perception patterns by applying graph in online video-sharing services. An alternative is objective analysis techniques to subjective rating data. We then examine quality assessment, which uses a model that mimics the human the performance of existing state-of-the-art objective metrics for perceptual mechanism. the quality estimation of user-generated videos. In addition, we The traditional objective quality metrics to estimate video investigate the feasibility of metadata accompanied with videos quality have two limitations. First, the existence of the ref- in online video-sharing services for quality estimation. Finally, various issues in the quality assessment of online user-generated erence video, i.e., the pristine video of the given video, is videos are discussed, including difficulties and opportunities. important. In general, objective quality assessment frameworks are classified into three groups: full-reference (FR), reduced- Index Terms—User-generated video, paired comparison, qual- ity assessment, metadata. reference (RR), and no-reference (NR). In the cases of FR and RR frameworks, full or partial information about the reference is provided. On the other hand, NR quality assessment does I.INTRODUCTION not use any prior information about the reference video, which ITH the advances of imaging, communications, and makes the problem more complicated. In fact, the accuracy W internet technologies, public online video-sharing ser- of NR objective metrics is usually lower than that of FR vices (e.g., YouTube, Vimeo) have become popular. In such and NR metrics [1]. Second, the types of degradation that services, a wide range of video content from user-generated are dealt with are rather limited. Video quality is affected by amateur videos to professionally produced videos, such as a large number of factors, for which the human perceptual movie trailers and music videos, is uploaded and shared. mechanism varies significantly. Because of this variability, it Today, online video sharing has become the most considerable is too complicated to consider all different video quality factors medium for producing and consuming multimedia content in a single objective quality metric. Hence, existing objective for various purposes, such as fun, information exchange, and quality metrics have considered only a single or a small promotion. number of major quality factors involved in production and For the success of video-sharing services, it is important to distribution such as compression artifacts, packet loss artifacts, consider users’ (QoE) regarding shared and random noise, assuming that the original video has perfect content, as in many other multimedia services. As the first quality. This approach has been successful for professionally step of maximizing QoE, it is necessary to measure perceptual produced videos. quality of the online videos. The quality information of videos However, it is doubtful whether the current state-of-the- can be used for valuable service components such as auto- art approaches for estimating video quality are also suitable matic quality adjustment, streaming quality enhancement, and for online videos. First, for a given online video, the cor- arXiv:1809.05220v1 [cs.MM] 14 Sep 2018 quality-based video recommendation. The most accurate way responding reference video is not available in most cases, to measure perceptual video quality is to conduct subjective where NR assessment is the only option for objective quality Manuscript received April 19, 2015; revised February 17, 2016. This evaluation. The performance of existing NR metrics is still research was supported by the MSIP (Ministry of Science, ICT and Future unsatisfactory, which makes the quality assessment of online Planning), Korea, under the “IT Consilience Creative Program” (IITP-2015- videos very challenging. Second, online video-sharing services R0346-15-1008) supervised by the IITP (Institute for Information & Commu- nications Technology Promotion). cover an extremely wide range of videos. There are two types The authors are with the School of Integrated Technology and Yon- of videos in online video-sharing services: professional and sei Convergence Institute of Technology, Yonsei University, Incheon, amateur. Professional videos, which are typically created by 21983, Korea. Phone: +82-32-749-5846. FAX: +82-32-818-5801. E-mail: [email protected];[email protected]. professional video makers, and amateur videos, which are c 2016 IEEE. Personal use of this material is permitted. Permission from created by general users, are significantly different in various IEEE must be obtained for all other uses, in any current or future media, aspects such as content and production and editing styles. including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers In particular, user-generated videos have large variations in or lists, or reuse of any copyrighted component of this work in other works. these characteristics, so they have wide ranges of popularity, This is the author’s version of an article that has been published in this user preference, and quality. Moreover, diverse quality factors journal (DOI: 10.1109/TMM.2016.2581582). Changes were made to this version by the publisher prior to publication. The final version of record is are involved in online user-generated videos (see Section II available at https://doi.org/10.1109/TMM.2016.2581582. for further details). However, existing NR metrics have been JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2 developed to work only for certain types of distortion due to II.BACKGROUNDS compression, transmission error, random noise, etc. Therefore, A. Visual Quality Assessment it is not guaranteed that the existing NR metrics will perform well on those videos. The overall QoE of a video service highly depends on the Online videos are usually accompanied with additional perceptual visual quality of the videos provided by the service. information, called metadata, including the title, description, One way to score the quality of videos is to have the videos are viewcount, rating (e.g., like and dislike), and comments. evaluated by human subjects, which is called subjective quality Some of the metadata of an online video (e.g., the spatial assessment. For many practical multimedia applications, qual- resolution and title) reflect the characteristics of the video ity assessment with human subjects is not applicable due to signal itself, while other metadata, including the viewcount the cost and real-time operation constraints. To deal with this, or comments, provide information about the popularity of and research has been conducted to develop automatic algorithms users’ preference for the video. These types of information that mimic the human perceptual mechanism, which is called have the potential to be used as hints about the quality of the objective quality assessment. video because the quality is one of the factors that affects the Objective quality assessment metrics are classified into three perceptual preference of viewers. Therefore, they can be useful categories: FR, RR, and NR metrics. FR quality assessment for the quality assessment of online user-generated videos by uses the entire reference video, which is the original signal replacing or being used in combination with objective quality without any distortion or quality degradation. Structural sim- metrics. ilarity (SSIM) [2], multi-scale SSIM (MS-SSIM) [3], most This paper deals with the issue of evaluating the perceptual apparent distortion (MAD) [4], and visual information fidelity quality of online user-generated videos. The research questions (VIF) [5] are well-known FR quality metrics for images, considered are: and motion-based video integrity evaluation (MOVIE) [6] and • Are there any noteworthy patterns regarding viewers’ spatiotemporal MAD (ST-MAD) [7] are FR metrics for videos. judgment of the relative quality of user-generated videos? RR metrics do not need the whole reference signal, but use its • How well do existing state-of-the-art NR objective quality partial information. Reduced-reference entropic differencing metrics perform for user-generated videos? (RRED) [8] for images and video quality metric (VQM) [9] • To what extent are metadata-based metrics useful for the for videos are examples of RR quality metrics. perceptual quality estimation of user-generated videos? A challenging situation of objective quality assessment • What makes the signal-based or metadata-based quality is when there is no reference for the given signal being estimation of user-generated videos difficult? assessed. Estimating quality from only the given image or To the best of our knowledge, our work is the first attempt video itself is hard, since no prior knowledge of the reference to investigate the issue of the perceptual quality assessment can be utilized. Currently available NR metrics include the of online user-generated videos comprehensively in various blind image integrity notator using discrete cosine transform aspects. Our contributions can be summarized as follows. First, statistics (BLIINDS-II) [10], the blind/referenceless image by examining subjective ratings gathered by crowdsourcing spatial quality evaluator (BRISQUE) [11], and the Video BLI- for online user-generated videos, we investigate the viewers’ INDS (V-BLIINDS) [1]. These metrics typically use natural patterns of perceptual quality evaluation. Second, we analyze scene statistics (NSS) as prior knowledge of images and the performance of state-of-the-art NR quality assessment videos, and the main difference among them lies in how to algorithms, metadata-driven features, and their combination in obtain information about NSS. BLIINDS-II constructs NSS perceptual quality estimation. The study aims the efficacy and models from the probability distribution of discrete cosine limitations of the signal-based and metadata-based methods. transform (DCT) coefficients extracted from macroblocks. Finally, based on the experimental results, various issues BRISQUE uses mean-subtracted contrast-normalized (MSCN) in the quality assessment of online user-generated videos coefficients rather than transform domain coefficients to speed are discussed in detail. We comment on the difficulties and up the quality assessment process. V-BLIINDS extracts NSS limitations of the quality assessment of user-generated videos features based on a DCT-based NSS model as in BLIINDS-II, in general and provide particular examples demonstrating such but uses the frame difference to obtain the spatiotemporal in- difficulties and limitations, helping us understand better the formation of the video. Additionally, V-BLIINDS uses motion nature of the quality assessment of online videos. estimation techniques to examine motion consistency in the The rest of the paper is organized as follows. Section II video. describes the background of this study, i.e., visual quality assessment, characteristics of online videos, and previous approaches to the quality assessment of online videos. Section B. Characteristics of Online Videos III introduces the dataset used in our study, including video In online video-sharing services, both user-generated videos data and subjective data. In Section IV, patterns of user and professional videos are shared. In terms of quality, perception of online videos are examined via graph analysis. they have significantly different characteristics, especially in Section V presents the results of quality estimation using filming and editing. Many of the makers of user-generated NR quality assessment algorithms and metadata. Section VI videos do not have professional knowledge of photography and discusses issues of the quality assessment of online user- editing, so quality degradation factors can easily be involved generated videos. Finally, Section VII concludes the paper. in every step, from the acquisition to the distribution of videos. JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

TABLE I: Types of observable quality degradation factors in online user-generated videos. Step Types of degradation factors Limited spatial/temporal resolution, misfocusing, blur, jerkiness, camera shaking, noise, Acquisition occlusion, insufficient/excessive lighting, poor composition, poor color reproduction Video Bad transition effect (e.g., fade in/out, overlap), harming caption (e.g., title screen, subtitle), Processing/editing frame-in-frame, inappropriate image processing Uploading Video compression artifacts, temporal resolution reduction, spatial resolution loss Acquisition Device noise, environmental noise, too-loud or too-low volume, incomprehensible language Inappropriate background music, audio-video desynchronization, unsuitable sound effects, Audio Processing/editing audio track loss Uploading Audio compression artifacts Content Boredom, violence, sexuality, harmful content Video & Audio Delivery Buffering, packet loss, quality fluctuation

Table I presents a list of observable quality degradation by the uploader or automatically extracted from the video, factors in online user-generated videos. They are grouped with include the title, information of the uploader, upload date, respect to channels affected by degradation (i.e., video, audio, duration, video format, and category. Metadata determined and audio-video) and steps involving degradation. by viewers include the viewcount, comments, and ratings. Visual factors in the acquisition step consist of problems One can analyze metadata to discover the production and with equipment, camera skills, and environments. In partic- consumption patterns of online videos and to improve the ular, according to the work in [12], typical artifacts in user- quality of service. Moreover, information from metadata (e.g., generated videos include camera shake, harmful occlusions, video links and subscribers) can be used to construct a social and camera misalignment. Visual quality degradation also network consisting of online videos. Analysis of the social occurs during video processing and editing due to the editor’s network can be used for content recommendations [14] and lack of knowledge of photography and video making or their investigating the network topology of online video sharing, intent. For example, scene transition effects, captions, and especially the evolution of online video communities [15] [16]. frame-in-frame effects, where a small image frame is inserted in the main video frame, can be used, which may degrade C. Quality Assessment of Online Content visual quality. Image processing (e.g., color modification, sharpening) can be applied during editing, which may not be There are few studies that consider particular characteristics pleasant to viewers. In the uploading step, the system or the of online images and videos. The method proposed in [17] uploader may compress the video or modify the spatial and estimates the quality of online videos using motion estimation, temporal resolutions of the video, which may introduce com- temporal factors to evaluate jerkiness and unstableness, spatial pression artifacts, visual information loss, or motion jerkiness. factors (including blockiness, blurriness, dynamic range, and intensity contrast), and video editing styles (including shot Audio quality degradation can also occur at each step length distribution, width, height, and black side ratio). Since of acquisition, processing and editing, and uploading. Some these features depend on the genre of video content, robustness of the audio quality factors involved in the acquisition and is not guaranteed, as pointed out in [17]. The work in [18] uploading steps are similar to the visual quality factors (equip- predicted the quality of user-generated images using their ment noise and compression, etc.). Moreover, the language social link distribution, the tone of viewer comments, and used in the recorded speech may have a negative effect on access logs from other websites. It was discovered that social perception when a viewer does not understand the language. In functionality is more important in determining the quality the processing and editing step, inserting inappropriate sound of user-generated images in online environments than the sources, background music, or sound effects may decrease user distortion of the images themselves. Our work deals with satisfaction. Loss of the audio track may be a critical issue videos, which are more challenging for quality assessment than when the content significantly depends on the sound. images. In comparison to the aforementioned prior work, we Some quality factors related to the characteristics of the conduct a more comprehensive and thorough analysis of the content or communication environment apply to both audio issue of the quality assessment of online user-generated videos and video channels. First, the content of a video can be a based on state-of-the-art video quality metrics and metadata- problem. Boring, violent, sexual, and harmful content can driven metrics. spoil the overall experience of watching the video. Second, the communication environment from the server to a viewer is not always guaranteed, so buffering, packet loss, and quality III.VIDEOAND SUBJECTIVE DATASET fluctuation may occur, which are critical in streaming multi- In this section, we introduce the video and subjective media content [13]. dataset used in this work. We use the dataset presented in Content in online video-sharing services is usually accom- [19]. In [19], 50 user-generated amateur videos and their panied with the information reflecting uploading and consump- metadata were collected from YouTube via keyword search, tion patterns, provided by uploaders and viewers, which is and subjective quality ratings for the videos were obtained called metadata. The metadata of a video clip, either assigned from a crowdsourcing-based subjective quality assessment JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

TABLE II: Description and quality degradation factors of the videos in the dataset. The video ID is defined based on the ranking in terms of the mean opinion score (MOS) (see Section IV-A). ID Description Quality degradation factors A hand drawing portraits on paper with pencil 1 Fast motion and charcoal from scratch 2 A man drawing on the floor with chalk Time lapse 3 A series of animal couples showing friendship Blur, compression artifacts 4 Procedure to cook cheese sandwich (close-up of the food) Misfocusing 5 Two men imitating animals eating food Compression artifacts, jerkiness 6 A baby swimming in a pool Camera shaking 7 Escaping from a chase (first-person perspective) Fisheye lens effect 8 Nature scenes including mountain, cloud, and sea Blur, compression artifacts Cheering university students 9 Camera shaking, captions (shot by a camera moving around a campus) 10 A red fox in a cage Blur, camera shaking 11 Cats and kittens in a house Blur, misfocusing 12 A crowd dancing at a station Poor color reproduction 13 Seven people creating rhythmic sounds with a car Camera shaking, misfocusing 14 People dancing at a square Camera shaking 15 A group of children learning to cook Jerkiness, misfocusing 16 A slide show of nature landscape pictures Blur, compression artifacts 17 Soldiers patrolling streets Camera shaking, compression artifacts 18 A sleeping baby and a cat (close-up shot) Camera shaking, compression artifacts 19 A baby smiling at the camera Poor color reproduction, compression artifacts 20 A man playing a video game Frame-in-frame, jerkiness 21 Cats having fun with water Blur, camera shaking, captions 22 A man playing a video game Frame-in-frame, jerkiness 23 Twin babies talk to each other with gestures Camera shaking, compression artifacts 24 Walking motion from the walker’s viewpoint Compression artifacts, camera noise 25 A man sitting in a car and singing a song Compression artifacts, misfocusing 26 A man playing with a pet bear and a dog on the grass Misfocusing, shaking image frame 27 Pillow fight on a street Insufficient lighting, varying illumination 28 Kittens playing with each other Blur, weak compression artifacts 29 People dancing and cheering outside Compression artifacts, packet loss, excessive lighting 30 A baby laughing loudly Camera noise, compression artifacts, captions 31 A man breakdancing in a fitness center Packet loss, compression artifacts, camera noise 32 Cheerleading in a basketball court Compression artifacts, misfocusing 33 Microlight flying (first-person perspective) Blur, compression artifacts, misfocusing, packet loss Compression artifacts, misfocusing, camera shaking, 34 A man participating in parkour and freerunning varying illumination 35 Three people singing in a car Camera shaking, compression artifacts, blur 36 Street posing performance Camera shaking, occlusion, blur 37 A puppy and a kitten playing roughly Low , compression artifacts, blur, jerkiness 38 Exploring a dormitory building (first-person perspective) Camera shaking, compression artifacts, misfocusing 39 Shopping people at a supermarket Compression artifacts, captions, blur Low frame rate, insufficient lighting, blur, varying illumination, 40 A man working out in a park poor color reproduction 41 People in a hotel lobby Camera shaking, misfocusing, occlusion 42 Bike trick performance Compression artifacts, blur, misfocusing, jerkiness 43 A man cooking and eating chicken Camera shaking 44 A baby walking around with his toy cart at home Vertical black sides, camera shaking, varying illumination 45 Men eating food Camera shaking, misfocusing, compression artifacts 46 An old singing performance clip Camera noise, poor color reproduction, compression artifacts, blur 47 A man doing sandsack training Compression artifacts, jerkiness, camera noise Poor color reproduction, varying illumination, 48 A series of short clips compression artifacts, camera noise Black line running, poor color reproduction, 49 Two men practicing martial arts jerkiness, blur, packet loss experiment. The description and observed quality degradation the videos by using YouTube Data API. For each video, factors of the videos are presented in Table II. the following metadata were gathered: the maximum spatial resolution, upload date, video length, video viewcount, video Since the metadata collected in [19] were rather dated and likes, video dislikes, video favorites, video comments, video limited, we collected more detailed and recent metadata for JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

TABLE III: Range and position information of the metadata of the videos in the dataset.

Metadata Min Max Average Median Max resolution (height) 144 1080 - 480 Datea 168 3086 1970 2304 Durationb 27 559 184.3 163 #viewcount 29 96372651 6949894 117220 #like 0 364131 28750 1458 #dislike 0 22426 1231 17 #comment 0 49015 5102 260 Description length 0 3541 279.9 100 #subscribe 0 1552053 105682 567 #channel viewcount 486 289704644 31341621 892815 #channel comment 0 19499 1691 10 #channel video 1 5390 225.6 27 Channel description length 0 911 91 0 a Days until the upload date since YouTube was activated (14 Feb. 2005). b In seconds. description length, channel viewcount, channel dislikes, chan- quality scores of video stimuli and inconsistency of sub- nel favorites, channel comments, channel description length, jects’ judgments. In HodgeRank analysis, the statistical rank and number of uploaded videos of the uploader. The collection aggregation problem is posed, which finds the global score process for metadata was conducted in March 2014. While s = [s1, s2, s3, ..., sn], where n is the number of video stimuli, collecting recent metadata, we found that one video in the such that original dataset was deleted, which was excluded from our experiment. Table III presents the metadata statistics of the videos. It can be seen that the dataset covers a wide range of min X ˆ 2 s∈Rn wij(si − sj − Yij) (1) online user-generated videos in various viewpoints of produc- i,j tion and consumption characteristics. where wij is the number of comparisons between stimuli The subjective ratings were collected based on the paired i and j, and si and sj are the quality scores of stimuli i comparison methodology [20] in [19]. Subjects were recruited and j, respectively, which are considered as mean opinion from Amazon Mechanical Turk. A web page showing two scores (MOS). Yˆij is the (i, j)th element of Yˆ , which is randomly selected videos from the dataset in a side-by-side the subjective data matrix derived from the original graph of manner was used for quality comparison. Subjects were asked paired comparison G by to choose a video with better visual quality than the other. Subjects had to play both videos before entering their ratings to prevent cheating and false ratings. In total, 8,471 paired- Yˆ = 2ˆπ − 1 (2) comparison results were obtained. Each video was shown 332 ij ij times, and each pair was matched 6.78 times on average. Here, πˆij is the observed winning rate of stimulus i against stimulus j, which is defined as

IV. GRAPH-BASED SUBJECTIVE DATA ANALYSIS The subjective paired comparison data forms an adjacency Mij πˆij = (3) matrix representing a set of edges of a graph G = (V,E). Mij + Mji Here, V is the set of nodes corresponding to the videos, and where Mij is the number of counts where stimulus i is E is the set of weighted directed edges, where each weight preferred to stimulus j. Note that Yˆ is skew-symmetric. is the winning count where a video is preferred to another The converted subjective data matrix Yˆ can be uniquely one. Therefore, it is possible to apply graph theory to analyze decomposed into three components as follows, which is called the subjective data, which aims at obtaining further insight HodgeRank decomposition: into viewers’ patterns of quality perception. In this section, two techniques are adopted, HodgeRank analysis and graph clustering. Yˆ = Yˆ g + Yˆ c + Yˆ h (4) where Yˆ g,Yˆ c, and Yˆ h satisfy the following conditions: A. HodgeRank Analysis The HodgeRank framework introduced in [21] decomposes ˆ g imbalanced and incomplete paired-comparison data into the Yij =s ˆi − sˆj, (5) JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

1 0.2 5 0.8 5 0 10 0.6 10 −0.2 15 0.4 15

20 0.2 20 −0.4 25 0 25 −0.6 30 −0.2 30 −0.8 35 −0.4 35

40 −0.6 40 −1

45 −0.8 45 −1.2 −1 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 (a) (b) x 10−6

4 5 5 3.5 1 10 10 3 15 15 2.5 0.5 20 20 2

25 25 1.5 0 30 30 1 0.5 35 35 −0.5 0 40 40 −0.5 45 45 −1

−1 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 (c) (d) Fig. 1: Results of HodgeRank analysis. (a) Subjective data matrix Yˆ , (b) decomposed global part Yˆ g, (c) decomposed curl part Yˆ c, and (d) decomposed harmonic part Yˆ h. Each axis represents the videos sorted by the global score of HodgeRank. In each figure, only the lower triangular part below the diagonal axis is shown because the matrices are skew-symmetric.

ˆ h ˆ h ˆ h Yij + Yjk + Yki = 0, for (i, j), (j, k), (k, i) ∈ E (6) quality difference increases. This is reflected in the global part in Fig. 1(b), i.e., the absolute values of the matrix elements X ˆ h wijYij = 0, for each i ∈ V. (7) increase (darker color). j6=i However, Fig. 1(a) is also noisy, which corresponds to the inconsistency parts in Fig. 1(c) and 1(d). To quantify this, we where sˆi and sˆj are the estimated scores for stimuli i and j, respectively. define the ratio of total inconsistency as The global part Yˆ g determines the overall flow of the c graph, which is formed by score differences. The curl part Yˆ 2 2 kYˆ hk kYˆ ck indicates the local (triangle) inconsistency (i.e., the situation Total Inconsistency = F + F (8) ˆ 2 ˆ 2 where stimulus i is preferred to stimulus j, stimulus j is kYkF kYkF preferred to stimulus k, and stimulus k is preferred to stimulus where k·k is the Frobenius norm of a matrix. The obtained ˆ h F i for different i, j, and k). The harmonic part Y represents ratio of total inconsistency for the subjective data is 67%. That the inconsistency caused by cyclic ties involving more than is, the amount of inconsistency, including local inconsistency three nodes, which corresponds to the global inconsistency. and global inconsistency, is larger than that of the global Fig.1 shows the results of the HodgeRank decomposition flow of the graph. Between the two sources of inconsistency, applied to the subjective data. The overall trend shown in Fig. the amount of the harmonic component is far smaller than 1(a) is that the total scores decrease (darker color) for elements that of the curl component, as can be seen from the scale closer to the lower left corner, showing that the perceived of the color bar in Fig. 1(d). This implies that it is easy superiority of one video against another becomes clear as their for human subjects to determine quality superiority between JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

videos with significantly different ranks (i.e., quality scores), 1 while determining preference for videos where quality is ranked similarly is relatively difficult. This will be discussed 0.9 further in Section VI.

B. Graph Clustering 0.8 The HodgeRank analysis showed that videos with similar ranks in the MOS are subjectively ambiguous in terms of 0.7 quality. Therefore, one may hypothesize that the videos can be grouped in such a way that different groups have distinguish- Modularity Q 0.6 able quality differences, while videos in each group have a similar quality. We attempt to examine if this is the case and, 0.5 if so, how many groups can be found via graph clustering. #Cluster: 1 #Cluster: 2 We use the algorithm presented in [22], whose objective is 0.4 0 0.2 0.4 0.6 0.8 1 to divide the whole graph represented by the adjacency matrix Restart Probability M into l groups, U1,U2, ..., Ul, by maximizing the modularity measure Q: Fig. 2: Graph clustering results with respect to restart proba- bility δ.

l in out ! 1 X X di dj Q = M − (9) A ij A p=1 i,j∈Up We apply the aforementioned algorithm to the subjective data graph for various restart probability values. Fig. 2 shows where A = P M is the sum of all edge weights in the i,j ij the final modularity value with respect to the restart probabil- din = P M dout = P M graph, and i j ij and i j ji represent the ity, ranging from 0.01 to 0.99. The number of final clusters in-degree and out-degree of the i-th node, respectively. differs when the restart probability differs. It can be seen that This algorithm is based on the random walk with restart the graph is clustered into one or two groups in all cases. In (RWR) model. It first computes the relevance matrix R, which particular, the results with Q = 1 correspond to the case in is the estimated result matrix of the RWR model, from the which all nodes in the graph are assigned to one cluster. That transition matrix, which equals M: is, it is difficult to divide the nodes into groups with a clearly distinguished subjective preference. Fig. 3 shows examples of final clustering results that have high modularity. Two clusters ˜ R = (1 − δ)(I − δM) (10) are formed in these examples, and the cluster containing high- where δ is the restart probability (an adjustable parameter in quality videos (marked with blue) is much bigger than that the RWR model), I is an identity matrix, and M˜ is the column- containing low-quality videos (marked with red). It seems that normalized version of M. The algorithm then consists of two in the used video dataset, discriminating quality for videos steps. First, starting with an arbitrary node, it repeatedly adds with high and medium quality was difficult, whereas videos the node with the largest single compactness measure, until with medium and low quality were more easily distinguished. its single compactness measure does not increase. The single 0 compactness of node i with respect to local cluster G is V. QUALITY ESTIMATION represented by: In this section, we investigate the problem of the objective quality assessment of online videos. First, the performance   of the state-of-the-art objective quality metrics is evaluated. 1 X Ri∗R∗i C(i, G0) = R + (R + R ) − (11) Second, quality estimation using metadata-driven metrics is B  ii ij ji B  j∈G0 investigated. . P A. No-Reference Objective Metrics where B = Rij is the total sum of the elements of R, and P i,j P Ri∗ = k Rik and R∗i = k Rki are the row and column The performance of three state-of-the-art NR quality met- sums of R, respectively. If the construction of a local cluster rics, namely, V-BLIINDS, BRISQUE, and BLIINDS-II, is from a node is finished, the algorithm starts from another examined using the video and subjective data described in node that has not been assigned to any local clusters yet to the previous section. Some videos have title scenes at the make another local cluster. This process is repeated until all beginning (e.g., title text shown on a black background for nodes are assigned to certain local clusters. After constructing a few seconds), which are excluded for quality evaluation local clusters, the algorithm merges the compact clusters by because they are usually irrelevant for judging video quality. maximizing the increase of the total modularity Q of clusters The performance of the metrics is shown in Table IV. in a greedy manner until there is no increase of Q, which We adopt the Spearman rank-order correlation coefficient results in final clusters. (SROCC) as the performance index because the relationship JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

1 14 2 14 2 14 2 6 15 6 1 6 1 25 5 7 11 15 5 7 11 15 25 5 7 11 18 29 18 25 29 12 13 17 49 12 13 17 18 12 13 17 29 19 32 19 32 10 38 26 4 45 38 26 4 49 19 10 38 26 32 49 47 3 3 10 47 45 45 40 21 2347 35 40 21 23 48 3 4333 40 21 4 23 27 364333 364333 46 35 8 48 8 35 36 9 46 20 9 42 28 27 20 9 42 28 42 28 8 48 30 3027 20 41 16 41 44 34 46 16 41 44 34 44 34 3031 39 31 39 31 16 242237 242237 39242237 (a) (b) (c) Fig. 3: Examples of graph clustering results. Each video is represented by its ID, and the high (low)-quality group is marked with blue (red). (a) δ = 0.81 (Q = 0.9275) (b) δ = 0.15 (Q = 0.8503) (c) δ = 0.99 (Q = 0.8225) between the metrics’ outputs and MOS is nonlinear. Statistical TABLE V: SROCC between MOS and metadata-driven met- test results reveal that only the SROCC of BRISQUE is sta- rics. Note that the metrics are sorted in descending order of tistically significant at a significance level of 0.05 (F = 6.00). SROCC. Interestingly, V-BLIINDS, which is a video quality metric, Metric SROCC appears to be inferior to BRISQUE and BLIINDS-II, which Description length 0.5402 are metrics, meaning that the way V-BLIINDS #like / #view 0.5347 incorporates temporal information is not very effective for the Max resolution 0.5331 user-generated videos. Overall, the performance of the metrics #subscribe 0.4694 is far inferior to that shown in existing studies using other #subscribe / #channel video 0.4408 #like 0.4307 databases. The SROCC values of BLIINDS-II and BRISQUE #dislike 0.3910 on the LIVE IQA database containing images corrupted by #channel viewcount 0.3855 blurring, compression, random noise, etc. [23] were as high #viewcount 0.3728 as 0.9250 and 0.9395, respectively [24]. The SROCC of V- #comment 0.3699 BLIINDS for the LIVE VQA database containing videos #like / date 0.3558 degraded by compression, packet loss, etc. [25] was 0.759 Channel description length 0.3482 in [1]. This implies that the problems of online video quality #channel viewcount / #channel video 0.3061 Date 0.3042 assessment are very different from those of traditional quality #view / date 0.2727 assessment. The reasons why the NR metrics fail are discussed #channel comment 0.2099 further in Section VI with examples. #comment / #view 0.1861 #channel video 0.1465 #channel comment / #channel video 0.1414 B. Metadata-driven Metrics Duration -0.0533 Metadata-driven metrics are defined as either the original values of the metadata listed in Table III or the values obtained The metadata-driven metric showing the highest SROCC is by combining them (e.g., #like divided by #view for normal- the description length. A possible explanation for this is that ization). Table V shows the performance of the metadata- a video with good quality would have significant visual and driven metrics for quality prediction. It is observed that the semantic information, and the uploader may want to provide performance of the metrics significantly varies, from fairly a detailed description about the video faithfully. The second- high to almost no correlation with the MOS. It is worth noting best metadata-driven metric is the ratio of the like count to that several metadata-driven metrics show better performance the viewcount. This metric inherently contains information than the NR quality metrics. Generally, the video-specific about the satisfaction of the viewers, which is closely related metrics (e.g., video viewcount, the number of video comments) to the video quality. The third-ranked metadata-driven metric show better performance than the channel-specific metrics is the maximum spatial resolution. The availability of high (e.g., channel viewcount, number of channel comments). resolution for a video means that it was captured with a high-performance camera or that it did not undergo video processing, which would reduce the spatial resolution and TABLE IV: Spearman rank-order correlation coefficients possibly degrade video quality. The number of subscribers of (SROCC) between MOS and V-BLIINDS, BRISQUE, and an uploader is ranked fourth, which indicates the popularity BLIINDS-II, respectively. of the uploader. A popular uploader’s videos are also popular, NR Metric SROCC and their visual quality plays an important role. V-BLIINDS 0.1406 Other metrics related to video popularity, including the BRISQUE 0.3364 numbers of likes and dislikes, viewcount, and number of BLIINDS-II 0.2383 comments show significant correlations with the MOS, but JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

0.9 0.9

0.85 0.85

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

SROCC 0.6 SROCC 0.6

0.55 0.55

0.5 Metadata only 0.5 Metadata only V−BLIINDS + Metadata V−BLIINDS + Metadata 0.45 BRISQUE + Metadata 0.45 BRISQUE + Metadata BLIINDS−II + Metadata BLIINDS−II + Metadata 0.4 0.4 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 #Metadata−driven Metric #Metadata−driven Metric (a) (b) Fig. 4: SROCC of (a) linear regression and (b) SVR using metadata and objective quality assessment algorithms. they remain moderate. This indicates that popularity is a evaluated using all the data without separating the training meaningful but not perfect predictor of quality. The age of and test data to obtain insight into the performance potential a video (“date” in the table) has a moderate correlation with of the integrated models to estimate quality. In addition, to perceptual quality, which means that newer videos tend to further incorporate the information of visual signals within the have better quality. This can be understood by considering regression models, we also test models combining metadata- that recent recording devices usually produce better quality driven metrics and V-BLIINDS, BRISQUE, and BLIINDS- videos than old devices and users are more experienced in II, respectively, to examine the synergy between the two video production than before. modalities (i.e., the visual signal and metadata). Apparently, each of the metadata-driven metrics represents a Fig. 4 shows the results of the linear regression and SVR distinct partial characteristic of the videos. Therefore, it can be model evaluation. In both cases, the performance in terms expected that combining multiple metrics will yield improved of SROCC is improved by increasing the numbers of the results due to their complementarity. For the integration of the metadata-driven metrics. When 20 metadata-driven metrics are metrics, we employ two techniques, namely, linear regression used, the SROCC becomes nearly 0.8 for linear regression and and nonlinear support vector regression (SVR) using Gaussian 0.85 for SVR, which is a significant improvement compared kernels [26]. The former is written as: with the highest SROCC achieved by a single metric (0.54 by the description length). The effect of incorporating an NR objective quality metric is negligible (for linear regression) or T fLinear(x) = a x + bLinear (12) marginal (for SVR). Such a limited contribution of the video where x is a vector composed of metadata-driven metrics, and signal-based metrics seems to be due to their limited quality predictability as shown in Table IV. a and bLinear are tunable parameters of the linear regression model. The latter is given as: To compare the quality prediction performance of metadata and NR quality metrics in more depth, we analyze differences between the predicted ranks and MOS ranks. Fig. 5 shows v X ∗ histograms of rank differences between the MOS and esti- fSVR(x) = (αi − αi )φ(xi, x) + bSVR (13) mated quality scores by the NR objective quality metrics or i=1 SVR combining 20 metadata-driven metrics. It is observed that ∗ where αi, αi , xi (i = 1, 2, ..., v), and bSVR are parameters of the SVR model of metadata shows smaller rank differences the SVR model, and φ(xi, x) is a Gaussian kernel expressed than the objective quality metrics. From an ANOVA test, it is by confirmed that the mean locations of the rank differences for the four cases are significantly different at a significance level

2 of 0.01 (F = 10.84). Additionally, Duncan’s multiple range kxi − xk φ(xi, x) = exp(− ) (14) tests reveal that the mean location of the rank differences of 2σ2 the SVR-based metadata model is significantly different from where σ2 is the variance of the Gaussian kernel [27]. those of the signal-based metrics at a significance level of We train the models by changing the number of metadata- 0.01. These results demonstrate that metadata-based quality driven metrics, i.e., the length of the input vector x in prediction is a powerful way of dealing with the online video (12) and (13), by adding metrics one by one in descending quality assessment problem by overcoming the limitations of order of SROCC in Table V. The models are trained and the conventional objective metrics based on the visual signal. JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

V−BLIINDS (Mean: 15.02) BRISQUE (Mean: 12.98) 40 40

35 35

30 30

25 25

20 20 Count Count 15 15

10 10

5 5

0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Rank Difference Rank Difference (a) (b) BLIINDS−II (Mean: 14.00) Metadata (Mean: 5.10) 40 40

35 35

30 30

25 25

20 20 Count Count 15 15

10 10

5 5

0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Rank Difference Rank Difference (c) (d) Fig. 5: Rank difference between estimated quality scores and MOS. (a) V-BLIINDS, (b) BRISQUE, (c) BLIINDS-II, and (d) the SVR model with 20 metadata-driven metrics.

VI.DISCUSSION In Section V, we showed that the state-of-the-art NR ob- In the previous sections, it was shown that the quality jective metrics fail to predict the perceptual quality of user- evaluation of online user-generated videos is not easy both generated videos. This is largely due to the fact that the quality subjectively and objectively. In this section, we provide further degradation conditions targeted during the development of the discussion on these issues in more detail with representative metrics are significantly different from those involved in user- examples. generated videos. When the NR metrics were developed, it In Section IV, it was observed that viewers have difficulty was normally assumed that original video sequences have in determining quality superiority among certain videos. As perfect quality. However, as shown in Table I, the user- shown in Table I, there are many different factors of quality generated videos are already subject to various types of quality degradation in user-generated videos. In many cases, therefore, degradation in the production stage (e.g., insufficient lighting, users are required to compare quality across different factors, hand shaking). Furthermore, NR metrics are usually only which is not only difficult, but also very subjective depending optimized for a limited number of typical quality factors, such on personal taste. For example, video #15 has jerkiness and as compression artifacts, packet loss, blurring, and random misfocusing, video #16 has blur, and video #17 has camera noise, while there are many more quality factors that can be shaking. In the subjective test results, video #16 is preferred involved during editing, processing, and distribution, some of to video #15, while video #17 is less preferred than video #15, which are even related to aesthetic aspects. and video #17 is preferred to video #16. As a result, the match Editing effects are particularly difficult to assess for the NR result among these three videos forms a triangular triad, which metrics, not only because many of them are not considered contributes to the local inconsistency shown in Fig. 1(c). by the metrics, but also some of them may be wrongly JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11 treated as artifacts. Videos #1 and #2 are examples containing only limited information on visual quality, combining them unique editing effects in the temporal domain. They are a fast- significantly improved the performance. Finally, based on the playing video and a time lapse video, respectively, which are results and examples, we provided a detailed discussion on interesting to viewers and thus ranked 1st and 2nd in MOS, why the subjective and objective quality assessment of user- respectively. However, as V-BLIINDS regards them as having generated videos is difficult. poor motion consistency, it gives them undesirable low ranks Our results demonstrated that the problem of quality as- (i.e., 40th and 44th, respectively). sessment of user-generated videos is very different from the In Section V-B, metadata were shown to be useful to conventional video quality assessment problem dealt with in extract quality information, showing better quality evaluation the prior work. At the same time, our results have significant performance than the NR objective metrics. A limitation implications for future research: The existence of diverse of metadata is that some are sensitive to popularity, users’ quality factors involved in user-generated videos and the preference, or other video-unrelated factors, which may not failure of existing metrics may suggest that the problem perfectly coincide with the perceived quality of videos. Video of user-generated video quality assessment is too large to #25 is such an example. It is a music video made by a fan of be conquered by a single metric; thus, metrics specialized a musician. It has moderate visual quality and thus is ranked to different factors should be applied separately (and then 25th in MOS. However, since the main content of this video is be combined later, if needed). Since many factors, such as music, its popularity (the viewcount, the number of likes and editing effects, are not covered by existing metrics, developing comments, etc.) is mainly determined by the audio content reliable metrics for them would be necessary. Moreover, the rather than the visual quality. Moreover, it has the longest problem is highly subjective and depends on personal taste, so video description (listing the former works of the musician) personalized quality assessment may be an effective method in in the dataset, according to which it would be ranked 14th the future. Therefore, proper ways to collect personalized data (note that the description length is the best-performing metric as ground truths would be required, where big data analysis in Table V). techniques may be helpful. A way to alleviate the limitation of each metadata-driven While our results are for a dataset with a limited number metric is to combine several metadata-driven metrics and of video sequences, it is still reasonable to consider that most expect them to compensate for the limited information of each of them, particularly those related to the different nature of of them, which was shown to be effective in our results. For user-generated videos from professional ones, can be applied instance, the available maximum resolution was shown to be generally, although the matter of degree exists. Nevertheless, highly correlated with MOS in Table V, so video #29 would be larger scale experiments with larger numbers of videos with ranked highly since a high-resolution (1080p) version of this more diverse characteristics will be desirable in the future. video is available. However, it is ranked only 29th in MOS due to compression artifacts and packet loss artifacts. When multiple metadata-driven metrics are combined using SVR, the REFERENCES rank of the video becomes 24th, which is much closer to the MOS rank. [1] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural video quality,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1352–1365, 2014. VII.CONCLUSION [2] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE We have presented our work on investigating the issue Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. of the subjective and objective visual quality assessment [3] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proceedings of the 37th of online user-generated video content. First, we examined Asilomar Conference on Signals, Systems and Computers, vol. 2, 2003, users’ patterns of quality evaluation of online user-generated pp. 1398–1402. videos via the HodgeRank decomposition and graph clus- [4] E. C. Larson and D. M. Chandler, “Most apparent distortion: Full- reference image quality assessment and the role of strategy,” Journal tering techniques. A large amount of local inconsistency in of Electronic Imaging, vol. 19, no. 1, pp. 011 006–1–011 006–21, 2010. the paired-comparison results was found by the HodgeRank [5] H. R. Sheikh and A. C. Bovik, “A visual information fidelity approach analysis, which implies that it is difficult for human viewers to to video quality assessment,” in Proceedings of the 1st International Workshop on Video Processing and Quality Metrics for Consumer determine quality superiority between videos ranked similarly Electronics, 2005, pp. 23–25. in MOS, mainly due to the difficulty of comparing quality [6] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal across different factors. Consequently, subjective distinction quality assessment of natural videos,” IEEE Transactions on Image between different quality levels is only clear at a large cluster Processing, vol. 19, no. 2, pp. 335–350, 2010. [7] P. V. Vu, C. T. Vu, and D. M. Chandler, “A spatiotemporal most- level, which was shown by the graph clustering results. We apparent-distortion model for video quality assessment,” in Proceedings then benchmarked the performance of existing state-of-the-art of the 18th IEEE International Conference on Image Processing, 2011, NR objective metrics, and explored the potential of metadata- pp. 2505–2508. [8] R. Soundararajan and A. C. Bovik, “Video quality assessment by driven metrics for the quality estimation of user-generated reduced reference spatio-temporal entropic differencing,” IEEE Trans- video content. It was shown that the existing NR metrics do actions on Circuits and Systems for Video Technology, vol. 23, no. 4, not yield satisfactory performance, whereas metadata-driven pp. 684–694, 2013. [9] M. H. Pinson and S. Wolf, “A new standardized method for objectively metrics perform significantly better than the NR metrics. measuring video quality,” IEEE Transactions on Broadcasting, vol. 50, In particular, as each of the metadata-driven metrics covers no. 3, pp. 312–322, 2004. JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12

[10] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality Jong-Seok Lee (M’06-SM’14) received his Ph.D. assessment: A natural scene statistics approach in the DCT domain,” degree in electrical engineering and computer sci- IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, ence in 2006 from KAIST, Korea, where he also 2012. worked as a postdoctoral researcher and an ad- [11] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image junct professor. From 2008 to 2011, he worked quality assessment in the spatial domain,” IEEE Transactions on Image as a research scientist at Swiss Federal Institute Processing, vol. 21, no. 12, pp. 4695–4708, 2012. of Technology in Lausanne (EPFL), Switzerland. [12] S. Wilk and W. Effelsberg, “The influence of camera shakes, harmful Currently, he is an associate professor in the School occlusions and camera misalignment on the perceived quality in user of Integrated Technology at Yonsei University, Ko- generated video,” in Proceedings of IEEE International Conference on rea. His research interests include multimedia signal Multimedia and Expo, 2014, pp. 1–6. processing and machine learning. He is an author [13] T. Hoßfeld, R. Schatz, and U. R. Krieger, “QoE of youtube video or co-author of about 100 publications. He serves as an Area Editor for the streaming for current internet transport protocols,” in Measurement, Signal Processing: Image Communication. Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance. Springer, 2014, pp. 136–150. [14] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston et al., “The YouTube video recommendation system,” in Proceedings of the 4th ACM Conference on Recommender Systems, 2010, pp. 293–296. [15] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon, “Analyzing the video popularity characteristics of large-scale user generated content systems,” IEEE/ACM Transactions on Networking, vol. 17, no. 5, pp. 1357–1370, 2009. [16] F. Figueiredo, J. M. Almeida, M. A. Gonc¸alves, and F. Benevenuto, “On the dynamics of social media popularity: A Youtube Case study,” ACM Transactions on Internet Technology, vol. 14, no. 4, pp. 24:1–24:23, 2014. [17] T. Xia, T. Mei, G. Hua, Y.-D. Zhang, and X.-S. Hua, “Visual quality assessment for web videos,” Journal of Visual Communication and Image Representation, vol. 21, no. 8, pp. 826–837, 2010. [18] Y. Yang, X. Wang, T. Guan, J. Shen, and L. Yu, “A multi-dimensional image quality prediction model for user-generated images in social networks,” Information Sciences, vol. 281, pp. 601–610, 2014. [19] C.-H. Han and J.-S. Lee, “Quality assessment of on-line videos using metadata,” in Proceedings of IEEE International Conference on Acous- tics, Speech and Signal Processing, 2014, pp. 1385–1388. [20] J.-S. Lee, “On designing paired comparison experiments for subjective multimedia quality assessment,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 564–571, 2014. [21] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, “HodgeRank on random graphs for subjective video quality assessment,” IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 844–857, 2012. [22] D. Duan, Y. Li, Y. Jin, and Z. Lu, “Community mining on dynamic weighted directed graphs,” in Proceedings of the 1st ACM International Workshop on Complex Networks Meet Information & Knowledge Man- agement, 2009, pp. 11–18. [23] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, 2006. [24] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, 2015. [25] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, 2010. [26] A. Smola and V. Vapnik, “Support vector regression machines,” Ad- vances in Neural Information Processing Systems, vol. 9, pp. 155–161, 1997. [27] A. J. Smola and B. Scholkopf,¨ “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.

Soobeom Jang received the B.S. degree from the School of Integrated Technology, Yonsei University, Seoul, Korea, in 2014. He is currently pursuing the Ph.D. degree at the School of Integrated Technology, Yonsei University. His research interests include social multimedia analysis and multimedia applica- tions.