DEGREE OF LOOP ASSESSMENT IN MICROVIDEO

Shumpei Sano, Toshihiko Yamasaki, and Kiyoharu Aizawa

Department of Information and Communication Engineering, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan {sano, yamasaki, aizawa}@hal.t.u-tokyo.ac.jp

ABSTRACT Table 1. Ratio of loop/non-loop video in 1,022 microvideo This paper presents a degree-of-loop assessment method for on with a tag “#loop”. microvideo clips. Loop video is one of the popular features non-loop video loop video in microvideo, but there are so many non-loop video Number of video 906 116 with “loop” on microvideo services. This is because upload- Ratio 88.6 % 11.4 % ers or spammers also know that loop video is popular and they their official web site, Vine is introduced as “a mobile service want to draw attention from viewers. In this paper, we statisti- 5 cally analyze the scene dynamics of the video by using color, that lets you capture and share short looping video ,” there optical flow, saliency maps, and evaluate the degree-of-loop. are two characteristics in Vine. First is video length shortness. We have collected more than 1,000 video clips from Vine and Max video length in Vine is 6.5 seconds, which is shorter than subjectively evaluated their degree-of-loop. Experimental re- the other similar services. Second is looping. Shared video sults show that our proposed algorithm can classify loop/non- on Vine are automatically played repeatedly, and many loop loop video with 85.7% accuracy and categorize them into five video which seamlessly connect the last and the first frames degree-of-loop categories with 61.5% accuracy. are uploaded with a hash tag as “#loop”. However, if users try to retrieve loop video, search results include many spam Index Terms— microvideo, short video, loop, degree– video which is not loop. According to our preliminary exper- of–loop iment (see Table 1), 89 percent of video clips which included the “#loop” tag were non-loop video because uploaders want 1. INTRODUCTION to get a lot of views and also know people want to watch loop video. 1 Microvideo sharing services launched in 2013 such as Vine , In this paper, therefore, we propose a degree-of-loop 2 3 MixBit and video on are rapidly growing as a (DoL) assessment method that can classify loop/non-loop new service (SNS). Similar to conventional video by analyzing the spatial and temporal statistics of a var- 4 video sharing services like Youtube , one of the user’s main ious kinds of visual features. To the best of our knowledge, interests is how to create or retrieve interesting video. If this paper is the first attempt of loop/non-loop detection. In sorted by the number of views or favorites, only the clips addition, such technology can be applied to user assistance to that have been revealed on the Internet for a long time can be create better loop video. We have collected 1,022 video from retrieved. To solve this problem, a lot of image processing Vine and manually evaluated the DoL score from 1 (strongly based video interestingness analysis have been proposed. non-loop) to 5 (perfect loop). The list of the video URLs and Although such approaches can also be applied to mi- the subjective degree-of-loop score is available on our project crovideo, new techniques dedicatory designed for microvideo page6. Experimental results using our DoL assessment model would also be required. Actually, some users are trying to cre- demonstrated 85.7% accuracy in loop/non-loop classification ate sophisticated and interesting video by taking advantage of and 61.5% accuracy in five-class DoL classification. the shortness. In Vine, for example, loop video which seems This paper is organized as follows. In section 2, related endless by taking advantage of its automatic repeat play func- works are summarized. Section 3 explains our DoL model tion is popular. Vine is one of the popular microvideo sharing using frame distances and saliency trajectories. Experimental services which launched in January 2013 and obtained 40 results are demonstrated in section 4, followed by the con- million users only within seven months. As announced in cluding remarks in section 5. 1https://vine.co/ 2https://mixbit.com/ 3http://instagram.com/ 5https://blog.twitter.com/2013/vine-a-new-way-to-share-video 4http://www.youtube.com/ 6https://www.hal.t.u-tokyo.ac.jp/˜sano/loopvideo/

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5182 ICIP 2014 Features

100000 Video loop video non-loop video 1. Adjacent frame distances Feature selection 10000 - RGB - Brightness 1000 - Magnitude of optical flow SVM 100 2. Loopness probability by

adjacent frame distance 10 - RGB - Brightness of distances Magnitude 1 1. Loop/non-loop 0 50 100 150 - Magnitude of optical flow classification

3. Continuity of the region of Frames interest 2. Degree-of-loop - Saliency centroids trajectory classification Fig. 2. Example of RGB distances of loop and non-loop video. The last value is last-frame/first-frame distance. Fig. 1. Overview of degree-of-loop assessment. DoL classification is conducted by using a support vector ma- chine (SVM). 2. RELATED WORKS

Content based video ranking [1, 2, 3] has been proposed aim- 3.2. Adjacent Frame Distances ing at retrieving interesting video. Irie et al. [1], assumed that interesting video is more edited than low–interestingness In loop video, the first and last frames need to be visually sim- video and proposed a “degree-of-edit” measure by analyzing ilar in order to smoothly connect the last and the first frames. editing clues such as the number of cuts, detecting sound and We measure the distance between the first and the last frames text captions, and so on. Wei et al. [2] proposed a cross- in terms of RGB, brightness, and optical flow as feature val- I reference video reranking method with fused multimodal fea- ues. di,j(x, y) is per pixel distance in index I between the tures. Tian et al. [3] used user’s labeling efforts in video frames fi and fj where pixel’s coordinate is (x, y). RGB || − ||2 reranking to bridge the semantic gap. Redi et al. [4] focused di,j (x, y) = RGBi(x, y) RGBj(x, y) (1) on microvideo characterization and introduced a ”creativity” dBr(x, y) = ||Y (x, y) − Y (x, y)||2 (2) measure. They analyzed correlations between creativity and i,j i j Opt || ||2 the audio-visual features such as filmmaking technique fea- di,j (x, y) = OptF lowi,j(x, y) (3) tures or features to model aesthetics. In [4], it is demonstrated Here, RGBi(x, y) is a vector of RGB value in each pixel of that the degree of loop is an important factor to analyze the the frame i. Y (x, y) is the brightness in YUV color space. creativity. However, only one dimensional feature was used 2 ||OptF lowi,j (x, y)|| is the magnitude of optical flow in each for loopness analysis. pixel between the frames i and j calculated using [10]. Let L Looping video generation techniques have also been pre- be the number of frames∑ and the last-frame/first-frame dis- sented. Various techniques are proposed [5, 6, 7, 8, 9] to form I I tance defined as F = x,y dL,1(x, y) are used as a feature. looping contents, though they are not focused on microvideo. Schodl et al. [5] proposed video texture which construct a graph between similar images in the video and stochastically 3.3. Loopness Probability by Adjacent Frame Distance transited from one clip to another, thus achieving a random The features in Section 3.2 calculates the distance between but continuous sequence. Kwatra et al. [6] generated looping adjacent frames. However, there are also loop video whose video by synthesizing video texture spatially and temporally background changes dynamically and non-loop video which using a graph cut technique. Agarwara et al. [7] and Rav– is dark or whose background is static. Fig. 2 shows an ex- Acha et al. [8] tried to create panoramic video. Liao et al. [9] ample of RGB distances over the frames of loop video and proposed a method to create varying dynamism looping video non-loop video. As shown in Fig. 2, the small last/first-frame by segmenting scene regions based on degree of motion. distance does not always stand for loop video and vice versa. Therefore, we also consider loopness probability by statistical 3. DEGREE OF LOOP ASSESSMENT analysis of the distances over the frames. Let us define a set of distances from the first frame up to the last frame: 3.1. Method Overview I { I | − } G = g (i, i +( 1) i = 1, ..., L 1) (4) ∑ Fig. 1 shows an overview of the proposed method. Given I I the input video, three types of features are extracted: adjacent g (i, j) = log 1 + di,j(x, y) (5) frame distances, loopness probabilities by adjacent frame dis- x,y I ≡ I tances, and continuity of the region of interest. Then, after the and gL g (L, 1) is calculated in each video. feature selection, loop/non-loop classification and five-class B bins direction oriented histogram (Opt b, b = 1..B) is

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5183 ICIP 2014 0.6 Table 2. DoL score variation in 1,022 video. 0.55 DoL score 1 2 3 4 5 0.5 Number of video 506 229 171 77 39 0.45 0.4 also calculated for{ feature extraction. Opt 2b−1 ≤ 2b+1 0.35 Opt b dij (x, y)(π B θ < π B ) of Absolute

d (x, y) = (6) coefficient s correlation ij 0 (otherwise) ’ 0.3 0.25 where θ is the orientation of OptF lowi,j(x, y) and B is set to 0.2 eight by empirical study. Optimal Gaussian mixture models Pearson (GMM) [11] with K (K = 1..5) components are estimated Feature name using the EM algorithm [12]. Fig. 3. Best 20 features of Pearson’s correlation coefficient ∑K I | I I 2 with gold DoL score. p(gL G ) = αkN(gL, µk, σk), (7) k=1 ( ) 1 (gI − µ )2 Table 3. Result of loop/non-loop classification : (a) N(gI , µ , σ2) = √ exp − L k (8) L k k 2 2 OnlyAFD (83.0% accuracy) , (b) proposed (85.7% accuracy). 2πσ 2σk ∑ k (a) K I 2 \ Here, k=1 αk = 1 and N(gL, µ, σ ) is a probability dis- Predict Gold loop non-loop Precision tribution function in normal distribution with average µ and loop 82 111 0.43 2 variance σ . The probability feature is defined as non-loop 34 624 0.95 I I | I FK = p(gL G ) (9) Recall 0.70 0.85 0.78 \ 0.69 Additional four features of stat ∈ {max, min, mean, median} (b) are also considered in optical flow. Predict\Gold loop non-loop Precision Opt stat Opt b loop 83 89 0.48 FK = stat (FK ) (10) b∈{1,...,B} non-loop 33 646 0.95 Recall 0.72 0.88 0.80 \ 0.72

3.4. Continuity of the Region of Interest Table 4. Added features and Matthews correlation coefficients [15] (MCC) in proposed method (loop/non-loop classification). In loop video, the object of interest also needs to be continu- F RGB +F Sal ndist +F Opt min ous between the last and the first frames. Therefore, we define Added feature lf est 5 the distance of the center of gravity of the saliency map be- MCC 0.45 0.48 0.49 Br Opt 8 Opt mean tween the last and first frames as defined below: Added feature +F3 +F5 +F2 Sal || s − s||2 MCC 0.50 0.51 0.51 F = CL C1 (11) Here, saliency is extracted using [13]. In addition, the cen- 4. EXPERIMENTS troid of the saliency point for the L + 1th frame (i.e., the first s We conducted two experiments : loop/non-loop classification frame) Cest is predicted from the centroids of last n frames s and five class DoL classification. We used SVM [16] with an using the Kalman filter estimator [14] and the distance to C1 (F Sal est) is calculated. The normalized distance (F Sal nest) RBF kernel. we evaluated the performance by 10-fold cross- is also defined as one of the features as shown below: validation. The kernel parameters (C, γ) were optimized by a grid search. We also optimized learning weights in each class F Sal est = ||Cs − Cs||2 (12) est 1 ∑ by reciprocal ratios on number of data applying cost-sensitive Sal est L s s 2 F − ||C − C − || learning [17] which solves the imbalanced data problem. F Sal nest = ,Z = i=L n i i 1 (13) Z n + 1 Z is the average distance between adjacent centroids in last n 4.1. Dataset frames. n is set to five in this paper. We collected 1,022 video (MPEG–4, 480 × 480, 30 fps) from Vine. All video was retrieved with a query “vine.co #loop” in Twitter7. All video was resized to 240 × 240 to reduce the 3.5. Feature Selection computational cost. We selected features by best first search. Starting from best One of the authors subjectively evaluated the DoL score feature, features which improve the performance best are in five degrees from 1 (strongly non-loop) to 5 (perfect loop). added while performance improves. 7https://twitter.com/

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5184 ICIP 2014 Table 5. Added features and accuracies proposed method (five- Table 6. result of five-class DoL classification : (a) OnlyAFD class classification) . (57.6% accuracy) , (b) proposed (61.5% accuracy). Opt RGB Br Added feature F +F +F3 (a) Accuracy (%) 55.1 57.2 58.6 P.\G. 1 2 3 4 5 Precision Opt sum Br Opt med 1 403 75 28 16 4 0.77 Added feature +F5 +F5 +F4 Accuracy (%) 59.8 60.7 60.9 2 81 93 48 15 6 0.38 3 7 40 64 17 5 0.48 Added feature +F Opt 7 +F Opt sum +F Opt 5 3 3 4 4 0 2 12 10 19 0.34 Accuracy (%) 61.2 61.4 61.5 5 15 19 19 19 19 0.21 Recall 0.80 0.41 0.37 0.13 0.49 0.44 \ 0.44

800 proposed OnlyAFD (b) 600 P.\G. 1 2 3 4 5 Precision

400 1 397 65 26 15 4 0.78 2 74 112 50 13 5 0.44 200 3 9 31 75 19 2 0.55 The number of video number The 0 4 10 8 10 21 4 0.40 0 1 2 3 4 5 16 13 10 9 24 0.33 DoL error Recall 0.78 0.49 0.44 0.27 0.62 0.52 \ 0.50 (a) Table 3 is the result of loop/non-loop classification and 100 proposed 90 OnlyAFD used features are shown in Table 4. The classification accu- racy of OnlyAFD is 83.0% accuracy and 2.7% better when 80 other features are all considered. In addition, 85.7% accuracy 70 is good enough for filtering out non-loop video and retrieving 60 only loop video. Overall accuracy (%) accuracy Overall 50 Next, we perform five-class DoL classification. The se- 0 1 2 3 4 lected features are shown in Table 5 and the final result is DoL error demonstrated in Table 6. It can be observed that the values on (b) diagonal line is the highest in each column/row. Compared Fig. 4. Evaluation value of result difference : (a) Histogram with OnlyAFD, the proposed method surpasses OnlyAFD by of estimated DoL class differences, (b) Overall accuracy. 3.9 % in overall accuracy. F Opt was selected first because the feature is good at detecting video whose loopness score is 1 Table 2 shows the distribution of the DoL scores. If the last and which accounts for about half of the clips in the dataset. and the first frames have same background or objects and Fig. 4 shows the histogram of DoL estimation error and which are connected seamlessly, the DoL score becomes high. whose overall accuracy. In Fig. 4, zero means the estimated Additionally, in order to distinguish loop video from static but DoL coincided with the subjective evaluation result. It is non-loop video, motion continuity in video was considered. shown that 61.5% of the video is classified with no error and DoL scores were high if changes such as background, camera 25.7% of the video with error = 1. It means, 87.2% of the motion, or object motion exist and the changes are smooth. video can be correctly classified with at most one-class dif- In the loop/non-loop classification, video clips whose ference. This result is very promising to correctly classify DoL scores were 4 and 5 were labeled as loop and those with loop/non-loop video only by the video content. evaluation score of 1 and 2 were labeled as non-loop. All 1,022 video clips were used for DoL assessment. 5. CONCLUSION

4.2. Feature Evaluation We proposed DoL assessment in microvideo by adjacent Fig. 3 shows the best 20 features in terms of the Pearson’s frame distances and its statistical probability by using RGB, correlation coefficient between subjectively evaluated DoL optical flow and saliency centroids trajectory. Experimental scores and each feature’s values. F RGB and F Br outper- results with 1,022 video collected from Vine showed 85.7% form other features. Statistical features show high correlation. accuracy in loop/non-loop classification and 61.5% accuracy We used all features (F RGB, F Br, F Opt, F Sal) which can be in five DoL class classification. The proposed method is the extracted only from the first and the last frames as a baseline first effort to detect loop video independent of spam hash tag. (OnlyAFD) because there is no previous DoL assessment The technology can also be applied to navigating users to method for comparison. generate more sophisticated loop video.

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5185 ICIP 2014 6. REFERENCES [12] Arthur P Dempster, Nan M Laird, Donald B Rubin, et al., “Maximum likelihood from incomplete data via [1] Go Irie, Kota Hidaka, Takashi Satou, Toshihiko Ya- the em algorithm,” Journal of the Royal statistical Soci- masaki, and Kiyoharu Aizawa, “A degree-of-edit rank- ety, vol. 39, no. 1, pp. 1–38, 1977. ing for consumer generated video retrieval,” in Proceed- ings of International Conference on Multimedia and [13] Erkut Erdem and Aykut Erdem, “Visual saliency esti- Expo. IEEE, 2009, pp. 1242–1245. mation by nonlinearly integrating features using region covariances,” Journal of Vision, vol. 13, no. 4, pp. 11, [2] Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Nan Liu, 2013. “Multimodal fusion for video search reranking,” Knowl- edge and Data Engineering, IEEE Transactions on, vol. [14] Rudolph Emil Kalman, “A new approach to linear fil- 22, no. 8, pp. 1191–1199, 2010. tering and prediction problems,” Journal of Basic Engi- neering, vol. 82, no. 1, pp. 35–45, 1960. [3] Xinmei Tian, Dacheng Tao, and Yong Rui, “Sparse transfer learning for interactive video search reranking,” [15] Brian W Matthews, “Comparison of the predicted and ACM Transactions on Multimedia Computing, Commu- observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta -Protein Structure nications, and Applications, vol. 8, no. 3, pp. 26, 2012. , vol. 405, no. 2, pp. 442–451, 1975. [4] Miriam Redi, Neil O ’Hare, Rossano Schifanella, [16] Chih-Chung Chang and Chih-Jen Lin, “Libsvm: a li- Michele Trevisiol, and Alejandro Jaimes, “6 seconds of brary for support vector machines,” ACM Transactions sound and vision: Creativity in micro-videos,” in Pro- on Intelligent Systems and Technology, vol. 2, no. 3, pp. ceedings of Computer Vision and Pattern Recognition. 27, 2011. IEEE, 2014. [17] Charles Elkan, “The foundations of cost-sensitive learn- [5] Arno Schodl,¨ Richard Szeliski, David H Salesin, and ing,” in Proceedings of International Joint Conference Irfan Essa, “Video textures,” in Proceedings of the on Artificial Intelligence. Citeseer, 2001, vol. 17, pp. Computer Graphics and Interactive Techniques. ACM 973–978. Press/Addison-Wesley Publishing Co., 2000, pp. 489– 498.

[6] Vivek Kwatra, Arno Schodl,¨ Irfan Essa, Greg Turk, and Aaron Bobick, “Graphcut textures: image and video synthesis using graph cuts,” in Proceedings of Transac- tions on Graphics. ACM, 2003, vol. 22, pp. 277–286.

[7] Aseem Agarwala, Ke Colin Zheng, Chris Pal, Ma- neesh Agrawala, Michael Cohen, Brian Curless, David Salesin, and Richard Szeliski, “Panoramic video tex- tures,” in ACM Transactions on Graphics. ACM, 2005, vol. 24, pp. 821–827.

[8] Alex Rav-Acha, Yael Pritch, Dani Lischinski, and Shmuel Peleg, “Dynamosaicing: Mosaicing of dynamic scenes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 10, pp. 1789–1801, 2007.

[9] Zicheng Liao, Neel Joshi, and Hugues Hoppe, “Auto- mated video looping with progressive dynamism,” ACM Transactions on Graphics, vol. 32, no. 4, pp. 77, 2013.

[10] Ce Liu et al., Beyond pixels: exploring new representa- tions and applications for motion analysis, Ph.D. thesis, Massachusetts Institute of Technology, 2009.

[11] Douglas Reynolds, “Gaussian mixture models,” Ency- clopedia of Biometrics, pp. 659–663, 2009.

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5186 ICIP 2014