<<

ICWMMN2015 Proceedings

Characterizing User Popularity Preference in a Large-Scale Online Video Streaming System

XiaoyingTan1, YuchunGuo1, Yishuai Chen1, Wei Zhu2 1 School of Electrical and Information Engineering, Beijing Jiaotong University, Beijing, 2 PPTV, Shanghai, China 1 {10111006, ychguo, yschen}@bjtu.edu.cn 2 [email protected]

Abstract—Most services present a uniform ranking list extensively examine general characteristics of their PP according to video popularity in order to satisfy users’ potential distributions. demands. But recent work claimed that users in video systems showed distinct preferences on popularity of videos. Moreover, to the best of our knowledge, there is no relevant Understanding these preferences has important implications for study in the case of online video streaming systems specifically. designing content distribution mechanism and improving Compared with the systems studied in existing works[5][6], recommendation accuracy. Users’ Popularity Preferences (PPs) including video rating systems IMDB and MovieLens, movie in online video streaming services, however, have not been well rental system , online music system Yahoo!Music, and characterized. In this paper, we characterize users’ PPs and web search system (surveyed by Nielsen), online streaming analyze their differences based on a large-scale trace of user systems have different characteristics with respect to customers, watching behaviors in one of the largest online video streaming user access methods or user behaviors. providers in China. We evaluate central tendency, dispersion tendency and asymmetry of users’ PP distributions. Through In this paper, we focus on statistically charactering users’ hypothesis testing, we prove that users’ PP distributions are not PP distributions and examining the homogeneousness of such homogeneous. Specifically, users prefer videos with different distributions based on a large-scale accurate user watching popularity and the widths of their favorite popularity ranges are behavior trace from PPTV, one of the largest online video different. Within a user’s favorite popularity range, she prefers streaming providers in China. For this goal, we evaluate three more popular videos. important statistic properties, i.e., central tendency, dispersion tendency and asymmetry, of users’ PP distributions, and then Keywords—online video; streaming system; video popularity; compare them with those of homogeneous PP distributions user preference obtained via a homogeneous null model. We prove that users’ PP distributions are not homogeneous. Specifically, users I. INTRODUCTION prefer videos with different popularities and the widths of their These years witness a huge increase in traffic of online favorite popularity ranges are different. Within a user’s favorite videos due to their huge commercial interests[1]. popularity range, however, she still prefers more popular videos. The characteristics of users’ PP distributions in PPTV Correspondingly, many researchers have studied on multiple are different from those in the movie rental system Netflix and aspects of these systems for huge commercial interests. Among in a web search system surveyed by Nielsen. these aspects, video popularity [2, 3, 4] as an important metric to evaluate commercial values of videos has attract a wide The remainder of this paper is organized as follows. In spread attention. section 2, we will describe the dataset our study based on. In Section 3, we conduct basic measurement of user dataset In fact, users may have different preferences on video comparing with those in other systems. Section 4 characterizes popularity. Understanding these preferences has important empirical users’ popularity preferences. Finally, we conclude implications for designing content distribution mechanism and our results and discuss our futrue work in Section 5. improving recommendation accuracy. On the contrary, ignorance of the difference may mislead the understanding of user interest and in turn the content distribution designs. For II. DATASET example, Goel et.al. claimed that not all the users preferred hot We collected a large dataset consisting of accurate user items, instead, they had different preferences on popularity of watching behavior records from PPTV[7], one of the largest videos, movies and web pages [5, 6]. This result implies that online video streaming systems in China. Our dataset is tilt of resources too far toward hot-videos may irritate the fans collected during March 23rd to 28th in 2011, including more of niche items. than 10 million user watching sessions, 20 thousand unique Users’ Popularity Preferences (PPs), however, have not videos and 1 million users. been well characterized yet. Oh et al. [6] considered users’ PP We focus our analysis on movies as it is an important video in recommendation algorithms and improved the accuracy as type. We filter out sessions where user watching time is shorter well as the novelty of their recommendation results. However, than 30 seconds, often of a surfing or unconscious click. We they only gave several examples to illustrate users’ PP, but did also filter out users who have less than 20 watching sessions not characterize users’ PP distributions systematically. Goel et during the trace period, to ensure each user have enough al. proved that users have diverse PPs[5], but did not watching samples. The threshold of 20 is chosen by experience

246 similar to [6]. The resulted dataset includes more than 20 TABLE I. TOY EXAMPLES OF USER PP SEQUENCES AND THERE thousands of movies, 90 thousands of users and more than 2 CHARACTERISTICS million of sessions. ID PP sequence Median CV Skewness III. POPULARITY DISTRIBUTION User1 14,15,16,17,18 16 0.099 0

We define the popularity of a video as the total number of User2 4,8,16,40,112 16 1.242 0.447 sessions referred to the video during the whole measurement 302,567,1490, User3 1490 1.280 0.447 period. In Figure 1, we plot the popularity distribution in PPTV 4052,11013 and redraw the distributions in other systems according to the constitute her PP sequence. Note that if a user clicks multiple figures in related works [5, 8]. The compared systems include videos in the same ranking position, this popularity would be Netflix, Yahoo!Music, a Web search system, and YouTube. repeated for times in her sequence. As shown in Figure 1, popularity distribution in PPTV is We present three toy examples of user PP sequences in less curved than that in Netflix, but more curved than those in Table 1. Thus, we could analyze their PPs in the following YouTube and the web search system. A more curved sections. distribution means a flatter head and a steeper tail. In other words, it means the popularity differences between the hot B. Characteristic Metrics videos are more significant, and that between the cold videos We evaluate three characteristics of a user’s PP sequence, are less significant. i.e., central tendency, dispersion tendency and skewness. These Such differences between the distributions are caused by characteristics are commonly used to describe distributions in complex reasons, including the differences between systems, statistics. contents or users. For example, compared with user renting and rating for a video in Netflix, clicking on a video in PPTV is  Central Tendency We use median, the middle point in free and much more effortless, which confirms that online an ordered history sequence, to measure the central video streaming system needs a specific study. tendency of the user’s PP. Compared with other candidate measures, such as mean or mode, median is more suitable for the discrete sequence. The smaller IV. ANALYSIS OF USER POPULARITY PREFERENCES median of a user’s PP distribution is, the more she In this section, we evaluate statistical characteristics of prefers popular videos in average. users’ PP distributions in PPTV. In order to answer the  Dispersion Tendency Dispersion tendency is also an question that whether users PPs follow a homogenous importance characteristic of distribution. It describes distribution, we compare them with those obtained in the width of a user’s PP range. We use a dimensionless assumption of homogeneous null model. measure, i.e., coefficient of variation(CV), to compare PP dispersion tendencies among users regardless of A. Defination of User Popularity Preference their different central tendencies. It is defined as the Before analysis of user PP, we first assign each user a PP ratio of the standard deviation to the mean of a PP sequence defined as follows. After ranking videos according to sequence. A larger CV means a user has a wider PP their popularities in non-increasing order, a click on a video at range. a rank is regarded as a click for the ranking position of the video. (For clarity, the popularity ranking position hereinafter  Skewness To compare asymmetry levels of users’ PP is referred to as popularity.) For a user, the popularity he distributions, we evaluate their PP skewnesses by a clicked during the whole measurement period are collected to nonparametric metric, i.e., Mean  Median Standard Deviation . The smaller 0   10 the absolute value, the more asymmetric the PP distribution. A positive skewness value means the user prefers more popular videos within her favorite -5 popularity range, regardless of the central tendency of PPTV 10 her PP distribution. Netflix Web Search Table 1 shows the three characteristics of the three toy

Normalized Popularity Normalized -10 YouTube example users respectively. The results demonstrate that all the 10 -8 -6 -4 -2 0 three characteristics are necessary to clearly present a PP 10 10 10 10 10 distribution. They complement each other. For example, User 1 Normalized Rank and User 2 have the same median, but their CVs are different, meaning that their PP distributions are quite different with each Fig. 1. The measured movie popularity distribution of PPTV and the redrawn popularity distribution in other systems measured by literatures [5, 8]. The other. User 2 and User 3 have a similar skewness, but the systems compared include a video rating system Netflix, a web browsing difference between their PP medians indicates that their PPs system surveyed by Nielson and an UGC video system YouTube. are total different, that is, User 2 is a fan of popular videos whereas User 3 is a fan of niche ones.

247 C. Homogeneous Null Model 1 To conduct a hypothesis test, we build a mull model. The null model assumes that users are homogeneous and watch videos with the same PP distribution. Similar to [5], it is 0.5 simulated as follows. Each user randomly selects a video at a probability proportional to the video popularity. Without replacement, Probability Cumulative 0 each user repeats the sampling selection. The selection number 0 1 2 3 Coefficient of Variance equals to the empirically observed number of movies she has watched. In this way, the distributions of video popularity and Fig. 3. Cumulative distribution of users’ PP CVs in PPTV. user activity in the simulation approximately conform to those in measured trace. 0.03 Reality D. Characterization of PP distributions 0.02 Null Model We compare the characteristics of users’ PP distributions in PPTV with those obtained in the null model. 0.01 Figure 2 plots distributions of users’ PP medians in PPTV Probability and in null model respectively. As shown in Figure 2, the peak 0 of the distribution in PPTV locates at round 1% ranking, 0 2 4 meaning that most users in PPTV averagely prefer popular Coefficient of Variance videos as expected. As shown, the distribution in PPTV Fig. 4. Distributions of users’ PP CVs in PPTV and in null model.

Null Model disperses more even around than that in null model. The difference demonstrates a larger diversity among users’ PP Reality distributions in PPTV than that assumed in the null model.

Probability However, such a difference is not the same with those in Netflix, Yahoo!Music or the web browsing system as shown in

0 1000 2000 3000 literature[5]. For comparison, we redraw the median Rank distributions in these systems in Figure 2(b) and (c). As shown, the peak of the distribution in Netflix locates at the left of that (a) in null model, meaning that users in Netflix averagely prefer more popular videos than that assumed in null model. On the Reality contrary, the peak in PPTV locates at the right of that in null Null Model model. Besides, the gap between the distributions in Web search system and that in null model is much significant than that between PPTV and the corresponding null model. Such differences demonstrate that users’ PP characteristics in online video streaming system are different from these systems.

0 500 1000 1500 2000 Figure 3 plots the cumulative distribution of users’ PP CVs Eccentricity in PPTV. As shown in Figure 3, more than 60% of the CVs are larger than 1. Since a distribution with CV>1 is considered to (b) be high-variance compared with exponential distribution [9]. The results in Figure 3 indicate that most users’ PP distribution Null Model scatter over a wide range. Furthermore, Figure 4 plots Reality distribution of users’ PP CVs in PPTV and in Null Model. The interquartile range of the distribution in PPTV ([0.89, 1.38]) is more than twice as wide as that in Null Model ([0.97, 1.19]). It means that some users have wide PP distributions while some others may not. In other words, the widths of PP ranges are different among users. 0 1000 2000 3000 4000 Eccentricity Finally, Figure 5 plots distributions of users’ PP (c) skewnesses in PPTV and in Null Model. As shown in Figure 5, the majority of users in PPTV have positive skewnesses, Fig. 2. Distributions of users’ PP medians in reality and in null model meaning that nearly all users’ PP distribution are asymmetrical respectively. We compare the distributions among systems (a) PPTV, (b) and users prefer more popular videos within their own PP Netflix (as shown as Figure 5(c) in literature[5]), (c)Nielsen (for Web ranges. In addition, the dominant asymmetric distributions browsing as shown as Figure 5(c) in literature[5]).

248 0.03 The results of their experiments in IMDB and MovieLens Null Model show that it is helpful to differentiate users’ PP for improving Reality novelty of the recommendation as well as accuracy. However, 0.02 they have not statistically analyzed the diversity among users but only plot some users’ PP distributions as examples. 0.01 Probability VI. CONCLUSION 0 -100 0 100 Based on a large-scale user watching behavior trace from Relative Skewness (%) PPTV, we characterize users’ Popularity Preference distributions in online video streaming systems in this paper. Fig. 5. Distributions of users’ PP skewnesses in PPTV and in null model. We statistically prove that users are not homogeneous in terms of ’ PP their favorite popularity ranges. confirm that the metric median is better than the mean to measure central tendency of PP sequences. In ther future, there are many other vital research directions besides the characterization of users’ PP distributions. For Through the above analysis, we conclude that users’ PP instance, we are now working at the prediction of users’ PPs distributions in PPTV are not homogenous. Moreover, users and their application in recommendation algorithm We believe prefer videos with different popularities and the widths of their our finding and methods are not restricted to PPTV but also favorite popularity ranges are different. Within a user’s favorite work in other online streaming systems. popularity range, however, she still prefers more popular videos. ACKNOWLEDGMENT

V. RELATED WORK This work was supported in part by the National Science Foundation of China under Grant No. 61572071, 61301082 A. Popularity distribution and the Fundamental Research Funds for the Central Universities under Grant No. W14JB00500. Plenty of works measured and analyzed video popularity distributions. For example, one of the first measurement studies of a large VOD system [2] showed that video REFERENCES popularity matched the Zipf distribution. Zhou et al. [3] [1] "Cisco Visual Networking Index: Forecast and Methodology." 2011– measured the dynamics of video popularity in online VoD 2016, May 30, 2012. system in China and studied the factors that affect such [2] Yu H, Zheng D, Zhao B Y and Zheng W. "Understanding user behavior dynamics, i.e., popularity’s age-sensitivity, constrained eyeball in large-scale video-on-demand systems," in proceedings of ACM and probability of replay. EuroSys, 2006, pp. 333-344. [3] Zhou Y, Chen L, Yang C, and Chiu D M. "Video popularity dynamics It is widely accepted that understanding video popularity is and its implication for replication," IEEE transactions on multimedia, important for system design, operation and recommendation. vol. 17, issue. 8, 2013, pp. 1273-1285. For example, Jayasundara et al. [4] estimated popularity ranks [4] Jayasundara C, Nirmalathas A, Wong E and Nadarajah N. "Popularity- aware caching algorithm for Video-on-Demand delivery over broadband using the most recent k arrival times and subsequently updated access networks," in proceedings of the global telecommunications the cache with the most popular movies at any given time. conference (GLOBECOM 2010), 2010, pp. 1-5. Based on analysis of the dynamics of video popularity, Zhou et [5] Goel S, Broder A, Gabrilovich E and Pang B. "Anatomy of the long tail: al. [3] proposed a mixed strategy to determine the videos ordinary people with extraordinary tastes," in proceedings of the third cached on CDN servers for the overall high hit rate. ACM international conference on Web search and data mining, 2010, pp. 201-210. B. User preference on popularity [6] Oh J, Park S, Yu H, Song M. "Novel recommendation based on personal popularity tendency," in proceedings of the 11th international Discussion of users’ preferences or attitudes towards video conference on data mining (ICDM), 2011, pp. 507-516. popularity may be traced to discussions about the long tail [7] http://www.pptv.com phenomenon in economic domain [5]. [8] Guillemin F, Kauffmann B, Moteau S and Simonian A. "Experimental analysis of caching efficiency for YouTube traffic in an ISP network," in In video domain, so far only Goel et al. [5] focused on proceeding of 25th international conference on teletraffic congress (ITC), analysis of users’ PPs. They analyzed the diversity of PPs 2013, pp. 1-9. between users in Netflix rental system, Yahoo!Music and Web [9] https://en.wikipedia.org/wiki/Coefficient_of_variation search system. Their empirical and theoretical analysis results [10] Carrascal J P, Riederer C, Erramilli V, Cherubini M, and de Oliveira R. both supported that that users have diverse preferences and "Your browsing behavior for a big mac: Economics of personal everyone is a bit eccentric, consuming both popular and information online," in proceedings of the 22nd international conference specialty products. on World Wide Web, 2013. [11] Noble, J A. "Minority voices of crowdsourcing: why we should pay Such an standpoint is widely accepted in many works in attention to every member of the crowd," in proceedings of the ACM various domains [10, 11, 12], but user PP has not been 2012 conference on computer supported cooperative work companion, specifically studied in the these works. Only Oh et al. [6] 2012, pp. 179-182. emphasized that users have personal popularity tendencies and [12] Peña-Ortiz R, Gil JA, Sahuquillo J, Pont A. "Analyzing web server performance under dynamic user workloads," Computer consider their statistical PPs in recommendation algorithms. communications, vol.36, issue.4, 2013, pp. 386-395.

249