A User-selectable Obscuration Framework to Censor Digital Videos for Children and Adolescents

Jiayan GUO, David LEONG, Jonathan SIANG, and Vikram BAHL School of Infocomm, Republic Polytechnic, Singapore

ABSTRACT storyline of the film. Such harmful content is still accessible to the viewers. The children thus are exposed to There is an increasing concern from parents, educators a vast amount of inappropriate media content. and policy-makers about the negative influence that digital media exerts on children and adolescents. Such According to the Singapore Review concerns have fueled a growing need to effectively filter Committee 2010 report [2], parents are encouraged to take potentially harmful content. However, existing responsibility to protect their children against the negative technologies have limited ability in allowing users to aspects of media proliferation. In order to guide their adjust the filtering levels, or generating seamless cutting children on digital media consumption, parents have to be results. To tackle this limitation, we propose a framework empowered with effective tools to filter sex, violence and which empowers parents and teachers to censor movies profanity out of the digital media. However, existing tools and TV shows according to their level of acumen and have limited capabilities to effectively block potentially discretion. Such framework helps parents and teachers harmful content. Presently, ClearPlay and MovieMask are protect children and adolescents against obscene content. the two forefront computer programs that cleanse movies In particular, our framework enables parents and teachers containing offensive scenes. Their technologies are built to sanitize movies and TV shows by skipping over onto stand-alone DVD players and other video devices. specific objectionable scenes. Moreover, the framework While both can mask objectionable content, their technical can blur out the unsavory objects in the scenes, so that the operations and capabilities are considerably different. integrity of the storyline can be preserved. Technically, it Specifically, for ClearPlay, users have to download two utilizes the non-rigid object tracking and video masking components, including the software and a filter associated techniques to blur out the unwanted object. Instead of with a particular DVD movie. The filter can guide the physically altering the original videos or making DVD player to mute dialogues or skip scenes during replicated copies, our framework keeps the original video playback of the corresponding movie. Such filter is pre- unscathed by applying the censorship to the video during defined, and users cannot customize it according to their playback. We conducted evaluations on the challenging needs. In contrast, MovieMask allows users to personalize real-world video sequences. The experimental results the blocking of harmful content. Technically, it first lets demonstrated effectiveness of the proposed framework. users select the edited scenes, and some graphics/animations. It then censors the movie by Keywords: Video Censorship, Non-Rigid Object overlaying the graphics/animations onto the selected Tracking, Mean Shift, Scale and Orientation Adaptive, scenes. However, this technology affects the integrity of Video Masking. the original movie. The resultant censored movie would be jagged, and the scene-to-scene cuts are noticeable.

1. INTRODUCTION Motivated by above observations, we propose a novel framework to facilitate parents and teachers to censor the With the rapid development in information technology, movies and TV shows. The framework enables parents children and adolescents have unprecedented access to and teachers to select the obscuration according to their digital media. Given the double-edged sword digital level of acumen and discretion. With our framework, media has become, parents, teachers and policymakers parents and teachers can sanitize the movies and TV have concerns about the negative impact that digital media shows by skipping over specific scenes that contain nudity, exerts on children and adolescents. Many parents believe sexual situations and excessive violence. Moreover, our that digital media is a major contributor to young people’s framework can also be used to blur out the unsavory violent or sexual behaviors [1]. This leads to a growing objects in the scenes, which is effective to preserve the need for digital media content regulation and censorship. continuity of the storyline. Take the painting scene in However, current censorship is mild and insufficient. It is "Titanic" with Kate Winslet posing nude as an example, only applied to a small number of films or TV shows that our framework allows users to pause the video in the have explicit sensitive or offensive content. Over the last current frame and select her body as the target area to be three years, out of 2,351 films classified, only nine films blurred out. It then automatically tracks the target area (0.4%) were censored. Many violence and sexual scenes throughout the video based on the non-rigid object are still allowed if they are relevant to the theme and tracking technique. After the target area is located, our

This work was supported by grant MOE2011-TIF-1-G-019 from the Ministry of Education, Singapore. framework applies the video masking technique to blur ask them to seek the objectionable scenes. The users can out the target area in every frame it appears. Finally, the pause the video and select the unsavory object in the censored video is presented to user during playback. current frame as a target area (also known as Region of Different from existing sanitizing tools, such as Interest). We then employ a non-rigid object tracking CleanFlicks1 and Family Flix, our framework does not algorithm to find the target area throughout the video. physically alter the original videos or make alteration Afterwards, a video masking technique is utilized to blur copies. The censorship is only applied to the video during out that target area throughout the frames covered. Finally, the video playback. That is, the original video remains we play back the censored video to the users. untouched and thus does not violate any copyright issues.

In this paper, we present the following three contributions: Firstly, our framework empowers parents and teachers to filter harmful content out of movies or TV shows based on their own standards and preferences. Secondly, the non- rigid object tracking and video masking techniques in our framework facilitate users to blur out the unsavory objects. In particular, the non-rigid object tracking technique outperforms the current state-of-the-art approaches. It has low computational complexity and is easy to be implemented. In addition, it is capable of handling large variety of objects with different color/texture patterns, being robust to partial occlusions, significant clutter, target scale variations, rotation in depth, and changes in camera position. The video masking technique is employed to blur out the objectionable target area. It is not only effective in blocking unwanted gore and salacious content for young viewers, but it preserves the veracity of the entire film, leaving its artistic vision intact. Lastly, since our framework does not physically alter the original videos or make modification copies, it does not infringe on any copyright law.

The rest of the paper is organized as follows. Section II provides a brief overview of our framework. The details of the non-rigid object tracking algorithm used in our Fig. 1. Procedure of blurring unsavory objects in the scenes. framework is elaborated in Section III. In Section IV, the video masking is described in detailed. Section V provides In the next two sections, we will discuss the non-rigid a snapshot of our framework and reveals the experimental object tracking algorithm and video masking technique. results. In addition, our non-rigid object tracking algorithm is compared against two popular tracking algorithms. We evaluate the experimental results we have 3. NON-RIGID OBJECT TRACKING ALGORITHM obtained. Finally, we conclude this paper in Section VI. Object tracking is a fundamental but challenging task in the field of computer vision. There are sources of 2. OUR FRAMEWORK uncertainty in tracking the real-world videos that render it a highly non-trivial task, such as complex scene Our framework aims to filter the potentially harmful clustering, partial/full occlusions, non-rigid object content out of videos according to the preferences of deformation, and illumination change. A number of parents and teachers. The filtering can either be skipping algorithms have been proposed to overcome these over the objectionable scenes, or blurring out the unsavory difficulties. Among various tracking algorithms, the mean objects in the scenes. shift algorithm is well-known due to its simplicity and efficacy. The mean shift algorithm was firstly developed To skip over objectionable scenes, we first segment the by Fukunaga and Hostetler [3] for data analysis. It was videos into frames, and ask the users to seek some frames later introduced into the field of image processing by as the potentially objectionable scenes. The videos are Cheng [4]. Comaniciu et al. [5] had successfully applied then censored by removing the objectionable scenes, and the mean shift algorithm to object tracking. However, in shown to users during playback. Such skipping may affect the classical mean shift tracking algorithm [5], the the continuity of the video storyline. To enhance the estimation of target scale and orientation changes was not censoring, we resort to the filtering strategy of blurring solved. Bradski [6] further modified the mean shift unsavory objects in the scenes. We illustrate the blurring tracking algorithm and developed the Continuously strategy in Fig.1. The strategy consists of four steps. We Adaptive Mean Shift (CAMSHIFT) algorithm. The first invite the users to watch the movie or TV shows, and moment of the weight image determined by target model 1 found to be illegal by a 2006 District of Colorado court ruling. is used to estimate the object scale and orientation. m ˆ()y  [(),]pˆ y qpˆˆˆ ()y q . (4) Although it is not robust, it could handle various types of u1 uu object movements in real time. Many tracking methods The mean shift algorithm finds the local maximum of the were proposed to tackle the problem of target scale and similarity function ˆ()y by iteratively sampling the orientation estimation. By exploring the relativity of the weight image and the Bhattacharyya coefficient between candidate locations. The new target position yˆ1 is the target model and candidate model, Ning et al. [7] calculated to be a weighted sum of pixels contributing to proposed a method to determine the scale and orientation the model. of the target object. Zivkovic and Krose [8] employed the EM algorithm to estimate the position and the covariance 2 n yxˆ  matrix that can describe the shape. Collins [9] adopted h xwg0 i i1 ii  scale space theory [10] to estimate the scale of the target. h yˆ  , (5) Unfortunately, it cannot handle the rotation changes of the 1 2 target, and the computational cost is also very expensive. n yxˆ  h wg0 i i1 i  In this paper, we present a scale and orientation adaptive h mean shift based non-rigid object tracking algorithm. We m qˆ employ the moment features and then estimate the scale wbxu[( ) ] u . (6) and orientation of the target object. iiu1 pˆu ()yˆ0 where g()xkx   ()is the negative derivative of the Mean shift tracking algorithm kernel profile. In object tracking, a target model is typically defined by an ellipsoidal region or a rectangle surrounding a region of interest in the image. Color histogram is widely employed Scale and orientation estimation to represent the target model because of its robustness to For the target scale and orientation estimation, the partial occlusions, invariance to scaling and rotation, and moment of the weight image is used. The weight image is low computational cost. The pixel location of the target corresponding to wi . The mean location, scale and model is denoted by * , which is centered at the {}xii1... n orientation are calculated as follows. Firstly, find the origin point and have n pixels. The probability of the zeroth moment, first order moments, and second order feature u (u=1,2,…,m) in the target model is computed as moments for x and y n *2 * , (7) qCˆ  kx(|| || )  [ bxu ( ) ] , (1) M 00  Ix(,y ) uiii1  xy where kx()is a convex, monotonically decreasing , (8) M10xI(, x y ); M 01 yIx(,y ) isotropic kernel,  is the Kronecker delta function, and xy xy * 22 bx()i is the histogram bin associated with the pixel M 20xIxy(, ); M 02 yIxy(,); location * . The normalization constant is represented xy xy xi C (9) n *2 M xyI(, x y ). by Ckx1 (|| || ) . 11  i1 i xy The mean location of the target candidate region is Let {}x be the normalized pixel locations in the ii1... nh computed as candidate region which are centered around y in the (,xy ) ( M10 M 00 , M 01 M 00 ). (10) current frame. Similarly, the probability of the feature u in The second order central moments are determined by the target candidate model is given by 2 20M 20Mx 00; 11 MMxy 11 00 ; 2 (11)  2 nh yx i  MM y. pˆ ()yC k [( bxu ) ], (2) (3)02 02 00 uhi1  i h Then the target scale and orientation can be obtained by decomposing the covariance matrix as follows where h is the bandwidth and Ch is the normalization 20 11 2 USUT , (12) nh yx i  function Ck 1 . 11 02 h i1 h  2 uu11 12  1 0 where U  and S  . To determine the similarity between the target model and   2 uu21 22  0 2 the candidate model, the similarity function is defined as The eigenvectors (,uu )T and (,uu )T represent compared the proposed non-rigid object tracking 11 21 21 22 algorithm with the classical mean shift algorithm with the orientation of the two major axes of the ellipse target. adaptive scale [5] and the EM-shift algorithm [8]. The values 1 and 2 denote the estimated length and width of the ellipse target. In practice, these values are Our proposed framework was implemented under the smaller than the length and width of the real target. The programming environment of MATLAB R2012a. In zeroth moment can be regarded as the real area of the Fig.3, a snapshot of our framework graphical user interface is shown. target that is A000 M . Therefore, the length l and width w can be computed as

lA10() 2  1 M 00 ()  2 , (13)

wA20() 1  2 M 00 ()  1. (14)

4. VIDEO MASKING

There are various types of visual masking techniques in censorship to obscure the objectionable images or videos from viewing. The most common ones are pixelization, censor bar and (blurred out) techniques. Pixelization technique obscures an image by displaying part or all of it in a remarkably low resolution. Censor bar Fig. 3. Snapshot of the proposed framework. is a black rectangle or square box used to occlude a small area in the image. For example, are used to The below TABLE I shows the details of the video cover the eyes of the suspects at crime scenes. Fogging is sequences. Three of them are sitcoms broadcasted from to blur out an area for a picture or movie. One drawback TV, while one is from series. The first of pixelization is that the original image can be easily experiment is on a model runway show scene (see Fig. 4); reconstructed by exploiting more moving images in that the bikini model’s bottom with change in scale is to be video. Another disadvantage of pixelization is that it does tracked and blocked. First row are some frames from the not perfectly blend with the surroundings in an image. In original video sequence. Second row shows the tracking contrast, fogging technique is irreversible, and it blends results. The initialized target region is selected by the user. smoothly with the surroundings in an image. Therefore, The red ellipse represents the estimated target region. fogging technique is preferable over most other forms of Third row shows the blurred out censored results after masking techniques. video masking is carried out. The proposed algorithm was also tested on Kissing scene 1 to 3; the kissing human In this paper, we use fogging (blurred out) technique to faces with deformation, change in scale and orientation, block the unsavory object. Fig.2 demonstrates our video change in illumination, partial and full occlusions and masking technique. To generate the blurry effect, we sudden camera zoom out were tracked and blocked (see convolute the tracking results obtained from Section III Fig.5, Fig.6 and Fig.7). with a circular averaging filter. TABLE I. VIDEO SEQUENCES USED FOR PROPOSED FRAMEWORK EVALUATION Video Sequences Size Frame No. fps Model runway show 624×352 318 29 Kissing scene 1 1024×576 553 23 Kissing scene 2 1024×576 985 23 Kissing scene 3 1024×576 218 23

Fig. 2. Video masking tehnique.

5. EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section, we first present a snapshot of our framework. Then several real-world video sequences are used to evaluate the proposed framework. In addition, we

Fig. 4. Censoring model runway show sequence. First row: Fig. 7. Censoring kissing scene 3 sequence. First row: frames frames from original video sequence. Second row: tracking from original video sequence. Second row: tracking results. Last results. Last row: censored results. From left to right, frames row: censored results. From left to right, frames 55, 61 and 72 167, 188 and 209 are shown. are shown.

We compared the performance of our proposed algorithm with the adaptive scale Mean Shift tracking algorithm in [5] and the EM-shift algorithm in [8]. The visual description of tracking performance for the algorithms can be observed from Fig. 8 to Fig. 11.

Fig. 5. Censoring kissing scene 1 sequence. First row: frames from original video sequence. Second row: tracking results. Last row: censored results. From left to right, frames 260, 266 and 293 are shown.

Fig. 8. Tracking model runway show sequence. First row: Scale Adaptive Mean Shift. Second row: EM-Shift. Last row: our proposed algorithm. From left to right, frames 171, 182 and 193 are shown.

Fig. 6. Censoring kissing scene 2 sequence. First row: frames from original video sequence. Second row: tracking results. Last row: censored results. From left to right, frames 1, 82 and 96 are shown.

Fig. 9. Tracking kissing scene 1 sequence. First row: Scale Adaptive Mean Shift. Second row: EM-Shift. Last row: our proposed algorithm. From left to right, frames 261, 266 and 296 are shown. methods. From the comparison results, we can see the adaptive Mean Shift object tracking algorithm shows good performance in tracking the target object, however it cannot estimate the orientation of the target. The EM-shift algorithm fails to localize the object center accurately. And it wrongly estimates the scale and orientation of the target. In addition, the tracking area has the tendency to continue shrinking or enlarging. The experimental results show that our proposed algorithm has good performance in tracking the target object throughout the video sequence. It is robust to partial/full occlusions, complex clutter background, object scale variations, rotation in Fig. 10. Tracking kissing scene 2 sequence. First row: Scale depth, changes in camera position and illumination Adaptive Mean Shift. Second row: EM-Shift. Last row: our variations. proposed algorithm. From left to right, frames 33, 75 and 161 are shown. 6. CONCLUSIONS

In this paper, we propose a framework which empowers parents and teachers to censor movies and TV shows according to their level of acumen and discretion. In our framework, we developed an adaptive mean shift based non-rigid object tracking algorithm and video masking technique to blur out the unsavory objects on the scenes. The proposed framework is tested on challenging real- world video sequences. Experimental results demonstrated effectiveness and robustness of the proposed framework. Our algorithm has shown superior tracking performance when compared to the adaptive scale Mean Shift tracking algorithm and the well-known EM-shift algorithm. Fig. 11. Tracking kissing scene 3 sequence. First row: Scale Adaptive Mean Shift. Second row: EM-Shift. Last row: our 7. REFERENCES proposed algorithm. From left to right, frames 61, 73 and 81 are shown. [1] V. Rideout, “Parents, Children, and Media,” Menlo Park, CA The Henry J. Kaiser Family Foundation, 2007. TABLE II. THE MLE AND TAR VALUES BY THE COMPETING [2] Singapore Censorship Review Committee 2010 report, TRACKING METHODS. http://www.mda.gov.sg/Public/Consultation/Documents/C RC_2010_Report.pdf. Scale Adaptive EM-Shift Our Proposed [3] K. Fukunaga and L. Hostetler, “The Estimation of the Method MS Gradient of a Density Function, with Applications in MLE TAR MLE TAR MLE TAR Pattern Recognition,” in IEEE IT, vol. 21, no. 1, pp. 32-40, Model 1975. runway 4.17 87.52% 13.43 48.29% 3.26 98.43% [4] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” in show IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, pp. 790-799, 1995. Kissing scene 1 8.38 72.64% 7.03 78.35% 2.42 92.16% [5] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-Based Object Tracking,” in IEEE Transactions on Pattern Kissing Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564- scene 2 15.08 28.41% 8.20 53.36% 4.77 85.52% 577, May 2003. [6] G. Bradski, “Computer vision face tracking for use in a Kissing 2.94 89.13% 7.72 63.82% 2.61 94.11% perceptual user interface,” in Intel Technology Journal vol. scene 3 2, pp.1-15, 1998. [7] J. Ning, L. Zhang, D. Zhang and C. Wu, “Scale and Orientation Adaptive Mean Shift Tracking,” in Computer To evaluate the competing methods, TABLE II lists the Vision, IET, vol. 6, iss. 1, pp. 52-61, 2012. mean localization errors (MLE) and the true area ratios [8] Z. Zivkovic and B. Krose, “An EM-like algorithm for (TAR) by the three methods on the four real video color-histogram-based object tracking,” in IEEE Conf. sequences. The TAR is defined as the ratio of the Computer Vision and Pattern Recognition, vol. 1, pp. 798- overlapped area between the tracking result and the 803, 2004. human annotated ground truth to the area of human [9] R. T. Collins, “Mean-shift blob tracking through scale space,” in IEEE Conference on Computer Vision and annotated ground truth. The MLE and TAR are closely Pattern Recognition, pp. 234-240, 2003. related to scale and orientation estimation of the target [10] T. Lindeberg, “Feature detection with automatic scale being tracked. TABLE II shows that our proposed method selection,” in International Journal of Computer Vision, achieves the best performance among the three tracking vol.30, iss.2, pp. 77-116, 1998.