<<

Intelligent and Editing: Papers from the AAAI-14 Workshop

EduCam: Cinematic Vocabulary for Educational Videos

Soja-Marie Morgens and Arnav Jhala [email protected] and [email protected] Computer Science Department, U. of California, Santa Cruz, CA

Abstract This paper focuses on three cinematic idioms that are common in professionally- educational videos and doc- Cinematography relies on a rich vocabulary of shots umentaries: shot composition, , and the establish- and framing parameters to communicate event structures. This cinematic vocabulary stems from the ing shot sequence, It proposes EduCam, an architecture that physical capabilities of camera movement and uses aes- allows for the employment of virtual camera control algo- thetic aspects to engage the audience and influence au- rithms in the physical lecture capture environment. Edu- dience comprehension. At present, automatic physical Cam uses real time reactive camera coverage to allow for camera control is limited in cinematic vocabulary. Vir- offline editing with hierarchal constrain satisfaction algo- tual cameras, on the other hand, are capable of exe- rithms. This paper will evaluate the current physical auto- cuting a variety of cinematic idioms. Specifically, shot matic camera control for their cinematic capacity, discuss composition, match cut, and sequence current virtual cinematography and finally outline how Ed- are three common often lacking uCam allows for hierarchal constraint satisfaction editing in in automated physical camera systems. This paper pro- a physical world. poses an architecture called EduCam as a platform for employing virtual camera control techniques in the lim- ited stochastic physical environment of lecture captur- ing. EduCam uses reactive algorithms for real time cam- era coverage with hierarchical constraint satisfaction techniques for offline editing.

Amateur video has become an important medium for dis- seminating educational content. The DIY community rou- tinely creates instructional videos and tutorials while profes- sional educators publish entire courses on Gooru and Cours- era (Coursera 2014; Gooru 2014). Video gives educators the ability to flip the classroom, pushing passive viewing out- Mal- side of the classroom and bringing interactivity and discus- Figure 1: An example of good shot composition from tese Falcon sion back in. With video, educators have access to cinematic . The subject is on the upper left third lines idioms to shape the narrative of their material. and his gaze is into the open space. (Bordwell 2014) However these idioms rely on a high threshold of videog- raphy knowledge, equipment, and time. So educators use Cinematic Vocabulary static cameras with time-consuming editing or automatic The three most common cinematic idioms used to en- lecture capture systems. Current automatic lecture capture gage audiences in content-driven film are shot composition, are passive capture systems, created to record the oratory match cuts, and establishing shot sequence. style of the educator to his live audience rather than em- Shot Composition ensures the shot is visually engaging ploy cinematic vocabulary. Moreover, their coverage typi- and cues the audience to the subject of the shot. Generally cally consists of just a few cameras in the back of the class- a good engaging composition employs rule of thirds and is room. However this reduces their shot vocabulary to long from an angle that emphasizes . Subjects are typ- and medium shots with limited shot composition capability. ically on the side of the frame opposite of where they are Current virtual camera control algorithms are able to em- moving or gazing (see Figure 1). ploy the full variety of shots and some cinematic transitions. Match Cuts are used to smoothly transition an audience’s However they do stochastic physical environment and typi- eyes from one shot to another. In it the area of interest from cally assume unlimited camera coverage. the second shot is often matched with the area of interest in Copyright c 2014, Association for the Advancement of Artificial the first shot as shown in Figure 2. This maintains the audi- Intelligence (www.aaai.org). All rights reserved. ence’s eyes on the subject and prevents ”jitteriness” or loss

57 of focus as the audience re-engages their eyes with the sub- ject. Establishing shot sequence is used to signal a change of scene or subject. It consists of sequential shots start with a long shot then transition to and finally end on close shots of the subject. Used a the beginning of a film, it focuses the audience’s attention on the subject and eases the audience into the story. Used as a chapter break, it signals to Figure 2: A Match Cut on region of interest from Aliens the audience a change and allows them time to refocus. (Thompson 2011) These three methods rely on the director’s ability to ma- nipulate multiple cameras along seven degrees of freedom according to the subject’s position and future intentions. In computational cinematography the virtual world allows These passive capture systems, limited in their coverage perfect knowledge of a subject’s location and near perfect and real time editing constraints, employ simple auto direc- knowledge of the subject’s intentions. Coverage is not an is- tors without cinematic editing. Liu et al. (2001) and Lampi, sue and most systems rely on multiple cameras. However Kopf, and Effelsberg transition based on shot length heuris- in lecture recording systems, knowledge of future intent is tics and physical buttons. Liu et al. (2002) and Kushalnagar, imperfect; editing algorithms are limited to real time opera- Cavender, and Parisˆ do not transition but show all streams tions; cameras are limited in number and mobility. By using simultaneously. real time capture from multiple angles and zooms but non Active capture systems demonstrate the capabilities of real time editing, the EduCam system will allow for use of a current technology for reactive real time automated control larger variety of cinematic idioms without loss of quality. and the need for cinematic idioms to improve audience en- gagement (Chi et al. 2013; Birnholtz, Ranjan, and Balakr- Lecture Capture Domain ishnan 2010; Ranjan et al. 2010; Ursu et al. 2013). Ranjan The lecture capture domain is an excellent testbed for et al. tracks the current speaker in a meeting through visual bringing virtual cinematographic techniques into the real facial recognition and audio cues. Both methods prove to be world. While the educational video exists within physically noisy. However, it reported its output as comparable to real stochastic environment it is one filled with discernible cues time editing video crew and is capable of close, medium, and recognizable subjects. It has just two subjects: a sin- and long shots. Birnholtz, Ranjan, and Balakrishnan uses gle instructor and the material which is either written on motion tracking of a hand to determine the material subject the board or shown as a live demo. Movement is along a and when to transition between shots. Its system can employ bounded plane and at a limited velocity, allowing for track- long shots and close shots and found that the use of a vari- ing cameras to be set in pre-determined positions without ety of shots shapes audience engagement. Chi et al. creates significant loss of coverage. Facial recognition and motion video using a single camera that follows the narrators posi- extraction algorithms allow cameras to roughly track the tion via Kinect recognition algorithms. It demonstrated the subjects in real time with finessed shot composition done ease at which a Kinect can track an object adequately during offline. Cues to transition between the subjects can be taken live recording and the reluctance of a narrator to participant from the narrator’s gaze, gesture and rhythm. in changes of camera coverage. Overall, these studies show that physical systems have the Physical Camera Control Systems capability to do real time reactive camera coverage. How- ever while active systems do use a variety of shots, they do Automatic physical camera control systems can be divided not often use cinematic vocabulary for shot transitions. They up into two categories: passive and active systems. Passive either have no transitions or use simple state machines that systems are unobtrusive and record the narrator’s oration to transition based on shot length heuristics and audio cues. a live audience, whether in a meeting or a classroom. Active systems are set up as the primary medium through which the Virtual Camera Control narrator communicates. Current lecture capture systems are passive and use re- Automatic virtual camera control can generally be di- altime editing. These systems typically consist of one or vided into three approaches: reactive, optimization-based, more static PTZ (Pan, , Zoom) cameras. Kushalnagar, and constraints satisfaction (Christie, Olivier, and Normand Cavender, and Parisˆ involves a single camera in the back 2008). of the classroom to capture the narrator and the material in The reactive approaches are adapted from robotics and a long shot. More complex set ups such as Liu et al.2001 best suited to real-time control of camera coverage. They and Lampi, Kopf, and Effelsberg involve a camera on audi- look for local optima near the camera’s current position and ence as well as a camera that can actively track the narrator. generally focus on minimizing occlusion of the subject and Lampi, Kopf, and Effelsberg does use some cinematic vo- optimizing shot composition. For example, Ozaki et al. uses cabulary: it composes its shots according to rule of thirds in A* with a viewpoint entropy evaluation based on occlusion its real time coverage. and camera collision with the environment. Optimization approaches tend to focus on shot compo-

58 sition but interactive editing systems provide experts with a range of temporal cinematic idioms of movement. These op- timization approaches have used frame coherence and shot composition with static cameras (Halper and Olivier 2000). Interactive systems take directorial input to create local op- tima for use with cinematic vocabulary such as the Semantic Volumes presented in Lino et al.(2010). However both meth- ods heavily rely on full knowledge of the environment and animation paths. Constraint Satisfaction approaches focus on a set of con- straints on camera parameters that each shot must obtain. While many of these approaches focus on shot composi- tion related parameters, Jardillier and Languenou´ also in- cludes temporal constraints, camera movement constraints, and framing to differentiate between different types of shots. However pure constraint satisfaction approaches result in a Figure 3: The EduCam cameras. A Raspberry Pi controls complex constraint set and an unwieldy language to spec- the Kinect as it used to run real time facial recognition ify these constraints. This makes it difficult for users to and region of interest algorithms while an HD camera avoid over- and under-constraining their specifications. Hi- passively records a of the scene. The Kinect erarchal constraint satisfaction, as used in ConstraintCam, depth field is synchronized offline with the HD coverage allows for specification of relationships between constraints and the constraint satisfaction algorithm digitally crops that enables better specification of relaxation priorities of the camera for smoothness, match cuts, and com- constraints as well the specification for possibly conflicting position. constraints for the constraint solver. (Bares, Gregoire,´ and Lester 1998). EduCam EduCam is a platform for employing computational cine- matic control algorithms on a physical platform. This sec- tion will outline the physical elements of EduCam, its real time reactive algorithm for camera control and its hierarchal constraints satisfaction rules for editing. Physical Camera Setup EduCam has a set up that allows for extended camera cover- age and is able to track the two subjects appropriately in real time. EduCam uses a traditional four camera set up about the Figure 4: An educator filmed by the Four Pan-tilt cam- action line as shown in Figure 4. The two outside cameras era setup of EduCam. Camera #4 is on top of Camera capture visually engaging medium and close shots. One mid- #2. Each camera has panned so the subject (board mate- dle camera captures long and medium shots, and the other rial or narrator) is on the right third of the shot in order middle camera takes close material shots. Camera placement to allow for a match shot. Camera #4 is filming an estab- is flexible and can be adjusted based on the needs of the lishing shot in case there’s a subject change. video and the patterns of the narrator. By placing the cam- eras about the action line rather than in difference to a live audience, this system has the physical capability to engage in shot composition, match cut and the establishing shot se- quence. ject tracking through facial detection, and motion extraction Like Ranjan et al., the EduCam system employs a dual algorithms standard in the OpenCV and OpenNI libraries. camera system as seen in Figure 3. In this case, the Kinect At each time point however, they maintain the subject at the and Raspberry Pi use a reactive coverage algorithm with fa- same ROI. This will allow there to be a match cut at any cial detection and motion extraction algorithms to actively time. The three cameras will also keep the shot as a medium maintain the subject in the current Region of Interest (ROI). shot. This allows not only for match cut at any time period The camera on top (currently a GoPro) passively records the but for shot composition to be further refined as necessary scene in HD for offline editing. after shooting. Many amateur videographers attempt to cap- ture their material cropped correctly the first time, leaving them with little edit if they see a mistake. Using 1920x1080 Real Time Reactive Camera Coverage footage instead of the standard 640x480 provided by the Reactive algorithms is used for the initial real time camera Kinect, a medium shot can be converted to a close shot using coverage. At all points, all three cameras are engaged in sub- digital zoom.

59 Hierarchal Constraint Satisfaction Christie, M.; Olivier, P.; and Normand, J.-M. 2008. Camera EduCam proposes to use hierarchal constraint satisfaction Control in Computer Graphics. Computer Graphics Forum with weighted rules in the post-editing process to transition 27(8):2197–2218. between shots. The hierarchical nature of content in educa- 2014. Coursera. https://www.coursera.org/. Accessed: tional videos could naturally fit the hierarchical specification 2014-04-10. language for constraints. Each shot is optimized for its shot 2014. Gooru. http://www.goorulearning.org/. Accessed: composition and given a score on composition, best meet- 2014-04-10. ing the 3 types of shots, and creating a match for the current Halper, N., and Olivier, P. 2000. Camplan: A camera plan- shot. Cues to transition include current shot length, audio ning agent. In Smart Graphics 2000 AAAI Spring Sympo- pauses to signal scene change, or simply a minimized er- sium, 92–100. ror between the matching of two shots. Rule weights change temporally depending on the current cues and the next shot Jardillier, F., and Languenou,´ E. 1998. Screen-space con- is chosen that best meets constraints. The current constraint straints for camera movements: the virtual cameraman. In set for EduCam includes rules for match cuts, establishing Computer Graphics Forum, volume 17, 175–186. Wiley On- shots, screen time for narrator and content, and shot com- line Library. position. EduCam will be evaluated based on its ability to Kushalnagar, R. S.; Cavender, A. C.; and Paris,ˆ J.-f. 2010. engage a viewing audience with the material by looking for Multiple View Perspectives: Improving Inclusiveness and correlations between engagement related observations and Video Compression in Mainstream Classroom Recordings. constraints. Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility ASSETS 10 123– Conclusion 130. The technology to run physical automated camera control Lampi, F.; Kopf, S.; and Effelsberg, W. 2008. Automatic with cinematic vocabulary is available. Both the active and lecture recording. In Proceedings of the 16th ACM interna- passive lecture capture systems have demonstrated that nar- tional conference on Multimedia, 1103–1104. ACM. rator tracking can be done in real time and at low cost. Com- Lino, C.; Christie, M.; Lamarche, F.; Schofield, G.; and putational cinematic algorithms has demonstrated that au- Olivier, P. 2010. A Real-time Cinematography System for tonomous cinematography works in limited stochastic en- Interactive 3D Environments. In 2010 ACM SIGGRAPH / vironments. Moreover the educational video domain is an Eurographics Symposium on Computer Animation. ideal transition point from virtual to the physical domain. Liu, Q.; Rui, Y.; Gupta, A.; and Cadiz, J. J. 2001. Automat- Once EduCam is fully implemented, there are several next ing camera management for lecture room environments. steps. The first and foremost is to evaluate whether these cin- Proceedings of the SIGCHI conference on Human factors ematic idioms, once employed in education, will not only in computing systems - CHI ’01 (3):442–449. engage students in the material better but also engage edu- Liu, Q.; Kimber, D.; Foote, J.; Wilcox, L.; and Boreczky, J. cators in creating better material. Secondly, the EduCam set 2002. Flyspec: A multi-user video camera system with hy- up can be applied to a more complex physical environment, brid human and automatic control. In Proceedings of the such as a play or even a meeting, where multiple subjects tenth ACM international conference on Multimedia, 484– must be tracked. Dividing the automatic director control into 492. ACM. rough reactive and constraint satisfaction allows for such de- velopments. EduCam is the bridge towards automatic cine- Ozaki, M.; Gobeawan, L.; Kitaoka, S.; Hamazaki, H.; Kita- matic control in a real stochastic environment. mura, Y.; and Lindeman, R. W. 2010. Camera movement for chasing a subject with unknown behavior based on real- time viewpoint goodness evaluation. The Visual Computer References 26(6-8):629–638. Bares, W. H.; Gregoire,´ J. P.; and Lester, J. C. 1998. Real- Ranjan, A.; Henrikson, R.; Birnholtz, J.; Balakrishnan, R.; time constraint-based cinematography for complex interac- and Lee, D. 2010. Automatic camera control using unob- tive 3d worlds. In AAAI/IAAI, 1101–1106. trusive vision and audio tracking. In Proceedings of Graph- Birnholtz, J.; Ranjan, A.; and Balakrishnan, R. 2010. Pro- ics Interface 2010, 47–54. Canadian Information Processing viding Dynamic Visual Information for Collaborative Tasks: Society. Experiments With Automatic Camera Control. Human- Thompson, K. 2011. Graphic content ahead. Computer Interaction 25(3):261–287. http://www.davidbordwell.net/blog/2011/05/25/graphic- Bordwell, D. 2014. Manny fraber 2: Space man. content-ahead/. Accessed: 2014-04-10. http://www.davidbordwell.net/blog/2011/05/25/manny- Ursu, M. F.; Groen, M.; Falelakis, M.; Frantzis, M.; Zsom- farber-2-space-man/. Accessed: 2014-04-10. bori, V.; and Kaiser, R. 2013. Orchestration: Tv-like mix- Chi, P.-Y.; Liu, J.; Linder, J.; Dontcheva, M.; Li, W.; and ing grammars applied to video-communication for social Hartmann, B. 2013. Democut: generating concise instruc- groups. In Proceedings of the 21st ACM international con- tional videos for physical demonstrations. In Proceedings of ference on Multimedia, 333–342. ACM. the 26th annual ACM symposium on User interface software and technology, 141–150. ACM.

60