Educam: Cinematic Vocabulary for Educational Videos
Total Page:16
File Type:pdf, Size:1020Kb
Intelligent Cinematography and Editing: Papers from the AAAI-14 Workshop EduCam: Cinematic Vocabulary for Educational Videos Soja-Marie Morgens and Arnav Jhala [email protected] and [email protected] Computer Science Department, U. of California, Santa Cruz, CA Abstract This paper focuses on three cinematic idioms that are common in professionally-shot educational videos and doc- Cinematography relies on a rich vocabulary of shots umentaries: shot composition, match cut, and the establish- and framing parameters to communicate narrative event structures. This cinematic vocabulary stems from the ing shot sequence, It proposes EduCam, an architecture that physical capabilities of camera movement and uses aes- allows for the employment of virtual camera control algo- thetic aspects to engage the audience and influence au- rithms in the physical lecture capture environment. Edu- dience comprehension. At present, automatic physical Cam uses real time reactive camera coverage to allow for camera control is limited in cinematic vocabulary. Vir- offline editing with hierarchal constrain satisfaction algo- tual cameras, on the other hand, are capable of exe- rithms. This paper will evaluate the current physical auto- cuting a variety of cinematic idioms. Specifically, shot matic camera control for their cinematic capacity, discuss composition, match cut, and establishing shot sequence current virtual cinematography and finally outline how Ed- are three common cinematic techniques often lacking uCam allows for hierarchal constraint satisfaction editing in in automated physical camera systems. This paper pro- a physical world. poses an architecture called EduCam as a platform for employing virtual camera control techniques in the lim- ited stochastic physical environment of lecture captur- ing. EduCam uses reactive algorithms for real time cam- era coverage with hierarchical constraint satisfaction techniques for offline editing. Amateur video has become an important medium for dis- seminating educational content. The DIY community rou- tinely creates instructional videos and tutorials while profes- sional educators publish entire courses on Gooru and Cours- era (Coursera 2014; Gooru 2014). Video gives educators the ability to flip the classroom, pushing passive viewing out- Mal- side of the classroom and bringing interactivity and discus- Figure 1: An example of good shot composition from tese Falcon sion back in. With video, educators have access to cinematic . The subject is on the upper left third lines idioms to shape the narrative of their material. and his gaze is into the open space. (Bordwell 2014) However these idioms rely on a high threshold of videog- raphy knowledge, equipment, and time. So educators use Cinematic Vocabulary static cameras with time-consuming editing or automatic The three most common cinematic idioms used to en- lecture capture systems. Current automatic lecture capture gage audiences in content-driven film are shot composition, are passive capture systems, created to record the oratory match cuts, and establishing shot sequence. style of the educator to his live audience rather than em- Shot Composition ensures the shot is visually engaging ploy cinematic vocabulary. Moreover, their coverage typi- and cues the audience to the subject of the shot. Generally cally consists of just a few cameras in the back of the class- a good engaging composition employs rule of thirds and is room. However this reduces their shot vocabulary to long from an angle that emphasizes perspective. Subjects are typ- and medium shots with limited shot composition capability. ically on the side of the frame opposite of where they are Current virtual camera control algorithms are able to em- moving or gazing (see Figure 1). ploy the full variety of shots and some cinematic transitions. Match Cuts are used to smoothly transition an audience’s However they do stochastic physical environment and typi- eyes from one shot to another. In it the area of interest from cally assume unlimited camera coverage. the second shot is often matched with the area of interest in Copyright c 2014, Association for the Advancement of Artificial the first shot as shown in Figure 2. This maintains the audi- Intelligence (www.aaai.org). All rights reserved. ence’s eyes on the subject and prevents ”jitteriness” or loss 57 of focus as the audience re-engages their eyes with the sub- ject. Establishing shot sequence is used to signal a change of scene or subject. It consists of sequential shots start with a long shot then transition to medium shot and finally end on close shots of the subject. Used a the beginning of a film, it focuses the audience’s attention on the subject and eases the audience into the story. Used as a chapter break, it signals to Figure 2: A Match Cut on region of interest from Aliens the audience a change and allows them time to refocus. (Thompson 2011) These three methods rely on the director’s ability to ma- nipulate multiple cameras along seven degrees of freedom according to the subject’s position and future intentions. In computational cinematography the virtual world allows These passive capture systems, limited in their coverage perfect knowledge of a subject’s location and near perfect and real time editing constraints, employ simple auto direc- knowledge of the subject’s intentions. Coverage is not an is- tors without cinematic editing. Liu et al. (2001) and Lampi, sue and most systems rely on multiple cameras. However Kopf, and Effelsberg transition based on shot length heuris- in lecture recording systems, knowledge of future intent is tics and physical buttons. Liu et al. (2002) and Kushalnagar, imperfect; editing algorithms are limited to real time opera- Cavender, and Parisˆ do not transition but show all streams tions; cameras are limited in number and mobility. By using simultaneously. real time capture from multiple angles and zooms but non Active capture systems demonstrate the capabilities of real time editing, the EduCam system will allow for use of a current technology for reactive real time automated control larger variety of cinematic idioms without loss of quality. and the need for cinematic idioms to improve audience en- gagement (Chi et al. 2013; Birnholtz, Ranjan, and Balakr- Lecture Capture Domain ishnan 2010; Ranjan et al. 2010; Ursu et al. 2013). Ranjan The lecture capture domain is an excellent testbed for et al. tracks the current speaker in a meeting through visual bringing virtual cinematographic techniques into the real facial recognition and audio cues. Both methods prove to be world. While the educational video exists within physically noisy. However, it reported its output as comparable to real stochastic environment it is one filled with discernible cues time editing video crew and is capable of close, medium, and recognizable subjects. It has just two subjects: a sin- and long shots. Birnholtz, Ranjan, and Balakrishnan uses gle instructor and the material which is either written on motion tracking of a hand to determine the material subject the board or shown as a live demo. Movement is along a and when to transition between shots. Its system can employ bounded plane and at a limited velocity, allowing for track- long shots and close shots and found that the use of a vari- ing cameras to be set in pre-determined positions without ety of shots shapes audience engagement. Chi et al. creates significant loss of coverage. Facial recognition and motion video using a single camera that follows the narrators posi- extraction algorithms allow cameras to roughly track the tion via Kinect recognition algorithms. It demonstrated the subjects in real time with finessed shot composition done ease at which a Kinect can track an object adequately during offline. Cues to transition between the subjects can be taken live recording and the reluctance of a narrator to participant from the narrator’s gaze, gesture and rhythm. in changes of camera coverage. Overall, these studies show that physical systems have the Physical Camera Control Systems capability to do real time reactive camera coverage. How- ever while active systems do use a variety of shots, they do Automatic physical camera control systems can be divided not often use cinematic vocabulary for shot transitions. They up into two categories: passive and active systems. Passive either have no transitions or use simple state machines that systems are unobtrusive and record the narrator’s oration to transition based on shot length heuristics and audio cues. a live audience, whether in a meeting or a classroom. Active systems are set up as the primary medium through which the Virtual Camera Control narrator communicates. Current lecture capture systems are passive and use re- Automatic virtual camera control can generally be di- altime editing. These systems typically consist of one or vided into three approaches: reactive, optimization-based, more static PTZ (Pan, Tilt, Zoom) cameras. Kushalnagar, and constraints satisfaction (Christie, Olivier, and Normand Cavender, and Parisˆ involves a single camera in the back 2008). of the classroom to capture the narrator and the material in The reactive approaches are adapted from robotics and a long shot. More complex set ups such as Liu et al.2001 best suited to real-time control of camera coverage. They and Lampi, Kopf, and Effelsberg involve a camera on audi- look for local optima near the camera’s current position and ence as well as a camera that can actively track the narrator. generally focus on minimizing occlusion of the subject and Lampi, Kopf, and Effelsberg does use some cinematic vo- optimizing shot composition. For example, Ozaki et al. uses cabulary: it composes its shots according to rule of thirds in A* with a viewpoint entropy evaluation based on occlusion its real time coverage. and camera collision with the environment. Optimization approaches tend to focus on shot compo- 58 sition but interactive editing systems provide experts with a range of temporal cinematic idioms of movement. These op- timization approaches have used frame coherence and shot composition with static cameras (Halper and Olivier 2000).