Visual Servoing of Presenters in Augmented Virtual Reality TV Studios
Total Page:16
File Type:pdf, Size:1020Kb
The 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems October 18-22, 2010, Taipei, Taiwan Visual Servoing of Presenters in Augmented Virtual Reality TV Studios Suraj Nair, Thorsten Roder,¨ Giorgio Panin, Alois Knoll, Member, IEEE Abstract— This paper presents recent developments in the area of visual tracking methodologies for an applied real-time person localization system, which primary aims to robust and failure-safe robotic camera control. We applied the described methods to virtual-reality TV broadcasting studio environments in Germany in order to close a gap in TV studio automation. The presented approach uses robot camera systems based on industrial robots in order to allow high-precision camera ma- nipulation for virtual TV studios, without limiting the degrees of freedom that a robot manipulator can provide. To take robotic automation in TV studios to a completely new dimension, we have imparted intelligence to the system by tracking the TV presenter in real-time, allowing him or her to move naturally and freely within the TV studio, while maintaining the required scene parameters, such as position in the scene, zoom, focus, etc. according to prior defined scene behaviors. The tracking system itself is distributed and has proven to be scalable to multiple robotic camera systems operating synchronously in real-world studios. Fig. 1. A virtual set for daily news broadcasting events. The upper and I. INTRODUCTION lower left pictures show the virtual studio. The right shows our robot camera system for broadcast automation. (image courtesy: RTL Television Studio Virtual TV studios have gained immense importance in Koln,¨ Germany, and Robotics Technology Leaders GmbH). the broadcasting area over the past years, and are becoming the mainstream way of broadcasting in the future. This evolutionary step is based on developments in computer housing a pan tilt unit. These systems have limitations in graphics and rendering hardware, that allow to achieve high terms of degrees of freedom, motion smoothness and high quality images with reasonable effort. For these reasons, the costs of the external sensor-based tracker, used to recover complexity and quality of HDTV contents in fully virtual sets the 3D pose of the camera. have seen a new high, providing an impressive experience Industrial robot arms can perform precise manipulation of for educational and documentary movies, as well as for e.g. TV cameras with high repeatability in large workspaces, us- weather or financial forecast transmissions. The present day ing many degrees of freedom. Moreover, the main advantage technology also allows broadcasters to have virtual objects of a robotic system is that the 3D camera pose is obtained inside the virtual scene. Although with these advances the free of cost through the robot kinematics, thereby eliminating acceptance for this technology has increased, the system the need for external trackers. complexity has also inherently increased. Therefore it is Robotic automation in TV studios can be pushed to a new a crucial goal to keep systems maintainable for human high by imparting intelligence to the robot system. To this operators, and thus in general to hide their complexity. aim, in this paper we propose a vision-based person tracker Besides the nowadays powerful rendering engines and for visual servoing, integrating multiple visual modalities. astonishing 3D graphics available, the fundamental quality The system is able to localize the moderator and keep of a virtual scene depends on the real-time robustness and her/him within the screen while sitting or freely walking accuracy of three major system components, namely: 1. The inside the studio. camera tracker, that recovers the absolute 3D position and Another important feature is the automatic positioning of orientation of the camera (e.g. by using external infrared the moderator during different run-down scenes according sensors or odometry), 2. the rendering software itself, and to a pre-determined region of interest. For example, when 3. the precision with which the camera is moved w.r.t. switching from a scene with the moderator in the center to translation, rotation, zoom and focus. one where visual graphics need to be rendered, it is necessary Virtual TV studios around the world use a typical camera to hold the moderator on the left (or right) part of the scene. configuration for motion control, consisting of a pedestal This can be achieved using the tracking results with almost no need for human intervention. In comparison to our previ- Authors Affiliation: Technische Universitat¨ Munchen,¨ Fakultat¨ fur¨ Infor- ous published work [1] and [2], the system has improved in matik. Address: Boltzmannstrasse 3, 87452 Garching bei Munchen¨ (Germany). scalability, distribution and contains new modules for three- Email: { nair, roeder, panin, knoll }@in.tum.de dimensional tracking. It has been completely integrated into 978-1-4244-6676-4/10/$25.00 ©2010 IEEE 3771 a commercially distributed system called RoboKamR . A. System overview The present paper is organized as follows: Sec. II briefly The system consists of two parts: 1. a single person reviews the related state-of-the-art; Sec. III describes the tracker, operating on each robot, localizes the moderator in visual tracker, providing a system overview, the user interface 2D position (x,y), scale h and rotation θ within the field and the tracking methodology. Sec. III-F explains the robot of view of the TV camera, allowing the robot to hold the controller. Experimental results are given in Sec. IV, and moderator in a desired region of the scene, with the desired conclusions including future system developments are finally zoom and focus. given in Sec. V. The target is modeled by representing the head and II. PRIOR ART shoulder silhouette, in order to provide a stable scale output, which is not possible by using only color statistics lacking di- To our knowledge, no fully integrated and self-contained rect spatial information, 2. a supplementary overhead stereo vision-driven robot cameraman has been developed for vir- tracker localizes the target over the entire studio floor w.r.t. tual reality (VR) TV studio applications. However, the liter- 3D translation (x,y,z). Although the 3D tracker makes the ature concerning single person or multiple people tracking in overall architecture more complex, it serves very important video surveillance, mobile robotics and related fields, already features such as: counts several well-known examples, that we briefly review • here. Although implementations of similar vision systems Initialize each robot camera, so that they can bring the do exist in the research and scientific domain, successful target in the respective fields of view independently of application in real world scenarios remains very limited. their initial positions • Multiple people trackers [3], [4], [5], have the common Re-initialize the local trackers in case of a target loss, requirement of using a very little and generic offline infor- to recover their 2D locations • mation concerning the person shape and appearance, while Initialize zoom and focus control of each robot camera • building and refining more precise models (color, edges, Handle interactions of the person with virtually rendered background) during the on-line tracking task; this unavoid- objects, such as occlusion of the object when the mod- able limitation is due to the more general context with respect erator moves around it to single-target tracking, for which instead specific models The target is modeled as a fixed omega-shape for the can be built off-line. person tracker, while for the overhead tracker it is represented Many popular systems for single-target tracking are based by a 3D box, along with a frontal picture of the person on color histogram statistics [6], [7], [8], [9] and employ appearance. a pre-defined shape and appearance model throughout the Figure 2 gives an overview of the complete system. Each whole task. tracker is integrated into the robotic camera system through In particular, [8] uses a standard particle filter with color a modular middleware called COSROBE, developed for histogram likelihood with respect to a reference image of the communication and configuration of studio devices. target, while [7] improves this method by adapting the model It is possible to have more than one robot camera in the on-line to light variations, which however may introduce same studio, although there exists only a single overhead drift problems in presence of partial occlusions; the same tracker. This tracker uses ceiling mounted firewire cameras color likelihood is used by the well-known mean-shift kernel in a stereo configuration, to compute the 3D pose of the tracker [9]. moderator. These cameras are calibrated with respect to a The person tracking system [6] employs a complex model common world frame, with individual intrinsic and extrinsic of shape and appearance, where color and shape blobs are parameters. modeled by multiple Gaussian distributions, with articulated For the 2D person tracker we choose a Kalman filter degrees of freedom, thus requiring a complex modeling [10], running on the output of a contour tracker known as phase, as well as several parameters specification. contracting curve density CCD algorithm [11], [12], based By comparison, in our system the off-line model is kept to on separation of local color statistics. In an object tracking a minimum complexity, while at the same time retaining the context, separation takes place between the object and the relevant information concerning the spatial layout of colour background regions, across the screen contour projected from statistics. the shape model onto a given camera view, under a predicted The main advantage of our tracker is therefore its usability, pose hypothesis. flexibility and successful integration within the RoboKamR The overhead tracker uses a sampling-importance- system for broadcast automation. In its complete form, it resampling particle filter [13] working on color histograms, bridges the gap between scientific research and industrial providing a joint likelihood of the 3D target pose.