The 2010 IEEE/RSJ International Conference on Intelligent and Systems October 18-22, 2010, Taipei, Taiwan

Visual Servoing of Presenters in Augmented Virtual Reality TV Studios

Suraj Nair, Thorsten Roder,¨ Giorgio Panin, Alois Knoll, Member, IEEE

Abstract— This paper presents recent developments in the area of visual tracking methodologies for an applied real-time person localization system, which primary aims to robust and failure-safe robotic camera control. We applied the described methods to virtual-reality TV studio environments in Germany in order to close a gap in TV studio . The presented approach uses camera systems based on industrial robots in order to allow high-precision camera ma- nipulation for virtual TV studios, without limiting the degrees of freedom that a robot manipulator can provide. To take robotic automation in TV studios to a completely new dimension, we have imparted intelligence to the system by tracking the TV presenter in real-time, allowing him or her to move naturally and freely within the TV studio, while maintaining the required scene parameters, such as position in the scene, zoom, focus, etc. according to prior defined scene behaviors. The tracking system itself is distributed and has proven to be scalable to multiple robotic camera systems operating synchronously in real-world studios. Fig. 1. A virtual set for daily news broadcasting events. The upper and I. INTRODUCTION lower left pictures show the virtual studio. The right shows our robot camera system for broadcast automation. (image courtesy: RTL Virtual TV studios have gained immense importance in Koln,¨ Germany, and Technology Leaders GmbH). the broadcasting area over the past years, and are becoming the mainstream way of broadcasting in the future. This evolutionary step is based on developments in computer housing a pan tilt unit. These systems have limitations in graphics and rendering hardware, that allow to achieve high terms of degrees of freedom, motion smoothness and high quality images with reasonable effort. For these reasons, the costs of the external sensor-based tracker, used to recover complexity and quality of HDTV contents in fully virtual sets the 3D pose of the camera. have seen a new high, providing an impressive experience arms can perform precise manipulation of for educational and documentary movies, as well as for e.g. TV cameras with high repeatability in large workspaces, us- weather or financial forecast transmissions. The present day ing many degrees of freedom. Moreover, the main advantage technology also allows broadcasters to have virtual objects of a robotic system is that the 3D camera pose is obtained inside the virtual scene. Although with these advances the free of cost through the robot kinematics, thereby eliminating acceptance for this technology has increased, the system the need for external trackers. complexity has also inherently increased. Therefore it is Robotic automation in TV studios can be pushed to a new a crucial goal to keep systems maintainable for human high by imparting intelligence to the robot system. To this operators, and thus in general to hide their complexity. aim, in this paper we propose a vision-based person tracker Besides the nowadays powerful rendering engines and for visual servoing, integrating multiple visual modalities. astonishing 3D graphics available, the fundamental quality The system is able to localize the moderator and keep of a virtual scene depends on the real-time robustness and her/him within the screen while sitting or freely walking accuracy of three major system components, namely: 1. The inside the studio. camera tracker, that recovers the absolute 3D position and Another important feature is the automatic positioning of orientation of the camera (e.g. by using external infrared the moderator during different run-down scenes according sensors or odometry), 2. the rendering software itself, and to a pre-determined region of interest. For example, when 3. the precision with which the camera is moved w.r.t. switching from a scene with the moderator in the center to translation, rotation, zoom and focus. one where visual graphics need to be rendered, it is necessary Virtual TV studios around the world use a typical camera to hold the moderator on the left (or right) part of the scene. configuration for motion control, consisting of a pedestal This can be achieved using the tracking results with almost no need for human intervention. In comparison to our previ- Authors Affiliation: Technische Universitat¨ Munchen,¨ Fakultat¨ fur¨ Infor- ous published work [1] and [2], the system has improved in matik. Address: Boltzmannstrasse 3, 87452 Garching bei Munchen¨ (Germany). scalability, distribution and contains new modules for three- Email: { nair, roeder, panin, knoll }@in.tum.de dimensional tracking. It has been completely integrated into

978-1-4244-6676-4/10/$25.00 ©2010 IEEE 3771 a commercially distributed system called RoboKamR . A. System overview The present paper is organized as follows: Sec. II briefly The system consists of two parts: 1. a single person reviews the related state-of-the-art; Sec. III describes the tracker, operating on each robot, localizes the moderator in visual tracker, providing a system overview, the user interface 2D position (x,y), scale h and rotation θ within the field and the tracking methodology. Sec. III-F explains the robot of view of the TV camera, allowing the robot to hold the controller. Experimental results are given in Sec. IV, and moderator in a desired region of the scene, with the desired conclusions including future system developments are finally zoom and focus. given in Sec. V. The target is modeled by representing the head and II. PRIOR ART shoulder silhouette, in order to provide a stable scale output, which is not possible by using only color statistics lacking di- To our knowledge, no fully integrated and self-contained rect spatial information, 2. a supplementary overhead stereo vision-driven robot cameraman has been developed for vir- tracker localizes the target over the entire studio floor w.r.t. tual reality (VR) TV studio applications. However, the liter- 3D translation (x,y,z). Although the 3D tracker makes the ature concerning single person or multiple people tracking in overall architecture more complex, it serves very important video surveillance, mobile robotics and related fields, already features such as: counts several well-known examples, that we briefly review • here. Although implementations of similar vision systems Initialize each robot camera, so that they can bring the do exist in the research and scientific domain, successful target in the respective fields of view independently of application in real world scenarios remains very limited. their initial positions • Multiple people trackers [3], [4], [5], have the common Re-initialize the local trackers in case of a target loss, requirement of using a very little and generic offline infor- to recover their 2D locations • mation concerning the person shape and appearance, while Initialize zoom and focus control of each robot camera • building and refining more precise models (color, edges, Handle interactions of the person with virtually rendered background) during the on-line tracking task; this unavoid- objects, such as occlusion of the object when the mod- able limitation is due to the more general context with respect erator moves around it to single-target tracking, for which instead specific models The target is modeled as a fixed omega-shape for the can be built off-line. person tracker, while for the overhead tracker it is represented Many popular systems for single-target tracking are based by a 3D box, along with a frontal picture of the person on color histogram statistics [6], [7], [8], [9] and employ appearance. a pre-defined shape and appearance model throughout the Figure 2 gives an overview of the complete system. Each whole task. tracker is integrated into the robotic camera system through In particular, [8] uses a standard particle filter with color a modular middleware called COSROBE, developed for histogram likelihood with respect to a reference image of the communication and configuration of studio devices. target, while [7] improves this method by adapting the model It is possible to have more than one robot camera in the on-line to light variations, which however may introduce same studio, although there exists only a single overhead drift problems in presence of partial occlusions; the same tracker. This tracker uses ceiling mounted firewire cameras color likelihood is used by the well-known mean-shift kernel in a stereo configuration, to compute the 3D pose of the tracker [9]. moderator. These cameras are calibrated with respect to a The person tracking system [6] employs a complex model common world frame, with individual intrinsic and extrinsic of shape and appearance, where color and shape blobs are parameters. modeled by multiple Gaussian distributions, with articulated For the 2D person tracker we choose a Kalman filter degrees of freedom, thus requiring a complex modeling [10], running on the output of a contour tracker known as phase, as well as several parameters specification. contracting curve density CCD algorithm [11], [12], based By comparison, in our system the off-line model is kept to on separation of local color statistics. In an object tracking a minimum complexity, while at the same time retaining the context, separation takes place between the object and the relevant information concerning the spatial layout of colour background regions, across the screen contour projected from statistics. the shape model onto a given camera view, under a predicted The main advantage of our tracker is therefore its usability, pose hypothesis. flexibility and successful integration within the RoboKamR The overhead tracker uses a sampling-importance- system for broadcast automation. In its complete form, it resampling particle filter [13] working on color histograms, bridges the gap between scientific research and industrial providing a joint likelihood of the 3D target pose. We applications with a minimum required throughput. choose a particle filter for the overhead tracker, over a more conventional Kalman filter, techniques because the overhead III. THE VISION-BASED PERSON TRACKER tracker has to be highly robust when dealing with multi- In this Section we describe our vision-based system for modal likelihoods, due to a high probability of having a moderator localization and continuous tracking, providing cluttered background. This way it can support the 2D person details regarding its design and implementation. tracker system in cases of loss detection and re-initialization.

3772 ing a tracking pipeline concept. Fig. 4 describes the pipelines Overhead Firewire Camera Signals for the two trackers of the previous Section. Each 2D tracker holds a state-space representation of the model pose, given by a planar roto-translation and OverHead Tracking System scale in the image plane. The Kalman filter provides the Overhead Tracking System Target 3D Pose (x, y, z) sequential prediction and update of the respective 2D state Target CS_1 (x, y, scale) VR Engine =( , , ,θ) Tracking System 1 s x y h . (IP/port) System GUI 1) Pre-processing: Sensor data are obtained from the TV Cam1 Signal (PAL) COSROBE Other Camera System 1 Robot Motion Planner camera in a raw PAL format, through a frame grabber. In CS1 Studio Console Devices Target CS_N CCD, no pre-processing is done, and color pixels are directly (x, y, scale) Tracking System N Lens (IP/port) used by the feature matching module, in order to collect CamN Signal (PAL) Joystick

Camera System N local statistics and optimize the pose. Therefore, the pre- Pan Tilt Units CSN processing function merely copies the input image to a local Collison Avoidance storage for the other modules. 2) Tracker prediction: The Kalman filter generates a − ( ) Fig. 2. Block diagram of the system architecture. The middleware prior state hypothesis st from the previous state st−1 by COSROBE integrates multiple robotic camera devices and tracking systems. applying a Brownian motion model An overhead stereo camera is used to initialize the system at startup and in cases of target loss. Since each camera device is calibrated to a given world − = + st st−1 wt ; (1) point, tracking information can be fused. The middleware also connects to the virtual reality engine and system configuration modules, like. e.g. the with w a white Gaussian noise sequence. Although very GUI or the motion planning module. simple, this model suits our needs very well, providing a Gaussian distribution around the current state which helps keeping track of the target when it is not moving, as well as Concerning computational resources, the software for each when it exhibits motion at a reasonable speed. If the target camera system runs on a separate PC and obtains the moves very fast, then we would need more adequate models TV camera picture through a frame grabber. The overhead such as constant velocity, but this situation normally does tracker uses stereo FireWire cameras with wide angle lenses not arise in TV studio environments. for covering the complete studio floor. Currently the over- 3) CCD Likelihood for Feature Level Matching: In the head tracker runs on a single PC, to which two FireWire original CCD algorithm [11], local areas for collecting color cameras are connected. statistics are given by regions around each contour sample B. Graphical user interface position, on each side of the contour. In order to simplify the computation, as also suggested in [12], we first sample points The tracking system can be easily controlled by the user along the respective normals, separately collect the statistics, using an intuitive graphical user interface (Fig. 3). The GUI and afterwards blur each statistics with the neighboring ones provides functions such as start/stop tracking, and automatic (Fig. 5). This is fully equivalent to the initial process, but target re-detection. The latter is done by frontal face detection computationally more convenient. [14].

Fig. 3. Graphical user interface

Fig. 5. The CCD algorithm tries to maximize the separation of color C. The 2D person tracker statistics between two image regions. The algorithm first samples pixels along the normals for collecting local color statistics. Our trackers are designed and implemented following a recently developed general-purpose framework [15], follow- From each contour position hi, foreground and background

3773 TV Camera PreProcessing FeatureMatching Kalman Filter Grabber OutputFilter N Times DirectPass CCD 2DPose(x,y,scale, )

OmegaShapeModel Pose2D1ScaleRotation:

Cam1Warp Cam1 Tint1 Text1

PreProcessing FeatureMatching ParticleFilter Grabber RGB2HSV ComputeAverage OutputFilter ColorHistogram 3DPose(x,y,z) Overhead Conversion FireWireCameras Tint:Intrinsics PreProcessing RectangularCube Text:Extrinsics FeatureMatching Grabber RGB2HSV Model ColorHistogram Conversion

CamN TintN TextN CamN Warp

Pose3DTranslation:

Fig. 4. Person tracking pipelines. The top pipeline estimates a planar pose with 4 degrees of freedom (roto-translation and scale). A Kalman filter incorporates a feature-based contour matching modality, using an omega shape model. The pipeline on the bottom describes the overhead tracker: here, a stereo setup is used in order to estimate the 3D position of the target. After capturing the image frames, they are converted into an HSV color space. A SIR particle filter is used for Bayesian prediction and correction. As an objective function we use 2D color statistics within a 3D box model, that gets simultaneously projected and evaluated on each camera.

color pixels are collected along the normals ni up to a The second step involves computing the residuals and distance L, and local statistics up to the 2nd order are Jacobian matrices for the Gauss-Newton pose update. For estimated this purpose, observed pixel colors I(hi + Ldn¯ i) with d¯ = − ,..., D 1 1, are classified according to the collected statistics ν0,B/F = (4), under a fuzzy membership rule a(x) to the foreground i ∑ wid d=1 region     D 1 d¯ 1,B/F ( ¯)= √ + ν = ∑ w I(h ± Ldn¯ ) (2) a d er f 1 (4) i id i i 2 2σ d=1 D which becomes a sharp {0,1} assignment for σ → 0; pixel ν2,B/F = ( ± ¯ ) ( ± ¯ )T i ∑ widI hi Ldni I hi Ldni classification is then accomplished by mixing the two statis- = d 1 tics accordingly ¯ with d ≡ d/D the normalized contour distance, where the ± ˆ = ( ¯)¯F +( − ( ¯))¯B Iid a d Ii 1 a d Ii (5) sign is referred to the respective side, and image values I are F B Rˆid = a(d¯)R¯ +(1 − a(d¯))R¯ 3-channel RGB. The local weights wid decay exponentially i i with the normalized distance, thus giving a higher confidence and color residuals are given by to observed colors near the contour. For a more detailed = ( + ¯ ) − explanation please refer to [12]. Eid I hi Ldni Iˆid (6) Single-line statistics are afterwards blurred along the con- with covariances Rˆ . tour, providing statistics distributed on local areas id Finally, the (3 × n) derivatives of Eid can be computed , / , / ν˜ o B F = (−λ | − |)νo B F = , , by differentiating (5) and (4) with respect to the pose i ∑exp i j j ; o 0 1 2 (3) 1 j parameters   ∂Iˆ 1 ∂a ∂h and finally normalized J = id = (I¯F − I¯B) nT i (7) id ∂s L i i ∂d¯ i ∂s ν1,B/F B/F ˜i I¯ = which are stacked together in a global Jacobian matrix Jccd. i ν0,B/F ˜i The state is then updated using a Gauss-Newton step 2,B/F / ν˜ ¯B F = i s = s + Δs (8) Ri , / ν˜ 0 B F Δ = + i s JccdEccd in order to provide the two-sided, local RGB means I¯ and 1 As in [12], we neglect the dependence of Rid on s while computing the (3 × 3) covariance matrices R¯. Jacobian matrices.

3774 using the stacked Jacobian and residual vector. After each 4) Computing the estimated state: The average states ¯t iteration, the measurement covariance is reduced with ex- 1 s¯ = ∑wisi (13) ponential decay, providing a robust multi-resolution conver- t N t t gence to the locally optimal pose i is computed and the three components (x¯,y¯,z¯) are returned = ∗ Zccd s (9) to the robot controller. which is used as a measurement Z for the respective Kalman E. Loss detection and handover between trackers filter. One of the most important features of our system is the possibility to automatically detect a track loss when the D. 3D overhead tracker pipeline person leaves the scene or gets occluded, and to re-initialize The overhead tracker holds a state-space representation of the system in such situations, using the overhead tracker. the 3D model pose, given by a translation (x,y,z) of the In principle there are two main techniques to determine a box model with respect to the reference system of the stereo possible target loss, 1) based on a covariance test, 2) based setup. The particle filter provides the sequential prediction on a likelihood test. A covariance test would be independent and update of the respective state s =(x,y,z). from the actual likelihood values, but it may fail to detect 1) Pre-processing: The sensor data for the person tracker loss when the hypotheses concentrate on a small peak (false is obtained from each FireWire camera in the raw RGB-444 positive), which has a low covariance as well. This is very format. The image from each camera is pre-processed by undesirable in a TV studio application, where the only target col = ,..., that should be detected is the moderator, and never other performing RGB to HSV color conversion zc ;c 1 C for the color-based likelihood, where the index c corresponds people or objects which the robot could drift to. to the FireWire camera, and C is the total number of cameras. On the other hand, the likelihood test is dependent on the An optional background subtraction step is possible before likelihood values, and may detect too often a loss (false neg- color conversion, to further increase robustness in suitable atives), for example in presence of light variations. However, studio setups. in a TV studio light conditions are strongly controlled, and 2) Tracker prediction: The particle filter generates sev- an occasional false negative is acceptable, as long as the re- i initialization is successful. eral prior state hypotheses st from the previous distribution i i Therefore, we employ the likelihood test on the estimated (s ,w )t−1 through a Brownian motion model states ¯t from both trackers, and declare a loss whenever i = i + i ( | ) st st−1 wt (10) P z s¯t decreases below a minimum value Pmin. This thresh- old is set as a percentage (e.g. ≤ 10%) of a reference value with w a zero-mean Gaussian noise of pre-defined covariance Pre f , initially given by the observed maximum likelihood. in the (x,y,z) state variables. Deterministic resampling over In order to provide adaptivity to variable postures, e.g. i the previous weights wt−1 is employed at each frame. when the moderator turns on a side, as well as light or 3) Color likelihood: For each generated hypothesis, the shading variations, Pre f itself is slowly adapted if the last tracker asks for a computation of the likelihood values average likelihood P(z|s¯t−1) is comparable to Pre f (e.g. ≥ ( col| i) P zc s after projecting the hypothesis onto each camera 60%). When a target loss occurs, the tracker is automatically view. re-initialized. The object model defining the person shape is projected F. Robot controller onto the HSV image of each camera at the predicted hypothe- i sis st , using the respective intrinsic and extrinsic parameters. Strategy (linear and/or hold angle) Mapping to joint movement, The underlying H and S color  values are collected in the Tracking system (Overhead-/Persontracking) collision avoidance i respective 2D histogram qc st , that is compared with the ∗ reference one qc through the Bhattacharyya coefficient [8]

 1 2 ( ( ), ∗)= − ∗ ( ) ( , ) B qc s qc 1 ∑ qc n qc s n (11) Tracking: estimated state Image: ROI adaptation Movement of joints N and movement where the sum is performed over the (bin × bin) histogram Fig. 7. Robot control methodology bins (in the current implementation, bin = 10). The compu- tation is done separately for each camera c. In order to keep the target in a predefined ROI it is The overall likelihood is then evaluated under independent necessary to generate the relative motion parameters for the Gaussian models in the overall residual corresponding robot system. This is achieved by using 2D pose information from the local tracker and convert it to ( col| i) ∝ (− ( 2/λ)) P z st exp ∑log Bc (12) 3D motion commands in the manipulator space. For this c conversion, additional parameters have to be considered, with a given covariance λ. namely:

3775 Fig. 6. Experimental results Upper row: 2D contour tracking, both in a TV news studio with a green wall background and constant lighting, as well as on a cluttered background. Bottom row: overhead stereo tracker, localizing the moderator in 3D translation.

• Region of interest (ROI): it is the desired area in the controller shipped by the manufacturer. image space of each camera system where the target In a TV Studio setup, production is done on a scene-by- should be held, e.g. in weather broadcasts the target scene basis, with respect to the run-down. The robot system appears usually on the right side of the scene. should react intuitively to a switch, from one scene to the • Balance speed: the speed of the robot is specified for X, other. This is achieved automatically, with almost no need for Y, pan and tilt, as an absolute percentage ranging from human intervention. Different scenes require the moderator [−100%;0%;100%] of the maximum values. This speed to appear in different regions, and with different zoom is used by the camera system in order to get the target and focus settings. This information can be combined with back into the ROI. The actual speed also depends on the the run-down information, in order to enable a completely distance between the target and the ROI. The effective automatic switch of the camera position and zoom/focus, speed is then proportional to the calculated distance, holding the moderator within the current region of interest. providing smooth motion properties. IV. EXPERIMENTS AND RESULTS As shown Fig. 7, 3D Cartesian control is computed out of The system has been evaluated in real TV studios, for vir- the X, Y, pan, tilt speeds. We propose two different operation tual reality productions. For this purpose, standard Desktop modes for the joint control: PCs with 2.4GHz Intel Pentium IV and standard graphics • Normal mode: The movement of the robot is limited to hardware, have been used to realize each camera tracker and a linear motion in the X and Y direction, and angular the overhead stereo tracker, all running on Linux operating motion for Pan and Tilt in order to get the target back system. into the desired ROI. Both trackers run in real-time, with approximately 15 − • Hold-angle mode: The movement of the robot is done 20 fps, which we found more than sufficient for the robot in 2 phases: In the 1st phase, the robot uses only Pan controller, that requests visual feedback every 200ms. Image and Tilt to bring the target back into the ROI, and in the resolution is 640 × 480 pixels. Fig. 8 illustrates a lab robot 2nd phase it uses linear X and Y motion to compensate with a pan-tilt unit and a TV camera: the robot camera and hold a predefined camera viewing angle. follows the person while moving and interacting with other The robot in used in the studio is a Staubli¨ RX-160L people in the scene. with modified kinematics. In laboratory experimental setups Fig. 6 show some experimental results of the 2D person we use the Staubli¨ RX-90. The RX-160L has 6 joints, but tracker, and also illustrates the 3D overhead tracker. Again, we replaced joints 4, 5 and 6 with a specialized tilt-pan- the system keeps good track of the person during the whole tilt configuration, to improve capabilities of camera motion sequence. The accompanying video file demonstrates per- control, as illustrated in Fig. 1. This robot comes with the formances in different scenarios, including automatic scene CS-8 Staubli¨ controller, so we bypass the kinematics com- switching during run-down. putation provided by the controller and instead compute the Although the tracker perform in real-time, it communicates kinematics and motion trajectories externally, and send this to the robot controller through a common middleware, with information directly to the low level controller through the TCP-IP sockets, that introduce some delay in the motion respective interface. The controller runs a real-time extension control. The robot controller send commands to the robot of LINUX. For the PID control, we instead rely on the CS-8 with a cycle time of 4msec, at the same time requesting

3776 VI. ACKNOWLEDGMENTS The authors wish to express their acknowledgments to the RTL Television Studio Koln,¨ Germany, for providing the environment, the pictures and live video sequences for our experiments.

REFERENCES [1] S. Nair, G. Panin, M. Wojtczyk, C. Lenz, T. Friedelhuber, and A. Knoll, “A multi-camera person tracking system for robotic ap- plications in virtual reality tv studio,” in Proceedings of the 17th IEEE/RSJ International Conference on Intelligent Robots and Systems 2008. IEEE, Sep. 2008. [2] S. Nair, G. Panin, T. Roder,¨ T. Friedelhuber, and A. Knoll, “A dis- tributed and scalable person tracking system for robotic visual servoing with 8 dof in virtual reality tv studio automation,” in Proceedings of the 6th International Symposium on Mechatronics and its Applications (ISMA09). IEEE, Mar. 2009. [3] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: A real time system for detecting and tracking people,” in CVPR ’98: Proceedings of the Fig. 8. In-house testbed: example sequence with the robot controller in IEEE Computer Society Conference on Computer Vision and Pattern action. Here the control of the robot arm as well as the pan-tilt unit, have Recognition. Washington, DC, USA: IEEE Computer Society, 1998, to be jointly performed, to provide smooth jitter-free trajectories. p. 962. [4] N. T. Siebel and S. J. Maybank, “Fusion of multiple tracking algo- rithms for robust people tracking,” in ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part IV. London, UK: data from the tracker every 100msec; the tracker generates Springer-Verlag, 2002, pp. 373–387. data at 15-25 fps, thereby fulfilling the requests from the [5] M. Isard and J. MacCormick, “Bramble: A bayesian multiple-blob robot controller. There is indeed some delay (approximately tracker,” in ICCV, 2001, pp. 34–41. [6] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: 200msec) when the robot reacts to fast movements of the Real-time tracking of the human body,” IEEE Transactions on Pattern moderator, but in TV studio environments, and especially Analysis and Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997. in virtual sets for new production, this situation is very [7] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, “An adaptive color- based particle filter,” Image Vision Comput., vol. 21, no. 1, pp. 99–110, rare. Most of the motion control takes place during scene 2003. switching, where a small delay can be tolerated during [8] P. Perez,´ C. Hue, J. Vermaak, and M. Gangnet, “Color-based prob- production. abilistic tracking,” in ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part I. London, UK: Springer-Verlag, 2002, pp. 661–675. V. CONCLUSIONS AND FUTURE WORKS [9] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object track- ing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564– A. Conclusions and future work 575, 2003. [10] G. Welch and G. Bishop, “An introduction to the kalman filter,” Tech. We presented a distributed and scalable person tracking Rep., 2004. system for visual servoing in Virtual-Reality TV applications. [11] R. Hanek and M. Beetz, “The contracting curve density algorithm: In particular, we improved robustness and usability of the Fitting parametric curve models to images using local self-adapting separation criteria,” Int. J. Comput. Vision, vol. 59, no. 3, pp. 233– current system with respect to our previous work [1], [2] in 258, 2004. many respects. [12] G. Panin, A. Ladikos, and A. Knoll, “An efficient and robust real-time The use of CCD algorithm for 2D tracking made possible contour tracking system,” in ICVS, 2006, p. 44. [13] M. Isard and A. Blake, “Condensation – conditional density propa- a very robust localization of the person with a stable scale gation for visual tracking,” International Journal of Computer Vision estimate, needed for zoom control as illustrated in the top- (IJCV), vol. 29, no. 1, pp. 5–28, 1998. right of Fig. 6. The new stereo overhead tracker allows 3D [14] P. A. Viola and M. J. Jones, “Robust real-time face detection.” in ICCV, 2001, p. 747. pose estimation, thus considerably improving performances [15] G. Panin, C. Lenz, S. Nair, E. Roth, M. Wojtczyk, T. Friedlhuber, of loss detection. Moreover, the system has been successfully and A. Knoll, “A unifying software architecture for model-based integrated into a real-world robot controller, being used live visual tracking,” in IS&T/SPIE 20th Annual Symposium of Electronic Imaging, San Jose, CA, January 2008. in TV studios. Our future developments will focus on scaling the over- head system to cover lager floor areas using camera grids. New visual modalities are also planned to be integrated, along with a fuzzy fusion module, to select the more reliable modality according to the scene conditions.

3777