Multicamera Real-Time 3D Modeling for and Remote Collaboration Benjamin Petit, Jean-Denis Lesage, Clément Menier, Jérémie Allard, Jean-Sébastien Franco, Bruno Raffin, Edmond Boyer, François Faure

To cite this version:

Benjamin Petit, Jean-Denis Lesage, Clément Menier, Jérémie Allard, Jean-Sébastien Franco, et al.. Multicamera Real-Time 3D Modeling for Telepresence and Remote Collaboration. International Jour- nal of Digital Multimedia Broadcasting, Hindawi, 2010, Advances in 3DTV: Theory and Practice, 2010, Article ID 247108, 12 p. ￿10.1155/2010/247108￿. ￿inria-00436467v2￿

HAL Id: inria-00436467 https://hal.inria.fr/inria-00436467v2 Submitted on 6 Sep 2010 (v2), last revised 18 Apr 2012 (v3)

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Hindawi Publishing Corporation International Journal of Digital Multimedia Broadcasting Volume 2010, Article ID 247108, 12 pages doi:10.1155/2010/247108

Research Article Multicamera Real-Time 3D Modeling for Telepresence and Remote Collaboration

Benjamin Petit,1 Jean-Denis Lesage,2 Clement´ Menier,3 Jer´ emie´ Allard,4 Jean-Sebastien´ Franco,5 Bruno Raffin,6 Edmond Boyer,7 and Franc¸oisFaure7

1 INRIA Grenoble, 655 avenue de l’Europe, 38330 Montbonnot Saint Martin, France 2 Universit´e de Grenoble , LIG, 51 avenue Jean Kuntzmann, 38330 Montbonnot Saint Martin, France 3 4D View Solutions, 655 avenue de l’Europe, 38330 Montbonnot Saint Martin, France 4 INRIA Lille-Nord Europe, LIFL, Parc Scientifique de la Haute Borne, 59650 Villeneuve d’Ascq, France 5 Universit´eBordeaux,LaBRI,INRIASud-Ouest,351coursdelaLib´eration, 33405 Talence, France 6 INRIA Grenoble, LIG, 51 avenue Jean Kuntzmann, 38330 Montbonnot Saint Martin, France 7 Universit´edeGrenoble,LJK,INRIAGrenoble,655avenuedel’Europe,38330MontbonnotSaintMartin,France

Correspondence should be addressed to Benjamin Petit, [email protected]

Received 1 May 2009; Accepted 28 August 2009

Academic Editor: Xenophon Zabulis

Copyright © 2010 Benjamin Petit et al. This is an open access articledistributedundertheCreativeCommonsAttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present a multicamera real-time 3D modeling system that aims at enabling new immersive and interactive environments. This system, called Grimage, allows to retrieve in real-time a 3D mesh of the observed scene as well as the associated textures. This information enables a strong visual presence of the user into virtual worlds. The 3D shape information is also used to compute collisions and reaction forces with virtual objects, enforcing the mechanical presence of the user in the . The innovation is a fully integrated system with both immersive and interactive capabilities. It embeds a parallel version of the EPVH modeling algorithm inside a distributed vision pipeline. It also adopts the hierarchical component approach of the FlowVR middleware to enforce software modularity and enable distributed executions. Results show high refresh rates and low latencies obtained by taking advantage of the I/O and computing resources of PC clusters. The applications we have developed demonstrate the quality of the visual and mechanical presence with a single platform and with a dual platform that allows telecollaboration.

1. Introduction this paper, we address these issues and propose a complete framework allowing the full body presence of distant people Teleimmersion is of central importance for the next genera- into a single collaborative and interactive environment. tion of live and interactive 3DTV applications. It refers to the The interest of virtual immersive and collaborative ability to embed persons at different locations into a shared environments arises in a large and diverse set of application virtual environment. In such environments, it is essential to domains, including interactive 3DTV broadcasting, video provide users with a credible sense of 3D telepresence and gaming, social networking, 3D teleconferencing, collabo- interaction capabilities. Several technologies already offer rative manipulation of CAD models for architectural and 3D experiences of real scenes with 3D and sometimes free- industrial processes, remote learning, training, and other viewpoint visualizations, for example, [1–4]. However, live collaborative tasks such as civil infrastructure or crisis man- 3D teleimmersion and interaction across remote sites is still a agement. Such environments strongly depend on their ability challenging goal. The main reason is found in the difficulty to to build a virtualized representation of the scene of interest, build and transmit models that carry enough information for for example, 3D models of users. Most existing systems use such applications. This not only covers visual or transmission 2D representations obtained using mono-camera systems aspects but also the fact that such models need to feed 3D [5–7]. While giving a partially faithful representation of the physical simulations as required for interaction purposes. In user, they do not allow for natural interactions, including 2InternationalJournalofDigitalMultimediaBroadcasting consistent visualization with occlusions, which require 3D being computation tasks. The component hierarchy offers a descriptions. Other systems more suitable for 3D virtual high-level of modularity, simplifying the maintenance and worlds use avatars, as, for instance, massive multiplayer upgrade of the system. The actual degree of parallelism and games analog to Second Life. However, avatars only carry mapping of tasks on the nodes of the target architecture partial information about users and although real-time are inferred during a preprocessing phase from simple data environments can improve such models and like the list of cameras available. The runtime environment allow for animation, avatars do not yet provide sufficiently transparently takes care of all data transfers between tasks, realistic representations for teleimmersive purposes. being on the same node or not. Embedding the EPVH To improve the sense of presence and realism, models algorithm in a parallel framework enables to reach interactive with both photometric and geometric information should execution times without sacrificing accuracy. Based on this be considered. They yield more realistic representations system we developed several experiments involving one or that include user appearances, motions and even sometimes two modeling platforms. facial expressions. To obtain such 3D human models, In the following, we detail the full pipeline, starting with multicamera systems are often considered. In addition to acquisition steps in Section 2, the parallel EPVH algorithm in appearance, through photometric information, they can Section 3, the textured-model rendering and the mechanical provide a hierarchy of geometric representations from 2D to interactions in Section 4. A collaborative set up between two 3D, including 2D and depth representations, multiple views, 3D-modeling platforms is detailed in Section 6. Section 7 and full 3D geometry. 2D and depth representations are present a few experiments and the associated performance viewpoint dependent and though they enable 3D visualiza- results, before concluding in Section 8. tion [8]and,tosomeextent,free-viewpointvisualization, they are still limited in that respect. Moreover they are 2. A Multicamera Acquisition System not designed for interactions that usually require full shape information instead of partial and discrete representations. To generate real-time 3D content, we first need to acquire 2D Multiple view representations, that is, views from several information. For that purpose we have built an acquisition viewpoints, overcome some of the limitations of 2D and space surrounded by a multicamera vision system. This depth representations. In particular, they increase the free- section will focus on the technical characteristics needed viewpoint capability when used with view interpolation to obtain an image stream from multiple cameras and to techniques, for example, [3, 9, 10]. However, interpolated transform it into suitable information for the 3D-modeling view quality rapidly decreases when new viewpoints distant step, that is, calibrated silhouettes. from the original viewpoints are considered. And similarly to 2D and depth representations, only limited interactions 2.1. Image Acquisition. As described previously, the 3D- can be expected. In contrast, full 3D geometry descriptions modeling method we use is based on images. We thus need allow unconstrained free viewpoints and interactions as they to acquire video streams from digital cameras. Today digital carry more information. They are already used for teleim- cameras are commodity components available from low cost mersion [2, 4, 11, 12]. Nevertheless, existing 3D human webcams to high-end 3-CCD cameras. Images provided by representations, in real-time systems, often have limitations current webcams proved to be of insufficient quality (low such as imperfect, incomplete, or coarse geometric models, resolution and refresh rates, important optical distortion), low resolution textures, or slow frame rates. This typically which made them unsuitable for our purpose. Consequently results from the complexity of the method applied for 3D we use mid range firewire cameras acquiring up to 30fps reconstruction, for example, stereovision [13]orvisualhull and 2 megapixels color images. Our acquisition platform is [14]methods,andthenumberofcamerasused. equipped with up to 16 cameras. Each camera is connected This article presents a full real-time 3D-modeling system to a cluster node dedicated to image processing. A software called Grimage (http://grimage.inrialpes.fr/)(Figure 1). It is library is used to control our cameras, managing cameras’ an extended version of previous conference publications [15– configurations and frame grabbing under Linux. 19]. The system relies on the EPVH-modeling algorithm Cameras are set to surround the scene. The number of [14, 20]thatcomputesa3Dmeshoftheobservedscenefrom cameras required depends on the size of the scene and the segmented silhouettes and robustly yields an accurate shape complexity of the objects to model as well as on the quality model [14, 20]atreal-timeframerates.Visualpresencein required for texturing and 3D-modeling. Beyond a certain the 3D world is ensured by texturing this mesh with the pho- number of cameras, the accuracy of the model obtained does tometric data lying inside the extracted silhouettes. The 3D not significantly improve while the network load increases mesh also enables a mechanical presence, that is, the ability resulting in higher latencies. Experiments have shown that 8 for the user to apply mechanical actions on virtual objects. cameras is generally a good compromise between the model The 3D mesh is plugged into a physics engine to compute accuracy and the CPU and network load. the collisions and the reaction forces to be applied to virtual The camera locations in the acquisition space usually objects. Another aspect of our contribution is the implemen- depends on the application: one can choose to emphasize tation of the pipeline in a flexible parallel framework. For a side of the set up to have better texture quality in a that purpose, we rely on FlowVR, a middleware dedicated particular direction, the user’s face for example, or to place to parallel interactive application [21–23]. The application more cameras in a certain area to get better models of the is structured through a hierarchy of components, the leaves arms and hands for interaction purposes. International Journal of Digital Multimedia Broadcasting 3

Trigger Trigger

··· ··· Site 1 Site 2

··· ···

Acquisition Acquisition Image processing Image processing

3D modeling 3D modeling

Texture mapping Texture mapping

Virtual environment

Figure 1: An illustration of the telepresence application pipeline.

2.2. Synchronization. Dealing with multiple input devices The calibration process consists in sweeping around the raises the problem of data synchronization. In fact all our scene a wand with four lights with known relative positions applications rely on the assumption that the input images on the wand. Once the lights are tracked through time captured from the different cameras are coherent, that is, in each image, a bundle adjustment iteratively lowers the that they relate to the same scene event. Synchronization reprojection error of the computed 3D light positions into information could be recovered directly from silhouettes the original images by adjusting the extrinsic and intrinsic using their inconsistency over several viewpoints as suggested parameters of each camera. in [24]; however, a hardware solution appears to be more practical and efficient in a dedicated environment such as 2.4. Background Subtraction. Regions of interest in the ours. The image acquisition is triggered by an external images, that is, the foreground or silhouette, are extracted signal sent directly through the cameras’ genlock connector. using a background subtraction process. We assume that the This mechanism leads to delays between images below scene is composed of a static background, the appearance of 100 microseconds which can be learned in advance. As most of the existing techniques [28, 29], we rely on a per-pixel color model of 2.3. Calibration. Another issue when dealing with multiple the background. For our purpose, we use a combination cameras is to determine their spatial organization in order of a Gaussian model for the chromatic information (UV) to perform geometric computations. In practice we need to and an interval model for the intensity information (Y) determine the position and orientation of each camera in with a variant of the method by Horprasert et al. [28] the scene as well as its intrinsic characteristics such as the for shadow detection (Figure 2(b)). A crucial remark here focal length. This is done through a calibration process that is that the accuracy of the produced 3D model highly computes the function giving the relationship between real depends on this process since the modeling approach is 3D points and 2D-image points for each camera. exact with respect to the silhouettes. Notice that a high- As with the synchronization step, the silhouette informa- quality background subtraction can easily be achieved by tion could also be used to recover calibration information, using a dedicated environment (blue screen). However, for using for instance [24, 25]. However practical considerations prospective purposes, we do not limit our approach to such on accuracy favor again a solution which is specific to specific environments in our set up. our dedicated environment. We perform this step using a software we developed which is based on standard off- 2.5. Silhouette Polygonalization. Since our modeling algo- the-shelf calibration procedures, see for instance [26, 27]. rithm computes a surface and not a volume, it does not use 4InternationalJournalofDigitalMultimediaBroadcasting

(a) (b) (c)

Figure 2: The different steps of the image processing: (a) image acquisition, (b) background subtraction (binary image), and (c) exact silhouette polygon (250 vertices).

image regions as defined by silhouettes, but instead their delimiting polygonal contours. We extract such silhouette contours and vectorize them using the method of Debled- Rennesson et al. [30]. Each contour is decomposed into an oriented polygon, which approximates the contour to a given approximation bound. With a single-pixel bound, obtained polygons are strictly equivalent to the silhouettes in the discrete sense (Figure 2(c)). However in case of noisy silhouettes this leads to numerous small segments. A higher approximation bound results in significantly fewer segments. This enables to control the model complexity, and therefore the computation time of the 3D-modeling process, in an efficient way.

Figure 3: Visual hull of a person with 4 views. 3. 3D Modeling To obtain a 3D geometric model of objects and persons located in the acquisition space, we use a shape-from- Our work is based on the exact polyhedron visual silhouette method, which builds a shape model called the hull (EPVH) algorithm [14, 20]. The EPVH algorithm has visual hull. Shape-from-silhouette methods are well adapted the particularity of retrieving an exact 3D model, whose to our context for several reasons. First, they yield shape projection back into the images coincides with the observed models as required later in the process, for example, texture silhouettes. This is an important feature when the models mapping or interaction. Second, they provide such models need to be textured as it makes textures, extracted from in real-time. Even though more precise approaches exist, silhouettes, directly mappable on the 3D model. for example, [31–33], most will fail at providing 3D models The method we present here recovers the visual hull in real-time over long period of time and in a robust and of a scene object in the form of a polyhedron. As previ- efficient way as shape-from-silhouette approaches do. Below, ously explained, silhouette contours of the scene object are we precise our shape-from-silhouette method. retrieved for each view as a 2D polygon. Such a discrete polygonal description of silhouettes induces a unique poly- 3.1. Visual Hull. The visual hull is a well-studied geometric hedron representation of the visual hull, the structure of shape [34, 35] which is obtained from scene object’s silhou- which is recovered by EPVH. To achieve this, three steps ettes observed in n views. Geometrically, the visual hull is the are performed. First, a particular subset of the polyhedron intersection of the viewing cones, the generalized cones whose edges is computed: the viewing edges, which we describe apices are the cameras’ projective centers and whose cross- below. Second, all other edges of the polyhedron mesh are sections coincide with the scene silhouettes (Figure 3). When recovered by a recursive series of geometric deductions. The considering piecewise-linear image contours for silhouettes, positions of vertices not yet computed are gradually inferred the visual hull becomes a regular polyhedron. A visual hull from those already obtained, using the viewing edges as an cannot model concavities but can be efficiently computed initial set. Third, the mesh is consistently traversed to identify and yield a very good human shape approximation. the faces of the polyhedron. International Journal of Digital Multimedia Broadcasting 5

(a) (b) (c) Figure 5: The three main steps of the EPVH algorithm, (a) viewing edge computation, (b) mesh connectivity (horizontal slices depict Figure 4: Viewing edges (in bold) along the viewing line. the partitioning used for the parallel version of the algorithm), and (c) face generation.

We first give an overview of the sequential algorithm. For more details refer to [14]. Then we explain how we distribute Stage 1: Viewing Edges. Let V be the number of thread— this algorithm in order to reach real-time performance. each thread being distributed on different CPUs across the cluster’s hosts—in charge of computing the viewing edges. 3.1.1. Computing the Viewing Edges. Viewing edges are the The silhouettes extracted by all image processing hosts are edges of the visual hull induced by viewing lines of contour broadcasted to the V threads. Each thread computes locally vertices, see Figure 4. There is one viewing line per silhouette the viewing edges for n/V viewing lines, where n is the total 2D vertex. On each viewing line, EPVH identifies segments number of viewing lines (Figure 5(a)). that project inside silhouettes in all other images. Each segment, called a viewing edge, is an edge of the visual Stage 2: Mesh Connection. Let M be the number of thread hull and each segment extremity a 3D vertex. Each 3D in charge of computing the mesh. The V threads from the vertex is trivalent, that is, the intersection point of 3 edges. previous step broadcast the viewing edges to the M threads. Higher valence is neglected because it is highly unlikely in Each thread is assigned a slice of the space (along the vertical practice. axis as we are usually working with standing humans) where it computes the mesh. Slices are defined to have the same 3.1.2. Computing the Visual Hull Mesh. After the first step, number of vertices. Each thread completes the connectivity the visual hull is not yet complete. Some edges are missing of its submesh, creating triple points when required. The to fulfill the mesh connectivity. Some vertices, called triple submeshes are then gathered on one host that merges the points, are also missing. A triple point is a vertex of the results, taking care of the connectivity on slice boundaries visual hull generated from the intersection of three planes removing duplicate edges or adding missing triple points defined by silhouette segments from three different images. (Figure 5(b)). EPVH completes the visual mesh by traveling along 3D edges as defined by two silhouette edges as long as these Stage 3: Face Identification. The mesh is broadcasted to K 3D edges project inside all silhouettes. At the limit, that threads in charge of face identification. Workload is balanced is, when it projects on a silhouette contour, it identifies by evenly distributing the set of generator planes among new triple points or recognize already computed visual hull processors (Figure 5(c)). vertices. A last step consists in traversing the mesh to identify the 3.2.2. Evaluation. Consider an acquisition space surrounded polyhedron faces. 3D face contours are extracted by walking by up to 8 cameras and with 1 or 2 persons. In that case, through the complete oriented mesh while always taking the algorithm reaches the cameras’ refresh rates (tuned in left turns at each vertex. Orientation data is inferred from between 20 and 30 frames per second) and ensures a latency silhouette orientations (counterclockwise oriented outer below 100 milliseconds (including video acquisition and 2D- contours and clockwise oriented inner contours). image processing) with between 4 and 8 processors. While the algorithm is flexible and allows for more processors, it 3.2. Distributed Algorithm did not prove to significantly increase the performance in this experimental context. More information and results about 3.2.1. Algorithm. For real-time execution we developed a the parallelization of this algorithm can be found in [15]. parallel version of the EPVH algorithm using a three-stage Using a large number of cameras raises several issues. pipeline: The algorithm complexity is quadratic in the number of 6InternationalJournalofDigitalMultimediaBroadcasting

of multimodel representation. In SOFA, most simulation Multi-camera acquisition components, for instance, deformable models, collision models or instruments, can have several representations, connected together through a mechanism called mapping. Image segmentation & texture extraction Each representation is optimized for a particular task such as mechanical computations, collision detection or visualiza- Silhouettes tion. Textures Integrating a SOFA simulation in our applications 3D modeling Mesh Simulation required adding a new component, receiving the stream of 3D meshes modeling the user and packaging it as an additional collision model (Figure 7). The triangulated Mesh Mesh polyhedron as computed by the 3D-modeling step can directly be used for collision detection. It is seen from the Rendering physics simulation point of view as a rigid mesh insensible to external forces (infinite mass), similar to predefined obstacles Figure 6: Application architecture coupling a multicamera acquisi- such as the floor, with the difference that it is changed each tion space and a virtual physical environment. time a new mesh is received. In order to obtain accurate interactions, the collision response additionally requires the speed and direction of cameras, quickly leading to non acceptable latencies. Today motion at collision points, so that the user can push our efforts focus on using higher resolution cameras rather virtual objects, for example, kicking a ball, instead of only than significantly more cameras. The algorithm complexity is blocking them. We currently provide this information by in n log(n), where n is the maximum number of segments per querying the minimum distance of the current surface silhouette, making it more scalable on this parameter. Having point to the previous modeled mesh. This is efficiently more cameras makes sense for large acquisition spaces, where implemented by reusing the proximity-based collision detec- 3D models are computed per subsets of cameras. tion components in SOFA. The computed distance is an estimation of the user’s motion perpendicular to the surface, 3.3. Texture Extraction. We also extract from each silhouette which is enough to give the colliding objects the right the photometric data that will be used later during the impulsion. However tangential frictions cannot be captured. rendering process for photorealistic rendering. The different parameters, like mass, spring stiffness, of the simulated scene are empirically tuned based on a trade- off between real-time constraints and a visibly plausible 4. Interaction and Visualization behavior. Another key aspect of SOFA is the use of a scene- The 3D mesh, and the associated textures, acquired by graph to organize and process the components while clearly the multicamera system are sent over the network to the separating the computation tasks for their possibly parallel visualization node and the physical simulation node that scheduling. This data structure, inspired by classical render- handle interactions (Figure 6). These tasks are detailed ing scene-graphs like OpenSG, is new in physically-based below. animation. Physical actions such as force accumulation or state vector operations are implemented as traversal actions. 4.1. Simulation This creates a powerful framework for differential equation solvers suitable for single objects as well as complex systems 4.1.1. Collision Based Interactions. Coupling real-time 3D- made of different kinds of interacting physical bodies: rigid modeling with a physical simulation enables interaction pos- bodies, deformable solids or fluids. sibilities that are not symbolic and feel therefore natural to We use an iterative implicit time integration solver. the user. Using the SOFA (http://www.sofa-framework.org/) The maximum number of iterations is tuned to limit framework, we developed a distributed simulator that han- the computation time. This creates a trade-off between dles collisions between soft or rigid virtual objects and accuracy and computation time that allows us to reach the user’s body. Unlike with most traditional interactive the real-time constraint without sacrificing stability. Parallel applications, it allows to use any part of the body or any versions of SOFA on multicore processors [37]andonGPU accessories seen inside the acquisition space without being have been developed, allowing to interactively simulate rich invasive. Some interactions are intricate, the prehension of environments. objects, for example, is very difficult as there is no force information linked to the model. 4.2. Rendering. Data to be rendered, either provided by the simulation software, a static scene loader, or from the 3D- 4.1.2. SOFA. SOFA (simulation open framework applica- modeling algorithm, are distributed to dedicated rendering tion) is an open source framework primarily targeted at nodes. The rendering can be performed on heterogeneous medical simulation research [36]. Its architecture relies display devices such as standard monitors, multiprojector on several innovative concepts, in particular the notion walls, head-mounted displays or stereoscopic displays. International Journal of Digital Multimedia Broadcasting 7

(a) (b) (c)

Figure 7: An interactive deformable object, (a) collides with the 3D-reconstructed mesh, (b) allowing interactions with the user in (c).

4.2.1. FlowVR Render. To distribute efficiently the rendering 5. Platform Integration part, we use FlowVR Render [38]. Existing parallel or remote rendering solutions rely on communicating pixels, Coupling the different software components involved into OpenGL commands, scene-graph changes or application- this project and distributing them on the nodes of a PC clus- specific data. We rely on an intermediate solution based on ter for reaching real-time executions is performed through a set of independent graphics primitives that use hardware the FlowVR (http://flowvr.sourceforge.net/)middleware[21, shaders to specify their visual appearance. Compared to an 23], a middleware we developed conjointly with the Grimage OpenGL based approach, it reduces the complexity of the project. model by eliminating most fixed function parameters while FlowVR enforces a modular programming that leverages giving access to the latest functionalities of graphics cards. It software engineering issues while enabling high performance also suppresses the OpenGL state machine that creates data executions on distributed and parallel architectures. FlowVR dependencies making primitive reordering and multistream relies on a dataflow and component-oriented programming combining difficult. approach that has been successfully used for other scientific visualization tools. Developing a FlowVR application is a Using a retained-mode communication protocol trans- two-step process. First, modules are developed. Modules mitting changes between each frame, combined with the are endless loops consuming data on input ports at each possibility to use shaders to implement interactive data iteration and producing new data on output ports. They processing operations instead of sending final colors and encapsulate a piece of code, imported from an existing geometry, we are able to optimize the network load. High- application or developed from scratch. The code can be level information such as bounding volumes is used to multithreaded or parallel, as FlowVR supports parallel code set up advanced schemes where primitives are issued in coupling. In a second step, modules are mapped on the target parallel, routed according to their visibility, merged and architecture and assembled into a network to define how reordered when received for rendering. Different opti- data are exchanged. This network can make use of advanced mization algorithms can be efficiently implemented, sav- features, from bounding-box-based routing operations to ing network bandwidth or reducing texture switches for complex message filtering or synchronization operations. instance. The FlowVR runtime engine runs a daemon on each node of the cluster. This daemon is in charge of synchroniza- 4.2.2. 3DModelandTextureMapping.Rendering the 3D tion and data exchange between modules. It hides all net- model is quite simple as it is already a polygonal surface. working aspects to modules, making module development To apply the textures extracted from the silhouettes, we use easier. Each daemon manages a shared memory segment. a shader that projects the mesh vertices in source images Messages handled by modules are directly written and read consistently with camera calibration parameters. The exact from this memory segment. If data exchange is local to polyhedral visual hull algorithm guarantees that the 3D a node, it only consists in a pointer exchange, while the model can be projected back to the original silhouette with daemon takes care of transferring data through the network minimal error, a property that leads to a better quality texture for internode communications. mapping. Taking into account the surface normal, viewing The largest FlowVR applications are composed of thou- direction, as well as self-occlusions, the pixel shader smoothly sand of modules and connections. To be able to deal with combines the contributions from the different cameras. the network design of such applications, FlowVR is based on Having access to the full 3D surface enables interactive and hierarchical component model [22]. This model introduces unconstrained selection of rendering viewpoints, and yields a new kind of components called composite. A composite realistic views of the reconstructed person. is designed by assembling other FlowVR components. This 8InternationalJournalofDigitalMultimediaBroadcasting hierarchy enables to create a set of efficient and reusable users can therefore recognize themselves and have life-like patterns or skeletons, for example, one-to-many broadcast conversations. Also emotions can be communicated through or scatter collective communications are encapsulated into face expressions and body gestures. communication tree patterns made generic and parametric to be used in various contexts. A compilation step instanti- 6.2. Mechanical Presence. Sharing our appearance is not the ates skeletons parameters to fit to the target architecture, for only advantage of our environment. 3D meshes can also example, in case of communication trees, parameters to be be used to interact with shared virtual objects. The server set are the tree arity and the mapping of each node. Using managing the virtual environment receives user information description files or parameters, the compilation step unfolds (geometric 3D models, semantic actions, etc.), runs the the hierarchical description and produces a flat FlowVR simulation and sends back the transformation of the virtual network optimized for the target architecture. scene to the users (Figure 9). The dynamic deformable The Grimage network (Figure 8) is a thousand modules objects are handled by this server while heavy static scenes and connections application developed by assembling this set can be loaded at initialization on each user’s rendering of skeletons. The network relies on several communications node. Such an environment can be used by multiple patterns, for example, in the acquisition component, a users to interact together from different locations with the pipeline is associated to each camera. The 3D reconstruc- same virtual objects. For each iterative update the physical tion algorithm needs a strong coherency between these simulation detects collisions and computes the effect of pipelines. To reach real-time execution, application needs each user interaction on the virtual world. It is of course sometimes to discard a metaframe because it will not be impossible to change the state of the input models themselves able to compute it under real-time constraints. Therefore as there are no force-feedback devices on our platforms. a pattern is in charge to do this sampling and keep Physically simulated interactions between participants are the coherency. This pattern synchronizes all pipelines and also impossible for the same reason. discards the metaframe in the distributed context. The FlowVR compilation process enforces the modularity of the application. The hierarchical description of Grimage is 6.3. Remote Presence. Remote site visualization of models totally independent from the acquisition set up and the requires the transfer of 3D model streams and their associ- target architecture. A file describing the architecture and ated textures under the constraints of limited bandwidth and the acquisition set up is used to compile the network. The minimal latency. The mesh itself is not bandwidth intensive compilation process will create the appropriate number of and can be easily broadcasted over the network. The textures, pipelines or reconstruction parallel processes based on the one per camera in our current implementation, induce much architecture description file. Therefore, in case of set up larger transfers and represent the bulk of the data load. We modification (add a stereoscopic display, integration of a will provide data bandwidth measurements in Section 7 for new SMP node in the PC cluster or modification of the a particular set up. We do not consider any specific transfer number of cameras), the only change needed is to update protocol, which is beyond the scope of this work. the architecture description file. This modularity is critical to The FlowVR middleware handles the synchronization the Grimage application that has been developed over several of both texture and mesh streams to deliver consistent years by various persons. FlowVR is a key component that geometric and photometric data, that is, the texture stream made it possible to aggregate and efficiently execute on a gathered from the acquisition nodes must be rendered at the PC cluster the various pieces of code involved. The level of same time as the 3D model reconstructed with this same modularity achieved significantly eases the maintenance and image stream, otherwise it would lead to visual artifacts. It enhancements of the application. also prevents network congestion by resampling the streams (discarding some 3D metaframe) in order to send only up- to-date data to the end-user nodes. As the physical simula- 6. Collaborative Environment tion only needs the mesh, each site sends it only the meshes as Virtual environments, such as multiplayer games or social soon as available. We did not experience incoherency issues network worlds, need a representation of each user that is requiring to enforce a strong time synchronization between often an avatar controlled with a keyboard and a mouse. In meshes. contrast, our system virtualizes the user into a model that has the user’s geometry and appearance at any instant, hence 7. Experiments relaxing the need for control devices and enabling new types of interactions with virtual worlds as discussed below. We report below on preliminary experiments that were conducted with two platforms located in the same room. 6.1. Visual Presence. From the user’s point of view, the sense of presence is drastically improved with an avatar 7.1. Practical Set up. The first acquisition platform is built that has the user’s shape and aspect instead of those of a with 8 firewire cameras with 1 MP resolution, allowing an purely synthetic avatar taken from a 3D model database. In acquisition space of 2 by 2 meters, suitable for a full person. addition, the avatar is moving with respect to the user’ body The PC cluster used is composed of 10 dual xeon PCs gestures and not to a preprogrammed set of actions. Different connected through a gigabit Ethernet network. International Journal of Digital Multimedia Broadcasting 9

Figure 8: The FlowVR flat network of the Grimage application. Nodes are modules and edges communication channels. This network is compiled for 8 cameras and EPVH parallelized on 6 CPUs.

Textures Mesh Mesh & estimated stream is about 5.8MB. To decrease the needed textures bandwidth we decided to use the full resolution of the camera Site 1 Site 2 for 3D model computation but only half the resolution for texture mapping, reducing the full multitexture frame to a Acquisition Acquisition maximum of 1.45 MB to transfer at each iteration. The mesh itself represents less than 80 KB (about 10000 triangles). 3D modeling 3D modeling Running at 20 frames per second, which is reason- able for good interactions, the dual platform requires a Rendering Rendering 29 MB/second bandwidth for 3D frame streaming which is easily scalable to a Gigabit Ethernet network (120 MB/s).

Virtual world Simulation Static scene 7.3. Results. We are able to acquire images and to generate the 3D meshes at 20 fps on each platform. The simulation and the rendering processes are running respectively at 50– Figure 9: Application architecture for two multicamera acquisition 60 fps and 50–100 fps, depending of the load of the system. spaces and a virtual physical environment. As they run asynchronously from the 3D model and texture generation we need to resample the mesh and the texture streams independently. In practice the mesh and texture transfer between sites oscillates between 15 fps and 20 fps, The second acquisition platform is a portable version of depending on the size of the silhouette inside the images. the first one with an acquisition space of 1 square meter Meanwhile the transfer between the 3D-modeling and the at table height used for demonstration purpose. It uses 6 rendering node inside a platform and the transfer going to firewire cameras and is suitable for hand/arm interactions. the simulation node are always running at 20 fps. We do The cluster is built with 6 mini-PCs used for camera not experience any extra connection latency between the acquisition, 1 dual xeon server for computation, and a laptop two platforms. During execution, the application does not for supervision. This platform was presented at Siggraph in overload the gigabit link. 2007 [36]. The accuracy of the model obtained using EPVH is The two platforms are connected by a gigabit Ethernet satisfactory both for visual experience and for physical network using one PC as gateway between the two platforms. simulation precision. The level of detail of the model is good This PC gathers the data from the two platforms and handles enough to distinguish the user’s fingers. Our application is the physical simulation. robust to input noise, the obtained 3D model is watertight (no holes) and manifold (no self intersection). It is also 7.2. Data Estimation. Our 8-camera platform produces 1 MP robust to network load change as the data transmitted could images, yielding 3 MB images and thus a theoretical 24 MB be resampled to avoid latency. multiimage frame throughput. In practice the only image We did not conduct any user study about the sense of data needed for texturing lies inside the silhouettes, which presence achieved through this system (Figure 10). However we use to reduce transfer sizes. When one user is inside the numerous users that experienced the system, in particular the acquisition space the silhouettes occupy usually less during the Emerging Technology show at Siggraph 2007 [36], than 20% of the overall image in a full-body set up. were generally impressed by the quality of the visual and Thus an average multitexture frame takes 4.8 MB. We also mechanical presence achieved without requiring handheld need to send the silhouette mask to decode the texture. A devices, markers or per-user calibration steps. Interaction multisilhouette mask frame takes about 1MB. The overall was intuitive, often requiring no explanation, as it was direct, 10 International Journal of Digital Multimedia Broadcasting

(a) (b)

Figure 10: (a) the 3D virtual environment with a “full-body” user and a “hand” user, interacting together with a virtual puppet, and (b) the “hand-size” acquisition platform. full-body and relying on physical paradigms that somehow of the normal velocity and does not provide any tangential mimicked a common real world experience. This positive velocity. This lack of data limits the range of possible feedback was achieved despite the non immersive display mechanical interactions. As a consequence, the user can used (a 2D display located 75 cm in front of the user) and the modulate the force it applies to a given virtual object but has third-person visualization. Future work will focus on associ- difficulties to keep an object on his hand or to grab anything. ating Grimage and an immersive visualization environment The first steps of the vision pipeline are crucial for the such as a HMD to enable first-person visualization and allow accuracy of the final 3D model. We experienced that a higher for better depth perception. quality background subtraction could significantly improve A similar experiment was showcased at VRST 2008 the 3D mesh. We are working on advanced background in Bordeaux [39], involving two “hand-size” platforms subtraction algorithms using fine-grain parallel processing to in the same room. Visitors could see each other’s hands keep the computation time low. immersed in the same virtual room. A virtual puppet was This paper includes a preliminary experiment with a animated by the physics simulation. Each user could push dual platform. We are today conducting telepresence and or grab the puppet. They could also try to collaborate for collaboration experiments between distant sites, each one instance to grab the puppet using two hands, one from having its own multicamera environment. This context will each user. No direct user-to-user physical interaction was require further optimizations to control the amount of data possible. The meshes would simply intersect each other exchanged between sites to keep an acceptable latency. when both hand positions superpose in the virtual world. The videos (http://grimage.inrialpes.fr/telepresence/)ofour experiments give a good overview of the sense of presence Acknowledgment achieved. This work was partly funded by Agence Nationale de la Recherche, contract ANR-06-MDCA-003. 8. Conclusion We presented in this article the full Grimage 3D-modeling References system. It adopts a software component-oriented approach offering a high-level of flexibility to upgrade part of the [1] P. J. Narayanan, P. W. Rander, and T. Kanade, “Constructing virtual wolrds using dense stereo,” in Proceedings of the ff application or reuse existing components in di erent con- 6th International Conference on ,pp.3–10, texts. The execution environment supports the distribution Bombay, India, 1998. of these components on the different nodes of a PC cluster [2] M. Gross, S. Wurmlin,¨ M. Naef, et al., “Blue-c: a spatially or Grid. We can thus harness distributed I/O and computing immersive display and 3D video portal for telepresence,” ACM resources to reach interactive execution times but also to Transactions on Graphics,vol.22,no.3,pp.819–827,2003. build multiplatform applications. 3D-modeling relies on the [3] W. Matusik and H. Pfister, “3D TV: a scalable system for real- EPVH algorithm that computes from 2D-images a 3D mesh time acquisition, transmission, and autostereoscopic display corresponding to the visual hull of the observed scene. We of dynamic scenes,” ACM Transactions on Graphics,vol.23,no. also retrieve photometric data further used for texturing the 3, pp. 814–824, 2004. 3D mesh. Experiments show that Grimage is suitable for [4] G. Kurillo, R. Bajcsy, K. Nahrsted, and O. Kreylos, “Immersive enforcing the visual and mechanical presence of the modeled 3D environment for remote collaboration and training of users. physical activities,” in Proceedings of IEEE The actual state of development shows some limitations. Conference,pp.269–270,2008. For instance we do not extract a complete velocity field on [5] L. Gharai, C. S. Perkins, R. Riley, and A. Mankin, “Large scale the mesh surface, our algorithm only provide an estimation video conferencing: a digital amphitheater,” in Proceedings of InternationalJournalofDigitalMultimediaBroadcasting 11

the 8th International Conference on Distributed Multimedia [22] J.-D. Lesage and B. Raffin, “A hierarchical component model Systems,SanFrancisco,Calif,USA,September2002. for large parallel interactive applications,” The Journal of [6] H. Baker, D. Tanguay, I. Sobel, et al., “The coliseum immersive Supercomputing,vol.7,no.1,pp.67–80,2008. teleconferencing system,” in Proceedings of the International [23] J.-D. Lesage and B. Raffin, “High performance interactive Workshop on Immersive Telepresence,JuanLesPins,France, computing with FlowVR,” in Proceedings of IEEE Virtual December 2002. Reality SEARIS Workshop,pp.13–16,2008. [7] D. Nguyen and J. Canny, “MultiView: spatially faithful group [24] S. N. Sinha and M. Pollefeys, “Synchronization and calibration video conferencing,” in Proceedings of of the SIGCHI Confer- of camera networks from silhouettes,” in Proceedings of the ence on Human Factors in Computing Systems,pp.799–808, 17th International Conference on Pattern Recognition (ICPR 2005. ’04),vol.1,pp.116–119,Cambridge,UK,August2004. [8] Philips 3D Solutions, “WoWvx technology”. [25] E. Boyer, “On using silhouettes for camera calibration,” in [9] E. Chen and L. Williams, “View interpolation for image Proceedings of the 7th Asian Conference on Computer Vision synthesis,” in Proceedings of the 20th Annual Conference on (ACCV ’06),vol.3851ofLecture Notes in Computer Science, Computer Graphics and Interactive Techniques (SIGGRAPH pp. 1–10, Hyderabad, India, January 2006. ’93), pp. 279–288, Anaheim, Calif, USA, 1993. [26] A. Hilton and J. Mitchelson, “Wand-based multiple camera [10] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. studio calibration,” Tech. Rep. VSSP-TR-2, CVSSP, 2003. Szeliski, “High-quality video view interpolation using a lay- [27] Z. Zhang, “Camera calibration with one-dimensional objects,” ered representation,” in Proceedings of International Conference IEEE Transactions on Pattern Analysis and Machine Intelligence, on Computer Graphics and Interactive Techniques (SIGGRAPH vol. 26, no. 7, pp. 892–899, 2004. ’04),pp.600–608,2004. [28] T. Horprasert, D. Harwood, and L. S. Davis, “A statistical [11] J. Mulligan and K. Daniilidis, “Real time trinocular stereo approach for real-time robust background subtraction and for tele-immersion,” in Proceedings of IEEE International shadow detection,” in Proceedings of the 7th IEEE International Conference on Image Processing,vol.3,pp.959–962,2001. Conference on Computer Vision (ICCV ’99),vol.99,pp.1–19, [12] P. Kauff and O. Schreer, “An immersive 3D videoconferencing Kerkyra, Greece, September 1999. system using shared virtual team user environments,” in Pro- [29] G. K. M. Cheung, T. Kanade, J.-Y. Bouguet, and M. Holler, ceedings of International Conference on Collaborative Virtual “Real time system for robust 3D voxel reconstruction of Environments,pp.105–112,2002. human motions,” in Proceedings of the IEEE Computer Society [13] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of Conference on Computer Vision and Pattern Recognition,vol.2, dense two-frame stereo correspondence algorithms,” Interna- pp. 714–720, 2000. tional Journal of Computer Vision,vol.47,no.1–3,pp.7–42, [30] I. Debled-Rennesson, S. Tabbone, and L. Wendling, “Fast 2002. polygonal approximation of digital curves,” in Proceedings of [14] J.-S. Franco and E. Boyer, “Efficient polyhedral modeling the International Conference on Pattern Recognition,vol.1,pp. from silhouettes,” IEEE Transactions on Pattern Analysis and 465–468, 2004. Machine Intelligence,vol.31,no.3,pp.414–427,2009. [31] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. [15] J.-S. Franco, C. Menier,´ E. Boyer, and B. Raffin, “A distributed Szeliski, “A comparison and evaluation of multi-view stereo approach for real time 3D modeling,” in Proceedings of the reconstruction algorithms,” in Proceedings of IEEE Computer IEEE Computer Society Conference on Computer Vision and Society Conference on Computer Vision and Pattern Recognition Pattern Recognition Workshop (CVPRW ’04),p.31,June2004. (CVPR ’06),vol.1,pp.519–526,NewYork,NY,USA,June [16] J. Allard, E. Boyer, J.-S. Franco, C. Menier,´ and B. Raffin, 2006. “Marker-less real time 3D modeling for virtual reality,” in [32] D. Vlasic, I. Baran, W. Matusik, and J. Popovic,´ “Articulated Immersive Projection Technology,2004. mesh animation from multi-view silhouettes,” ACM Transac- [17] J. Allard, J.-S. Franco, C. Menier,´ E. Boyer, and B. Raffin, tions on Graphics,vol.27,no.3,article97,2008. “The GrImage platform: a environment for [33] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and interactions,” in Proceedings of the 4th IEEE International H.-P. Seidel, “Motion capture using joint skeleton tracking Conference on Computer Vision Systems (ICVS ’06),pp.46–52, and surface estimation,” in Proceedings of International Confer- 2006. ence on Computer Vision and Pattern Recognition (CVPR ’09), [18] J. Allard, C. Menier,´ B. Raffin, E. Boyer, and F. Faure, Miami, Fla, USA, June 2009. “GrImage: markerless 3D interactions,” in Proceedings of the [34] A. Laurentini, “The visual hull concept for silhouette-based International Conference on Computer Graphics and Interactive image understanding,” IEEE Transactions on Pattern Analysis Techniques (SIGGRAPH ’07),SanDiego,Calif,USA,2007. and Machine Intelligence,vol.16,no.2,pp.150–162,1994. [19] J. Allard, C. Menier,´ E. Boyer, and B. Raffin, “Running large [35] S. Lazebnik, E. Boyer, and J. Ponce, “On how to compute VR applications on a PC cluster: the FlowVR experience,” in exact visual hulls of object bounded by smooth surfaces,” in Immersive Projection Technology,2005. Proceedings of IEEE Conference on Computer Vision and Pattern [20] J.-S. Franco and E. Boyer, “Exact polyhedral visual hulls,” in Recognition,vol.1,pp.156–161,2001. Proceedings of the British Machine Vision Conference,vol.1,pp. [36] J. Allard, S. Cotin, F. Faure, et al., “SOFA—an open source 329–338, 2003. framework for medical simulation,” in Medicine Meets Virtual [21] J. Allard, V. Gouranton, L. Lecointre, et al., “FlowVR: a Reality,pp.1–6,2007. middleware for large scale virtual reality applications,” in [37] E. Hermann, B. Raffin, and F. Faure, “Interactive physical Euro-Par Parallel Processing,vol.3149ofLecture Notes in simulation on multicore architectures,” in Proceedings of Computer Science,pp.497–505,Springer,Berlin,Germany, Symposium on Parallel Graphics and Visualization (EGPGV 2004. ’09),pp.1–8,Munich,Germany,March2009. 12 International Journal of Digital Multimedia Broadcasting

[38] J. Allard and B. Raffin, “A shader-based parallel rendering framework,” in Proceedings of the IEEE Visualization Confer- ence,pp.127–134,2005. [39] B. Petit, J.-D. Lesage, J.-S. Franco, E. Boyer, and B. Raffin, “Grimage: 3D modeling for remote collaboration and telep- resence,” in Proceedings of the ACM Symposium on Virtual Reality Software and Technology (VRST ’08),pp.299–300, 2008.