Movebox: Democratizing Mocap for the Microsoft Rocketbox Avatar Library

MoveBox: Democratizing MoCap for the Microsoft Rocketbox Avatar Library Mar Gonzalez-Franco1, Zelia Egan1, Matthew Peachey2, Angus Antley1, Tanmay Randhavane1, Payod Panda1, Yaying Zhang1, Cheng Yao Wang4, Derek F. Reilly2, Tabitha C Peck3, Andrea Stevenson Won4, Anthony Steed5 and Eyal Ofek1 1Microsoft Research 2Dalhousie University 3Davidson College 4Cornell University 5University College London Redmond, WA, USA Halifax, Canada Davidson, NC, USA Ithaca, NY, USA London, UK [email protected] Fig. 3. A selection of avatars animated with the MoveBox system. a) Person doing live MoCap from Azure Kinect onto a Microsoft Rocketbox avatar. b) Playback of two avatar animations created to represent a social interaction with MoveBox. c) A crowd of avatars dancing powered by Mocap. Abstract—This paper presents MoveBox an open sourced inside VEs [2]. While some avatars may represent intelli- toolbox for animating motion captured (MoCap) movements onto gent agents that synthesize their motion from prerecorded the Microsoft Rocketbox library of avatars. Motion capture is motion-capture (MoCap) [3], others may be a real-time virtual performed using a single depth sensor, such as Azure Kinect or representation of real people [4]. These self-avatars are a Windows Kinect V2. Motion capture is performed in real-time using a single depth sensor, such as Azure Kinect or Windows representation of a user within head mounted display (HMD) Kinect V2, or extracted from existing RGB videos offline lever- worn VR and are fundamental to social-science embodiment aging deep-learning computer vision techniques. Our toolbox research [5]–[8]. enables real-time animation of the user’s avatar by converting the One of the main challenges in creating realistic avatars and transformations between systems that have different joints and self-avatars is the need to animate them so that their motions hierarchies. Additional features of the toolbox include recording, playback and looping animations, as well as basic audio lip replicate the user’s motions [9]–[11]. Previous research and sync, blinking and resizing of avatars as well as finger and hand development has develop a variety of sensors, tools and animations. Our main contribution is both in the creation of this algorithms for animating avatars. A hindrance to this work open source tool as well as the validation on different devices has been a common set of open source data and tools. This is and discussion of MoveBox’s capabilities by end users. partly addressed by the Microsoft Rocketbox library [12]. Index Terms—Avatars, Animation, Motion Capture, MoCap, Rigging, Depth Cameras, Lip Sync Animation of avatars is often performed through motion capture (MoCap) systems that capture user motions in VR. I. INTRODUCTION MoCap is the processes of sensing a person’s pose and move- Avatars are of increasing importance for social research ment. Real-time motion capture of users is usually limited and social Virtual Reality (VR) applications [1]. Avatars are to measuring the position and orientation of the user’s head, needed to interact with others inside social virtual environ- measured by the HMD, and the position and orientation of ments (VEs), and are the basis for creating realistic stories the left and right palms, measured by hand-held trackable controllers. Controllers may include a set of input buttons and the sensors [9]. The accuracy of the system is dependent on triggers, as well as additional capacitive sensors for recovery precise calibration of the relative position and orientation of all of the finger motions in the vicinity of the controllers. The the sensors. An alternative is suits with Inertial Measurement tracking may be done using grounded sensors or beacons such Unit (IMU) arrays, but these suffer from translational drift [17] as light emitters, cameras, or magnetic sensors [9]. and require more setup time i.e. putting the suit on. A common technique to extrapolate limited sensing to a full Different companies and researchers have attempted to use body animation is through Inverse Kinematics (IK) [13], [14]. low cost trackers, such as a multitude of Vive Trackers as IK works best for joints between known locations and thus a base for motion capture [18]. However, these systems are poses a challenge for joints that are far from the head and the targeted to the professional market with larger budgets. Liu hands, such as hips, legs or feet. To enable animation of the et al. [19] developed a system that uses six wearable sensors, feet, IK usually applies heuristics such as generating walking however even this minimal number is still cumbersome for steps based on the global movement of the user’s head, or smaller productions that need to maintain them charged and using pre-recorded animations. wear them in precise locations on each actor’s body. Ad- High quality, non real-time animations are captured using ditionally, their system uses proprietary built hardware and professional motion capture systems. In systems such as the software used is not available to the public. Ahuja et al. Optitrack and Vicon multiple high frame rate cameras are used [20] suggested a set of spherical mirrors and cameras that to capture a professional actor’s performance. The resulting are mounted on the user’s HMD, to capture the user’s motion. motions recorded by such systems are of high quality and While limited only to people that wear HMDs, the first person can be used in projects such as motion picture productions view of the body reduces the quality of the tracking. and AAA games. However these systems are expensive and Compared to other lightweight body tracking solutions, we require complex setup, which makes them inaccessible for have found Azure Kinect to be more robust to occlusion many researchers, artists, small content creators, and users. partially due to the underlying Deep Neural Network (DNN) Furthermore, once a motion is captured, there are still diffi- body model that the data is being fit to. Although it is possible culties in applying the motion to the desired avatar. Different to integrate multiple Azure Kinects in a Unity application, for motion capture and avatar systems may represent the skeletal this toolbox we aim for a simple system that can be easily set structure in different ways, with variations in the number up. We, therefore, simplified the system to use one single depth of joints and the topology of bones [11]. Large productions camera; either the Azure Kinect or the older model Microsoft employ artists and technicians familiar with each system to Kinect V2. While limited to a single point of view, the camera ensure the pipeline from capturing to rendering will perform does not need to be calibrated, be rigged to other cameras, or as needed. However there is a need for a simpler pipeline for a maintain temporal synchronization with other cameras, thus range of other use cases, for example by researchers in fields enabling a quick and easy setup. such as psychology or sociology wishing to use avatars as 2) Facial Capture: There are several approaches to con- tools to test new theories (e.g. [3], [15]). struction of facial performance capture for virtual models. In this paper, we present an open source toolbox, MoveBox, A common approach is camera-based solutions that employ that uses a single sensor—specifically a depth camera such as computer vision algorithms to calculate the three-dimensional the Microsoft Azure Kinect or Kinect V2—to record the full mesh of the face. These might require a camera system with a body motion of a user. The captured motion can be applied depth sensor [21], multi-view cameras [22], or binocular stereo in real-time to an avatar in Unity 3D. Envisioning the needs [23]. Some solutions also employ deep learning techniques to of researchers, artists, and others for such a tool box, we estimate the mesh [24]. have added capabilities beyond the main skeletal animation In addition to camera-based solutions, some audio-based fa- using a Kinect depth camera, to help those that want to create cial animation solutions also exist. Lip-sync libraries typically convincing animated avatars without a sophisticated MoCap estimate the visemes from the phonemes detected in the audio production studio. This includes using audio input to generate stream [25] [26]. Some solutions like the SALSA suite [25] approximate lip motion on the avatars, synchronized voice can also synthesize the speech to predict emotions, and apply recording, seamless recording and playback of animations, and appropriate facial animations from this prediction. scaling avatars to fit the body proportions of the user. We It is important to note that if the user of the MoveBox support the Microsoft RocketBox avatar library [16], which is system wears a VR headset, then their facial features might freely available to the research community. It contains humans be obscured. of different genders, races, and everyday occupations. B. Character Engines II. RELATED WORK Motion capture systems that interface with avatars for end A. Motion Capture users are critical to the adoption of more human-like avatars 1) Body Capture: Most existing motion capture (MoCap) for research and social VR [27]. Previous work in the area of systems are based on capturing the actor performance using character engines that support MoCap, like HALCA [28] and multiple sensors to enable robust capturing of all the skeleton HUMAN [11], [29], have aimed at converting MoCap from joints’ positions regardless of joint occlusion from some of different sources into avatars and robot skeletons. Work by Gillies and Spanlang comparing real-time character engines for virtual environments proposed four metrics by which to evaluate character engines: 1) photo-realism 2) movement realism 3) interaction and 4) control. [30]. Currently, the most widely used game engines, Unity and Unreal, both provide industry quality for these four features if interaction is limited to the triggering of transitions in a finite state machine.

Load more