Camera-Based Occlusion for AR in Metal Via Depth Extraction from the Iphone Dual Camera
Total Page:16
File Type:pdf, Size:1020Kb
Camera-based occlusion for AR in Metal via depth extraction from the iPhone dual camera Cristobal Sciutto Breno Dal Bianco Stanford University Stanford University cristobal.space [email protected] Abstract information is essential to to be able to occlude virtual ob- jects which are behind real objects. For example, looking In this project, we will explore the development of a at the Ikea Place app in 1, which allows users to place vir- proof-of-concept of occlusion with camera-based AR on tual furniture into the physical world, we see that it does not iOS’s graphics platform Metal. Per Cutting and Vishton’s support occlusion. While the sofa is behind the man, it is survey of the relative importance of visual cues, it is clear pasted on top of him in the image plane. Rather, the region that occlusion is the most important for generating a re- of the sofa that the man is covering should not be rendered. alistic virtual experience [1]. We hope that our develop- This is severely detrimental to the user’s experience as ments pave the way for future compelling consumer-grade occlusion is the most important visual cue, per Vishton and applications in the celular space. Our implementation in- Cutting [1]. In other words, it is the most important real volves tapping into both the phone’s capture system, ex- world effect through which a human is capable of distin- tracting disparity information from an iPhone X’s dual- guishing where objects are located in the world. Without camera, and implementing appropriate shaders for occlu- occlusion, a user experience cannot be compelling. sion. The difficulties tackled in this project are two-fold: Devices such as Microsoft’s Hololens have time-of-flight (i) iOS API deficiencies, and (ii) processing of inaccurate sensors that query the real world and measure the time it depth information. Source code and GIF examples of the app in usage are available at https://github.com/ tobyshooters/teapottoss. 1. Introduction In this section we explore the motivation behind the de- velopment of a solution to occulsion, particularly in the con- text of camera-based AR on phone. 1.1. Motivation Per GSMA’s 2019 Mobile Economy report, of the 5.3 billion humans over the age of 15, 5 billion of them own a smartphone [2]. Of the 5 billion users, iOS and An- droid constitute the largest user base. Consequently, most consumer-grade AR experiences are being developed for Android and iOS devices. They are made using ARCore or ARKit, Apple and Google’s proprietary AR frameworks, which specifically target their own hardware. The problem with mobile development, however, is that without specialized hardware, it is hard to get the sensor in- put needed for the generation of a quality AR or VR experi- ence. Namely, there is the lack of a depth map, i.e. a notion of how far objects in the real world are from the phone. This Figure 1. Ikea Place app. Note that lack of occlusion. Figure 2. Relative importance of visual cues per [1]. Figure 3. Example of 3D scene mesh reconstruction with Selerio takes for light to reflect back into the device. With this in- formation, along with the intended locaiton of an virtual ob- putational intensity of the approach. Furthermore, most ex- ject, it is possible to conditionally render an object only in isting algorithm’s require an initial calibration of the scene the case that it is the closest to the screen. While a time- at hand. ARKit requires similar initialization, so it could of-flight sensor is the best possible solution to the problem, be argued that this is not a further cost to the user experi- a low resolution sensor is sufficient for most existing expe- ence, however, it is clear that the ability to have occlusion riences. Since most objects are of considerable scale, e.g. out-of-the-box would be ideal. a couch, the resolution of depth needed is closer to meters than milimeters. In the second case, the user is prompted to create bound- Fortunately, a solution to this problem is possible within ing boxes for existing objects in the scene before begin- the form-factor of a phone. Namely, the usage of two cam- ning his experience. With these bounding boxes oriented eras that generate a stereo pair is sufficient for calculating in world space, it is similarly possible to estimate depth disparity, which can be then approximated to depth. All from them and thus occlude virtual objects. This initial- iPhones since the 7 and the Android competitor, Google’s ization phase is even more costly to user experience than Pixel, have two cameras. Currently, the dual cameras are the reconstruction, as users are required to manually gener- primarily used for computational photography. A dispar- ate 3D geometry for the rendering system. This approach ity map is sufficient for simulating effects such as depth- is taken by an existing open source project on GitHub by of-field or face-from-backgroudn isolation. This has led to Bjarne Lundgren [3], and can be seen in figure ??. many applications such as the iPhone’s portrait-mode. Nev- ertheless, this information has not yet been tapped for AR Overall, it is clear than any preprocessing step is non- experiences, beyond face filters. ideal. Using depth sensors directly forgo this problem, and Using this information is the first frontier for bettering this further motivates our approach of using mobile phone current consumer-grade AR experiences, and the area in dual cameras. Nevertheless, a combination of depth sensors which the greatest number of users can be impacted the with scene reconstruction, as used in current AR hardware quickest. will most likely be the way to go. This combined approach allows both immediate usage, and increased accuracy over 2. Related Work time. In the mobile space, there existing solutions to AR occlu- At Apple’s World Wide Developer Conference in 2019 sion use one of two methods: (i) 3D scene reconstruction, (after the development of this project), Apple announced or (ii) preprocessing of scene. In the first case, a series of ”people occlusion”, i.e. occlusion for the specific case of images captured from the phone are processed together to people. These features will be released in the form of their create an estimate of the real world scene. With this 3D new Reality Kit SDK. This does not yet solve that challenge scene, along with the phone’s position in the scene, one can solved by our application but makes it clear that it is Apple’s estimated the depth in a given camera direction. The re- intention to eventually reach the point where this informa- construction is often done with help of a machine learning tion is possible. We speculate that since the depth infor- algorithm. The Selerio SDK for cross-platform AR devel- mation likely passes through a machine learning pipeline, opment implements this approach [5]. The obvious draw- objects of which we have more data (e.g. faces and bodies) backs are the inaccuracies of reconstruction, and the com- will be the first to have occlusion enabled. Figure 6. Occlusion shader uses image as base color, but adds Phong shading. Figure 4. Bjarne Lundgren’s app, with the user establishing oc- cluding boxes. SDK. We then built a renderer in Metal (Apple’s lower-level graphics framework) that describes the 3D scene and ren- ders it with a shader. For Teapot Toss we have a 3D scene constitued of two objects: the teapot and an AR image plane. The former has the ability of being thrown upon tapping. The later hosts an image of the real world captured from the camera. The renderer receives both the captured image from the phone and the depth map from the iPhone’s camera stream and passes them as textures into a shader. For the AR image plane, we place its edges along the border of window space. Consequently, the AR image plane occupies the entirety of the user’s screen. The window space is then mapped to UV coordinates which are used to sample the captured image. We thus effectively have a live stream of the phone’s camera Figure 5. Successful occlusion with out final product. being displayed on this plane. This can best be visualized by using the color from the image, but still performing specular 3. Our Work highlighting on the teapot, per figure 6. Notably, the pot is occluded by the finger, yet we can still see the contour of the In this section we elaborate on the final product devel- object because the normals of the pot are still being used for oped and the process to reach that solution. highlights. The teapot is then rendered in front of this plane. We get 3.1. Overview the depth from the depth map corresponding to a fragment Our final product consists of an iOS game called Teapot of the teapot by projecting the fragment onto window space Toss, as observable in 5, which allows users to throw vir- and sampling the depth map texture. If the world space tual teapots into real-world cups. This is done by access- depth of the teapot is greater than the depth map, we use ing the camera stream for depth information and condition- the same UV coordinates to sample the captured image tex- ally rendering the teapot whether it is occluded by the real- ture, thus painting the teapot with the video stream. This world object. The camera information is extracted from process is best summarized by a example of Metal shader, the phone’s camera stream using the iOS AVFoundation available in the appendix.