<<

A Case Study of Security and Threats from (AR)

Song Chen∗, Zupei Li†, Fabrizio DAngelo†, Chao Gao†, Xinwen Fu‡ ∗Binghamton University, NY, USA; Email: [email protected] †Department of Computer Science, University of Massachusetts Lowell, MA, USA Email: {zli1, cgao, fdangelo}@cs.uml.edu ‡Department of Computer Science, University of Central Florida, FL, USA Email: [email protected]

Abstract—In this paper, we present a case study of potential security and privacy threats from (VR) and augmented reality (AR) devices, which have been gaining popularity. We introduce a computer vision- based attack using an AR system based on the Gear VR and ZED stereo camera against authentication approaches on touch-enabled devices. A ZED stereo cam- era is attached to a Gear VR headset and works as the video feed for the Gear VR. While pretending to be playing, an attacker can capture videos of a victim inputting a password on touch screen. We use depth Fig. 1. AR Based on VR and ZED and distance information provided by the stereo camera to reconstruct the attack scene and recover the victim’s and Microsoft Hololens recently. Glass is a password. We use numerical passwords as an example and relatively simple, light weight, head-mounted display perform experiments to demonstrate the performance of intended for daily use. Microsoft Hololens is a pair of our attack. The attack’s success rate is 90% when the augmented reality smart with powerful compu- distance from the victim is 1.5 meters and a reasonably tational capabilities and multiple sensors. It is equipped good success rate is achieved 2.5 meters. The goal of the paper is to raise awareness of potential security and with several kinds of cameras, including a set of stereo privacy breaches from seemingly innocuous VR and AR cameras. technologies. However, VR and AR devices may be abused. In this paper, we discuss potential security and privacy I.INTRODUCTION threats from seemingly innocuous VR and AR devices. In the past 3 years, advancements in hardware and We built an AR system based on the Samsung Gear have led to an increase in popularity for virtual VR and the ZED stereo camera. We use a ZED as the reality (VR) and augmented reality (AR) devices [2]. video feed for the 2 displays of a Gear VR. An attacker VR devices have shown their value in many applications takes stereo videos of a victim inputting a password and may be used in entertainment, artistry, design, on a smart device. The attacker can monitor the video gaming, education, simulation, tourism, exploration, stream from the Gear VR, as shown in Figure 1, while psychology, meditation, real estate, shopping, social pretending to play games. A recorded stereo video is and . In late 2015, Samsung released their then used to reconstruct the finger position and touch mobile VR solution, Samsung Gear VR. In 2016, HTC screen in the 3D space so that the password can be and launched their PC-based VR devices, HTC obtained. In this paper, we use numerical passwords Vive and . In the meantime, Sony released as an example. We are the first to use an AR device their game console based VR approach, Playstation VR. to attack touch-enabled smart devices. The 3D vision Within 2 years, virtual reality technology has covered attack in this paper is against alphanumeric passwords almost all consumer electronic platforms. instead of graphical passwords in [6], which is the most AR technology has been around for decades, such related work. as the Head Up Display (HUD) in aviation. In the The rest of this paper is organized as follows. In consumer market world, there are more than 20 years Section II, we introduce and compare several popular history of augmented reality games on smart phones. VR and AR devices. In Section III, we present the However, wearable AR devices are not popular due general idea and system implementation of our AR to their size and weight limitations. There have been system based on Gear VR and ZED. In Section IV, several attempts in this area such as we discuss the threat model, basic idea of the attack

1 Fig. 2. Samsung Gear Fig. 3. Google Card- Fig. 4. Fig. 5. Playstation VR Fig. 6. Oculus Rift VR board View

world. is a low-cost VR solution that only provides the basic display function [4]. Google Daydream [3] and Gear VR are more complex and advanced, equipped with a motion controller. Gear VR even has a separate IMU and control interface (trackpad Fig. 7. HTC Vive Fig. 8. Microsoft Hololens and buttons) on the headset. Unlike the mobile VR system, a VR system such as HTC Vive, and workflow. In Section V, we present the experiment Oculus Rift [9] and Playstation VR [12] is more capa- design and results to demonstrate the feasibility of the ble. They all have extra tracking systems for positioning attack. We conclude this paper in Section VI. with excellent accuracy and latency. The refresh rate is usually higher than what mobile VR systems offer. II.BACKGROUND However, they need to be driven by full scale computers In this section, we review the Samsung Gear VR and or game consoles. then compare popular VR and AR devices. Unlike the virtual reality device we mentioned above, Microsoft Hololens [8] is an AR device. It is a self- A. Samsung Gear VR sustaining device just like a mobile VR system. The Samsung Gear VR is a VR system based on smart device itself has enough computational power to run phones. A device acts as the display Windows 10. Hololens has several camera systems so and processor for the VR system. The Gear VR headset that it does not need a controller, and users can use is the system’s basic controller, eyepiece and the IMU gestural commands to perform input via camera sensors. (Inertial measurement unit), as shown in Fig. 2 [11]. The IMU connects to the Galaxy device by USB. Gear III.AUGMENTED REALITYBASEDON GEAR VR VR is a self-supporting, mobile VR system, which does AND ZED not require connection to a high-performance computer In this section, we introduce our AR system based or positioning base station, though the headset has only on Gear VR and ZED. the 3-axis rotational tracking capability. The resolution of Gear VR is determined by the res- A. Overview of the System olution of the attached smart phone and the maximum In order to implement AR on the Gear VR, the video resolution is 2560×1440 (1280×1440 per eye). The feed from the ZED camera must be sent to a Gear VR FOV (field of view) of the lens is 101◦. app installed on a . Since the ZED camera There are several ways to control Gear VR. A user captures two side by side 1920×1080 pixels per frame, can use the built-in button and trackpad on the headset, the video could then be converted to a 3D SBS (side by or use the hand held motion controller, which has its side) video [13]. A smartphone ( in own , and magnetic sensor. The our case) is plugged into the headset and displays the hand held controller connects to the Galaxy device by ZED live stream to the user. There are two possible . There are some 3rd party controllers, such approaches to do this: either by plugging the ZED as Xbox Wireless Controller, which also support Gear camera directly to the Gear VR’s additional USB Type- VR. C port or by sending the video over the internet to the phone while the ZED camera is connected to a computer B. Comparison host. Apparently the former has the potential of much Figures 3, 4, 5, 6, 7 and 8 show Google Cardboard, less video display delay. We leave the former as future Google Daydream, Oculus Rift, HTC Vive, Playsta- work, and introduce our research and development of tion VR and Microsoft Hololens respectively. For the the latter in this paper. The video stream from ZED is purpose of comparison, we group Samsung Gear VR, sent to a Gear VR app over RTSP (Real Time Streaming Google Cardboard and Google Daydream as mobile VR Protocol). systems. They all lack positional tracking capabilities. During our attack introduced in this paper, the at- That is, when a user leans or moves her body, the tacker uses a Samsung VR that receives the video corresponding virtual role doesn’t move in the virtual stream from the ZED camera. The computer host is

2 stored in a backpack while the ZED is pointed to a victim inputting her password onto a smartphone. B. Implementation RTSP is a control protocol for the delivery of multi- media data between a client and a server in real time. Fig. 10. Video Screenshot RTSP can tell the requester the video format, play a The main camera rig object holds a reference to all video and stop the play. The delivery of the actual video 3D objects that are to be rendered in front of the lenses. data is handled by another protocol such as RTP (real Thus, any child of the main camera rig object is also time protocol). positioned in front of the lenses. This way, when the headset moves in space, the video screen will always stay positioned in front of the lenses. Figure 10 shows a fusion 3D screenshot from the Samsung VR. Therefore, to record videos, an attacker could attach Fig. 9. Video Stream Pipeline the ZED camera to the top of the Gear VR headset, We set up an RTSP server on a Linux laptop. The while storing the Linux host computer, which the ZED RTSP server is implemented as follows. First, the video is connected to, inside a backpack. Both the Linux host stream is captured using V4l2, the media source that and the ZED camera could connect to the same Wi-Fi Gstreamer manipulates. Gstreamer is a C library for network via a wireless AP powered by the laptop or a dynamically creating pipelines between media sources nearby wireless router. and destinations [5]. V4l2 (video for Linux) is a Linux program that can access and use the ZED Camera or IV. 3D VISION ATTACK AGAINST NUMERICAL any other UVC compliant cameras. Raw video frames PASSWORD are captured from the ZED camera in the YUV format, In this section, we introduce the threat model, our and can be compressed by codecs such as h.264. The attack workflow, and discuss each step. pipeline is attached to the RTSP server. When the server is sent the PLAY protocol command, the video feed will A. Threat Model and Basic Idea travel along the pipeline to the requester through RTP. In our attack, the attacker uses a stereo camera Figure 9 shows the complete pipeline. mounted on a VR headset to record videos of victims The video feed is accessed locally and remotely. It inputting their numerical passwords. We use numerical can be accessed on the local Linux host using the RTSP password as an example, our attack approach is generic. URL rtsp://localhost:8554/test. For remote access of the We assume the victim’s fingertip and device’s screen video feed on a remote server, localhost is replaced with are in the view of stereo camera, but the content of the the IP address of the server. A Gear VR app using the screen is not visible. Gear VR Framework is then created to play a video We analyze the 3D videos and reconstruct the 3D using the RTSP URL. trajectory of the inputted fingertip movements, mapping The Gear VR Framework is a high level Java In- the touched positions to a reference keyboard to recover terface with access to the low level Oculus Mobile the passwords. SDK. The Gear VR Framework uses a Scene Graph B. Workflow and Attack Process to render 3D objects onto the headset lenses [10]. The Scene Graph is implemented using a tree data Figure 11 shows the workflow. We discuss these 5 structure, where child node objects inherit properties steps in detail below. Because of the space limit, our from their parent, such as position in space. The 3D description will be brief. object representing the video screen is created using an obj file, and instantiated as a child node to the main camera rig object node. An object file contains information about the 3D Fig. 11. Workflow of 3D Vision Attack textures used by 3D rendering engines. An object file 1) Step 1: Calibrating Stereo Camera System: In can be used in the Gear VR framework to instantiate order to get the parameters of the stereo camera system, a GVRScene object before any transformations are we perform stereo calibration on our stereo camera. applied to it. An object file lists each component of the We use the stereo camera to take multiple photos of a 3D object, such as a vertex or a face defined by vertices. chessboard with a side length of 23.5mm from different Associated with each vertex is its position using the distances and positions. In order to achieve satisfactory XYZ[W] coordinate system. Each vertex’s position may calibration accuracy, we take at least 10 photos. Then be defined relative to the first vertex or last vertex in we run the calibration algorithm that will obtain the the list, or they may be defined absolutely. chessboard corner points in every photo to perform the

3 calibration. The calibration algorithm will output the where (x, y) is the point’s 2D coordinate in the left following parameters: camera intrinsic matrix, camera image, and d is the disparity. The 3D coordinate of the distortion coefficients and the geometrical relationship point is (X/W, Y/W, Z/W ) where W is a scale factor. of left and right camera [1]. 2) Step 2: Taking Stereo Videos: The attacker uses our AR system to take videos of a victim inputting passwords. The attacker aims the camera in the correct direction and angle by watching the video stream from the VR headset. During video recording, the attacker needs to make sure the victim’s inputting fingertip and the screen surface of the device always appear in the Fig. 12. Reference 3D Model Fig. 13. 2 Types of Keyboard view of both left and right cameras. We build a reference 3D model of the target device Distance, resolution, and frame rate are the three by measuring the device’s geometric parameters and its major factors that affect the outcome of the attack. The keyboard layout, as shown in Figure 12. We perform a ZED camera is equipped with 2 short focal length wide point warping algorithm between the 4 device corner angle lens that lack an optical zooming function. In points we derive from the video frame and the 4 order to acquire a clear image with enough information corresponding device corner points in the reference 3D (pixels) of the victim’s hand and target device, the model. By perform warping, we know the transform adversary needs to be close enough to the victim. information between the target device and its model. The camera needs a high resolution to achieve high We use the transform information to calculate the key- attack success rates, for the same reason. The average board’s 3D position. human can input a 4 digit numerical password within 2 5) Step 5: Estimating Touching Fingertip Position seconds, so the frame rate of the camera system must and Derive Password Candidates: A touching frame be high enough to capture the touching moment of is where the finger touches the touch screen. We track the victim’s fingertip [7] [14]. Because of the above the finger by using the optical flow algorithm and derive limitations, in our attack approach, we use the ZED a touching frame as the one where the finger changes camera to record videos of 1920×1080 at 30 FPS the direction (from touching down to lifting up). For (frames per second). each touching frame, we derive the fingertip position 3) Step 3: Preprocessing: To improve processing through the following steps. We first select a point speed in our later steps, we only select video frames at the fingertip edge in the left frame, and perform which contain the victim’s fingertip touching the de- a sub- template matching algorithm to find the vice’s screen. In our case, for each input process, we corresponding point of the fingertip on the right image. have 4 touching frames represent 4 digit key inputs. Next, we calculate the fingertip point’s 3D coordinate On some devices, a user needs to input an extra con- by using the same algorithm in Step 4. Then we project firmation key after inputting the digit keys. Next we the touching fingertip point to the device screen plane. perform stereo rectification on every frame pair with However, the projected fingertip point does not always the camera system parameters we got from Step 1. accurately reflect the actual touching point. Stereo rectification is sued to mathematically eliminate We now generate password candidates based on spa- the rotation between corresponding left and right frames tial relationship of projected points. We consider two [1]. In this step, we also calculate the reprojection scenarios, the device only requires 4 digit key inputs and matrix Q which projects a 2D point in the left image the device requires a confirmation key input besides the to the 3D world from the perspective of the left camera digit key inputs. The example of 2 keyboards is shown [1]. in Figure 13. 4) Step 4: Modeling Target Device: We derive the In the first scenario, an input pattern is constructed device’s corner points in the left and right images by from the projected points while the location of the calculating the intersection of the device’s 4 edges. pattern is still undefined. We enumerate all the possible After getting 4 pairs of corresponding points, we calcu- pattern positions. To do so, we find the upper left point late the disparity (d) of every point pair by calculating in the pattern. Then we generate candidates by moving their horizontal x coordinate difference in left and right the pattern on the keyboard: the upper left point of this images. Then we calculate the 3D coordinate of the pattern is aligned with the center point of each of the ten point by the following equation [1]: digit keys. We only keep the candidates that land in the x X  keyboard. We rank the candidates by the distance of the y  Y  pattern’s upper left point to its corresponding projected Q   =   , (1) d  Z  point. We output the top three candidates as our final 1 W result.

4 VI.CONCLUSION In this paper, we first present the comparison of 7 popular visual reality and augmented reality systems. It can be observed some of the systems are equipped with stereo cameras. We then present an Augmented Reality system based on Samsung Gear VR and the ZED stereo camera. We attach the ZED stereo camera Fig. 14. Success Rate at different distances onto the Samsung Gear VR headset, using a video In the second scenario, some keyboards require a stream server to transfer the live video from the camera user to input a confirmation key after the digit key to the headset. We present a computer vision based side inputs. The confirmation key input is always at the same channel attack using the stereo camera to obtain the position. We can use this fact to correct the position of inputted numerical passwords on touch-enabled mobile the estimated pattern. We calculate the vector between devices. For 4 digit numerical passwords, our attack the confirmation key center and the projected fingertip reaches a success rate of 90% at a distance of 1.5 meters position corresponding to the confirmation key input. and achieves reasonably good success rate within 2.5 Then we use this vector to correct all other 4 digit meters. Through this case study, we would like to raise inputs fingertip position. Therefore, we check where awareness of potential security and privacy breaches the corrected touching fingertip positions land in the from seemingly innocuous VR and AR devices (like keyboard and derive the password. Samsung Gear VR, Microsoft HoloLens and iPhone 7 and 8 plus with dual camera ) that have been gaining V. EVALUATION popularity.

This section introduces the experiment setup and ACKNOWLEDGMENTS results. This work was supported in part by National Natural A. Evaluation Setup Science Foundation grants 1461060 and 1642124. Any opinions, findings, conclusions, and recommendations We use our AR system to perform real-world attacks in this paper are those of the authors and do not against a Samsung tablet. We expect the result necessarily reflect the views of the funding agencies. will be similar for other victim mobiles. We perform experiments with different distances, ranging from 1.5 REFERENCES meters to 3 meters. The distance is defined as the [1] G. R. Bradski and A. Kaehler. Learning opencv, 1st edition. horizontal distance between the camera and the target O’Reilly Media, Inc., first edition, 2008. device. For each data point, the participant performs [2] CNET. Virtual reality 101. https://www.cnet.com/ special-reports/vr101/, 2017. 10 four-digit numerical password inputs. The passwords [3] Google. Daydream view. https://vr.google.com/daydream/ are randomly generated. smartphonevr/, 2017. [4] Google. Google cardboard IO 2015 technical specifications. B. Results https://vr.google.com/cardboard/manufacturers/, 2017. [5] Gstreamer. Gstreamer concepts. https://gstreamer.freedesktop. The attack success rates in terms of distance is shown org/documentation/tutorials/basic/concepts.html, 2017. in Figure 14. We list success rates for the both scenarios, [6] Z. Li, Q. Yue, C. Sano, W. Yu, and X. Fu. 3d vision attack against authentication. In Proceedings of IEEE International with and without a confirmation key. In the scenario that Conference on Communications (ICC), 2017. the device doesn’t require a confirmation key, we list the [7] Z. Ling, J. Luo, Q. Chen, Q. Yue, M. Yang, W. Yu, and X. Fu. success rate for both single and three input attempts. In Secure fingertip mouse for mobile devices. In Proceedings of the Annual IEEE International Conference on Computer the case of the three attack attempts, if any of our top 3 Communications (INFOCOM), pages 1–9, 2016. candidates match the actual input password, we consider [8] Microsoft. The leader in —hololens. https://www. the attack a success. This assumption is reasonable since microsoft.com/en-us/hololens, 2017. [9] Oculus. Oculus rift. https://www.oculus.com/rift/, 2017. most devices allow three password/passcode attempts [10] Samsung. GearVR framework project. https: before locking the device. In the scenario that the device //resources.samsungdevelopers.com/Gear VR and Gear needs a confirmation key to unlock, we also list the 360/GearVR Framework Project, 2017. [11] Samsung. Samsung gear VR specifications. http://www. success rate for both single and three attack attempts. samsung.com/global/galaxy/gear-vr/specs/, 2017. The three-attempt success rate is derived from our [12] Sony. Playstation VR tech specs. https://www.playstation.com/ single attack attempt success rate based on the Bernoulli en-us/explore/playstation-vr/tech-specs/, 2017. [13] Stereolabs. ZED - depth sensing and camera tracking. https: distribution. //www.stereolabs.com/zed/specs/, 2017. From the experiment, we can observe that as distance [14] Q. Yue, Z. Ling, X. Fu, B. Liu, K. Ren, and W. Zhao. Blind increases, the success rate reduces. Our attack approach recognition of touched keys on mobile devices. In Proceedings of the ACM SIGSAC Conference on Computer and Communi- is relatively effective when the distance between at- cations Security (CCS), pages 1403–1414, 2014. tacker and the device is within 2.5 meters.

5