No Escape From Reality: Security and Privacy of Augmented Reality Browsers

Richard McPherson Suman Jana Vitaly Shmatikov University of Texas at Austin University of Texas at Austin University of Texas at Austin [email protected] [email protected] [email protected]

ABSTRACT Augmented reality (AR) browsers are an emerging category of mo- bile applications that add interactive virtual objects to the user’s view of the physical world. This paper gives the first system-level evaluation of their security and privacy properties. We start by analyzing the functional requirements that AR brows- ers must support in order to present AR content. We then investi- gate the security architecture of Junaio, Layar, and Wikitude brows- ers, which are running today on over 30 million mobile devices, and identify new categories of security and privacy vulnerabilities unique to AR browsers. Finally, we provide the first engineering guidelines for securely implementing AR functionality. 1 Introduction Figure 1: A Layar-based mobile app [7]. Augmented reality (AR) technologies enhance users’ perception of realistically blending them with real objects. The resulting AR the world by blending interactive virtual objects with the visual rep- content combines image recognition, geolocation, interactive vir- resentation of actual objects in real time [2, 3]. Traditional AR tual objects, conventional Web content, and control code written in applications range from medical visualization to aircraft naviga- JavaScript (see an example in Fig. 1). tion, but only recently have consumer mobile devices become suf- The basic architecture of AR services is shown in Fig. 2. From ficiently powerful to run AR software. the security and privacy perspective, its key aspect is that the AR AR applications have three stages: sensing input, transform- browsers provide augmentation mechanisms, but the actual AR con- ing sensed objects (e.g., adding virtual objects), and rendering the tent comes from channels created by independent developers. Just transformed objects to the user. Modern AR platforms ease the like a conventional Web browser is an interface between the user burden of implementing these tasks. By far the most popular plat- and Web content from independent websites, an AR browser is an forms are AR browsers like Junaio, Layar, and Wikitude, available interface between the user and independent AR content. An AR as SDKs or standalone mobile apps. Junaio has more than 20 mil- browser is thus responsible for ensuring that malicious AR content lion users and over 20,000 content developers who have created cannot access content from other sources, nor damage or abuse the more than 210,000 AR “channels” [14]. Layar has 1.5 million user’s system outside the browser. users and 9,000 content developers [18]. Wikitude has 13 million A major difference between Web browsers and AR browsers is users [29] and over 30,000 content developers. the business model. Web browsers are typically part of the stan- All existing AR browsers are based on Web browsers and are dard software distribution, and their developers are paid by the li- similar to them in the sense that they, too, fetch and display inter- censing fees from OEMs and OS owners and by the search engines. active content from websites (“channels,” in AR parlance). In ad- This model works because there is already a wealth of Web content. dition to rendering HTML and executing JavaScript, AR browsers AR browsers, however, need a different model because there is not provide support for the three key tasks necessary for AR func- much AR content available today. Their sources of revenue include tionality: sensing, transforming, and displaying transformed ob- advertising injected into AR content, registration fees from content jects. They enable AR channels to (1) access sensors on the mo- developers, and revenue sharing for paid content. This business bile device, including the onboard camera and GPS location, (2) model has an impact on the architecture of AR services: unlike create and manipulate a variety of 2D and 3D interactive virtual Web content, which is accessed directly from the Web browser, objects, and (3) display virtual objects on top of the camera feed, requests to load third-party AR content must go through the AR service provider, as shown in Fig. 2. Our contributions. We perform the first systematic analysis of the security and privacy properties of AR browsers and how they Copyright is held by the International World Wide Web Conference Com- differ from Web browsers. Untrusted AR content presents new, mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the unique types of threats, yet—in contrast to Web-browser specifica- author’s site if the Material is used in electronic media. tions—the latest Augmented Reality Markup Language (ARML) WWW 2015, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3469-3/15/05. specification [19] barely mentions security or privacy, and they are http://dx.doi.org/10.1145/2736277.2741657. often overlooked in the design of the existing AR browsers. We start by analyzing the functional requirements needed to support the sensing, transforming, and rendering of AR content. These include new ways of combining AR objects and conventional HTML content from multiple origins, new for accessing ob- jects outside the browser, new mechanisms for controlling the dis- play of AR and HTML objects, and new ways of launching content. Then, for each functional requirement, we investigate how it is implemented by the existing AR browsers. All AR browsers are based on Web browsers, which do not support AR functionality, forcing AR browsers to resort to ad-hoc cross-origin mechanisms, APIs that open holes in the browser sandbox, custom techniques for composing visual content from different origins, and non-standard delegation schemes for authentication credentials. Architectural flaws in these mechanisms result in security and Figure 2: Architecture of a typical AR service. privacy vulnerabilities. We explore the threat model of AR browsers and demonstrate several new categories of threats caused by the Support for interactive, non-HTML AR content. In addition AR browsers’ unique combination of high-volume visual data gath- to HTML content such as images and text, AR content may in- ering, image-triggered code execution, outsourced image process- clude 2D and 3D models and animations that cannot be described in ing, and merging images from the onboard camera with third-party HTML. AR channels thus include service-specific XML or JSON content. For example, individual-specific items such as license defining how to place and render these objects. plates can automatically launch malicious AR content, enabling Image-triggered code execution. AR browsers access content in fully automated stalking and tracking; malicious AR channels can non-standard ways: they send images from the device’s camera to abuse image-triggered code execution; and a conventional webpage their servers, which attempt to recognize certain pictures and QR can hijack the AR browser installed on the user’s mobile device and codes and automatically launch the associated AR channels. use it to gain unauthorized access to the device’s camera and GPS Outsourced image processing. Image recognition is a computa- without the user’s permission. We also show AR browsers amplify tionally heavy task that may not be feasible on low-powered mobile existing threats such as cross-site scripting, clickjacking, cookie devices and often involves proprietary algorithms. Furthermore, stealing, and leakage of private information. image-based code execution requires the server to extract the trig- For each design flaw, we present our recommendations. Some ger image from the camera feed and match it against a proprietary are easy to fix, others require a substantial re-design, but none are database of registered images. Therefore, AR browsers send im- mere “bugs.” They all stem from the fact that standard system ages from the phone’s camera to the AR provider for processing. components used in today’s mobile and Web applications are in- sufficient to securely support AR functionality. For each functional Visual composition of AR content. The AR browser is respon- requirement of the AR browsers, we explain which features and sible for constructing a visual stack that combines non-HTML AR system abstractions are needed to implement it properly content, such as interactive 3D models, with HTML content from multiple origins (e.g., online ads) on top of the camera feed. 2 AR Services Indirect retrieval of AR content. Instead of directly fetching AR content from its developers, AR browsers typically submit requests AR services are deployed by AR service providers such as Junaio via the AR provider’s server. This enables providers to charge fees and Layar. These companies supply AR client software (we use for registration and usage, inject advertising, etc. the term AR browser) to users and maintain dedicated AR servers through which users access third-party AR content (see Fig. 2). AR 2.2 Components of AR services content providers are independent developers who create AR con- tent, host it on their own servers, and register this content with AR AR browsers. Fig. 3 shows the generic architecture of an AR service providers. We use the term channel generically for any AR browser, including (1) one or more instances of an embedded Web content, but the actual terminology differs from service to service browser such as WebView, (2) a “native” component with direct (e.g., channels are called layers in Layar). access to OS-managed resources such as the camera and GPS loca- By analogy with conventional Web, AR service providers are tion, and (3) ad-hoc mechanisms for gluing these pieces together. similar to Web-browser developers, while AR channels are simi- AR channels. An AR channel is roughly similar to a website. It lar to Web applications. There are important differences, however. defines an augmented reality experience by specifying AR content AR providers make money by charging for SDK licenses, features to display and how to display it. This content may include AR ob- such as cloud storage for AR channels, and per-user fees from third- jects linked to a geolocation (“points of interest” or POI), HTML party apps that connect to their services. All providers analyzed in pages, audio, video, etc., as well as JavaScript to control these ob- this paper allow a limited use of free channels, but some charge for jects. The channel may also specify the actions to take when a commercial channels and/or may insert banner ads into free chan- certain object comes into view or is clicked by the user. nels. Therefore, they typically require that browsers initiate access For example, an AR channel may overlay historical pictures when to third-party channels via providers’ own servers. viewing landmarks,1 show reviews for nearby restaurants,2 or con- 3 2.1 Functional requirements trol an avatar running around the scene. A channel may directly incorporate third-party content—for instance, include online ads in Access to native resources on the user’s device. The AR browser must have access to the onboard camera and GPS location to recog- 1http://layar.it/YuDzik nize images and locations that launch AR content, and to correctly 2Wikitude Restaurants add AR objects to the camera feed. 3junaio://channels/?id=127275 (a) (b) Figure 4: 4a is a Junaio channel showing nearby places of interest. 4b is a Figure 3: Architecture of an AR browser. channel showing a 3D model placed over the Junaio logo. its HTML—or instruct the AR browser to load a third-party web- Ad attackers. AR channels can include third-party content such page when the user performs a certain action. as syndicated ads. An “ad attacker” tricks a trusted website or AR A user launches a channel by selecting it from a list provided by channel into incorporating his malicious content, e.g., via ad bro- the AR browser (based on the geolocation or most popular chan- kers. We assume that ads can run arbitrary JavaScript, but are con- nels) or by scanning an image. fined into iframes when rendered by the AR browser. AR servers. As explained in Section 2.1, requests to load a chan- Web attackers. The focus of this paper is on malicious AR con- nel are sent by the AR browser to the AR provider’s server, not tent, but we also investigate how the mere presence of AR browsers directly to the channel server (see Fig. 2). Each request includes on the device can be exploited by conventional Web attackers (Sec- some combination of the channel’s id, the location of the device, tion 4.2). A Web attacker controls his own website (but not the and other data. The AR server forwards the request to a server that network) and may lure users to it via enticing content, ads, etc. the channel owner registered with the AR provider. The AR server may also handle the authentication of users to channels (Section 9). Curious AR services. We assume that AR browsers are benign The response from the channel with the XML or JSON definitions (the issues raised by malicious mobile apps are well beyond our of AR objects is forwarded via the AR server, too. Subsequent scope), but we do investigate privacy risks caused by user-specific requests may be sent by the browser directly to the channel server. visual data collected by AR services. 2.3 Specific AR browsers Network attackers. We briefly analyze privacy threats posed by network attackers. Either through man-in-the-middle attacks or by We focus on the most popular AR browsers. Junaio is an AR being on the same network as the victim, a network attacker can browser developed by to augment both print media and listen in on the communications between the AR browser and the geolocation-based environments (Fig. 4). Layar focuses primar- AR provider, AR channel owners, and third-party servers. ily on adding AR features to print media such as magazines and newspapers (Fig. 5), but also supports geolocation-based AR. AR 4 Out-of-sandbox Native Access content for Layar is served by layers, but we will refer to them as AR browsers cannot function without access to the camera and channels for terminological consistency. Wikitude is another AR GPS location. Both are required to launch AR channels and to browser, but some of its features did not execute correctly in our correctly add AR objects to the camera feed. Consequently, all AR testing, thus we discuss only the features we were able to evaluate. browsers equip JavaScript with some form of access to native de- Unlike HTML, AR content is browser-specific, i.e., a Junaio vice resources outside the browser. Script access to AR objects is browser can only display Junaio channels. Augmented Reality also required by the ARML 2.0 specification [19, Section 9.1]. Markup Language (ARML) is a proposed standard that unifies the These custom APIs effectively open holes in the Web-browser XML format of AR objects [19]. sandbox, intended to support native access by the channel’s own 3 Threat Model JavaScript. Unfortunately, the WebView embedded browser where this JavaScript is executed does not provide any way to restrict ac- We are concerned with five classes of attackers. cess to these APIs. Consequently, they can be accessed by any Web AR attackers. Just like a standard Web attacker, an AR attacker content regardless of its origin. controls the malicious content of his AR channel and may trick Another common functionality is launching AR browsers via or entice users into visiting it. He cannot observe users’ network custom URLs. This is needed for interoperability [21, Section 5]: communications with other destinations, nor execute any code on for example, one AR browser may launch another AR browser to their machines other than JavaScript served by his own channel. render proprietary content that is not supported by the first browser. Unlike conventional Web browsers, AR browsers automatically 4.1 Doing it wrong launch a channel whenever they scan a picture or QR code associ- ated with it. This introduces a new attack vector: since the attacker The control code of Junaio channels is written in JavaScript and ex- can choose any image for his channel, he can trick users’ browsers ecuted in an embedded WebView. This WebView is extended with into automatically launching malicious content by placing this im- custom APIs for accessing the camera, reading and changing the age in a public place (e.g., as a sticker on a wall). Junaio-reported geolocation, controlling the device’s light, making Malicious ads breaking out of the sandbox. Because native- access rights are not restricted to the channel’s own origin, any iframe can hijack them. In Section 5.2, we describe how native- access capabilities can be used by a malicious ad to perform a cross-site scripting attack against any origin of its choosing. Furthermore, malicious third-party iframes included into a trusted channel can redirect the entire AR browser to a malicious channel. For example, in Layar, a script in an iframe can use a layar:// com- mand to switch the browser to a different channel. In Junaio, the switchChannel command in AREL (also accessible from an iframe) has the same effect. This can be exploited for undetectable phish- Figure 5: A Layar channel running on a scanned magazine page. The AR ing: a malicious iframe can automatically switch the browser to a objects are circled. Clicking any color below the 3D watch model changes visually indistinguishable malicious channel. its color. The user can also add the watch to his or her shopping cart. Malicious AR content abusing native access. The ability of AR requests to the channel server, opening conventional Web-browser channels to access resources outside the browser sandbox presents windows, or loading a different channel. privacy risks to their users. In Junaio, as long as the channel’s trans- These APIs are accessed via AREL, a JavaScript library that en- parent overlay (Section 5.1) continues to run in the background, it codes commands in pseudo-URIs. For example, arel://media/ can surreptitiously grab images from the camera and send them to website/?action=open&external=true&url=http%3a%2f%2f the channel server even after the user moved away from the place www.google.com instructs the Junaio app to launch Google.com in where he launched the channel. The user’s location can be tracked a conventional browser. To pass this pseudo-URI from WebView in a similar fashion in Junaio, Layar, and Wikitude. to the Junaio app, the channel’s Web code pushes it to the global 4.3 How to do it right “commandQueue” and sets window.location to “arel://requests Pending”. The Junaio app intercepts the URL load event, reads Interfaces to native resources must be protected by origin-based ac- the command off the queue, and performs the requested action. In cess control, lest they are hijacked by untrusted iframes. Recent so- Junaio on Android, however, any content—regardless of its origin lutions to the problem of unauthorized native access by third-party and even if running inside an iframe—can bypass AREL and exe- origins in mobile apps, e.g., NoFrak by Georgiev et al. [9], may cute native commands directly, without user permission, by setting be applicable to AR browsers. Furthermore, AR browsers should window.location to the corresponding pseudo-URI. be re-designed to support fine-grained native-access permissions. For example, instead of unfettered access to a camera, the chan- 4.2 Risks nel would be restricted to accessing it only via pre-defined system In this section, we are concerned with (1) conventional Web at- abstractions such as “recognizers” [12] for specific objects. tackers, whose malicious pages are viewed by mobile users in con- To prevent conventional webpages from gaining unauthorized ventional browsers, (2) “ad attackers,” whose untrusted HTML is access to the camera and other resources by launching the AR browser incorporated into trusted AR channels but confined into iframes, and directing it to the attacker’s channel, the user should be asked and (3) AR attackers who directly control malicious AR channels. for confirmation whenever the AR browser is invoked automati- Conventional Web content breaking out of the sandbox. Conven- cally (this presents interface-design and usability challenges). tional webpages cannot access the camera or other native resources 5 Support for Non-HTML AR Content outside the browser unless explicitly authorized by the user. Unfor- tunately, the AR browser’s access rights can be hijacked by mali- In addition to HTML content such as images and text, interactive cious webpages to gain this access without involving the user. AR content includes videos, animations, and 2D and 3D models Suppose the user has the Junaio app installed on his Android with unique visual presentation requirements. These AR objects phone. The user accidentally visits a malicious webpage in a regu- cannot be described in HTML alone, thus AR services rely on XML lar Web browser (e.g., Android’s default system browser) by click- or JSON definitions to specify how to place and render these ob- ing on an ad, a link in a spam message, etc. The malicious page jects, and on JavaScript to control these objects at runtime. contains a URL of the form junaio://channel=. . . and a script in the Just like conventional websites, AR channels may combine con- page forces the browser to open this URL. This generates an An- tent from different origins. AR browsers must therefore confine droid intent, which automatically starts the Junaio app and launches untrusted content. In conventional Web browsers, the same-origin any channel chosen by the attacker, e.g., the attacker’s own chan- policy (SOP) ensures that content from a given origin—defined nel. Like all Junaio channels, the attacker’s channel automatically by the protocol (HTTP or HTTPS), domain name, and port num- has access to the device’s camera, can take pictures of the user and ber—cannot access the non-trivial attributes of any content from its surroundings, etc. Layar, too, can be automatically launched a different origin [27]. Web browsers also provide origin-based from a conventional webpage via a pseudo-URI. isolation mechanisms such as iframes and structured cross-origin This attack completely bypasses OS access control. Even though communication mechanisms such as postMessage. the user granted camera access only to Junaio or Layar, this ac- In AR browsers, interactive, non-HTML AR objects make the cess has now been hijacked by a conventional webpage. The attack confinement problem much harder because these objects must be can even be stealthy. Having read images from the camera, the described in XML or JSON, which are not governed by the SOP. attacker’s channel can relaunch the regular Web browser and im- Therefore, the AR browser cannot rely on the underlying Web browser mediately redirect the user to the page he was initially browsing. to provide isolation between origins. This vulnerability is generic because the ability to automatically 5.1 Doing it wrong launch an AR browser is required for interoperability [21]. The presence of a single AR browser on the user’s device can thus be Junaio. In Junaio, AR objects are defined in an XML page re- exploited by any conventional webpage to bypass user permissions. turned by the channel server. Junaio supports floating clickable Figure 6: Junaio’s visual stack. AR objects are on top of the camera feed, the transparent overlay on top of the objects. If an object is clicked, a popup Figure 7: Layar’s visual stack. AR objects, which can include HTML appears at the very top. pages, are overlaid on the camera feed. objects (“points of interest”), 3D models, floating pictures, movies, Cross-site scripting. The XML definition of an AR object in Ju- and 360-degree panoramas (Figs. 4a and 4b). The Junaio browser naio can have a popup field with a textual description and an array renders these objects in the visual stack shown in Fig. 6. of buttons. When such an object is clicked, a partially transpar- On top of the AR objects, Junaio places a transparent window ent window with the popup’s description and buttons is opened on implemented using WebView (Android) or UIWebView (iOS). We top of the transparent overlay (Fig. 6). Each button contains either call it the transparent HTML overlay. This overlay provides GUI a URL, or JavaScript code. When a button is clicked, the associ- functionality to channels and enables them to control AR objects ated URL is loaded in an opaque window. If the button contains outside WebView via special browser interfaces and a custom Java- JavaScript, it is executed in the transparent overlay—even if the Script library called AREL (Section 4.1). These interfaces can be origin of the content in the overlay is different from the origin of used to create, destroy, animate, move, or resize AR objects, to the channel that provided the script. read and modify their parameters such as id, name, geolocation, This setup opens a hole in the same-origin policy. A malicious and the associated popup, and to handle events based on channel channel can specify any origin for the transparent page and asso- state, object state, or user’s interaction with the object (e.g., channel ciate an arbitrary script with a button. When the button is clicked, ready, object loaded, sound finished playing, object rotated). this script is injected into the transparent page and gains unrestricted The URL of the transparent overlay is specified in the XML access to all content from this page’s origin—see Fig. 8a. This page and may belong to a different origin than the AR channel it- cross-site scripting (XSS) vulnerability can be exploited, for exam- self. This URL cannot be viewed by the user. The channel—and ple, to modify the victim’s DOM (see Fig. 8b) or steal cookies. any third-party content included in the channel—can also supply HtmlDrawable objects in Wikitude contain an even simpler XSS JavaScript that will be executed inside the transparent page. vulnerability. A malicious channel can specify any URL for an Clicking a link in the transparent overlay loads the destination in HtmlDrawable object and use evalJavaScript to inject an arbitrary the same window, replacing the old page. JavaScript in the trans- script into this object. parent overlay can also open an opaque window with a conven- tional embedded Web browser. Another way to open an opaque Universal cross-site scripting. The above XSS attacks assume window is via a popup (Section 5.2). JavaScript continues to run in that the channel is malicious. Unfortunately, even if (1) the chan- the background after opening the window. nel itself is benign, (2) all untrusted, third-party content, such as Layar. The Layar browser displays AR objects on top of the visual online ads, is correctly confined to iframes, and (3) the embedded feed from the device’s camera (Fig. 7). The objects can be HTML Web browser running the channel’s HTML correctly enforces the webpages, 2D images, 3D models, or videos, and can have actions SOP, confined third-party content in Junaio can perform XSS at- associated with them, such as placing a phone call, sending an SMS tacks against any origin of its choosing. or email, launching a website, loading or refreshing channels, shar- Consider a benign Junaio channel that includes an AR object ing the channel on Facebook or Twitter, and loading movies and with a popup button and suppose that the channel’s transparent music. Actions are specified in the object definition via pseudo- HTML page contains an ad in an iframe (Fig. 9a). Malicious Java- URIs such as ‘tel:’, ‘sms:’, ‘mailto:’, ‘layar://’ , ‘layarshare://’. Script hidden in such an ad can (1) use AREL commands (Sec- Wikitude. Wikitude is architecturally similar to Junaio. AR con- tion 4.1) to change the script associated with the popup, and (2) tent includes a transparent webpage that shows a GUI and controls change the URL of the transparent overlay to the victim page (Fig. 9b). AR objects via a custom JavaScript library called ARchitect. Ob- When the button is clicked, the attack script is executed in the vic- ject types include HtmlDrawable, intended to display HTML con- tim page (Fig. 9c). This is a universal XSS vulnerability: a mali- tent. HtmlDrawables have an evalJavaScript function that can be cious ad can inject an arbitrary script into any origin whatsoever. used to execute JavaScript inside a drawable (it worked only spo- As a proof of concept, we have implemented this attack against radically in our testing on Android 4.4.2). Twitter. Our channel includes an HTML page and one geolocation object associated with a popup. At first, this popup simply launches 5.2 Risks google.com. The channel’s HTML page contains an iframe with In this section, we are concerned with AR attackers, who may in- a button. When clicked, this button executes JavaScript which is- corporate trusted content into their malicious AR channels, and “ad sues an AREL command to Junaio to associate the popup with an attackers,” whose malicious content (e.g., online ads) is incorpo- attack script, then changes the URL of the transparent page to Twit- rated into trusted AR channels but confined into iframes. ter with an attacker-chosen tweet text. When the user opens the (a) Step 1 (b) Step 2

(a) XSS vector.

(b) Exploiting XSS. Figure 8: Cross-site scripting (XSS) in Junaio (c) Step 3 popup and clicks the button, Junaio unwittingly injects the script Figure 9: Universal XSS vulnerability in Junaio. into the Twitter page, where it submits the attacker’s tweet. with peril [26]. Enforcement of the SOP is complicated further Other capabilities available to malicious code in an iframe in- by the fact that several of these new tags may need plugins to be clude launching an opaque browser window, turning on and off the rendered (similar to Flash). camera and the light, removing all AR objects, switching channels, Lacking HTML5 support for AR, WebView can at least provide and launching audio and video files. origin-restricted APIs that render arbitrary objects on top of camera 5.3 How to do it right images and let JavaScript inside WebView control these objects. Quick patches. The cross-site scripting vulnerabilities described 6 Image-Triggered Code Execution above are caused in part by the fact that the origin of HTML in- The ability to scan their surroundings and to recognize and track corporated into an AR channel may be different from the channel’s images is fundamental in AR browsers [19, Section 7.5.1.2]. This own origin. One plausible defense is for the AR browser to en- enables new methods for invoking AR content: for example, a Ju- sure that the two origins are the same; another is to sanitize XML naio or Layar channel can be launched simply by scanning a picture so that it does not contain scripts, which is a notoriously difficult or QR code associated with the channel. problem [4]. Both defenses require the AR browser to carefully reason about the origins of content specified in custom XML def- 6.1 Doing it wrong initions, thus replicating a complex piece of Web-browser func- When the user is viewing his surroundings through the Junaio or tionality. Furthermore, both defenses disable important functional Layar browser, the AR service is continuously analyzing the cam- features of AR browsers (such as controlling the appearance of AR era feed. As soon as it recognizes an image associated with some objects from another origin) and may break existing applications. channel, it automatically launches and executes the channel’s con- In Wikitude, where evalJavaScript allows channels to inject scripts tent, without any confirmation prompts. The user cannot preview into an HtmlDrawable regardless of its origin, restricting the origin the URL or any other information about the content, with one ex- is not feasible because HtmlDrawable is intended to display content ception: for QR codes (but not pictures), Layar previews the URL from origins other than the channel itself. by showing it as a button before launching the channel. Unfor- Principled solutions. The root cause of many security holes de- tunately, its URL parser is broken (Fig. 10). For example, if the scribed in this section is that AR objects cannot be described in URL in the QR code is http:////attacker.com, it will not be displayed HTML, thus AR browsers must use custom mechanisms to enable in the preview, but the browser will launch AR content hosted at HTML content to control these objects. Standardizing AR object http://attacker.com. In Junaio, after a channel is fully loaded, the description languages, including them in HTML5 via either tags, user can see its description and the developer’s name. or a special document type, (e.g., channel), and adding support for 6.2 Risks these new HTML5 features into browsers would allow AR content to execute entirely within WebView, eliminating the need for XML In this section, we are primarily concerned with AR attackers who and some of the ad-hoc browser interfaces. can choose any picture or QR code as the automatic trigger for their Unfortunately, assigning origins to these tags is not trivial. In the malicious channels. For some (but not all) attacks, the attacker existing AR browsers, all AR objects are treated as if their origin needs to physically place these images in public places. is the domain where the main AR channel is hosted. Since these Fully automated, stealthy, large-scale tracking. Because not all objects may contain JavaScript, this is extremely dangerous. AR services vet pictures associated with AR channels, they can be The alternative is to extend the same-origin policy to AR tags. used for automated stalking and tracking. For example, Layar’s These tags are intended to support 3D models, animations, UI el- image recognition algorithm is sufficiently precise to distinguish ements, etc. which may come from different domains but are in- between license plate numbers. An AR attacker can register a La- tended to work smoothly together to produce a unified AR ex- yar channel associated with the photo of a specific license plate. perience. A naive extension of the SOP would isolate the AR Whenever any of the millions of Layar users scan their surround- HTML tags based on their domains, but this would prevent them ings and the license plate is prominent in the camera’s view, the from communicating. The developers would then have to imple- channel is launched automatically and the plate’s location, along ment cross-origin communication mechanisms, which is fraught with its entire visual environment, is sent to the channel’s owner, (a) (b) (c) (d) Figure 12: Different combinations of the Junaio mascot and QR code launch different channels. Furthermore, a malicious channel can suppress the channel se- lection menu using the native-access capabilities described in Sec- tion 4.2. A layar://[channelname] pseudo-URI instructs the browser Figure 10: Both codes launch the same channel, but Layar fails to parse the to launch a channel. In this case, the browser does not show other code on the right and does not show the URL. channels associated with the image. Consider a malicious chan- nel that associates itself with the same image as a benign channel. If the user previews the malicious channel before flipping to the benign channel, the first object loaded from the malicious channel can reload the entire channel using layar://[channelname] and the other, benign channel will no longer be visible to the user. The other risk is composite images that include a trusted image in an unexpected visual environment. When faced with a compos- ite image, Junaio’s choice of the channel to launch depends on the camera angle and distance. For example, the images in Figs. 12a Figure 11: Depending on the angle, each poster nondeterministically and 12b launch different channels depending on whether the mas- launches its own channel or the channel associated with the other poster. cot or the QR code is more prominent. Sometimes, changing the angle of the camera by a few degrees changes which channel is enabling him to track the plate’s movements. Other sensitive items launched. The image in Fig. 12c automatically launches the chan- can be tracked in a similar fashion. nel associated with the mascot when scanned from a close distance, Automatically launching malicious content. As mentioned above, and the one associated with the QR code when scanned from fur- when an image is recognized by the AR service’s (black-box) recog- ther away. Fig. 12d launches the channel associated with the QR nition algorithm, the code of the associated channel executes with- code, even though the mascot is visible. This means that even after out user confirmation or channel identification. If a scanned image scanning a familiar image, a user cannot be sure that the automati- contains sub-images associated with different channels or a famil- cally launched channel is the one he expects. iar image in an unusual environment or an image that is similar yet 6.3 How to do it right subtly different from a familiar image, the user cannot know ahead of time what channel will be launched. The risk of an AR attacker registering an image trigger that is spe- cific to an individual (e.g., a license plate) is inherent in AR ser- Image recognition algorithms suffer from false positives and are vices. A service may attempt to filter out such images during chan- inevitably nondeterministic from the user’s point of view [30]. Un- nel registration, but this requires deep semantic analysis of the sub- fortunately, user interfaces of the AR browsers are derived from the mitted images and will be inevitably bypassable. This inherent risk underlying Web browsers and do not inform the user about spurious is exacerbated by the fact that AR content is executed immediately matches and other potential problems with visual identification. after the image is scanned. This can be exploited by an AR attacker in two ways: (1) register First, AR browsers should inform the user about the origin of an image trigger that is very similar to an image already associated AR content before launching it (at the very least, display the de- with a trusted channel, or (2) combine a malicious channel’s trig- veloper’s name and basic information about the channel). Second, ger with a trusted channel’s trigger into a single composite image. automatic, image-triggered code execution is fraught with danger In either scenario, the AR browser may be tricked into automati- and should be used sparingly—for example, only with trusted chan- cally launching the malicious channel when scanning the attacker’s nels—and not with every image that happens to fall into the cam- image on a building wall, bus shelter, etc. era’s field of vision. Third, AR browsers should develop better user In Layar, the same picture may be associated with multiple chan- interfaces that inform users about the possibility of spurious image nels. For example, we have been able to register our channel with matches and nondeterministic launches of unexpected content. the same movie-poster image as one of Layar’s demo channels. If there are multiple channels associated with a picture, the user can 7 Outsourced Image Processing open a menu in the corner to see channel names and switch between AR browsers must continuously analyze the device’s camera feed them. It is possible, however, to create visually similar images that in order to recognize automatic content triggers and to anchor or automatically and nondeterministically launch different channels position AR objects on the screen. without the browser presenting the channel selection menu to the 7.1 Doing it wrong user. Each poster in Fig. 11 nondeterministically launches the chan- nel associated with the poster or the (completely different) channel AR browsers such as Junaio and Layar do not process the captured associated with the other poster. At many viewing angles and light- images on the device; instead, they send them to central AR servers. ing, the channel selection menu is not offered. There are several reasons to outsource image processing. First, for Figure 13: An image sent by the Layar browser over HTTP so that the Layar server can recognize content triggers. Note the accidentally captured credit card. business reasons—injecting ads, charging content providers, keep- (a) Overlapped HTML widgets in Layar. The widget with “hi” is ing usage statistics, etc.—all AR content retrieval is mediated by cut off before its Tweet button. the AR service. Second, to facilitate image-based channel launch- ing, recognition of trigger images is done at the server. Because this involves matching against proprietary databases using proprietary algorithms, centralized image processing helps protect intellectual property and removes the need to replicate and update the service’s image database on millions of devices. Third, many image recog- nition algorithms are computationally intensive and would heavily task low-powered mobile devices. 7.2 Risks In this section, we are concerned with (1) network attackers who observe network traffic between the device and the provider’s AR server, and (2) the AR service itself. (b) The two HTML widgets expanded. In the attack, the bottom Accidental overcollection of sensitive data is a big risk in this widget is not fully shown (its tweet text is covered and not visible to setting. For example, the Layar browser sends raw camera images the user). to the server over unencrypted HTTP and includes the phone’s lo- Figure 14: Clickjacking in Layar. cation into the GET request for the channel’s JSON. Combining In Layar, a channel can use a webpage as an AR object, called images with location data is a serious privacy concern for many an HTML widget. Each widget opens in its own WebView and does users [10]. All sensitive items in the image (see Fig. 13) and re- not display URLs or Web-browser buttons. HTML widgets may quest are leaked to any Wi-Fi eavesdropper. not be covered by other types of AR objects, but can be overlaid on Even if network communications are secure, the AR service in- each other to create a visual AR experience. evitably collects a tremendous amount of raw visual data about its users’ physical environment. This is an inherent design flaw of all 8.1 Doing it wrong existing AR services. The users must trust them to safeguard cap- tured images, which contain a lot of sensitive information that is Conventional Web browsers provide the iframe abstraction that al- completely irrelevant to the AR functionality: screens, credit cards, lows composition of HTML content from different origins. To license plates, etc. Furthermore, a user has no way to learn which defend against clickjacking, a webpage can ensure that it is not data is sent to AR servers. For example, in addition to the unen- framed by a page from a different origin, via either framebust- crypted camera images sent at the start of each scan and the geolo- ing [25] (moving itself to the top frame), or X-Frame-Option [31]. cation, the Layar browser occasionally sends a log message to the AR browsers must deal with both HTML and non-HTML con- server with the phone’s make, model, and OS version number. tent, and thus resort to custom mechanisms to implement the func- tional equivalent of iframe. Consequently, standard defenses based 7.3 How to do it right on framebusting or X-Frame-Option no longer work. For exam- When image recognition is outsourced to the server, a secure pro- ple, as described above, Layar puts each instance of HTML content tocol should be used to prevent accidental leakage of irrelevant in- into its own WebView instance. Each instance acts like an iframe formation in the images. If the server is attempting to recognize a and can be overlaid on other instances. Therefore, a malicious AR channel trigger on a magazine page, there is no need for it to “see” channel can overlay content from another origin (B) on top of its the physical objects surrounding the page. own content (A) without B being technically “framed” by A. This is a difficult problem, but there has been some recent progress. 8.2 Risks Osadchy et al. described a prototype system for secure outsourced face matching [20]. This system cannot be directly applied in AR In this section, we are concerned with an AR attacker whose chan- browsers, however, because images matched by AR browsers may nel combines his own malicious HTML content with trusted HTML appear in different lighting, under different angles, etc. Another content from other origins. approach is taken by Darkly [13], which can perform simple com- By cleverly overlaying HTML widgets from different origins, a puter vision tasks without access to raw image details. malicious channel can “hijack” the user’s clicks. The user sees a button that appears to belong to some window, but the click is actu- 8 Visual Composition ally captured by a different window. For example, Fig. 14a shows To render images, text, 2D and 3D models, and HTML content a malicious Layar channel that overlays two Twitter windows. The from multiple origins on top of the camera feed, AR browsers main- user may think that the visible “Tweet” button submits the “hi” tain complex visual stacks. Junaio and Layar’s visual stacks are tweet, but it actually belongs to the bottom window and thus sub- described in Section 5.1. mits the invisible, malicious tweet. Because the victim page is in Figure 15: Overlapped HTML widgets in Layar. The first option is in a different widget and, surprisingly, not part of the actual Slashdot poll. the top/main frame of its own WebView instance, it cannot prevent Figure 16: User authentication in Layar. this attack or even detect when it is being framed in this way. to the channel that the channel’s origin has changed. The browsers 8.3 How to do it right continue to attach the cookies from the old origin to their requests, Defenses against clickjacking in AR browsers would benefit from a and the Layar server obliviously forwards them to the new origin. whole-browser equivalent of X-Frame-Option. Layar already pre- vents non-widget objects from covering widgets, but there is no 9.2 Risks way for HTML content to specify that its widget—or any WebView In this section, we are concerned with AR attackers who lie about in which it is displayed—should not overlap with other widgets. their channels’ URLs. By “desynchronizing” the Layar browser’s In general, using conventional browsers such as WebView to and the Layar server’s understanding of the channel’s origin, a ma- render AR content is dangerous because it forces AR browsers to licious channel can steal cookies from any origin (Fig. 17). use ad-hoc mechanisms for visually combining content from differ- ent origins. A principled solution to clickjacking in AR browsers For example, the attacker initially tells Layar that the URL of should involve a clean-slate redesign of their user interfaces. his channel is https://www.twitter.com. When a user launches the channel, the Layar browser attaches Twitter cookies to every chan- 9 Indirect Retrieval of AR Content nel update request. Next, the attacker changes his channel’s URL to https://attacker.com. The Layar server registers the change, but AR content comes from independent third-party developers. While the browsers connected to the channel continue to attach Twitter in theory the AR browser could fetch this content directly from the cookies to every channel update request. The Layar server for- developers’ servers, in practice the business models of AR services wards these requests, cookies attached, to https://attacker.com, and require them to track usage, charge fees for channel registration, the attacker steals all of its users’ Twitter cookies. inject advertising, and, in general, tightly monitor the interaction between their browsers and third-party content. Consequently, con- This attack works for any domain of the attacker’s choosing (we tent requests must pass through the AR provider’s own servers. tested it for Twitter and Facebook). Note that many AR channels are integrated with online social networks, thus the user is likely to 9.1 Doing it wrong be logged into Facebook and Twitter through his AR browser. Some AR browsers enable third-party channels to authenticate users or keep track of users’ preferences between their visits. For exam- 9.3 How to do it right ple, Layar supports a cookie-based user authentication scheme that The first defense is to avoid replicating the state of the browser on can be deployed by geolocation channels (Fig. 16). When request- the Layar server. The browser may request the URL of the chan- ing a channel, the Layar browser sends a POST request to the Layar nel server from the Layar server, but subsequent communication server and attaches the cookies associated with the channel’s origin. should be conducted directly between the browser and the channel The Layar server then attaches these cookies to the GET request it server. The same-origin policy within the browser will then en- forwards to the channel server. sure that cookies are disclosed only to their origins. This defense, Cookie security depends on the binding between the cookie and however, may break Layar’s business model. its origin. A conventional Web browser keeps this binding and au- tomatically attaches the cookie to every request sent to its origin. The second defense is for the Layar server to ensure that it agrees In Layar, channel launches are mediated by the Layar server, which with the browser about the channel server’s URL. This defense must maintain the same cookie-origin binding as the Layar browser. requires re-engineering the protocol between the browser, Layar The Layar browser learns the origin of the channel from the La- server, and channel server. yar server. When the browser first loads the channel, the cookies are The final defense is to use an authentication protocol that sup- set by the channel’s authentication page and thus correctly bound ports delegation, e.g., OAuth. In current Layar, channels may use to the channel’s origin at that time. If this origin changes later (e.g., OAuth 1.0 to authenticate the Layar server. This protects benign the channel moves to a different domain), the Layar server notes AR developers from spoofed Layar servers, but not legitimate La- the change and forwards all browser requests to the new location. yar servers from malicious developers, and thus does not help against Critically, the Layar server does not notify the browsers connected the cookie-stealing attack. Glass to connect to an attacker-controlled Wi-Fi access point.4 Both attacks employ malicious QR codes, but the similarities end there. The attacks described in Section 6.2 exploit the deficiencies of user interfaces in AR browsers, not software vulnerabilities. Other re- lated work includes security threats involving QR codes [15] and the use of QR codes for malware distribution and phishing [17]. Dabrowski et al. [5] recently demonstrated numerous attacks in- volving hiding a QR code inside of another QR code, similar to our our attacks in Section 6.2. Clickjacking attacks against conventional Web content were an- alyzed in [11, 24, 25]. In Section 8.2, we explained that our click- jacking attacks and defenses are somewhat different because of the architectural differences between Web browsers and AR browsers. (a) 11 Conclusions Augmented reality (AR) browsers are a new technology with excit- ing potential. We presented the first in-depth analysis of their secu- rity and privacy properties, identified multiple architectural flaws, and proposed short-term fixes for specific vulnerabilities as well as directions for future research on building secure AR browsers. We have reported our findings to Junaio, Layar, and Wikitude. Junaio informed us that they will incorporate our results into their latest internal build. Wikitude was aware of the security flaw in HtmlDrawable (Section 5.2) and is looking into adding security mechanisms. Layar never responded to us. Acknowledgments. This work was partially supported by the (b) NSF grants CNS-0746888 and CNS-1223396, a Google research award, NIH grant R01 LM011028-01 from the National Library of Figure 17: Layar cookie stealing attack. Medicine, and Google PhD Fellowship to Suman Jana. 10 Related Work 12 References Azuma et al. [2, 3] identified three major properties of AR systems, exhibited by all AR browsers in our study: combining real and [1] G. Abowd, C. Atkeson, J. Hong, S. Long, R. Kooper, and virtual objects, real-time interactivity, and support for 3D blend- M. Pinkerton. Cyberguide: A mobile context-aware tour guide. ing of virtual and real objects. Several papers suggested adding Wireless Networks, 3(5), 1997. augmented reality to mobile applications such as tour guides [1, [2] R. T. Azuma. A survey of augmented reality. Presence: 8]. Spohrer et al. [28] explored the idea of associating information Teleoperators and Virtual Environments, 6(4):355–385, 1997. with real-world objects using “WorldBoard channels.” Just like AR [3] R. T. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and channels analyzed in this paper, “WorldBoard channels” can dis- B. MacIntyre. Recent advances in augmented reality. Computer Graphics and Applications, 21(6):34–47, 2001. play HTML-encoded information overlaid on real-world objects. [4] D. Bates, A. Barth, and C. Jackson. Regular expressions considered Kooper et al. [16] used the term “real world-wide web” for the harmful in client-side XSS filters. In WWW, 2010. combined information space created by merging real-world objects [5] A. Dabrowski, K. Krombholz, J. Ullrich, and E. Weippl. QR with the HTML content of the WWW. ARML [19] is a proposed inception: Barcode-in-barcode attacks. In SPSM, 2014. standard for defining geospatial AR objects through XML. [6] L. D’Antoni, A. Dunn, S. Jana, T. Kohno, B. Livshits, D. Molnar, Roesner et al. [22, 23] have surveyed various security, privacy, A. Moshchuk, E. Ofek, F. Roesner, S. Saponas, M. Veanes, and H. J. and legal concerns arising from the widespread use of AR technolo- Wang. support for augmented reality applications. In HotOS, 2013. gies. By contrast, we analyze the technical architecture of pop- [7] Layar launches “world’s first augmented reality store”. ular, deployed AR platforms. With the exception of clickjacking http://eurodroid.com/2010/04/28/layar- and general privacy concerns, none of the issues we discovered are launches-worlds-first-augmented-reality-store, mentioned in these papers. 2010. Several recent papers focused on privacy concerns arising from [8] S. Feiner, B. MacIntyre, T. Höllerer, and A. Webster. A touring the unrestricted access to sensor data by untrusted third-party ap- machine: Prototyping 3D mobile augmented reality systems for Personal Technologies plications. Darkly [13] prevents certain privacy violations by appli- exploring the urban environment. , 1(4), 1997. [9] M. Georgiev, S. Jana, and V. Shmatikov. Breaking and fixing cations based on the OpenCV computer vision library; D’Antoni et origin-based access control in hybrid Web/mobile application al. [6] and Jana et al. [12] show how to confine AR applications by frameworks. In NDSS, 2014. adding fine-grained permissions to the OS. These systems are con- [10] B. Henne, M. Harbach, and M. Smith. Location privacy revisited: cerned with protecting users from untrusted applications, whereas Factors of privacy decisions. In CHI, 2013. we investigate whether and how trusted AR applications protect [11] L.-S. Huang, A. Moshchuk, H. J. Wang, S. Schechter, and users from untrusted content (i.e., our threat model is similar to the C. Jackson. Clickjacking: Attacks and defenses. In USENIX Security, standard threat model of Web browsers). 2012. Some of our attacks involve pictures or QR codes placed in a 4http://www.techweekeurope.co.uk/news/ public area to trick AR browsers into launching a malicious AR google-glass-security-vulnerability- channel. Lookout Mobile Security used a QR code to force Google internet-of-things-122073 [12] S. Jana, D. Molnar, A. Moshchuk, A. Dunn, B. Livshits, H. J. Wang, Interoperability_Architecture_Jan_21_2014_v1_ and E. Ofek. Enabling fine-grained permissions for augmented 2.pdf, 2014. reality applications with recognizers. In USENIX Security, 2013. [22] F. Roesner, T. Kohno, T. Denning, R. Calo, and B. C. Newell. [13] S. Jana, A. Narayanan, and V. Shmatikov. A scanner Darkly: Augmented reality: Hard problems of law and policy. In UPSIDE, Protecting user privacy from perceptual applications. In S&P, 2013. 2014. [14] Become a Junaio developer. [23] F. Roesner, T. Kohno, and D. Molnar. Security and privacy for http://www.slideshare.net/metaio_AR/why-to- augmented reality systems. In Communications of the ACM, become-a-junaio-developer, 2013. volume 57, pages 88–96, 2014. [15] A. Kharraz, E. Kirda, W. Robertson, D. Balzarotti, and A. Francillon. [24] G. Rydstedt, E. Bursztein, and D. Boneh. Framing attacks on smart Optical delusions: A study of malicious QR codes in the wild. In phones and dumb routers: Tap-jacking and geo-localization. In DSN, 2014. WOOT, 2010. [16] R. Kooper and B. B. MacIntyre. Browsing the real-world wide web: [25] G. Rydstedt, E. Bursztein, D. Boneh, and C. Jackson. Busting frame Maintaining awareness of virtual information in an AR information busting: A study of clickjacking vulnerabilities at popular sites. In space. International Journal of Human-Computer Interaction, 16(3), W2SP, 2010. 2003. [26] S. Son and V. Shmatikov. The postman always rings twice: Attacking [17] K. Krombholz, P. Frühwirt, P. Kieseberg, I. Kapsalis, M. Huber, and and defending postMessage in HTML5 websites. In NDSS, 2013. E. Weippl. QR code security: A survey of attacks and challenges for [27] Same origin policy. http: usable security. In HCI, 2014. //www.w3.org/Security/wiki/Same_Origin_Policy. [18] Layar introduction for developers. [28] J. Spohrer. Information in places. IBM Systems Journal, http://www.slideshare.net/layarmobile/layar- 38(4):602–628, 1999. introduction-for-developers, 2011. [29] Wikitude for agencies. [19] Open Geospatial Consortium. OGC augmented reality markup http://www.slideshare.net/wikitude/wikitude- language 2.0 (ARML 2.0) [candidate standard]. http://www. media-portfolio-presentation, 2012. opengeospatial.org/projects/groups/arml2.0swg, [30] Z. Wu, Q. Ke, M. Isard, and J. Sun. Bundling features for large scale 2013. partial-duplicate web image search. In CVPR, 2009. [20] M. Osadchy, B. Pinkas, A. Jarrous, and B. Moskovich. SCiFI - A [31] The X-Frame-Options response header. system for secure face identification. In S&P, 2010. https://developer.mozilla.org/en- [21] C. Perey. A proposal for AR browser interoperability. US/docs/HTTP/X-Frame-Options. http://www.perey.com/ARStandards/AR_Browser_