
A Measurement Study of Google Play Nicolas Viennot Edward Garcia Jason Nieh Computer Science Computer Science Computer Science Department Department Department Columbia University Columbia University Columbia University New York, NY, USA New York, NY, USA New York, NY, USA [email protected] [email protected] [email protected] ABSTRACT Keywords Although millions of users download and use third-party Android; Authentication; Clone Detection; Decompilation; Android applications from the Google Play store, little in- Google Play; Mobile Computing; OAuth; Security; formation is known on an aggregated level about these ap- plications. We have built PlayDrone, the first scalable 1. INTRODUCTION Google Play store crawler, and used it to index and analyze The Google Play store allows users to download and use over 1,100,000 applications in the Google Play store on a a vast amount of third-party applications. Millions of users daily basis, the largest such index of Android applications. register personal information both with Google and third- PlayDrone leverages various hacking techniques to circum- party services to download and use these applications on vent Google's roadblocks for indexing Google Play store con- their personal Android phones and tablets. Hundreds of tent, and makes proprietary application sources available, thousands of developers upload content to the Google Play including source code for over 880,000 free applications. We store and millions of users download the content despite the demonstrate the usefulness of PlayDrone in decompiling fact that the content is largely unchecked. and analyzing application content by exploring four pre- However, little is known at an aggregate level about the viously unaddressed issues: the characterization of Google hundreds of thousands of applications available in the Google Play application content at large scale and its evolution over Play store. This is due in large part to the lack of scal- time, library usage in applications and its impact on appli- able tools available for discovering and analyzing Android cation portability, duplicative application content in Google applications in the Google Play store. Application source Play, and the ineffectiveness of OAuth and related service code is also only available to the respective third-party de- authentication mechanisms resulting in malicious users be- velopers. Not even Google has access to the source code, ing able to easily gain unauthorized access to user data and as applications are submitted directly as compressed binary resources on Amazon Web Services and Facebook. packages by application developers to Google Play. Fur- thermore, Google imposes various mechanisms to prevent others from crawling and indexing Google Play store con- Categories and Subject Descriptors tent. For example, discovery of applications in the Google Play store is limited as only the first 500 applications be- C.2.4 [Computer-Communication Networks]: Dis- longing to any category or matching any search term can be tributed Systems; C.4 [Performance of Systems]: found by browsing the store's web interface. Some applica- Measurement techniques; C.5.3 [Computer System tions also require specific hardware features or other existing Implementation]: Microcomputers{Portable devices; applications and libraries to be available on the end-user de- H.3.3 [Information Storage and Retrieval]: Infor- vice. Such applications are only available if the Google Play mation Search and Retrieval{Information filtering; J.7 interface is accessed with an account registered on a device [Computers in Other Systems]: Consumer products; with the prerequisites available. K.6.2 [Management of Computing and Information To explore Google Play content, we have created Play- Systems]: Installation Management{Performance and us- Drone, the first scalable Google Play store crawler and ap- age measurement; K.6.5 [Management of Computing plication analysis framework. PlayDrone uses four key and Information Systems]: Security and Protection{ techniques. First, PlayDrone leverages common hacking Authentication techniques to easily circumvent security measures that Google uses to prevent indexing Google Play store content. These Permission to make digital or hard copies of all or part of this work for personal or techniques include simple dictionary-based attacks for dis- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation covering applications, and decompiling and rebuilding the on the first page. Copyrights for components of this work owned by others than the Google Play Android client to use insecure communication author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or protocols to communicate with the Google Play servers to republish, to post on servers or to redistribute to lists, requires prior specific permission capture, understand, and reproduce the necessary protocols. and/or a fee. Request permissions from [email protected]. Second, PlayDrone leverages higher-level languages and SIGMETRICS’14, June 16–20, 2014, Austin, Texas, USA. frameworks to provide highly concurrent, distributed pro- Copyright is held by the owner/author(s). Publication rights licensed to ACM. cessing with modest implementation effort. ACM 978-1-4503-2789-3/14/06 ...$15.00. PlayDrone http://dx.doi.org/10.1145/2591971.2592003. is written in Ruby and uses the Sidekiq [31] asynchronous processing framework and the Redis [33] key-value store. ning the applications. Our results demonstrate developer Its performance scales easily by simply adding servers to confusion may subvert the effectiveness of the widely used the cluster, enabling PlayDrone to efficiently crawl the OAuth open source standard for authentication. We notified Google Play store on a daily basis even as its content con- and worked with service providers to prevent these attacks, tinues to grow. Third, PlayDrone stores each applica- including providing Google with code to help them scan for tion's metadata and decompiled sources in a Git repository. secret keys in applications as part of the Google Play appli- This provides a simple versioning system for PlayDrone cation publication process to protect users and developers. to track and manage multiple versions of each application This rest of this paper is organized as follows. Section 2 and analyze how Google Play store content evolves over describes how PlayDrone intefaces with the Google Play time. Finally, PlayDrone leverages the Elasticsearch [19] API. Section 3 describes the PlayDrone crawler architec- distributed real-time search and analytics engine using an ture. and Section 4 measures its scalable performance. Sec- indexing schema based on the Google Play store API to tion 5 characterizes Android applications in Google Play. make it easy to analyze and explore the Google Play store Section 6 discusses library usage in Android applications. metadata and content. Section 7 describes our approach for efficiently detecting We have used PlayDrone to crawl the Google Play store similar Android applications and our measurements of simi- and analyze over 1,100,000 Android applications, including larity among applications in Google Play. Section 8 presents decompiling the source code for over 880,000 free Android a study of secret authentication key usage and its problems applications and analyzing over 100 billion lines of decom- in Android applications. Section 9 discusses related work. piled code. We demonstrate the usefulness of PlayDrone Finally, we present some concluding remarks. for analyzing application content by exploring four previ- ously unaddressed issues in understanding Android applica- 2. INTERFACING WITH GOOGLE PLAY tions. First, we provide a characterization of Google Play To crawl the Google Play store, PlayDrone needs to application content at scale. We discuss the relationship be- communicate with the Google Play store, which requires tween application ratings and download frequency, discuss use of a Google account for all the necessary functionality. how applications are categorized in Google Play and how the Using only a few Google accounts to crawl the entire store choice of self-categorization can affect application visibility. might risk having the accounts disabled by Google, so we We show how Google Play store content evolves over time, decided to harvest a large number of Google accounts. To providing a measure of how often applications are released, do this quickly and efficiently, we had to address two prob- updated, and removed. We also show that a small percent- lems. First, registering for a Google account requires solv- age of free applications account for almost all downloads. ing CAPTCHAs. Second, registering for a Google account Second, we perform the first large-scale source code anal- requires phone verification when the same IP attempts to ysis of library usage in Android applications. We show how register more than five accounts on a given day. library usage differs between popular and unpopular applica- We addressed both issues by using a crowdsourcing Inter- tions, including that native libraries are heavily used among net marketplace service to cheaply use other human users to the most popular applications. As a result, Android systems register for Google accounts from a diverse set of IPs. Any which only support Java-based applications are inadequate such service could be used, including dedicated
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-