Analysis of Location Data Leakage in the Internet Traffic of Android-based Mobile Devices

Nir Sivan, Ron Bitton, Asaf Shabtai Department of Software and Information Systems Engineering Ben Gurion University of the Negev

Abstract are used daily by millions of people. These devices possess a tremendous amount of private information, ranging from In recent years we have witnessed a shift towards personal- users’ personal and financial data to their location data, mak- ized, context-based services for mobile devices. A key com- ing such devices the target of personalized advertisements (by ponent of many of these services is the ability to infer the commercial entities) and intelligence gathering. A key prop- current location and predict the future location of users based erty of many of these services is the ability to understand the on location sensors embedded in the devices. Such knowledge users’ current location, infer points of interest, and predict the enables service providers to present relevant and timely offers future location of users based on location sensors embedded to their users and better manage traffic congestion control, in the devices. Such knowledge enables service providers to thus increasing customer satisfaction and engagement. How- present relevant and timely services (such as navigation rec- ever, such services suffer from location data leakage which ommendations, weather forecasts, advertisements, and social has become one of today’s most concerning privacy issues for networks) to their users, thus increasing customer satisfac- smartphone users. In this paper we focus specifically on loca- tion and engagement. Methods for deriving the location of tion data that is exposed by Android applications via Internet a mobile device can be categorized into the following two network traffic in plaintext without the user’s awareness. We approaches. present an empirical evaluation involving the network traf- In the host-based approach, an installed application can infer fic of real mobile device users, aimed at: (1) measuring the the location of the device by probing built-in sensors or evalu- extent of relevant location data leakage in the Internet traf- ating data provided when a user checks in to a place on social fic of Android-based smartphone devices; (2) understanding media. Local sensors that can provide location data include the value of this data and the ability to infer users’ points hotspot (Wi-Fi) information such as the SSID and BSSID [1], of interests (POIs); and (3) deriving a step-by-step attack connected cell tower, as well as GPS [2]. The location can aimed at inferring the user’s POIs under realistic, real-world also be inferred by using various side-channel attacks such as assumptions. This was achieved by analyzing the Internet power supply variance analysis [3]. traffic recorded from the smartphones of a group of 71 par- In the network-based approach, the location of the mobile ticipants for an average period of 37 days. We also propose device can be derived by using cell tower triangulation [4] a procedure for mining and filtering location data from raw (i.e., using radio location by analyzing signals received by network traffic and utilize geolocation clustering methods to the cell towers the device is connected to) or by analyzing infer users’ POIs. The key findings of this research center on the CDRs [5]. This requires high privileged access to the the extent of this phenomenon in terms of both ubiquity and data which is usually available to service providers and law severity; we found that over 85% of the users’ devices leaked enforcement agencies. location data, and the exposure rate of users’ POIs, derived from the relatively sparse leakage indicators, is around 61%. In order to utilize location traces as a meaningful informa- tion source, it is imperative to analyze the data and aggregate 1 Introduction it to location clusters that are important to the user, such as home, shopping, or work [2]. These locations are also known In recent years, there has been a trend towards the person- as the users’ points of interest (POIs). The most common alization of services in many areas. This is particularly true approach for inferring a user’s POIs is by clustering the lo- for services provided on mobile devices, where a plethora of cation traces by distance and time thresholds; eventually, a context-based applications (e.g., Yelp, , Google Maps) cluster will be produced if the user stays in the same place

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 243 for a sufficient amount of time. POIs are identified by under- device network traffic). Third, to understand the privacy expo- standing which clusters are important to the user and omitting sure level of users in terms of the number of identified POIs, less important data such as data [6,7]. amount of data required for identifying the POIs, accuracy of Location data collected on the mobile device may be pro- the detected POIs, and time spent in the POIs. vided to third party services by applications or leaked by a In order to achieve these goals, we collected and analyzed malicious application [8]. Recent research has reported a high the Internet traffic of 71 smartphone users for an average of 37 rate of personal data leakage by popular applications over days, while the devices were being used routinely. In addition, insecure communication channels without users’ awareness. we collected the location of the mobile devices by using a These studies also showed that location data is one of the most dedicated Android agent (application) that was installed on "popular" leaked personally identifiable information (PII), as the devices and sampled the location sensor. The data col- 10% of the most popular applications leak location data in lected by the agent was used as the ground truth for the actual plain text [8]. In fact, according to Trend Micro, the location location of the mobile device. The results of our experiment permission was identified as the most abused Android applica- showed that over 85% of the users’ devices leaked location tion permission.1 This privacy breach was also acknowledged data. Furthermore, the exposure rate of users’ POIs, derived during the 2018 DEFCON workshops, when applications of from the relatively sparse leakage indicators, is around 61%. both iOS and Android-based devices were detected sending Even cases of low location leakage rates (once in six hours) accurate location data in unencrypted formats.2 and coverage of only 20% (i.e., 20% of the time during which Recent research investigated the privacy risks to mobile location data was accessed by the applications it was leaked) device users emanating from legitimate or malicious applica- can expose approximately 70% of the weighted POIs. tions granted permission to access the user’s location. Most In summary, the contributions of this paper are as follows. of these works however, focused on analyzing and quantify- First, we explore and discuss the scope, volume and qual- ing the exposure of private data to the specific application (or ity of location-based data leaked via insecure, unencrypted location-based service provider) granted access to the location network traffic of smart mobile devices. We are specifically in- data [9–12]. However, in practice, multiple applications with terested in exploring the implication of location data leakage access to location data are installed on each mobile device. by multiple applications on the user’s privacy, a case which Hence, there is a need to explore the implications of location has not been investigated before. Second, we conduct an em- data leakage by multiple applications on the user’s privacy. pirical evaluation based on real data from mobile devices. The One main challenge in conducting research that focuses evaluation involves a unique dataset that was collected simul- on understanding the privacy risks to mobile device users is taneously from the device itself and the network traffic sent the collection of location data accessed and/or sent by appli- from the device; such a dataset from real users’ devices is very cations, which might require root access. Recent researches difficult to obtain. Third, we present a methodological process collected data either by running applications in a controlled for collecting, processing, and filtering location-based data environment (e.g., sandboxes) [11,12] or provided subjects from the network traffic of mobile devices in order to infer the with alternative (rooted) devices that were not necessarily users’ POIs. We use POI clustering on a sparse, inconsistent used as their primary devices [9]. Previous studies also ana- data stream by modifying available clustering algorithms, and lyzed the privacy risk associated with an adversary that can discuss the experimental results and the effectiveness of the eavesdrop on the communication of the mobile device. These applied process. Finally, to the best of our knowledge, we are works focused mainly on automatically identifying PII. the first to present a step-by-step attack aimed at leaking loca- In this research, we investigate the phenomenon of location tion data and inferring users’ POIs from a device’s network data leakage in the Internet traffic of Android-based smart- traffic under realistic assumptions. phones, by multiple applications. The main goals of this re- search are as follows. First, to understand the amount and 2 Related Work quality (relevancy) of location leakage detected in plain text in the device’s network traffic. Second, to analyze the location 2.1 Privacy in location based services leaks in order to infer the user’s POIs. This task is not trivial due to the fact that the vast majority of existing POI detec- According to the GDPR definition [13], personal data or per- tion methods assume a consistent and high rate of location sonally identifiable information (PII) is any data that can be sampling (e.g., via GPS); therefore, they cannot be directly used to identify a person, including name, ID, social media applied on noisy and sparse location data, like the data we identity, and location. PII leaks are a major privacy concern focus on in this study (i.e., location data leaked over mobile for mobile device users. Along with device and user identi- fiers, location leakage is among the private data most com- 1 http://about-threats.trendmicro.com/us/library/image-gallery/12-most- monly and extensively leaked from mobile devices [8]. As abused-android-app-permissions 2https://arstechnica.com/information-technology/2018/09/dozens-of- a case in point, recent studies showed that it is possible to -apps-surreptitiously-share-user-location-data-with-tracking-firms/ de-anonymize users using location traces [14]. Furthermore,

244 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association location permission is commonly requested by most mobile applications or a location-based service provider). In addition, apps (25% of apps use precise location, and an even greater unlike the works mentioned above in which the user actively number of apps use coarse location) [15]. Without accurate installed the applications and granted access to the private knowledge about how each application uses and handles the data, in our case, the user is unaware that an the adversary location data, access of applications to location API pose a may be observing the network traffic. real threat to users’ privacy. Several previous studies also focused on a network eaves- A summary of related works in this domain is provided in dropper adversary [8, 20, 29]. Taylor et al. [29], for example, Table 8. The research can be broadly categorized by the threat showed that by sniffing the network traffic of well-known and actor that abuses network traffic access in order to disclose heavily downloaded apps, an attacker can obtain a broad spec- users’ personal information. trum of personal and device identifying information without The first type of threat actor considered in related studies in- the user’s awareness. They searched for predefined strings cludes applications that are installed by the user on the mobile of PIIs (e.g., phone identifiers such as the IMEI and MAC device [1, 9–12, 16–20]. These studies focus on analyzing the address) and conducted a controlled experiment by running private data that is available to the installed application or the applications on a dedicated mobile phone and collecting the location-based service provider with which the application is network traffic using a dedicated setup. They did not, however, communicating, and consequently understanding the privacy analyze location data leakage. In contrast, in this research, risks to which the user is exposed. This can be attempting our focus is on analyzing the extent of location data leakage to de-anonymize the user by using location traces [9], in- from devices that are owned and used by real users and by ferring the user’s POIs [9, 10, 19], or identifying other PII all applications that are installed on the users’ phones. Ren et leaks [11, 12, 16–18, 20]. In most cases, the assumption is al. [8] presented Recon, a system which is based on machine that the user has granted permission to access the private learning technique used to automatically identify PIIs sent data [1,9,10,16,17,20]. Razaghpanah et al. [18] and Song et in plain text network traffic; the authors tested the proposed al. [19] analyzed the potential exposure of private data to approach in a controlled lab environment (i.e., running appli- third party services, such as advertisement and tracking ser- cations in a sandbox), using devices owned by real users who vices, usually in the form of libraries that are added to the were required to manually tag potential PII leaks, including applications. In these cases, the user grants the permission to location data. The authors, however, didn’t focus on estimat- an installed application without being aware of the third party ing the extent and value of the location data leaked by all of library that is included within the application and without the applications running on the device, which is a focus of knowing that the data is available and used by these third- our study. In addition, we present a step-by-step process that party services. can be applied by the adversary in order to derive valuable In order to perform the analysis, the abovementioned stud- information about the user from the raw network traffic. ies collected data from mobile devices provided to users for The protocol used to transfer PIIs may be vulnerable to cy- the purpose of the experiment [9, 10] or executed the mon- ber attacks and thus creating an additional privacy breach itored applications on a dedicated mobile device or emula- regardless of user awareness. As a case in point, Ren et al. tor [11,12,19,20]. [20] presented a review on data leakage in mobile apps by Another threat actor considered in previous research is an an unsecured HTTP protocol which can be used to identify adversary that can analyze publicly available location data in the user. As part of this process, the adversary can use the online social networks. Such data includes tweets metadata in PII detector presented by the authors in order to improve the Twitter, check-in information in Facebook, and proximity ser- detection of location data within network traffic. vices. The adversary in these cases analyzes the data in order To summarize, in this study we focus specifically on an to identify users’ POIs [21–24], mobility patterns [25–27], adversary that can monitor network traffic. While in previ- and location [28]. Like in our case, the spatial-temporal loca- ous studies the adversary did not consider any noise in the tion data analyzed in these studies is also sparse and incon- observed data, in this study we discuss a realistic case where sistent. Nevertheless, in our research we consider a different the attacker collects and analyzes raw network traffic in a adversary model, namely an eavesdropper that can monitor completely unsupervised approach. and analyze the network traffic sent from the mobile device (to the LBSs). 2.2 POI Identification Methods In our research we focus on a different adversary (threat actor), such as an ISP, VPN service provider, or proxies (i.e., Inferring meaningful locations from aggregated location an eavesdropper on network traffic) that is able to intercept all traces is a field of research that has rapidly evolved since of the mobile device’s network communication. We assume the appearance of cheap mobile GPS devices for civilian use. that this threat actor is exposed to all data leaks by all of These devices, as well as smartphones (in the proper navi- the applications and services installed on the mobile device gation and sampling mode) which have become ubiquitous, (as opposed to leaks of data that is available to individual have a high and constant sampling rate. The availability of a

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 245 constant data stream is a common assumption for most studies consume restricted entertainment content [34].3 4 in location analytics. Tor-like solutions, are commonly used to protect pri- Inferring meaningful locations relies upon several major vacy [35]; in this case, the location leakage data can be used algorithm families: Ester et al. introduced DBSCAN [30], by the exit node to expose the real user location (and possibly a density-based algorithm for spatial data, which provides the user’s identity as well). the ability to determine clusters with undefined shapes and is The main goal of this study is to estimate the potential pri- not bound to a specific number of clusters and does not use vacy exposure of a user by such threat actors if they choose to temporal data as a parameter. Birant et al. extended DBSCAN misuse this data, or by an attacker. Note that previous research into ST-DBSCAN [31], which not only uses the spatial data (e.g., [14, 36, 37]) proved that it is possible to de-anonyimize of the database points but also uses its temporal data which is users based on location traces. Therefore, an adversary that more suitable for spatio-temporal data sets. Another approach is able to monitor a device’s network traffic and infer the introduced by Kang et al. [6] clustered places based on time user’s POIs will be able to apply similar methods (together and distance thresholds to differentiate stay points from tran- with information publicly available in phone books and social sit to improve analysis of trajectories. In the current research networks) in order to breach the user’s anonymity. we refer to this method as the "incremental method." Kang et al. [6] used predefined constraints in order to prevent incor- 4 Data Collection rect clustering due to missing information between traces. Montoliu et al. [32] also deal with inconsistency and missing In the interest of exploring the extent of location data leak- data by adding maximum time constraints between traces’ age in the Internet traffic of Android-based smartphones, we constraints. Alvares et al. [33] proposed the use of semantic developed a dedicated data collection framework. Using this data to better understand the meaning of the collected data, framework, we collected data from 71 participants for an and their method can be used to determine whether a place average period of 37 days. could be important to the user. 4.1 Data collection framework 3 Threat Model The framework consists of three main components: a VPN The threat model serving as the focus of this research is an client that collects all of the network traffic transmitted by the adversary that is able to eavesdrop on the mobile device’s net- device to the Internet, a dedicated Android agent application work traffic and is exposed to personal, sensitive information that obtains location readings from the device’s location API, that is transmitted in plaintext. and a light-weight server. Previous research has discussed personal information dis- VPN client. The high volume of Internet traffic on smart- closure and inference from network traffic leakage. However, phones makes it prohibitive to store this information locally those studies mainly dealt with inferring static information, on the device for a long period of time. Alternatively, caching such as demographic attributes or other PII which can be the Internet traffic locally and transmitting it daily to a remote observed when the user is connected to a single malicious server adds overhead to the device’s battery, CPU, and net- hotspot. In our case, since we are analyzing location data work. In addition, due to security concerns, such operations over time in order to obtain contextual information, capturing require super-user privileges, which necessitate rooting the network traffic from a single hotspot is insufficient. Thus, in user’s device. For these reasons, we opted to use VPN tun- this paper it is assumed that a threat actor can continuously neling to redirect the traffic through a dedicated VPN server, capture the network traffic of the user’s mobile device. This where we could record and store the traffic. It should be men- special capability is granted to the following threat actors: tioned that a VPN is the only user space API provided by Internet service providers and mobile network operators the Android operating system that can be used to intercept (MNOs) who are exposed to a vast amount of the users’ net- network traffic that does not require super-user privileges. work traffic. The ISP threat model is strong, however we Android agent (client) application. To understand the qual- believe that location data leakage is a significant problem in ity of location leakage detected in plaintext within the device’s and of itself and therefore should be explored. In addition, network traffic and to estimate the privacy exposure level of although ISP’s can derive location information from the cell users, we developed a dedicated Android application that ob- ID, the derived location is very coarse. On the other hand, tains the actual location of users. The application uses the the location that can be derived from the leaked data is much network location provider API which utilizing three sources more accurate. of information: GPS, Wi-Fi, and Cell ID (obtained from the VPN and proxy servers, which relay mobile device network cellular network) to assesses the location. The data collected traffic to a third party server, can also misuse the unencrypted 3https://www.forbes.com/sites/forbestechcouncil/2018/07/10/the-future- data sent in the network traffic; these solutions are increas- of-the-vpn-market/ ingly being used by mobile users to protect their privacy or 4https://blog.globalwebindex.com/chart-of-the-day/vpn-usage/

246 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association Percentage Number of Installations Application of users that Application apps in by the agent application was used as the ground truth for the (from Google name installed the Type this category play store) actual location of the users. Applying clustering algorithms application (Wikipedia) Whatsapp 100% User 1-5B 31 on data collected from the mobile device was shown to be ac- Google Maps 100% Pre-installed 5B+ 6 Google Play Depends on 100% 5B+ 6 curate and effective for deriving users’ POIs [33]. Therefore, Services Android version YouTube 100% Pre-installed 5B+ 6 we opt to use this approach as our baseline and not to rely on Chrome 90% Pre-installed 5B+ 6 85% User 100-500M 334 the (subjective) collaboration of the participants in providing Depends on Facebook 80% 1-5B 31 the actual/labeled POIs. Android version Shazam 70% User 100-500M 334 Facebook Depends on 67% 1-5B 31 Light weight server. This server has two primary objectives. Messenger Android version Skype 65% User 1-5B 31 First, it operates as an application server which communicates Ynet 52% User 1-5M 25K with the Android agent application and stores all location data Duolingo 47% User 100-500M 334 Moovit 47% User 50-100M 484 in a database. Second, it acts as a VPN server which com- IsraelTrain 43% User 1-5M 25K GetTaxi 42% User 10-50M 3925 municates with the VPN clients, records all of their network Instegram 38% User 1-5B 31 Istudent 32% User 100-500K 334 traffic, and redirects it to the Internet. To provide VPN con- Viber 32% User 500M-1B 43 nectivity and record the traffic, we created a dedicated LAN (local area network) on that server where every VPN client Table 1: Exploring the popularity of applications installed on was assigned to a different IP address in the LAN. The Inter- the mobile devices of the participants. net traffic was recorded using "tshark," - a network analysis tool that is capable of capturing packet data from real-time network traffic. We opted to use this data collection approach for the follow- than 20% of the devices. Only a small number of applications ing reasons. First, because we were performing exploratory are installed on more that 50% of the devices. research where we didn’t know exactly what data we could We also analyzed the popularity of applications installed on expect and what data we would want to collect and analyze the participants’ devices. As can be seen, most of the heavily in advance, we wanted to be able to collect all of the network adopted applications used by the participants in the experi- traffic transmitted by the device. Second, it was important for ment (shown in Figures 1a and 1b) are extremely popular, with us to be able to collect data from devices owned and regu- more than one billion installations worldwide (see Table 1). larly used by the participants. Therefore, we could not use These applications (e.g., WhatsApp, Google Maps, Chrome, a data collection approaches that requires root access. This YouTube, Facebook, Skype, and Instagram) are not typical to also introduced another challenge to our research which is the a specific user profile. In addition, because of the diversity ability to link network traffic (specifically, location data) with of the applications installed on the participants’ devices (as the sending application. Finally, since we wanted to collect shown in Figures 1a, 1b, and 2), we believe that the insights data for a long period of time, we followed prior research and derived from our study are valid and can be generalized to opted to use the data collected by the agent as our baseline other user profiles. for the users’ POIs and not to rely on the (subjective) collabo- ration of the participants in providing the actual/labeled POIs The participants were required to install the two client which would require significant overhead on the participants applications (VPN and Android monitoring agent) on their (and consequently cannot scale). As noted, this approach was personal mobile device throughout the experiment, which shown to be accurate and effective in prior research. lasted for an average period of 37 days, depending on the participants’ actual engagement. To ensure that the location sampling had minimal impact on the battery, the Android 4.2 Experiment client sampled the location provider for 60 seconds every 20 minutes for most of the experiment. In addition, the server was We conducted an experiment involving 71 participants. The deployed on an AWS (Amazon Web Services) EC2 instance participants were current and former students at two univer- to ensure high availability. sities located in two different cities in Israel. Additional in- In Figure 3 we present the amount of time the users par- formation about the participants is as follows: 60% are male, ticipated in the experiment. The distribution of the agent and 40% female; 44% are in the 18 to 24 age range, and 56% application’s actual location sample rate (as observed in the in the 25 to 30 age range; 51% are undergraduate students, experiment) is presented in Figure 4. and 49% are graduate student. In Figure 2 we present the distribution of the number of dis- Note that the framework we developed, and specifically tinct application installations, focusing only on applications the VPN and monitoring agent applications installed on the that are granted with location permission. As can be seen, participants smartphones, were used for data collection, re- more than 40% of the applications are installed on one device. search, and validation only and are not assumed to be part of Furthermore, most of the applications are installed on less the threat model.

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 247 h xeietivle h olcino estv informa- sensitive of collection the involved experiment The 248 withdraw to allowed were of and type collected the of be aware would fully that were data they addition, In research. the will own steps. preserve following To the time. took of we period privacy, long subjects’ a the for subjects real from tion considerations ethical and Privacy 4.3 40. top the by select applications and the installations sorted distinct we their system) and appli- (user each group the for cation among permissions; location applications request of that participants installations Distinct 1: Figure h ujcspriiae nteeprmnsa their at experiments the in participated subjects The

User Installations (percentage) User Installations (percentage) 0.17 0.33 0.67 0.83 0.17 0.33 0.67 0.83 0.0 0.5 1.0 0.0 0.5 1.0 n rvddterfra osn opriiaein participate to consent formal their provided and 22nd com_android_phone com_whatsapp com_android_vending com_android_chrome com_android_settings com_waze com_google_android_gms com_facebook_katana com_google_android_googlequicksearchbox com_shazam_android International com_google_android_street com_facebook_orca com_google_android_apps_maps com_skype_raider com_google_android_apps_plus com_goldtouch_ynet b ytmapplications. system (b) com_android_contacts com_vlingo_midas com_android_mms applications. user (a) com_tranzmate com_google_android_talk com_moblin_israeltrain com_android_calendar com_gettaxi_android com_android_browser com_instagram_android com_google_android_location com_mobixon_istudent

Package name com_sec_android_app_camera Package name com_viber_voip com_sec_android_gallery3d com_google_earth com_sec_android_widgetapp_ap_hero com_ebay_mobile com_sec_dsm_system com_unicell_pangoandroid Symposium com_android_email com_yohana_ea_ziggi com_sec_android_app_clockpackage com_google_android_apps_genie com_fmm_dm com_google_android_keep com_sec_android_widgetapp_dualclockdigital com_jangomobile_android com_sec_chaton com_leumi_leumiwallet com_sec_android_app_videoplayer com_tinder com_sec_android_widgetapp_dualclockanalog com_devsense_hulyo com_sec_android_daemonapp_ap_accuweather com_melodis_midomiMusicIdentifier com_sec_android_widgetapp_diotek_smemo led_android_bbbclub com_android_providers_contacts com_moblin_superpharm com_sec_jewishcalendar com_ideomobile_hapoalim com_sec_android_app_mt com_antivirus on com_android_settings_mt www_bus_co_il com_sec_android_widgetapp_dualclock lucy_me com_sec_android_saslideshow com_twitter_android

Research com_android_gallery3d com_policydm com_lge_hiddenmenu tunein_player com_lge_systemservice com_tripadvisor_tripadvisor com_lge_settings_easy com_google_android_calendar com_lge_sizechangable_weather il_co_dominos_android com_lge_email com_google_android_apps_walletnfcrel com_sec_pcw appinventor_ai_progetto2003_SCAN in Attacks, a trdi adcp ouetkp nasf o;we box; safe a in kept document copy hard a in stored was a etfrfrhranalysis. further for kept was ie,wti h ntttoa ewr)ta a o connected not was that network) institutional the within (i.e., igo h xeiet admue Dwsasge to assigned was ID user random a experiment, the of ning in htrqetlcto permission. location applica- request among that installations tions distinct of Distribution 2: Figure indt hti rnmte npanet uhsrcue data structured such plaintext; geoloca- in explicit transmitted on is focus that we data research tion this In network ID). cellular (Cell and data (BSSID), networks Wi-Fi POIs, of or names cities coordinates, many geolocation in explicit traffic including: network over formats transmitted be can data Location Net- Raw from Traces Location Extracting 5 (IRB). board review institutional subjects the of information anonymized Only Internet. the server to local a to transferred was end data the the At experiment, database. the encrypted of an in stored encrypted. was fully collected data was servers and agents experiment. the of end the at subjects document the this of destroyed identity real the and UID map- The the information. between identifying subject, ping actual the her of or identifier his the than as rather served UID this subject; each one-time participation. a their received for subjects compensation The as time. payment any at study the from in participated users days) (in experiment. time the of length The 3: Figure ae nteeses h eerhwsapoe ythe by approved was research the steps, these on Based experiment, the During data. the to applied was Anonymization Intrusions okTraffic work

User percentage Apps percentage 0.05 0.10 0.2 0.4 0 0 0.0 and Defenses 0.17 30 Users percentage 0.33 h omncto ewe the between communication the 40 Days 0.5 0.67 50 USENIX Association 0.83 nadto,the addition, In 60 ttebegin- the At 1.0 0.3

0.2

0.1 User percentage 0 1000 2000 3000 4000 5000 Sample rate (sec)

Figure 4: The actual (averaged) location sample rate of the agent installed on the users’ devices. can (potentially) provide more accurate location and is easier Figure 5: The geo-fencing of identified coordinates; filtering to extract and analyze; thus, it introduces a greater risk to the coordinates that are outside of a predefined geographical area. user’s privacy if leaked.

5.1 Process description In order to automatically detect location traces within the network traffic of a mobile device, we capture the data at the IP network layer. Geographic coordinates can be represented in different formats [38]. We perform regex search of the standard Android API [39] representation of the geographic coordinates which is decimal degrees in the following format: XX.YYYYYYY. We specifically used this regular expression for two main reasons. First, this is the standard format of the Android location API, and thus such expressions are very likely to be a location data. Second, our manual exploration of other location data formats (e.g., names of cities, POIs, and Wi-Fi networks) indicated that they can dramatically increase the number of false positives. For example, city names sent by a weather forecast application do not indicate the true location Figure 6: Worldwide distribution of randomly selected geo- of the user. Note that as proposed by Ren et al. [8], machine locations found in the network traffic for all of the participants. learning techniques can be used to automatically identifying PII (including location data) within the network traffic; this, however, still does not ensure that the identified data indicates the true location of the user. Each result is assigned a timestamp based on the packet Geo-fencing filter. Filtering out geographic coordinate that capture time. This regular expression may retrieve irrelevant are outside a predefined geo-fence (e.g., the geographical results of simple float numbers which have no geographic boundaries of a given country or city). In our case, all users meaning and can appear within the network traffic (e.g., the were located within the geographical boundaries of Israel location of an object on the screen). Therefore, in the next during the data collection period, and therefore we filtered step we apply the following heuristics in order to filter out out all geographic coordinates that are not within this area irrelevant results. (an illustration of the geographical boundaries is presented Outgoing traffic filter. Extracting geographic coordinates in Figure 5). Note that to apply the geo-fencing filter in a from outgoing traffic (incoming traffic may contain geo- general and practical case, the attacker can perform reverse graphic data that is not relevant to the real location of the user geo-coding on the leaked location data and identify the geo- such as recommendations of POIs and weather forecasts). graphical areas that are most likely to be relevant to a user or Latitude/longitude pair filter. Extracting only pairs of a group of users. We demonstrate this approach in Figure 6, in valid geographic coordinates that were assigned with the same which we performed reverse geo-coding on randomly selected timestamp; this is because a coordinate is represented by two samples from the leaked location data (1%). As can be seen, values indicating the latitude and longitude of the location. most of the samples are located within the state of Israel.

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 249 5.2 Analysis and results Latitude/longitude pairs Outgoing traffic 125000 In order to evaluate the amount of location data leakage, we 100000 have to determine the accuracy and correctness of the geo- 75000 graphic coordinates detected within the network traffic. In our 50000 experiment we could compare the geographical coordinates 25000 detected within the network traffic to the location data that was sampled by the agent application (installed on the partici- Sum of waypoints 0 pants’ mobile phones). By analyzing the data collected by the FALSE TRUE Unknown FALSE TRUE Unknown Results summary agent application, we observed that the location was sampled only 70% of the time (see Figure 4 for the distribution of Figure 7: The classification of geolocations detected within the location sampling rate). Possible reasons for this are that the network traffic after applying the latitude/longitude pair the device was off, the agent was shut down, or the location filter and after applying both the latitude/longitude pair and service was disabled. Following the above observation, we outgoing traffic filters. define the active time of a given user as the number of hours at which the agent observed at least one location sample. Validating the correctness of leaked samples. Using the Figure 7 presents the classification of geo-locations de- location samples observed by the agent, we validated the ge- tected within the network traffic after applying the lati- ographic coordinates that were detected within the network tude/longitude pair filter (left column) and after applying both traffic. Specifically, a location that was observed in network the latitude/longitude pair filter and the outgoing traffic fil- traffic was classified as a ’true’ location only if (1) the times- ter (right column). Each column presents the distribution of tamp of the detected coordinate was within a predefined time labels (’true,’ ’false,’ and ’unknown’) of the remaining geo- threshold to a location sampled by the agent application, and locations according to the labeling process described above. (2) the measured distance between the two coordinates was In total, after applying the latitude/longitude pair filter, 257K below a predefined distance threshold. If just the timestamp geolocations remained; 36% of them were labeled as ’true,’ of the detected coordinate was close enough (within the time 47% as ’false,’ and the rest could not be labeled. After also threshold) to a location sampled by the agent application, we applying the outgoing traffic filter, 100K geo-locations re- labeled the detected coordinate as ’false’; otherwise, it was mained; 58% of them were labeled as ’true,’ 11% as ’false,’ labeled as ’unknown.’ and the rest could not be labeled. We tested the labeling of the leaked location data when setting These results support our hypothesis that incoming traffic the distance threshold to 250, 500, and 1000 meters and the is unlikely to contain relevant geolocations of the mobile time threshold to 10 and 30 minutes (see Table 2). As can be device. We can also see that after applying all three filters, expected, when increasing the distance and/or time threshold 85% of the geolocations that could be labeled (either as ’true’ more leaked location coordinates are labeled as ’true.’ On the or ’false’) indicated the true location of the mobile device. We other hand, doing so may result in incorrect labeling. There- can assume that the same rate also exists for the geolocations fore, we opted to use the most strict labeling rules in which that could not be labeled (i.e., ’unknown’). the distance threshold was set to 250 meters and the time Rate of data leakage. By analyzing the validated geo- threshold to 10 minutes. locations (i.e., labeled as ’true’) of the 71 users, we could see Volume of leaked location data. A total of approximately that the mobile devices of about 90% of them were leaking lo- 474K geolocations (complying with the standard API location cation traces. The rate of data leakage of a given user (device) regex) were identified within the network traffic of all of the is calculated by dividing the user’s active time by the number monitored mobile devices. After applying the geo-fencing of validated leaked locations. We partitioned the calculated filter, approximately 347K geolocations remained. leakage rate into the following groups: ’high,’ ’medium,’ ’low’ and ’no leakage’ as presented in Table 3. As can be seen in Table 3, 55% of the devices were leaking location data at a Distance threshold (meters) medium rate (once every one to six hours) or high (once every 250 500 1000 one hour) rate. 89K/203K 121K/203K 157K/203K 10 (0.44) (0.60) (0.77) Time threshold 129K/210K 129K/210K 165K/210K (minutes) 30 (0.61) (0.61) (0.79) 5.2.1 Leakage coverage

Table 2: The results of labeling leaked location data when We define an exposed hour as an hour within the collected setting the distance threshold to 250, 500, and 1000 meters data (network or agent) in which at least two valid location and the time threshold to 10 and 30 minutes. leaks were detected.

250 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association In order to analyze the coverage over time of relevant (vali- 0.4 dated) leaked location data, we define the coverage rate mea- sure as the total number of exposed hours of network traffic 0.3 divided by the total number of hours of agent data (i.e., ex- 0.2 cluding hours in which the agent was not active):

Percentage 0.1 #o f ExposedTra f ficHours CoverageRate = 0 #o f AgentDataHours 0 0.3 0.6 0.9 1.2 Coverage rate We assume that a high coverage rate will result in high expo- sure and discovery rates of users’ important places. Figure 8 Figure 8: The distribution of the coverage rate measure. The presents the distribution of the coverage rates of the mobile coverage rate measure is defined as the total number of ex- devices in the collected dataset. As can be seen, for almost posed hours of network traffic (i.e., hours in which location 70% of the users the coverage rate is below 0.2. leakage was observed) divided by the total number of hours of agent application data.

5.2.2 Leakage inconsistency. 15

10 source While Table 3 and Figure 8 present the overall average Agent leakage rate of location data, our manual exploration of the 5 Traffic data showed that the leaked data exhibits inconsistent, non- Location samples 0 0 200 400 600 uniform, and bursty behavior. As an example, Figure 9 depicts Timeline (hours) the number of location samples per hour of a single user ob- served by the agent application (blue line) and within the Figure 9: An example of a single user location leakage rate network traffic (black line). It can be seen that while the observed by the agent installed on the device (blue line) and agent’s sample rate is relatively stable (around 12 samples within the network traffic (black line). per hour) excluding minor changes (such as phone shutdown or agent crash), the leaked location data within the network traffic is unstable, ranging from only a few or no leaks to a 6 Inferring POIs from Leaked Location high rate of leakage. Traces Thus, in order to analyze and understand the inconsistency We are also interested in understanding how an attacker can in the amount of leaked location data, we computed the rel- infer meaningful insights from the geolocations (coordinates) ative standard deviation measure for each mobile device by that were detected as leaks within the mobile device’s network dividing the standard deviation of ’leaks per hour’ by the traffic. Specifically, we are interested in identifying a user’s average number of ’leaks per hour.’ POIs and differentiating them from transit or noise data [40]. Figure 10 depicts the distribution of the values of the rela- The most common approach for identifying stay points (or tive standard deviation measure of the mobile devices; a value POIs) is by applying clustering algorithms that are not usually of zero (0) indicates a constant leakage rate. bound to a predetermined number of clusters (e.g., k-means) We were also interested in understanding the distribution and clustering stay points by spatial or spatio-temporal pa- of location leakage data during the day. As can be seen in rameters. In this research we opted to use three different Figure 11, the leakage rate during different hours of the day algorithms: incremental clustering [6], DBSCAN [7], and is correlated with the participants’ normal activity during the ST-DBSCAN [31]. These algorithms usually make some as- day (e.g., sleeping at night, attending classes, and taking a sumptions about the data. Specifically, it is assumed that the lunch break). data arrives at a constant rate, which is not correct in our case. Therefore, we made several modifications to the algorithms.

Group Leakage rate Number of devices Percentage First, for the incremental algorithm, we added the notion High under 1hr 20 28% of time by calculating the time between samples and defined Medium 1-6hrs 19 27% Low 6+ hrs 23 32% a bound on that time interval. The pseudo-code of the modi- No leakage ∞ 9 13% fied incremental algorithm is presented in Algorithm 1 (see Cluster procedure). As can be seen, the procedure receive Table 3: The leakage rate of different mobile devices. As can three inputs: the distance threshold (denoted by D), the time be seen, 55% of the devices were leaking location data at a threshold (denoted by T), and a list of location samples sorted medium rate (once every one to six hours) or high (once every by their timestamps (denoted by WP). The distance threshold one hour) rate. specifies the maximal distance (in meters) between the center

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 251 Algorithm 1 Incremental clustering with backtracking and 0.12 time constraint 1: Inputs: 0.08 2: D ← Distance threshold 3: T ← Time threshold 0.04 4: WP ← List of location samples sorted by their timestamps

Percentage 5: procedure CLUSTER(WP , D, T ) 6: cluster ← 0/ 0 7: clusterList ← 0/ 0 1 2 3 8: for wp ∈ WP do 9: if cluster = 0/ then Location leak (RSTD) 10: cluster.add(wp) 11: cluster.startTime ← wp.timestamp 12: else Figure 10: The leakage rate variability as indicated by the 13: if distance(cluster.center,wp.location) < D then leak relative standard deviation measure. 14: cluster.add(wp) 15: cluster.updateCenter() 16: cluster.endTime ← wp.timestamp 17: else 0.03 18: if (cluster.endTime − cluster.stratTime) > T then 19: clusterList.add(cluster) 20: cluster ← 0/ 0.02 21: cluster.add(wp) 22: end if 23: end if 0.01 24: end if 25: end for Location leakage 0 26: return(clusterList) 27: end procedure 0 5 10 15 20 25 28: procedure BACKTRACK(ClusterList , D) Time of day 29: for cluster1,cluster2 ∈ WP do 30: if distance(cluster1.center,cluster2.center) < D then 31: cluster1.merge(cluster2) Figure 11: Location leakage rate (in percentage) within dif- 32: clusterList.remove(cluster2) ferent time (hour) of day. 33: end if 34: end for 35: return(clusterList) 36: end procedure of a cluster to a given location sample. If this distance is less than the distance threshold, the location sample is added to the cluster (lines 13-17); otherwise it is considered an instance Note that we did not have any information about the users’ of a different cluster, or as a transition state. The time thresh- real (confirmed) POIs in order to understand the nature and old specifies the minimal time for a list of way points to be validity of the POIs identified in the network traffic’s leaked considered as a cluster. If this time is greater than the time locations. Therefore, as a benchmark (and ground truth) we threshold, the list of way points is considered a cluster (lines used the POIs identified by applying the incremental cluster- 18-21); otherwise, it is not considered as cluster. ing algorithm on the Android agent location data (denoted Second, for both the incremental and ST-DBSCAN algo- as Incremental-agent). Because the location traces collected rithms we applied a backtracking procedure, which iterates by the mobile agent application indicate the true location of over the created clusters and merges clusters that are within the user with a high degree of accuracy, and POIs clustering the distance threshold (lines 28-36). The main benefit of the methods have been shown to be effective in previous work, backtracking procedure is to increase the confidence for re- we found this benchmark sufficient for our purposes. peated clusters over time. Another approach uses semantic data to determine when a In Table 4, we present the number of clusters (POIs) de- user is at an important place; for example, a location trace at a tected by the incremental algorithm for different distance and famous landmark will identify it as an important place for that time threshold values. As can be seen, reducing the distance user [33]. This approach is not relevant in our case, because and time thresholds resulted in a large number of clusters, and the POIs are not known in advance; however, we use semantic increasing them resulted in fewer clusters. Nevertheless, it data from reverse geocoding to eliminate transit geolocations can be seen that the detection rates are not affected by the (e.g., highways). different parameters. For the rest of the evaluation we set the One of the main challenges when applying clustering algo- thresholds for the algorithms at 500 meters and 30 minutes. rithms in fully unsupervised data is selecting the parameters The POIs identified by applying the different clustering correctly (namely, the distance and time threshold), since their algorithms on the network traffic’s leaked locations (denoted values directly affect the number and size of clusters detected as Incremental-traffic, DBSCAN-traffic, and STDBSCAN- by the algorithm. In order to address this challenge we tested traffic) were compared with the agent-based POIs, namely several time threshold values (15, 30, and 60 minutes) and the Incremental-agent. We calculated the total amount of time distance threshold values (100, 250, and 500 meters). spent at each user POI and assigned a weight representing the

252 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association Distance threshold (meters) The importance of the 25% identified POIs. The num- 100 250 500 371/1402 238/996 184/724 ber of identified POIs alone does not necessarily provide a 15 (0.26) (0.24) (0.25) good estimation of the exposure rate of users’ whereabouts. 317/1162 218/847 171/627 Time 30 (0.27) (0.26) (0.27) For example, let’s assume a user with ten different significant threshold 228/906 183/718 150/532 (minutes) 60 locations (POIs). If a user spends 50% of his/her time at home (0.25) (0.25) (0.28) and is at work 35% of the time, by determining the user’s Table 4: The number of clusters (i.e., POIs) detected by the home and work locations, we are able to identify the locations incremental algorithm when selecting different distance and at which the user spends 85% of his/her time (although we time thresholds. identified only 20% of the user’s POIs). Thus, in order to estimate the significance of the identified locations (POIs), we computed a weighted measure for the POI’s significance based on its part of the user’s total amount POI detection rate as follows. For each POI detected we com- of time spent in all POIs. puted the relative time spent by the user at that location (i.e., POI detection rate. A total of 1,053 POIs (across all users) the total time that the user was at the POI divided by the total were identified using the Incremental-agent method. For each time the user spent at all POIs). The weighted measure of traffic-based method we calculated: (1) the total number of the POIs was computed from the baseline Incremental-agent POIs identified; (2) the true positive measure (number of POIs method. Then, the POI discovery rate measure was computed that were also detected by the Incremental-agent method); (3) by using the weights computed for each POI identified. The the precision (the true positive value divided by the total num- results presented in Table 6 show a high weighted POI discov- ber of POIs identified); and (4) the recall (the true positive ery ratio for the medium and high leakage rates, and a total value divided by the number of POIs identified by the bench- of 61% weighted POI’s exposure rate. mark method, i.e., Incremental-agent). The attack. To conclude, an adversary that can eavesdrop In addition, previous work on location data analysis showed on the user’s network traffic can apply the following step-by- that previously obtained semantic information (e.g., land- step attack in order to infer the user’s POIs (and consequently marks, shopping centers, roads, etc.) can be used in order reveal his or her identity). First, the attacker identifies geo- to determine if a location trace is a user’s POI or a transit loca- graphic coordinates within the outgoing network traffic (us- tion [33]. Thus, in order to further improve the POI inference ing regex or pre-trained machine learning models). Next, the process, we used previously obtained semantic information attacker applies the latitude/longitude pair and geo-fencing in order to better determine the real POIs. Specifically, we filters on the identified coordinates. Finally, the attacker ap- used Google’s reverse geo-coding API in order to remove geo- plies our proposed incremental-traffic clustering algorithm location clusters (i.e., POIs) that are located on highways. on the remaining geo-locations (after applying the filters) in The results are presented in Table 5. As can be seen, the order to identify the user’s POIs. recall, which represents the POI discovery rate, is approxi- In a real-life scenario an attacker would not have a benchmark mately 20% for all methods; the Incremental-traffic method to relate to, and inferring a user’s POI exposure rate would be yields the best results, compared to the other methods, with based on captured data alone. By deriving a regression model slightly lower recall but much higher precision. The values (Table 7), we can see that the user weighted POI exposure within the parentheses represents the results of detected POIs ratio parameter has a high correlation with the leakage and (true positive, precision, and recall) when using this semantic coverage rate measures and no significant correlation with information. As can be seen, using semantic information to the relative standard deviation. Therefore, the attacker can eliminate irrelevant location clusters can improve the preci- compute the leakage rate and coverage measures from the sion with no effect on the recall. This can be explained by the analyzed traffic and estimate the potential exposure rate of fact that due to the lower and inconsistent location leakage the user’s POIs. within the network traffic, the irrelevant location clusters (e.g., highways) are poorly reflected within the network traffic but better captured by the agent. Incremental DBSCAN STDBSCAN In a real-life scenario, the adversary will not be able to label Total 282 (263) 470 (374) 339 (264) True positive 205 (193) 213 (201) 148 (141) the extracted geolocations as ’true,’ ’false,’ or ’unknown’ (as Precision 0.73 (0.73) 0.45 (0.54) 0.43 (0.53) described in Section 5). Therefore, in Table 5 we applied the Recall 0.20 (0.2) 0.20 (0.2) 0.14 (0.14) clustering algorithm on all of the geolocations. Nevertheless, in order to evaluate the best results that an ad- Table 5: Recall and precision measures of the three methods versary could achieve, we applied the POI identification pro- (Incremental-traffic, DBSCAN-traffic, STDBSCAN-traffic), cess only on the users’ ’true’ geolocations. In this case, the when considering the Incremental-agent as the ground truth of Incremental-traffic clustering method achieved a similar re- the users’ POIs. The values within the parentheses represents call, however the precision improved dramatically to 95%. the results of detected POIs when using semantic information.

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 253 Leakage rate POI discovery ratio Weighted POI discovery ratio location data, we identified 112 different services. Then, we High 48% 81% Medium 26% 67% analyzed the host names using a public security service (such Low 8% 37% as VirusTotal), search engines (Google and Whois), and se- curity reports. Based on the results of this analysis, we were Table 6: The POI and weighted POI discovery ratios of the able to classify each host name by its reported usage (e.g., Incremental-traffic method. The POI discovery rate is the weather forecast, navigation, location analytics) and whether number of places found in the network data divided by the it appears to be a legitimate or unwanted/suspicious service. number of places found by the Incremental-agent method. Services with clear/reasonable location usage and no reported The weighted POI discovery rate is the amount of time spent security issues were classified as ’benign’; the rest of the at the identified POIs out of the total user time. services were classified as ’suspicious.’ Figure 12 presents the top 12 host names classified by Dependent variable: their category (color) and level of suspiciousness (size of Weighted POI’s exposure rate circle). Each host name is placed on the graph according to Coverage 0.559∗∗∗ the average number of detected leakage events (x-axis) and (0.172) Leak rate −0.0000013∗∗ the number of participants sending location data to that host (0.00000) name (y-axis). Relative standard deviation −0.048 (0.078) Some of the suspicious domain names include samsung- Constant 0.597∗∗∗ (0.106) buiasr.vlingo.com which was previously published as a Sam- R2 0.315 sung pre-installed speech recognition application called Adjusted R2 0.283 Vlingo. This application was found to be leaking sen- Residual Std. Error 0.331 (df = 65) F Statistic 9.955∗∗∗ (df = 3; 65) sitive information. Other example includes the domains, n129.epom.com and mediation.adnxs.com, which are re- Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 ported to provide personalized advertisements. An additional Table 7: The linear regression model for estimating POIs’ significant suspicious domain is app.woorlds.com, which was exposure rate from network traffic measures. reported to be a location analytics service. Interestingly, our analysis concludes that the Google Maps JavaScript API (maps.googleapis.com), which lets Android application devel- 7 Identifying leaking applications opers customize maps with user locations, is also responsible for sending location data in plaintext. This is particularly The goal of the analysis presented in this section is to de- noteworthy since Google recommends that application devel- termine which applications are responsible for leaking the opers use the secured Maps JavaScript API (which operates location data and whether the leakage occurs as a result of over HTTPS) whenever possible.5 Nonetheless, our analysis intentional misuse or a benign application sending location shows that in practice, developers also use the unsecured Map data in plaintext. JavaScript API. Overall, based on the analyzed data, we found One approach for obtaining such information is real-time that the set of unwanted services is responsible for more than on device monitoring of the installed applications’ outgoing 60% of location leakage events. traffic [17, 41]. However, starting from Android version no’ Another observation is that although the number of location 8.0, such operations require super-user privileges (which ne- data leakage events (x-axis in Figure 12) for each individual cessitate rooting the user’s device). In our experiment, we host name is not high, based on the analysis presented in analyzed the network traffic of the user’s personal device, Section 6 we were still able identify the users’ significant POIs therefore such an approach was not an option. from the data. We attribute that finding to the fact that there are Deriving information about the leaking applications from multiple applications installed on each individual smartphone, the network traffic captured is also challenging for three main which together leak a sufficient amount of information that reasons: (1) the network traffic captured does not provide can be analyzed in order to infer the POIs. an explicit indication of the sending application/service, (2) Although the identity of the leaking application is not ex- because of the growing use of cloud services and content plicitly indicated in the network traffic, we performed further delivery networks, many destination IPs are hosted by services analysis in an attempt to link the identified host names with such as AWS, Akamai or Google, and (3) location data can the applications installed on the users’ mobile devices. In or- be sent to advertisement and intelligence service domains by der to do so, we first extracted (using our Android agent) the components embedded in many Android applications. set of all applications installed on the mobile devices of the Given this, we opted to analyze the destination host names users which require both location and network permissions. observed within the HTTP traffic. We focused specifically on Next, we computed a modified tf-idf measure for each pair outgoing traffic containing location leaks. By extracting the host names from the HTTP requests that contained the leaked 5https://developers.google.com/maps/documentation/javascript/tutorial

254 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association maps.googleapis.com Category 35 Advertisement Analytics  Speech Recognition TFIDFh(a) = TFh(a) ∗ max 1,IDF(a) Other Location Service 30 Weather Label The reason for limiting the inverse document frequency (IDF) Suspicious Benign value to one is to prevent rare applications from achieving a 25 very high tf-idf score and consequently be erroneously linked

mediation.adnxs.com with the host name. In addition, we applied min-max normal- 20 tf-idf n129.epom.com ization to the scores of applications in order to keep

app.woorlds.com them within the range of zero to one. A high t f value for

Leakage Volume (users) 15 samsungbuiasr.vlingo.com an application a with respect to a host h indicates that a was venues.waze.co.il lgemobilewidget.accu-weather.com frequently observed in devices that transmit location data in 10 plaintext to h. upload.wikimedia.org easy.co.il On the other hand, a high id f value for a indicates that a 5 location.gtforge.com htc2.accu-weather.com .openweathermap.org was not observed frequently on the devices in general. Thus, a high tf-idf score for application a with respect to host h may 0 500 1000 1500 2000 2500 3000 Avarage Leakage Volume (packets) indicate that a is related to the location leakage to h. The raw results of this analysis are presented in Fig- Figure 12: Presenting the top 12 host names classified by their ure 13, where the tf-idf values are shown for each host category (color) and level of suspiciousness (size of circle). name (x-axis) and application (y-axis). Based on these re- The x-axis represents the average number of detected leakage sults, we classified the applications into two categories. The events, and the y-axis represents the number of participants first category includes applications that send the location sending location data to that host name. data to their own hosting service. In this category we can find multiple HTC, LG, and Samsung pre-installed appli- consisting of an application and host name. A well-known cations found to be related to their own hosting services measure in the field of text categorization, tf-idf is often used (htc2.accu-weather.com, lgemobilewidget.accu-weather.com, as a weighting factor in information retrieval and text min- and samsungbuiasr.vlingo.com), as well as the GetTaxi ing [42]. The tf-idf is a numerical statistic intended to reflect (com.gettaxi.android) and Easy (easy.co.il.easy3) applications how important a term (i.e., word) is to a document in a col- which were found to be related to their hosting services (loca- lection or corpus. The tf-idf value increases proportionally to tion.gtforge.com and easy.co.il respectively). the number of times a term appears in the document, but it is The second category includes applications that send (via in- offset by the frequency of the term in the corpus, which helps tegrated "software plug-ins") location data to third party ser- to adjust for the fact that some terms appear more frequently vices such as advertisement (n129.epom.com) or analyt- in general. In our case, a document is a host name, and a term ical services (app.woorlds). In this category we identified a is an application. The term frequency (denoted by TF) of an popular student application named com.mobixon.istudent. We application (denoted by a) with respect to a given host name also identified applications that send location data to Google (denoted by h) is calculated as follows: Maps services, some of which potentially use the Google Maps JavaScript API in an unsecured manner (the HTTP a protocol instead of the HTTPS protocol). |Uh | TFh(a) = |Uh| 8 Mitigation strategies where Uh represents the set of users for which we identify a location leakage (from their devices to h), and Ua represents h In order to reduce the privacy risk associated with location the subset of users from Uh that have application a installed on their devices. leakage and POI inference presented in our paper, the fol- lowing countermeasures are suggested and could be further The inverse document frequency (denoted by IDF) of an investigated and developed in future work. application is calculated as follows: Awareness. The most basic approach is increasing the |Ua| awareness of mobile device users to such risks and providing IDF(a) = −log 10 |U| them with the tools and means to reduce the risk [43]. For example, reducing the risk can be achieved by installing only where Ua represents the set of users for which application a trusted applications (from trusted sources), monitoring and was installed on the devices; and U represents the set of all limiting sensitive permissions such as location and Internet users. access, and disabling location service on the device while Given the above, the tf-idf of an application with respect not in use and turning it on only on demand. Tools such as to a given host name is calculated as follows: Recon [8] and LP-Doctor [10] can also be used to increase

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 255 nri Scudcnie mrvn rvc yenabling by privacy improving consider could OS Android tnadscrt ouin etrssc sioain en- isolation, as such features solutions security standard 256 sen- data. involving sitive operation background regarding decisions the user of lo- versions future the addition, accessing In from runtime. application at an sensor block cation to [46] used LP-Guardian and be [45] can MockDroid protection. as privacy such for crafted tools data Other with infor- it leaked replace the and modify traffic, mation, detect network can the that over Priva- solution leakage as VPN-based data such a tool is a which applying [19] or cyGuard Internet the such over information location sensitive as send that en- applications enforcing prac- on include best cryption may by This solely [44]. done recommendations is handling tice PII been user, has the permission by Once granted sensors. and resources phone for permissions user-granted and management, memory cryption, transmitted sensitive override to information. ability alerts presenting the by providing risks privacy and the to users of awareness the iue1:Tet-d ausfrec otnm n app. and name host each for values tf-idf The 13: Figure com_sec_android_widgetapp_dualclockdigital4x2 com_sec_android_daemonapp_ap_accuweather Splc n tools. and policy OS com_melodis_midomiMusicIdentifier_freemium com_sec_android_widgetapp_dualclockanalog com_sec_android_widgetapp_diotek_smemo com_sec_android_app_clockpackage com_sec_android_app_videoplayer com_lge_sizechangable_weather com_htc_android_locationpicker com_sec_android_app_camera com_nero_android_htc_sync com_wolfram_android_alpha com_htc_AutoMotive_Traffic com_amudanan_amudanan 22nd com_ideomobile_hapoalim com_sec_android_app_mt com_sgiggle_production com_lge_systemservice com_macropinch_swan com_lge_settings_easy com_android_gallery3d com_mobixon_istudent com_yohana_ea_ziggi com_gettaxi_android com_goldtouch_ynet com_android_email com_vlingo_midas com_ebay_mobile easy_co_il_easy3 com_sec_chaton com_lge_email com_fmm_dm com_sec_pcw com_htc_task International .5 .801 .3007 .50 0.25 0.75 0 0.33 0 0.25 0 0 0.33 0.14 0.17 0.18 0 0 0.28 0 0.27 0.051 0 0.33 0.099 .106 .100703 0 0 0 0 0.33 0 0.7 0 0 0.41 0 0 0 0.28 0 0 0 0.17 0 0.66 0.41 0 0.13 0.31 0 0.7 0.091 0 0.61 0 0.69 0.11 0.16 0.15 0.2 0.66 0.61 .502 .202 .402003 .8000 0 0 0.28 0.33 0 0.2 0.14 0.87 0.27 0.87 0.12 0.22 0.15 .405 .40.24 0.44 0.51 0.24 1 0.91 .70.92 0.87 0.75 0.75 .501 .504 0 0.45 0.35 0 0.11 0.45 0.15 0.35 0 0.11 0.45 0.15 0.35 0 0.11 0.45 0.15 0.35 0 0.11 0.15 0.45 0.35 0.091 0.16 .806 .506 .705 0.44 0.11 0 0.38 0.24 0.54 0.064 0.48 0.17 0.49 0.082 0.65 0.31 0.26 0.45 0.71 0.62 0.51 0.68 0.6 0.64 0.28 . .201 .801 .502 0 0.25 0 0.75 0.33 0 0 0 0 0.25 0 0.75 0.28 0.091 0 0 0 0.67 0.14 0.11 0.18 0 0.1 0.12 0.22 0 0.1 0.41 0.27 0.23 0.33 0.1 . 0.96 0.8 . 0.96 0.8 . 0.96 0.8 . 0.96 0.8 0.3 .1000101 .300.25 0 0.25 0.33 0 0 0.25 0.33 0 0 0 0.25 0.33 0.14 0 0.091 0 0 0 0.33 0.14 0.091 0 0.11 0 0 0.14 0 0.091 0 0.11 0 0.14 0 0.091 0.11 0 0 0.11 0 maps.googleapis.com 0.8 1 0.87 0.77 0.77 0.82 mediation.adnxs.com h nri Spoie built-in provides OS Android The .5 .406 .702 .100 0 0.21 0 0.27 0.67 0 0.64 0.14 0.054 .4 0.083 0.042 .10.12 0.11 .601 .90 0.69 0.17 0.16 .10.13 0.11 .70.11 0.17 .70.11 0.17 .20.15 0.12 .20.15 0.12 .20.15 0.12 .20.15 0.12 0 0 0 0 0.26 0 0.49 0.46 0.72 0.48 n129.epom.com 1 1

app.woorlds.com Symposium 0.85 0.96 0.86 0.92 0.92 0.83 0.83 0.83 0.83 1 venues.waze.co.il 0.18 0 0.18 0 0.24 0 0.62 0 .70.16 0.17 .902 . .800 0 0.18 0.1 0.24 0.59 0 .403 .7 .400.13 0 0.14 0.076 0.37 0.44 0 .404 .9 0.16 0 0 0.094 0.46 0.54 0 .404 .9 0.16 0 0 0.094 0.46 0.54 0 .403 .7 .400.13 0 0.14 0.076 0.37 0.44 0 .703 .6 .200.11 0 0.12 0.064 0.31 0.37 0 .703 .6 .200.11 0 0.12 0.064 0.31 0.37 0 .703 .6 .200.11 0 0.12 0.064 0.31 0.37 0 .703 .6 .200.11 0 0.12 0.064 0.31 0.37 0 lgemobilewidget.accu-weather.com 1 1 1 1 1 1 samsungbuiasr.vlingo.com 0 0 0 0 0 0.98 .500301 0.13 0 0.13 0.073 0.35 upload.wikimedia.org 1 1 1 1 1 .707 .80 0.48 0.75 0.27 .402 0.23 0 0.23 0.25 0.14 0 0.23 0.25 0.14 0 0.23 0.25 0.14 0 0.23 0.25 0.14 0 0.25 0.14 location.gtforge.com 1 on .600.15 0 0.16 easy.co.il 1 .50 0.25 1 1 1 1 1 Research htc2.accu-weather.com 1 api.openweathermap.org 0 0 0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 in Attacks, a ope n eoreadtm osmn,w pe to opted we consuming, time and resource and complex was applica- an of form the in be can approach alternative An 3 spooe norsuy tsol nlz h privacy the analyze should it study, our in proposed as (3) iiiiyt l unn plctosadcnapyaccess apply can and applications running all to visibility the on privacy user protecting and hand, one the on vices) esr hthsacs otelcto aao utpeLBSs. multiple of data location the to access has that versary oto oiisa ela bucto ehius 2 auto- (2) techniques; obfuscation as have well as thus policies and control level OS solution the at preserving implemented privacy (1) a be: that should believe we hand, other ser- (or applications location-based to data location quired les aiuae,o eeaie h srslcto data location user’s the generalizes or manipulates, filters, rsreterpiay hsi oee,ipatclbecause impractical however, to is order This in privacy. devices their client preserve the to functionality application the ad- Puttaswamy an of these model threat addition, unique In the service. consider not the do provide techniques location to frequent order or in to accurate difficult samples requires is LBS approach the an Such when LBS. apply the to it sending before in k-anonymity ensure to [47 presented LBSs were algorithms and user. the by transmitted selected the value replace another to with user data the allows also Recon [8] proposed applications application The identifying information. private for proposed users’ approach misuse was useful that leaks a is PII and detect [8] to by order in providers curity ,for [55], algorithms research. clustering future the on adver- approach an learning as sarial obfuscation PII defense applying the especially of approaches, evaluation and development design, the leave research this of setup the experimental from the inferred Since be data. can transmitted to that and POIs application, the each obfuscate by intelligently gran- required of data level location the of understand ularity to thereby order leaks, in PII) applications other profiling (and data auto- location to identify techniques it matically learning (4) machine and intelligent adversary; eavesdropper apply network should a of risk the gate making; decision and involvement user minimal with mated user. the of POIs true the infer to difficult actor it threat make a can for that in- locations and traffic spoofed network inject the telligently location over the leakage monitors location that or device sampling mobile the on installed tion LBS’s the is model. business location) main (e.g., data third-party personal or applications collecting free SDK) for (particularly cases most in okn ttentoktafi sawoei re omiti- to order in whole a as by traffic also but network application the single at a looking by only not level exposure nnmzto n I obfuscation. PII and Anonymization Monitoring. u oteihrn rd-f ewe rvdn h re- the providing between trade-off inherent the to Due Intrusions – .Ms ftemtosrl napoythat proxy a on rely methods the of Most 53]. tal. et and oioigntoktafi ytidpryse- party third by traffic network Monitoring Defenses ugse htLS hudmove should LBSs that suggested [54] USENIX Association aiu techniques Various 9 Conclusions and Future Work [6] J. H. Kang, W. Welbourne, B. Stewart, and G. Borriello, “Extracting places from traces of locations,” ACM SIG- In this paper, we showed the extent of location leakage from MOBILE Mobile Computing and Communications Re- mobile devices and presented a systematic process of extract- view, vol. 9, no. 3, pp. 58–68, 2005. ing location traces from raw network traffic. We analyzed the [7] M. Umair, W. S. Kim, B. C. Choi, and S. Y. Jung, “Dis- results of three geo-clustering methods applied over inconsis- covering personal places from location traces,” in 16th tent network traffic data and showed that existing algorithms International Conference on Advanced Communication yield good results, even with inconsistent location data; this Technology. IEEE, 2014, pp. 709–713. makes any user location privacy vulnerable, even with a low rate of location leakage. Our work enables location exposure [8] J. Ren, A. Rao, M. Lindorfer, A. Legout, and assessment by monitoring network traffic and reveals that D. Choffnes, “Recon: Revealing and controlling pii even relatively low location leakage rates (once in six hours) leaks in mobile network traffic,” in Proceedings of the and coverage of only 20% can expose approximately 70% of 14th Annual International Conference on Mobile Sys- the weighted POIs. In future work we plan to further auto- tems, Applications, and Services. ACM, 2016, pp. 361– mate the POI identification process by automatically setting 374. up the clustering algorithm parameters, evaluate the extent to which the leaked data can be used to predict the users future [9] J. Freudiger, R. Shokri, and J.-P. Hubaux, “Evaluating location, and develop a mitigation approach based on adver- the privacy risk of location-based services,” in Interna- sarial learning techniques, which will be based on injecting a tional conference on financial cryptography and data minimal number of false locations in clear text to the network security. Springer, 2011, pp. 31–46. traffic, in order to deceive the clustering algorithms (i.e., PII [10] K. Fawaz, H. Feng, and K. G. Shin, “Anatomization and obfuscation). protection of mobile apps’ location privacy threats,” in 24th USENIX Security Symposium USENIX Security 15), 2015, pp. 753–768. References [11] C. Leung, J. Ren, D. Choffnes, and C. Wilson, “Should [1] L. Nguyen, Y. Tian, S. Cho, W. Kwak, S. Parab, Y. Kim, you use the app for that?: Comparing the privacy impli- P. Tague, and J. Zhang, “Unlocin: Unauthorized loca- cations of app-and web-based online services,” in Pro- tion inference on smartphones without being caught,” in ceedings of the 2016 Internet Measurement Conference. 2013 International Conference on Privacy and Security ACM, 2016, pp. 365–372. in Mobile Systems (PRISMS). IEEE, 2013, pp. 1–8. [12] E. P. Papadopoulos, M. Diamantaris, P. Papadopoulos, T. Petsas, S. Ioannidis, and E. P. Markatos, “The long- [2] I. Hazan and A. Shabtai, “Dynamic radius and confi- standing privacy debate: Mobile websites vs mobile dence prediction in grid-based location prediction algo- apps,” in Proceedings of the 26th International Confer- rithms,” Pervasive and Mobile Computing, vol. 42, pp. ence on World Wide Web. International World Wide 265–284, 2017. Web Conferences Steering Committee, 2017, pp. 153– 162. [3] Y. Michalevsky, A. Schulman, G. A. Veerapandian, [13] P. Regulation, “General data protection regulation,” Of- D. Boneh, and G. Nakibly, “Powerspy: Location track- ficial Journal of the European Union, vol. 59, pp. 1–88, ing using mobile device power analysis,” in 24th 2016. USENIX Security Symposium (USENIX Security 15), 2015, pp. 785–800. [14] S. Gambs, M.-O. Killijian, and M. N. del Prado Cortez, “De-anonymization attack on geolocated data,” Journal [4] L. Arigela, P. Veerendra, S. Anvesh, and K. Satya, “Mo- of Computer and System Sciences, vol. 80, no. 8, pp. bile phone tracking & positioning techniques,” Interna- 1597–1614, 2014. tional Journal of Innovative Research in Science, Engi- [15] V. F. Taylor and I. Martinovic, “A longitudinal study neering and Technology, vol. 2, no. 4, 2013. of app permission usage across the google play store,” CoRR, abs/1606.01708, 2016. [5] S. Isaacman, R. Becker, R. Cáceres, S. Kobourov, M. Martonosi, J. Rowland, and A. Varshavsky, “Iden- [16] A. Shuba, A. Le, M. Gjoka, J. Varmarken, S. Langhoff, tifying important places in people’s lives from cellular and A. Markopoulou, “Antmonitor: Network traffic mon- network data,” in International Conference on Pervasive itoring and real-time prevention of privacy leaks in mo- Computing. Springer, 2011, pp. 133–151. bile devices,” in Proceedings of the 2015 Workshop on

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 257 Wireless of the Students, by the Students, & for the Stu- [27] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and dents. ACM, 2015, pp. 25–27. mobility: user movement in location-based social net- works,” in Proceedings of the 17th ACM SIGKDD inter- [17] A. Shuba, A. Le, E. Alimpertis, M. Gjoka, and national conference on Knowledge discovery and data A. Markopoulou, “Antmonitor: A system for on-device mining. ACM, 2011, pp. 1082–1090. mobile network monitoring and its applications,” arXiv preprint arXiv:1611.04268, 2016. [28] I. Polakis, G. Argyros, T. Petsios, S. Sivakorn, and A. D. Keromytis, “Where’s wally?: Precise user discovery at- [18] A. Razaghpanah, R. Nithyanand, N. Vallina-Rodriguez, tacks in location proximity services,” in Proceedings of S. Sundaresan, M. Allman, C. Kreibich, and P. Gill, the 22nd ACM SIGSAC Conference on Computer and “Apps, trackers, privacy, and regulators: A global study Communications Security. ACM, 2015, pp. 817–828. of the mobile tracking ecosystem,” 2018. [29] V. Taylor, J. R. Nurse, and D. Hodges, “Android apps [19] Y. Song and U. Hengartner, “Privacyguard: A vpn-based and privacy risks: what attackers can learn by sniffing platform to detect information leakage on android de- mobile device traffic,” 2014. Proceedings of the 5th Annual ACM CCS vices,” in [30] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A Workshop on Security and Privacy in Smartphones and density-based algorithm for discovering clusters in large Mobile Devices . ACM, 2015, pp. 15–26. spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231. [20] J. Ren, M. Lindorfer, D. J. Dubois, A. Rao, D. Choffnes, and N. Vallina-Rodriguez, “Bug fixes, improvements,... [31] D. Birant and A. Kut, “St-dbscan: An algorithm for and privacy leaks,” 2018. clustering spatial–temporal data,” Data & Knowledge Engineering, vol. 60, no. 1, pp. 208–221, 2007. [21] K. Drakonakis, P. Ilia, S. Ioannidis, and J. Polakis, “Please forget where i was last summer: The privacy [32] R. Montoliu and D. Gatica-Perez, “Discovering human risks of public location (meta) data,” arXiv preprint places of interest from multimodal mobile phone data,” arXiv:1901.00897, 2019. in Proceedings of the 9th International Conference on Mobile and Ubiquitous Multimedia. ACM, 2010, p. 12. [22] R. Li, S. Wang, H. Deng, R. Wang, and K. C.-C. Chang, “Towards social user profiling: unified and discrimina- [33] L. O. Alvares, V. Bogorny, B. Kuijpers, J. A. F. tive influence model for inferring home locations,” in de Macedo, B. Moelans, and A. Vaisman, “A model Proceedings of the 18th ACM SIGKDD international for enriching trajectories with semantic geographical conference on Knowledge discovery and data mining. information,” in Proceedings of the 15th annual ACM ACM, 2012, pp. 1023–1031. international symposium on Advances in geographic information systems. ACM, 2007, p. 22. [23] J. Lin and R. G. Cromley, “Inferring the home locations of twitter users based on the spatiotemporal clustering [34] M. T. Khan, J. DeBlasio, G. M. Voelker, A. C. Sno- of twitter data,” Transactions in GIS, vol. 22, no. 1, pp. eren, C. Kanich, and N. Vallina-Rodriguez, “An empiri- Pro- 82–97, 2018. cal analysis of the commercial vpn ecosystem,” in ceedings of the Internet Measurement Conference 2018. [24] T.-r. Hu, J.-b. Luo, H. Kautz, and A. Sadilek, “Home ACM, 2018, pp. 443–456. location inference from sparse and noisy data: models [35] A. Mani, T. Wilson-Brown, R. Jansen, A. Johnson, Frontiers of Information Technology and applications,” and M. Sherr, “Understanding tor usage with privacy- & Electronic Engineering , vol. 17, no. 5, pp. 389–402, preserving measurement,” in Proceedings of the Internet 2016. Measurement Conference 2018. ACM, 2018, pp. 175– 187. [25] Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui, “Exploring millions of footprints in location sharing services,” in [36] Y.-A. De Montjoye, C. A. Hidalgo, M. Verleysen, and Fifth International AAAI Conference on Weblogs and V. D. Blondel, “Unique in the crowd: The privacy Social Media, 2011. bounds of human mobility,” Scientific reports, vol. 3, p. 1376, 2013. [26] F. Luo, G. Cao, K. Mulligan, and X. Li, “Explore spa- tiotemporal and demographic characteristics of human [37] P. Golle and K. Partridge, “On the anonymity of mobility via twitter: A case study of chicago,” Applied home/work location pairs,” in International Conference Geography, vol. 70, pp. 11–25, 2016. on Pervasive Computing. Springer, 2009, pp. 390–397.

258 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association [38] “Standard representation of geographic point location cloaking,” in Proceedings of the 1st international con- by coordinates, volume = 2008, address = Geneva, CH, ference on Mobile systems, applications and services. institution = International Organization for Standardiza- ACM, 2003, pp. 31–42. tion,” Standard, Jul. 2008. [50] A. Narayanan, N. Thiagarajan, M. Lakhani, M. Ham- [39] “Location-Android developer documentation,” burg, D. Boneh et al., “Location privacy via private prox- https://developer.android.com/reference/android/ imity testing.” in NDSS, vol. 11, 2011. location/Location.html/, 2018, [Online; accessed 12-April-2018]. [51] T. Xu and Y. Cai, “Feeling-based location privacy pro- tection for location-based services,” in Proceedings of [40] D. Ashbrook and T. Starner, “Using gps to learn signif- the 16th ACM conference on Computer and communi- icant locations and predict movement across multiple cations security. ACM, 2009, pp. 348–357. users,” Personal and Ubiquitous computing, vol. 7, no. 5, pp. 275–286, 2003. [52] R. Shokri, G. Theodorakopoulos, C. Troncoso, J.-P. Hubaux, and J.-Y. Le Boudec, “Protecting location pri- [41] A. Razaghpanah, N. Vallina-Rodriguez, S. Sundare- vacy: optimal strategy against localization attacks,” in san, C. Kreibich, P. Gill, M. Allman, and V. Paxson, Proceedings of the 2012 ACM conference on Computer “Haystack: A multi-purpose mobile vantage point in and communications security. ACM, 2012, pp. 617– user space,” arXiv preprint arXiv:1510.01419, 2015. 627.

[42] F. Sebastiani, “Machine learning in automated text cate- [53] M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, gorization,” ACM computing surveys (CSUR), vol. 34, and C. Palamidessi, “Geo-indistinguishability: Differen- no. 1, pp. 1–47, 2002. tial privacy for location-based systems,” arXiv preprint arXiv:1212.1984, 2012. [43] L. Kraus, T. Fiebig, V. Miruchna, S. Möller, and A. Shab- tai, “Analyzing end-users’ knowledge and feelings sur- [54] K. P. Puttaswamy and B. Y. Zhao, “Preserving privacy in rounding smartphone security and privacy,” S&P. IEEE, location-based mobile social applications,” in Proceed- 2015. ings of the Eleventh Workshop on Mobile Computing Systems & Applications. ACM, 2010, pp. 1–6. [44] “Security Tips-Android developer documentation,” https://developer.android.com/training/articles/ [55] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, security-tips.html/, 2018, [Online; accessed 01-April- G. Giacinto, and F. Roli, “Poisoning behavioral malware 2018]. clustering,” in Proceedings of the 2014 workshop on artificial intelligent and security workshop. ACM, [45] A. R. Beresford, A. Rice, N. Skehin, and R. Sohan, 2014, pp. 27–36. “Mockdroid: trading privacy for application functionality on smartphones,” in Proceedings of the 12th workshop [56] J. Krumm, “Inference attacks on location tracks,” in on mobile computing systems and applications. ACM, International Conference on Pervasive Computing. 2011, pp. 49–54. Springer, 2007, pp. 127–143.

[46] K. Fawaz and K. G. Shin, “Location privacy protection for smartphone users,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2014, pp. 239–250.

[47] X. Zhao, L. Li, and G. Xue, “Checking in without wor- ries: Location privacy in location based social networks,” in 2013 Proceedings IEEE INFOCOM. IEEE, 2013, pp. 3003–3011.

[48] S. Mascetti, C. Bettini, D. Freni, X. S. Wang, and S. Jajo- dia, “Privacy-aware proximity based services,” in 2009 Tenth International Conference on Mobile Data Man- agement: Systems, Services and Middleware. IEEE, 2009, pp. 31–40.

[49] M. Gruteser and D. Grunwald, “Anonymous usage of location-based services through spatial and temporal

USENIX Association 22nd International Symposium on Research in Attacks, Intrusions and Defenses 259 Threat Permissioned Platform Data Goal Evaluation environment Evaluation method Comments actor access? cars (24), location- continuous simulating location streams simulating yes - data sent data collected from cars and Nokia based active user de-anonymization to the LBS by location [9] by the device phones provided to users for phones service sampling of and POI inference undersampling the location exposure to to the service data collection (40) (LBS) GPS traces traces LBSs Android yes - data continuous (119), applica- sampled by active data collected from devices analyzing location traces [10] iPhone user POI inference – tions the sampling of provided/owned by users sampled by the applications (34) application GPS traces phones iPhone network traffic collected (63), using a VPN connection using machine learning for Android eavesdrop- network identify PIIs leaked by while running applications automatically identify PIIs [8] no – (33), per traffic specific applications in sandbox (no human within unencrypted network Windows activity); running Recon on flows phones rooted user devices network traffic collected mobile yes - data using Recon tool [8] for controlled iPhone (1), using a VPN/Proxy and Web sampled by network identify PIIs leaked by automatically identify PIIs experiment; [11] Android connection while running applica- the traffic specific applications within unencrypted network PIIs are (2) applications on the devices tions application flows known (including human activity) network traffic collected mobile yes - data controlled using a man-in-the-middle analyze network traffic and Android and Web sampled by network identify PIIs leaked by experiment; [12] device while running search for predefined strings (1) applica- the traffic specific applications PIIs are applications on the device of PIIs tions application known (no human activity) yes - data no evaluation network network traffic captured and analyze network traffic and Android applica- sampled by block PIIs sent by or [16] traffic of inspected (using virtual search for predefined strings phones (9) tions the applications performance applications VPN) on the device of PIIs application analysis proposing an on-device framework for analyzing apps’ focus on the Android yes - data sent applica- network network traffic for network traffic collected search for predefined strings efficient data [17] phones by the tions traffic performance using a VPN connection representing PIIs analysis (12) application monitoring, data framework leakage detection, and user profiling controlled experiment - run applications on a dedicated identify PIIs within network do not focus Android eavesdrop- network identify PIIs sent by [29] no mobile phone and collect the traffic by searching for on location phones per traffic individual applications network traffic using a predefined strings of PIIs data dedicated network setup third-party analyze the extent of do not Android advertise- yes - without PIIs leaks to network traffic collected analyze network traffic and network analyze [18] phones ment and user third-parties using a virtual VPN search for predefined strings traffic location (11,384) tracking awareness advertisement and connection on the device of PIIs leakage services tracking services propose a transparently leak yes - app comparing the locations transparent Android applica- location information by data collected from one [1] installed by WiFi readings inffered from the WiFi state approach for phone (1) tions using WiFi state device owned by a user the user readings to GPS readings leaking permission location data detect location leaks Android yes - without using predefined filters applica- network within network traffic network traffic collected [19] phone and user (rules) to identify PIIs and – tions traffic and identify user’s using a VPN connection emulator awareness location data POIs controlled experiment; run detect PIIs and location applications on dedicated leaks within network using string matching and yes - data sent mobile phones and collect Android applica- network traffic sent by an machine learning models to [20] by the the network traffic by – phones (5) tions traffic application; analyze label traffic containing application forwarding the traffic to a leakage trends with leaked PIIs proxy; mimic user application versions interaction using software yes - data location- sampled by network network traffic collected apply simple heuristics for [56] cars (172) based user POI inference – the service traffic from the cars inferring home location services provider Garmin the research is yes - data user POI cluster GPS data to extract model location- based on sampled by inference and data extracted from the apply simple heuristics for locations that are then [40] 35-LVS based frequent and the service location devices inferring home location incorporated into a Markov wearable services fresh GPS provider modeling model (7) samples network traffic collected measure the extent of detect relevant using a VPN connection (to location data leakage in the Android Our eavesdrop- network location leaks within a remote server) as well as Internet traffic of Android phones no – paper per traffic network traffic and location data collected devices and analyze the (71) identify user’s POIs simultaneously from the value of the leaked data (i.e., device infer POIs)

Table 8: Summary of related works.

260 22nd International Symposium on Research in Attacks, Intrusions and Defenses USENIX Association