arXiv:1808.01718v1 [cs.CR] 6 Aug 2018 e rwig hsifraini sdb eeoest dete to developers by used is information This browsing. web uoai rs eotn ytm(CS a oet h user the to pose may (ACRS) System Reporting Crash Automatic r financial, legal, on focusing conducted, were studies few A i rwes rs eotn ytm.A aesuy emi we study, case a As systems. reporting crash browsers’ vis about information report to applications by used are (ACRS) ftecahcletdfo h sr’ssesadteamount the and systems users’ the from collected crash the of etlifrain(.. Sadbosrvrin,adprob and crash. the version), on browser feedback and user’s system OS user’s (e.g., thread, designe failing information is the mental ACRS of browsers, trace collect web stack to In the ACRS 27]. collect the 25, use 16, that Sa [4, names Firefox, crashes common Chrome, field few bug. a the are rectify Opera and and cause root crash the d experience users’ improve and bugs application address to from reports crash asso collect companies and ACRS, hardware c Using underlying software. factors as such of crash, ACR the variety to users. a tributing the amongst from smooth reports troubleshooting i crash makes ACRS collect of to use approach The common unexpectedly. process execution normal its privacy. brow of implementation current the that risks the cond examine was study to previous no However, 35]. [1 31, 19, priva incident [8, and browsers leakage security the data analyzed specific studies several a Similarly, analyzed studies these 10, [1, of disclosure information of aspects other and tional, INTRODUCTION 1 repor the maintaining while bugs fixing readability. of process the n eas on has be and pact ACRS can of hotfix implementation current proposed the Our into server. integrated the prio report to crash report the the from submit data sensitive removing sec by and ACRS privacy in users’ enhance to hotfix a propose privac we reportin Further, and crash browser security state-of-the-art current important the an in sue on light sheds analysis sensitiv other Our addresse 20,00 and email information, than contact 9,000 more of passwords, amount of enormous 600 presence IDs, token the per and shows the sessions analysis over Our collected reports years. crash six of consisting dataset v leakage a privacy a such study sen we users’ paper, contain this may In they information. errors, re diagnose crash to Although valuable failure. are software a during happening o rors Syst context Reporting the Crash un- in Automatic an studied system. not been reporting is not crash leakage has the data it through however, users issue, of known privacy the to Harm ABSTRACT rsigPiay nAtpyo e rwe’ ekdCrash Leaked Browser’s Web a of Autopsy An Privacy: Crashing oee,afc hthsbe ieydseaddi h conte the is disregarded widely been has that fact a However, rs sa nvtbefc cuswe otaeterminat software a when occurs fact inevitable an as crash A nvriyo lbm tBirmingham at Alabama of University ivs Satvat Kiavash [email protected] systems. g 0.Some 30]. yo web of cy environ- clients data. e er- the eputa- ciated o of iod ucted uring sitive ports Reports urity ,an s, im- o 10]. , sers’ able ems is- y is-a- to d fari, to r on- a s ily ne nt ts’ es of ct s’ S 0 f otebs forkolde hsi h rtcs td which study case first the is this knowledge, our of best the To iie Rsa h letsd ro odseiaigt the to disseminating to prior side client the at URLs visited 1 sdsusdi eto 3.1 Section in discussed as oeatra oteues privacy. users’ which the reports to crash threat in a found pose be can that information private utm nomto,a hycnb eurdfrtedebuggi the for required be can they as information, cr the runtime during URL browsed including data similar relatively aae fprilyaoyie rs eot eesdb o by released a browsers reports study major crash the we privacy, anonymized partially users’ to of pose dataset may ACRS of implementation ainfo h niomna nomto,ues feedbac users’ information, i add environmental sensitive the the to remove from Furthermore, to mation hotfix attackers. a proposed for we target deficiency, this compelling A a for potential be the shows to and approach current the of signify ficiency results obtained The numbers. phone na and including addresses, information, contact of a amount email enormous 9,000 an 20 password, and text approximately clear 600 including, IDs, token data and session private of amount extrac nificant we informatio inspection, our runtime Conducting descriptions. system users’ URLs, visited containing anonymized partially ports, million 2.5 of database a scrutinizes rsn h eut fordt nlss eto describe 5 Section Section analysis. In data our reports. of the methodol results presents the the present and defines data 3 the Section analyze to study. used this in expla used and ACRS dataset about the background the provides 2 Section lows. Organization. Paper • • follow. Contribution. Our h aeo h rwe,ya,adtefilso aae r rem are dataset of fields the and year, browser, the of name The e rwes hog u npcin eetatadrepres and extract we inspection, our Through browsers. web esi h ae evrgt opoie rdt e leaked. get data or compromised gets server cases the in unauthori cess against data private users’ readab the protects reports’ it on the However, maintaining impact while no bugs has fixing of and process in ACRS integrated of easily implementation be current can the hotfix Our URL. users’ and the from field data scription sensitive removing by privacy ACRS, users’ in enhance security to hotfix, a propose we analysis, data 5). (Section Hotfix A Deploying c browser state-of-the-art current systems. securi reporting the important in an issue on privacy light and securit sheds users’ analysis to Our harmful privacy. be and can ACRS th of extent implementation what to rent confidentia delineating and reports, the information in presented personal data of amount significant a ma the of one by released reports crash anonymized partially 4). (Section Analysis Data nti ae,t eosrt osberssta h curren the that risks possible demonstrate to paper, this In nvriyo lbm tBirmingham at Alabama of University 1 h urn tt fteatbosr collect browsers art the of state current The . h ealcnrbtoso u ytmaeas are system our of contributions detail The iehSaxena Nitesh [email protected] h eto hsppri raie sfol- as organized is paper this of rest The ecnuta miia td on study empirical an conduct We codn otersl four of result the to According vdo anonymized or oved insights s e sig- a ted rs re- crash s and ash ddress, e ac- zed h de- the server. ,and k, and n cur- e eof ne ,we 4, nfor- rash mes, may CRS ility. ,000 ress and ogy ent ins jor de- ng. to ty y t l and countermeasures. Finally, Section 6 concludes our study and • Deleted Fields: Fields like email and login were deleted since suggests future research directions. they explicitly carry sensitive information. • Masked Fields: In this category, some efforts have been taken for data de-identification. In ip, some records were de-identified 2 STUDY PRELIMINARIES AND DATASET by replacing the last two digits of IP address with zeros. • Untouched Fields: All fields except email, login and address 2.1 ACRS Background apparently remained untouched, however, not necessarily all The use of ACRS is a common approach for collecting and handling of them convey any meaningful information. In this paper, we crashes. ACRS primarily consists of two main components. First, mostly focused on these attributes to demonstrate the breach of what we refer to as Collector, a set of client side libraries which the information. collects the data from the clients and transfers it to the server for further processing. Second, the Processor which is a set of servers 3 METHODOLOGY and services responsible for analyzing, categorizing, and reporting In this section, we explain our approach for inspecting and ana- the collected crashes. On the client side, Collector, as an interacting lyzing the dataset. We also describe the methodology we use to interface between the Processor and clients, runs when the appli- present our results in Section 4. cation terminates its normal execution process. Collector typically asks for the users’ permission to send the crash report to Proces- 3.1 Ethical Consideration sor or to quit without sending the report. While choosing to send the report to the Processor, the user has an option to include the We conducted our research based on a dataset which was initially visited website at the time crash occurred or not. The user also can published and later removed following a report regarding the pres- provide additional information in the description field to explain ence of sensitive information. During our scrutiny, however, we no- the issue that caused the crash. Additionally, there might be an op- ticed the presence of similar data which is widely available for the tion for the user to provide an email address for future support public use. Therefore our study aims to highlight a potential pitfall and contributions. After these steps, Collector transfers a dump of associated with the current implementation of the ACRS to the re- crash report and collected details to Processor that is responsible searchers, software developers, and the society. Our study neither for further processing and presenting for the view of developers propagates the data nor does it draw attention to any specific com- and the general public. pany or individual. Hence, we believe our study causes no further The collected data from clients are referred to as Crash reports. harm to victims. To preserve the privacy of associated individuals, Each crash report is composed of two components. The first com- companies and third parties, and avoid working directly with this ponent is the stack trace of the failed thread which is generally sensitive information, we used complex queries and avoided per- referred to the minidump and carries information about the sys- forming a manual check. Therefore, all results provided in Section tem memory. The second component is the runtime information 4 represents a close estimation based on the output of our queries. of the users’ system and the possible feedback provided by the For the presentation, we removed identifiable information from user. While these details are essential for the developers to detect our results, including but not limited to the names of individuals, where the problem lies, they may contain users’ sensitive informa- websites, applications, companies and their contact information, tion. For instance, the minidump as a sensitive chunk of data may such as phone numbers and addresses. The de-identification pro- contain information such as username, password, and encryption cess was performed using a python script replacing this sensitive key [7]. Also, it is not hard to expect that the runtime information information with asterisks. and users’ comment may carry private information such as contact Similarly, we removed all dataset related details which may details and other sensitive data. imply a vendor or result in identifying an individual including names, years, and other traceable information. We also changed the dataset’s fields’ names to avoid defaming company by possible 2.2 Dataset cross referencing of presented data in this paper. In this paper, we study a dataset of 2,493,278 partially anonymized crash reports consists of details such as visited URLs, time of the 3.2 Our Approach crash, the client operating system, and user description of the Main Fields of the Database. In this study, we mainly focus on crash. A dataset aggregated by a major web browser during the two fields of url and description which carry more sensitive course of 6 years. The current approach employed by the software information. companies, including web browsers (e.g., [24, 28]), stores similar • description. To understand what might be laid in this field we data in crashes. Our close inspection revealed that while some of considered the following questions: “What are the possible users’ the fields were fully anonymized (e.g., field email), unsuccessful feedbacks on crashes?” ,“What might be the response of an expert de-identifications of the other fields left a notable amount of sen- user to the crash?”, “Is it possible that a user shares his private sitive and personal information in the database. Private data like information over the description field?”. According to these ques- the usernames, passwords, and emails embedded in the URL field, tions, we mined the fields, using keywords such as “username”, or some confidential information in the description field which is “password”, “my phone number”, and “please contact me” to in- shared by the users. We noticed the presence of three types of data spect the presence of gripping information and the data that may in this dataset; impact users’ privacy. • url. Similarly, for the URL field, we considered the following • Medium Risk: This category presents those records which do questions: “ what sort of information the URL can carry?”, “Is not compromise users’ privacy or security, but a cross referenc- it possible to find any sensitive information embedded in URL?”, ing with an auxiliary information can turn them into a threat to “What if the browser crashes at the time that user presses the lo- the users’ privacy. gin button to access his account?”, “What if the browser crashes • High Risk: This category presents records which explicitly vi- when a user presses the submitbutton to register to a website or up- olate users’ privacy and contains users’ sensitive information load a form?”, “What if the browser crashes when an attacker was such as username, password, and contact details. Similar to the launching an attack against a website?”. According to these ques- data presented in the previous categories data appeared in this tions, and using keywords such as “username=”, “password=”, category can be shared by users through description or URL we mined this field to extract those records that may be harmful field. to users’ security and privacy. 4 DATA ANALYSIS Cross Referencing. We performed no cross referencing between In this section, we demonstrate the results of our data analysis and the different fields of the database or with other auxiliary resources present them based on the severity level introduced in Section 3. such as social networks or previously published datasets. We be- lieve that doing so can result in exposure of more private infor- 4.1 Low Risk mation. For instance, an attacker can locate an unpopular/distinct platform (platforms that are being used in a particular industry As discussed in Section 3.3, under the low risk category, we present e.g., banks) by cross referencing between the database fields, such information which does not violate privacy, but at the same time is as ip, platform, language. Using these fields, the attacker can ob- highly descriptive in terms of providing statistical details, and can tain the approximate location (e.g. city), and then spot the exact be a representative of the whole Internet users population. In con- location using the date and platform fields. trast to reports offered by measurement websites (e.g., [22, 38]), our dataset encompasses a wide range of users, while they were visit- ing various websites. Moreover, the data has been collected over 3.3 Categorizing Data Breach Severity the course of six years and has no correlation with any specific ap- A variety of risk analysis methodologies are proposed to appraise plication, company, or individual, which signifies the impartiality and calculate the breach severity and the impact of the associated of our dataset. risks [13, 34]. These methods consider several factors, including Platforms and Protocols. Market share websites (e.g., [32, 33]) context or type of the data, source of the leakage, and potential are the main source, to obtain statistic on the popularity of differ- impact to rank the severity of each incident. These methods are ent platforms on the Internet. These websites build their database applicable to all previously reported incidents, as the majority of using the obtained data from their visitors’ browsers. Unlike these these leakages consist of one type of entity. For instance, in the case reports, the random distribution of users in the current dataset of the JPMorgan [29], as one of the most severe data breaches of all provides a more accurate statistic on the popularity of different time [15], users’ contact information breached. In the case of Mys- platforms. In this study, such information were extracted from the pace data breach [39], subscribers’ account details were subjected platform field. Similarly, same approach can be taken to extract to the breach. This similarity makes the assessment and evalua- the of various web protocols amongst websites. Unlike other re- tion of the associated risk relatively easy. However, in the case of ports, which target the specific domains (e.g. most popular web- browsers data leakage, we deal with a set of dissimilar and indepen- sites [18] ) , the current dataset formed from a wide range of users dent records, each being generated by a different user and in a dis- and can be counted as a small sample of the whole web population. parate circumstance. For instance, while a record may only carry Identified Attacks. Considering the fact that a significant number simple data about user’s browsing, the other may convey user’s of attacks against websites are launched through URL, we queried medical condition or financial details. This diversity adds complica- the url to detect the presence of common attacks such as SQL In- tions and gives each of these records its own unique characteristic jection, Local File Inclusion, and Directory Traversal. Below are and makes the prior methodologies defective towards this dataset. some of the known attacks we found during our inspection. Therefore, to present the result of our inspection, we define three • SQL Injection: We used a combination of keywords such as categories of risk severity that are designed to address our specific “select”, “from”, “where” and “union” to detect the presence of needs and are compatible with the current dataset. We present the SQL injection. Below URL illustrates a situation in which the at- data based on their severity into three categories: tacker’s browser crashed while he was launching a SQL attack against a website. • Low Risk: This category presents those records which does not impact individuals’ privacy, but accessing them may provide ad- http://www.******.com/photo/gallery/search.php? ditional information to an attacker, helping to launch targeted search_user=x%2527%20union%20select%20user_password attacks against specific population (e.g., users who are using a %20froum%204images_user%20where%20Andrew%20Whyte= particular operating system in a small region or a specific range %2572 of IP address). This information may also be interesting to the re- • Directory Traversal: In this form of attack, the attacker can get searchers and developers as they statistically represent the web access to restricted directories of the web servers (e.g., root direc- population. tory). The attacker can use expression “../” to instruct the system to move between directories. To look up this type of attack, we The next series of passwords were extracted from the searched for “../” as a keyword. The following is an instance of description, where the user shared his password in the descrip- this attack found in our database. tion field of Crash Reporter aiming to receive future support. http://www.*****.com/de_old/index.html?prm_popup=../../ I can login but the ****** don’t remember password for this site. I tell you the password to login: • Phishing Attack: Since it was not possible to search for all the username: ****** password: ****** you can login websites, except by querying the url, we tried to inspect the south west of page. if you need translation go to users feedback on the crash. The below example is a paradigm http://www.*****.net/dic/. where the user was trying to report a phishing website. Token & SessionID. Stolen SessionID within its life cycle, means http://******/www.wellsfargo.com/ that the account is stolen [42]. Our inspection revealed 21298 ses- sion and token IDs. Though at the time of writing this paper we Similarly, we used a same approach to extract other suspicious cannot verify the persistence of vulnerability, considering the fact records. For instance, we searched for “/etc/passwd/” as a keyword that sessions authenticate users, we can understand the risk posed for detecting the Local File Inclusion or “document.cookie” for the to individuals’ security and privacy at the time of the crash. More- Cross Site Scripting (XSS). over, in our analysis, we noticed the presence of 7000 Tokens (as more severe authentication method compared to session IDs). The 4.2 Medium Risk below URL depicts one of the samples with token Id. IP Address. We observed three types of IPs in ip. The first type https://******.com/*****/Main/Login_WS.aspx? contains IPs which were fully deleted, the second type consists of tokenid=880217f4-94c3-496d-8628-b2388b4ef299 records which were masked by turning into “10.2.0.0” and finally Contact Information. During our data analysis, we dealt with a 261,388 records that were partially anonymized by replacing the significant number of contact information some laid in URL while last two blocks of IP with zeros. Moreover, a close inspection of the others were provided by users intentionally in the description the url revealed a significant number of URLs with embedded IP field to receive further support. Among these records, the latter addresses. The below sample shows an IP address inside a URL. is specifically more threatening as can be used to launch a tar- http://******/?getpostdata=get&name=******&site= geted attack against the inexpert user. The attacker with access submit&ip=**.**.**.**&ref=000/ to such records may easily undermine users’ security and privacy by contacting the user as a member of browser’s support team to Email. We extracted 9139 emails including personal and business deceive the user to install a Malware or request additional infor- email addresses. Out of which 7731 email obtained from the de- mation from the target scription field, while the remaining 1462 extracted from the URLs. Please call me at ****** to discuss details about The below sample shows an embedded email inside the URL. missing info on my ebay bid pages. Never a problem in https://*****.net/activate.php?user=*****137199& the past, now an everyday problem. email=*****@gmail.com/ Similarly, we noticed a considerable number of phone numbers Location. A fraction ofusers couldhave been locatedbased on the laid in url. Below demonstrates a number embedded inside URL. data in url and description fields. locating users can occur either http://******/jcp/OrderDetail.aspx?context=OrderHist through the URL (e.g. when the user was trying to submit a regis- &iia=ya3z*****ZTtG&pid=1&OrderNum=******&Phone=******/ tration form with an embedded address in URL) or by querying the description for the cases which user shared his address. 5 INSIGHT AND COUNTERMEASURES 4.3 High Risk A variety of direct and indirect factors contributed to the privacy Username and Password. We were able to pinpoint over 621 harm caused by this data leakage incident. Factors such as disre- passwords. This number of users’ credentials further escalate the garding the importance of data retention and its impacts on users’ severity of this leakage when we consider the fact of the perva- privacy and security [5, 6, 11], improper use of GET method and siveness of password reuse (43-51%) [12]. Some of these passwords the human error [21, 40, 41] in disclosing users’ information. While were extracted from the URL, and the rest were embedded in the de- these kinds of factors are widely discussed, yet they are inevitable scription field. In the case of URL, the credentials were exposed due and as the result, we witness an unprecedented wave of data leak- to the weaknesses in design and development of the application ages. Therefore, in this section, we propose a hotfix that can be where both the HTTP and GET method have been used to trans- deployed into the current ACRS, helping to enhance the security fer the sensitive data between client and server while this method and privacy of users against such a data disclosure, by removing of transferring has been refused by the standards [37]. Below URL sensitive information from the crash report. demonstrates a case where the browser crashed when a user was about to log into a website. 5.1 Client Side Data Sanitization In the case of ACRS, both server and client side approaches can http://******/logon.do;jsessionid=abbTXFV3-1BSw7? be employed to enhance users’ privacy and security by removing username=******&password=******/ sensitive information from the reports. While removing the users’ sensitive data at the server side may provide more flexibility for the Algorithm 1: Sanitizing Crash Report developers to get the best readability of a report, the user should Data: input (url, description) not be obligated to trust the developer and the system to send their Result: anonymized input sensitive data to the server. Therefore, in this paper, we propose a 1 initialization; client side approach to remove sensitive data from the crash re- 2 lst = [x,y,z]; port’s environmental information, aiming to safeguard users’ sen- 3 open input; sitive data right at the client side. This approach, as a hotfix to the 4
<<