<<

JUSTICEPage 1 RESEARCHWeb AND STATISTICS ASSOCIATION WEB SCRAPING DECEMBER FACT SH2017EET

Web Scraping

An Emerging Collection Method for Criminal Justice Researchers

Erin J. Farley, Ph.D. & Lisa Pierotte, B.S.

Introduction

With the continual advancement and economical than techniques of technology and the traditionally used in the past, and proliferation of the , the it arguably holds great promise amount of criminal justice-related for researchers working in the being placed on-line criminal justice community (Levy, has dramatically increased over 2017). the last decade. As a result, public access to certain types of This brief is intended to: introduce criminal justice data and criminal justice researchers to statistical information on the web scraping and explain what Internet has rapidly expanded, web scraping is and how it works; presenting new and provide examples of how web fundamentally different has been used in access opportunities for criminal criminal justice research; and justice researchers. One method describe several issues one should researchers are using to harness be aware of if thinking about these new data access using this type of data collection opportunities is web scraping. method for criminal justice research purposes. Web scraping is essentially an automated tool for searching and extracting data from and other on-line What is Web Scraping? sources. Pioneered in the fields of and e-commerce, Web scraping is an automated web scraping provides a with tool for finding and extracting an automated way to find and data from on-line sources. It collect data of interest from on- utilizes computer programming line sources that is more efficient Page 2 Web WEB SCRAPING FACT SHEET

and customized software code to mine data or other information from on-line How Does Web Scraping sources in order to remove a Work? copy of the data and store it in

an external for analysis. Web scraping involves the Typically, the data harvested development and use of two through web scraping is analyzed customized software programs – to answer questions that could a crawler and a scraper. The not be answered, or answered crawler systematically downloads efficiently, using the data as it data from the Internet; then the was originally presented on-line. scraper systematically pulls the Essentially, web scraping is a way relevant information to pull information from particular (unstructured, semi-structured, or web pages and re-purpose it for structured) from the downloaded customized analysis (Marres & data, codes it, and relocates it in Weltevrede, 2013). a database or file based on a

pre-determined structure and Web scraping is also referred to format defined by the user. This as automated data collection, new external database or file – web extracting, web crawling, or populated with data originally web content mining. Web presented on-line – is scraping has arguably been subsequently analyzed in ways around since the inception of the the original on-line presentation , but it has of data did not support. primarily been utilized in the field

of data science and is commonly Common software programming associated with e-commerce languages like R and Python are (Marres & Weltevrede, 2013). typically used to write the Indeed, a form of web scraping is software code for both the often used by travel-related crawler and the scraper. Hence, websites readers may be familiar software programming skills are with, specifically those that allow essential for building and consumers to compare prices for deploying a web scraper. The airline tickets or hotel rooms software code, however, is offered by different companies. constructed based on specific In the past decade, however, the search and use of web scraping has criteria established by the emerged in several other fields researcher based on his/her including journalism, marketing, understanding of the on-line data policy analysis, and psychology source(s) of interest and the research (Baker & Yacef, 2009; research questions the analysis Marres & Weltevrede, 2013; will attempt to answer. In Youyou, Kosinski, & Stillwell, 2015) Page 3 Web WEB SCRAPING FACT SHEET

practice, a data source theory, however, does not necessarily developed by the researcher, mean an individual has been guides the programmer’s convicted of a crime. While a development of the crawler and criminal history record is scraper. This theory describes the generated when someone is researcher’s and programmer’s arrested, an arrest does not assumptions about the always result in a criminal charge; information source and its and a charge does not always content, as well as their result in a criminal conviction. understanding of how the Hence, it is possible for someone available data is maintained and who has not been adjudicated to how key measures are have engaged in criminal operationalized. behavior to still have a criminal record, and this information can be, and sometimes is, used by employers to screen out job Web Scraping as a Criminal applicants, arguably unfairly Justice Research Tool limiting employment opportunities for D.C. residents with such

records. The use of web scraping by

criminal justice researchers is a One of the key information needs relatively new phenomenon. In a in understanding the extent of this search of the literature for problem in DC requires criminal justice-related research determining what percentage of employing web scraping as a individuals with criminal records data collection tool, only a were and were not charged or handful of studies were found in convicted of a criminal offense. which web scraping was utilized. Researchers have attempted to

answer this question in the past; One of these studies was but due to data fragmentation conducted by the Urban Institute across law enforcement agencies (2017) as part of a larger and the courts, the ability to exploration of how criminal accurately answer this question background checks by for D.C. has been a challenge employers may create barriers to (Council for Court Excellence, employment among residents of 2011; Duane, Reimel, and Lynch, the District of Columbia (D.C.). 2017). Background checks are utilized

by potential employers, in D.C. According to the Urban Institute and around the nation, to screen researchers, web scraping job applicants and to identify provided a viable way to those with a criminal record. overcome some of the existing Having a criminal history record, data access and analysis issues Page 4 Web WEB SCRAPING FACT SHEET

that resulted from this data obstacles encountered, Eads fragmentation. Specifically, Urban worked with computer Institute researchers used a web programmers proficient in writing scraper to collect publicly software code to create and available criminal history record deploy a web scraper for data for Washington, D.C. extracting publicly available data residents over a 10-year period. from the Cook County jail These data were then used to (maintained by the Sheriff’s estimate how many D.C. residents department), including inmate had a criminal record yet had not names, their date of birth, and been convicted of a crime. The the location of the jail in which an researchers determined that of inmate was held. The information the 68,000 D.C. residents who extracted from the website using were flagged as having an arrest web scraping will be utilized as during the 10-year period one part of a larger project examined, about half had not aimed at tracking the flow of been convicted of a crime during inmates through the entire that time span. This use of web criminal justice system in Illinois. scraping allowed Urban researchers to pull information off A third example comes from a the web to produce more National Institute of Justice- accurate estimates of the funded study currently in progress number of residents with criminal at JRSA. The study is exploring records who had not been how the characteristics of various convicted of a crime. This, in turn, on-line advertisements for escorts, better informed policy discussions such as those posted on CraigsList regarding employment barriers and other on-line sources, can for D.C. residents. potentially be used to identify human trafficking cases. The Another recent example of how objective of this project is to utilize web scraping has been used for the information pulled from criminal justice-related research websites (as well as from other involves the work being done by sources like interviews) to create journalists from ProPublica Illinois, a profile of escort ads highly a non-profit news agency. In an correlated with human trafficking, article published in July 2017, thereby providing law David Eads describes ProPublica’s enforcement officers and efforts and ultimate failure to prosecutors with practical obtain certain information on the guidance to more efficiently and Cook County jail population from effectively target escort ads, the Cook County Sheriff’s thereby leading to the successful Department through a Freedom prosecution of human traffickers. of Information Act (FOIA) request. To overcome the data access Page 5 Web WEB SCRAPING FACT SHEET

As part of this project, researchers are relying upon a pre-existing, large-scale web scraping tool Web Scraping Issues to known as Memex. Launched by the U.S. Department of Defense in Consider 2015, Memex searches on-line escort ads and extracts While the use of web scraping for information of interest on a daily criminal justice research is indeed basis. Since its inception, the in its infancy, the technology Memex Program1 has pulled arguably has the potential to billions of ads off the internet to provide criminal justice keep law enforcement informed researchers with an important about trends in online sex new data collection tool. Given exploitation as well as to assist the proliferation in the amount of with anti-trafficking investigations data being placed on-line, web (Sneed, 2015). scraping may serve as a viable alternative to traditional methods A final example involves research for accessing data through the on bullying being conducted at Internet and conducting analyses Simon Fraser University. As part of that help answer important a larger effort to explore research questions. environmental factors associated with bullying events, researchers Time Saver utilized a web scraper to pull messages from a range of Efficiency is another potential stakeholders (e.g., victims, benefit of web scraping. parents, teachers, and bullies) Manually collecting data from from different countries2 off of an on-line sources typically is time- international bullying website. This consuming and labor intensive. data collection contributed to an As an automated process, web analysis that found bullying scraping can save time and behavior most often took place in reduce labor costs. Rather than public settings where capable reviewing websites and then guardianship should be present, manually copying and pasting but even when potential relevant information from a guardians were present, their website into another document impact on bullying behavior was or file for cleaning and analysis, a limited (Lam, Towle, & Cartwright, web scraper essentially 2017). automates these tasks, reducing the time and labor necessary to 1 The Memex Program was created as part of collect the information and the Defense Advanced Research Projects prepare it for analysis. Agency. 2 Including but not limited to: Canada, the United States, England, and Australia. Page 6 Web WEB SCRAPING FACT SHEET

However, as a novel and researchers who utilize this relatively untested approach to technique may be introducing data collection for criminal justice ‘alien’ assumptions into their research purposes, there are research process (Marres & several issues anyone Weltevrede, 2013). Since web contemplating the use of web scraping does not typically scraping should be aware of as involve direct communication they consider the pros, cons and between the researcher and feasibility of using this emerging those who originally collected the technology. data and placed it on-line, data interpretation problems can easily Software Programming and emerge, and it can be difficult for Coding Skills Are Needed a researcher to properly understand or verify the validity While web scraping is typically and reliability of the data. If used as a data collection tool to researchers are not cognizant of support research and analysis, these issues, they risk making developing and deploying a web inaccurate assumptions and scraper requires technical skills reaching invalid conclusions. that social science researchers typically do not possess. A high Legal Constraints level of proficiency in writing software code in computer There are also potential legal programming languages such as constraints for those who Python or R is a prerequisite for undertake data collection developing a web scraper. through web scraping. Indeed, Hence, a criminal justice an organization placing data on- researcher will often need to line may expressly prohibit web collaborate with a competent scraping of their site, or deny programmer or outsource the access to “robots,” web scrapers development of the web scraper, and other types of automatic thereby incurring financial costs. data harvesters. These prohibitions may be, but are not and Interpretation necessarily, stated in the site’s ‘Terms of Service,’ ‘Terms and Researchers employing web Conditions,’ or ‘Terms of Use.’ scraping technology as a data Websites may also employ the collection tool also need to be use of a or concerned with the quality of the system such as CAPTCHA to information pulled from the distinguish between human and website and, in turn, its accurate automated website users and to interpretation. Web scraping was prevent automated data not originally created for social extraction (Studdenberg, 2017). science research; as a result, Hence, researchers should Page 7 Web WEB SCRAPING FACT SHEET

contact the organization or Website Overload agency from which the website information is being drawn to A final issue researchers should ensure that web scraping is consider is the impact web permitted. This will support scraping may have on the transparency in the research functionality of a website, as project, and it may provide the some web scraping attempts source agency with an have inadvertently overloaded opportunity to provide the data and shut down a website. This of interest to the researcher happened with an early version through more traditional and of a web scraper developed by transparent means. Eads and colleagues for ProPublica Illinois in 2014. That Another legal issue that can web scraper continuously ran on emerge in the context of web the Cook County jail website, scraping relates to personal overwhelming and eventually privacy. The collection of crashing it. As a result, public personal information from a defenders and family members of website may potentially result in a jail inmates could not access the violation of personal privacy rights site to find information about their - even if that information is clients or family members (Eads, publicly available (Levy, 2017). 2016). Consequently, there are While the standards or risks as ethical and potential legal they relate to web scraping have ramifications to crashing a yet to be clearly outlined by the website that anyone utilizing web courts, experts recommend scraping should be concerned avoiding the collection of with. personal information through web scraping when possible (Hussain, 2017). Summary While there are some legal decisions that can be referenced Considering the amount of for guidance (e.g., eBay v. criminal justice related data and Bidder’s Edge, 2000; Facebook other information available Inc. v. Power Ventures Inc., 2012; online, web scraping arguably Ticketmaster Corp., et al. v. presents researchers with a Tickets.com Inc., 2000), in general valuable new tool for collecting the standards (and legal data and answering research consequences) regarding what questions. Properly designed and websites or types of information implemented, a web scraper can can and cannot be scraped by help researchers overcome data researchers have not yet been access barriers, collect on-line clearly established. Page 8 Web WEB SCRAPING FACT SHEET

data more efficiently, and will increase awareness of this ultimately answer research novel methodology and prompt questions that were unable to be and advance discussions on the answered through traditional appropriateness of its use within data collection and analysis criminal justice research. means. We hope this fact sheet

References

Baker, R.S.J.D. & Yacef, M. (2009). The of Educational in 2009: A Review and Future Visions. Journal of Educational Data Mining, 1, pp. 3- 16.

Council for Center Excellence. 2011. Unlocking Employment Opportunity for previously Incarcerated Persons in the District of Columbia. Washington, DC: CCE.

Duane, M., Reimal, E., & Lynch, M. (2017). Criminal Background Checks and Access to Jobs: A Case Study of Washington, DC. Urban Institute; Washington, DC.

Eads, D. (2017, July 24). How (and Why) We’re Collecting Cook County Jail Data. ProPublica. Retrieved from: https://www.propublica.org/nerds/how-and- why-collecting-cook-county-jail-data

eBay, Inc. v. Bidder’s Edge Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000).

Facebook Inc., V. Power Ventures Inc., 844 F. Supp. 2d 1025 (E.D. Cal. 2012).

hiQ Labs Inc. v. LinkedIn Corporation, No. 3:17- CV-03301 (N.D. Cal. 2017).

Hussein, Z. (2017, January 10). Web Scraping: Pitfalls and Proactive Best Practices. [Blog post on The Foundary Law Group Blog]. Retrieved from: http://foundrylawgroup.com/web-scraping-pitfalls-best-practices/

Lam, V.C., Towle, K., & Cartwright, B. (2017). Bullying and Audience Behaviour: Capable Guardianship and the Environmental Backdrop of School Bullying (PowerPoint presentation). Presented at the 2017 Annual American Society of Criminology Conference.

Landers, R.N., Brusso, R.C., Cavanaugh, K.J., & Collmus, A.B. (2016). A Primer on Theory-Driven Web Scraping: Automatic Extraction of From the Internet for Use in Psychological Research. Psychological Methods, 21(4), pp. 475-492. Page 9 Web WEB SCRAPING FACT SHEET

Levy, J. (2017, August 23). If Scraping Public Data can be Considered Criminal, Innovative Research will Suffer. [Blog post on the Urban Wire]. Retrieved from: https://www.urban.org/urban-wire/if-scraping-public-data-can-be- considered-criminal-innovative-research-will-suffer

Marres, N. & Weltervrede, E. (2013). Scrapping the Social? Issues in Live Social Research. Journal of Cultural Economy, 6(3), pp. 313-335.

Ticketmaster Corp., et al. v. Tickets.com, Inc., CV 99-7654 HLH (BQRx) (C.D. Cal. 2000).

Sneed, T. (2015, January 14). How Big Data Battles Human Trafficking. US News & World Report. Retrieved from: https://www.usnews.com/news/articles/2015/01/14/how-big-data-is- being-used-in-the-fight-against-human-trafficking.

Review and Future Visions. Journal of Educational Data Mining, 1, pp. 3-16.

Studdenberg, M. (2017, July 13). Web Scraping [webinar]. In Justice Research and Statistics Webinar Series. Retrieved from: http://www.jrsa.org/webinars/index.html#scraping.

Youyou, W., Kosinski, M. & Stillwell, D.(2015). Computer-Based Personality Judgments are More Accurate Than Those Made by Humans. Proceedings at the National Academy of Sciences of the United States of America, 112: 1036-1040. Retrieved from: http://dx.doi.org/10.1073/pnas.1418680112.