Web Scraping December Fact Sh2017eet

Total Page:16

File Type:pdf, Size:1020Kb

Web Scraping December Fact Sh2017eet JUSTICEPage 1 RESEARCHWeb AND STATISTICS ASSOCIATION WEB SCRAPING DECEMBER FACT SH2017EET Web Scraping An Emerging Data Collection Method for Criminal Justice Researchers Erin J. Farley, Ph.D. & Lisa Pierotte, B.S. Introduction With the continual advancement and economical than techniques of computer technology and the traditionally used in the past, and proliferation of the Internet, the it arguably holds great promise amount of criminal justice-related for researchers working in the information being placed on-line criminal justice community (Levy, has dramatically increased over 2017). the last decade. As a result, public access to certain types of This brief is intended to: introduce criminal justice data and criminal justice researchers to statistical information on the web scraping and explain what Internet has rapidly expanded, web scraping is and how it works; presenting new and provide examples of how web fundamentally different data scraping has been used in access opportunities for criminal criminal justice research; and justice researchers. One method describe several issues one should researchers are using to harness be aware of if thinking about these new data access using this type of data collection opportunities is web scraping. method for criminal justice research purposes. Web scraping is essentially an automated tool for searching and extracting data from websites and other on-line What is Web Scraping? sources. Pioneered in the fields of data science and e-commerce, Web scraping is an automated web scraping provides a user with tool for finding and extracting an automated way to find and data from on-line sources. It collect data of interest from on- utilizes computer programming line sources that is more efficient Page 2 Web WEB SCRAPING FACT SHEET software and customized software code to mine data or other information from on-line How Does Web Scraping sources in order to remove a Work? copy of the data and store it in an external database for analysis. Web scraping involves the Typically, the data harvested development and use of two through web scraping is analyzed customized software programs – to answer questions that could a crawler and a scraper. The not be answered, or answered crawler systematically downloads efficiently, using the data as it data from the Internet; then the was originally presented on-line. scraper systematically pulls the Essentially, web scraping is a way relevant information to pull information from particular (unstructured, semi-structured, or web pages and re-purpose it for structured) from the downloaded customized analysis (Marres & data, codes it, and relocates it in Weltevrede, 2013). a database or file based on a pre-determined structure and Web scraping is also referred to format defined by the user. This as automated data collection, new external database or file – web extracting, web crawling, or populated with data originally web content mining. Web presented on-line – is scraping has arguably been subsequently analyzed in ways around since the inception of the the original on-line presentation World Wide Web, but it has of data did not support. primarily been utilized in the field of data science and is commonly Common software programming associated with e-commerce languages like R and Python are (Marres & Weltevrede, 2013). typically used to write the Indeed, a form of web scraping is software code for both the often used by travel-related crawler and the scraper. Hence, websites readers may be familiar software programming skills are with, specifically those that allow essential for building and consumers to compare prices for deploying a web scraper. The airline tickets or hotel rooms software code, however, is offered by different companies. constructed based on specific In the past decade, however, the search and data extraction use of web scraping has criteria established by the emerged in several other fields researcher based on his/her including journalism, marketing, understanding of the on-line data policy analysis, and psychology source(s) of interest and the research (Baker & Yacef, 2009; research questions the analysis Marres & Weltevrede, 2013; will attempt to answer. In Youyou, Kosinski, & Stillwell, 2015) Page 3 Web WEB SCRAPING FACT SHEET practice, a data source theory, however, does not necessarily developed by the researcher, mean an individual has been guides the programmer’s convicted of a crime. While a development of the crawler and criminal history record is scraper. This theory describes the generated when someone is researcher’s and programmer’s arrested, an arrest does not assumptions about the always result in a criminal charge; information source and its and a charge does not always content, as well as their result in a criminal conviction. understanding of how the Hence, it is possible for someone available data is maintained and who has not been adjudicated to how key measures are have engaged in criminal operationalized. behavior to still have a criminal record, and this information can be, and sometimes is, used by employers to screen out job Web Scraping as a Criminal applicants, arguably unfairly Justice Research Tool limiting employment opportunities for D.C. residents with such records. The use of web scraping by criminal justice researchers is a One of the key information needs relatively new phenomenon. In a in understanding the extent of this search of the literature for problem in DC requires criminal justice-related research determining what percentage of employing web scraping as a individuals with criminal records data collection tool, only a were and were not charged or handful of studies were found in convicted of a criminal offense. which web scraping was utilized. Researchers have attempted to answer this question in the past; One of these studies was but due to data fragmentation conducted by the Urban Institute across law enforcement agencies (2017) as part of a larger and the courts, the ability to exploration of how criminal accurately answer this question background checks by for D.C. has been a challenge employers may create barriers to (Council for Court Excellence, employment among residents of 2011; Duane, Reimel, and Lynch, the District of Columbia (D.C.). 2017). Background checks are utilized by potential employers, in D.C. According to the Urban Institute and around the nation, to screen researchers, web scraping job applicants and to identify provided a viable way to those with a criminal record. overcome some of the existing Having a criminal history record, data access and analysis issues Page 4 Web WEB SCRAPING FACT SHEET that resulted from this data obstacles encountered, Eads fragmentation. Specifically, Urban worked with computer Institute researchers used a web programmers proficient in writing scraper to collect publicly software code to create and available criminal history record deploy a web scraper for data for Washington, D.C. extracting publicly available data residents over a 10-year period. from the Cook County jail website These data were then used to (maintained by the Sheriff’s estimate how many D.C. residents department), including inmate had a criminal record yet had not names, their date of birth, and been convicted of a crime. The the location of the jail in which an researchers determined that of inmate was held. The information the 68,000 D.C. residents who extracted from the website using were flagged as having an arrest web scraping will be utilized as during the 10-year period one part of a larger project examined, about half had not aimed at tracking the flow of been convicted of a crime during inmates through the entire that time span. This use of web criminal justice system in Illinois. scraping allowed Urban researchers to pull information off A third example comes from a the web to produce more National Institute of Justice- accurate estimates of the funded study currently in progress number of residents with criminal at JRSA. The study is exploring records who had not been how the characteristics of various convicted of a crime. This, in turn, on-line advertisements for escorts, better informed policy discussions such as those posted on CraigsList regarding employment barriers and other on-line sources, can for D.C. residents. potentially be used to identify human trafficking cases. The Another recent example of how objective of this project is to utilize web scraping has been used for the information pulled from criminal justice-related research websites (as well as from other involves the work being done by sources like interviews) to create journalists from ProPublica Illinois, a profile of escort ads highly a non-profit news agency. In an correlated with human trafficking, article published in July 2017, thereby providing law David Eads describes ProPublica’s enforcement officers and efforts and ultimate failure to prosecutors with practical obtain certain information on the guidance to more efficiently and Cook County jail population from effectively target escort ads, the Cook County Sheriff’s thereby leading to the successful Department through a Freedom prosecution of human traffickers. of Information Act (FOIA) request. To overcome the data access Page 5 Web WEB SCRAPING FACT SHEET As part of this project, researchers are relying upon a pre-existing, large-scale web scraping tool Web Scraping Issues to known as Memex. Launched by the U.S. Department of Defense in Consider 2015, Memex searches on-line escort ads and extracts While the use of web scraping for information of interest on a daily criminal justice research is indeed basis. Since its inception, the in its infancy, the technology Memex Program1 has pulled arguably has the potential to billions of ads off the internet to provide criminal justice keep law enforcement informed researchers with an important about trends in online sex new data collection tool. Given exploitation as well as to assist the proliferation in the amount of with anti-trafficking investigations data being placed on-line, web (Sneed, 2015).
Recommended publications
  • An Overview of the 50 Most Common Web Scraping Tools
    AN OVERVIEW OF THE 50 MOST COMMON WEB SCRAPING TOOLS WEB SCRAPING IS THE PROCESS OF USING BOTS TO EXTRACT CONTENT AND DATA FROM A WEBSITE. UNLIKE SCREEN SCRAPING, WHICH ONLY COPIES PIXELS DISPLAYED ON SCREEN, WEB SCRAPING EXTRACTS UNDERLYING CODE — AND WITH IT, STORED DATA — AND OUTPUTS THAT INFORMATION INTO A DESIGNATED FILE FORMAT. While legitimate uses cases exist for data harvesting, illegal purposes exist as well, including undercutting prices and theft of copyrighted content. Understanding web scraping bots starts with understanding the diverse and assorted array of web scraping tools and existing platforms. Following is a high-level overview of the 50 most common web scraping tools and platforms currently available. PAGE 1 50 OF THE MOST COMMON WEB SCRAPING TOOLS NAME DESCRIPTION 1 Apache Nutch Apache Nutch is an extensible and scalable open-source web crawler software project. A-Parser is a multithreaded parser of search engines, site assessment services, keywords 2 A-Parser and content. 3 Apify Apify is a Node.js library similar to Scrapy and can be used for scraping libraries in JavaScript. Artoo.js provides script that can be run from your browser’s bookmark bar to scrape a website 4 Artoo.js and return the data in JSON format. Blockspring lets users build visualizations from the most innovative blocks developed 5 Blockspring by engineers within your organization. BotScraper is a tool for advanced web scraping and data extraction services that helps 6 BotScraper organizations from small and medium-sized businesses. Cheerio is a library that parses HTML and XML documents and allows use of jQuery syntax while 7 Cheerio working with the downloaded data.
    [Show full text]
  • Data and Computer Communications (Eighth Edition)
    DATA AND COMPUTER COMMUNICATIONS Eighth Edition William Stallings Upper Saddle River, New Jersey 07458 Library of Congress Cataloging-in-Publication Data on File Vice President and Editorial Director, ECS: Art Editor: Gregory Dulles Marcia J. Horton Director, Image Resource Center: Melinda Reo Executive Editor: Tracy Dunkelberger Manager, Rights and Permissions: Zina Arabia Assistant Editor: Carole Snyder Manager,Visual Research: Beth Brenzel Editorial Assistant: Christianna Lee Manager, Cover Visual Research and Permissions: Executive Managing Editor: Vince O’Brien Karen Sanatar Managing Editor: Camille Trentacoste Manufacturing Manager, ESM: Alexis Heydt-Long Production Editor: Rose Kernan Manufacturing Buyer: Lisa McDowell Director of Creative Services: Paul Belfanti Executive Marketing Manager: Robin O’Brien Creative Director: Juan Lopez Marketing Assistant: Mack Patterson Cover Designer: Bruce Kenselaar Managing Editor,AV Management and Production: Patricia Burns ©2007 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458 All rights reserved. No part of this book may be reproduced in any form or by any means, without permission in writing from the publisher. Pearson Prentice Hall™ is a trademark of Pearson Education, Inc. All other tradmarks or product names are the property of their respective owners. The author and publisher of this book have used their best efforts in preparing this book.These efforts include the development, research, and testing of the theories and programs to determine their effectiveness.The author and publisher make no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this book.The author and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.
    [Show full text]
  • Research Data Management Best Practices
    Research Data Management Best Practices Introduction ............................................................................................................................................................................ 2 Planning & Data Management Plans ...................................................................................................................................... 3 Naming and Organizing Your Files .......................................................................................................................................... 6 Choosing File Formats ............................................................................................................................................................. 9 Working with Tabular Data ................................................................................................................................................... 10 Describing Your Data: Data Dictionaries ............................................................................................................................... 12 Describing Your Project: Citation Metadata ......................................................................................................................... 15 Preparing for Storage and Preservation ............................................................................................................................... 17 Choosing a Repository .........................................................................................................................................................
    [Show full text]
  • Data Management, Analysis Tools, and Analysis Mechanics
    Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem statement has been formulated, research hypotheses have been stated, data collection planning has been conducted, and data have been collected from various sources (see Volume I for information and details on these phases of research). This chapter discusses how to combine and manage data streams, and how to use data management tools to produce analytical results that are error free and reproducible, once useful data have been obtained to accomplish the overall research goals and objectives. Purpose of Data Management Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work. In addition, as the size of databases in transportation continue to grow, it is becoming increasingly important to invest resources into the management of these data. There are a number of ancillary steps that need to be performed both before and after statistical analysis of data. For example, a database composed of different data streams needs to be matched and integrated into a single database for analysis. In addition, in some cases data must be transformed into the preferred electronic format for a variety of statistical packages. Sometimes, data obtained from “the field” must be cleaned and debugged for input and measurement errors, and reformatted. The following sections discuss considerations for developing an overall data collection, handling, and management plan, and tools necessary for successful implementation of that plan.
    [Show full text]
  • Webscraper.Io a How-To Guide for Scraping DH Projects
    Webscraper.io A How-to Guide for Scraping DH Projects 1. Introduction to Webscraper.io ​ ​ 1.1. Installing Webscraper.io ​ ​ 1.2. Navigating to Webscraper.io ​ ​ 2. Creating a Sitemap ​ ​ 2.1. Sitemap Menu ​ ​ 2.2. Importing a Sitemap ​ ​ 2.3. Creating a Blank Sitemap ​ ​ 2.4. Editing Project Metadata ​ ​ 3. Selector Graph ​ ​ 4. Creating a Selector ​ ​ 5. Scraping a Website ​ ​ 6. Browsing Scraped Data ​ ​ 7. Exporting Sitemaps ​ ​ 8. Exporting Data ​ ​ ​ Association of Research Libraries 21 Dupont Circle NW, Suite 800, Washington, DC 20036 ​ (202) 296-2296 | ARL.org 1. Introduction to Webscraper.io ​ ​ Webscraper.io is a free extension for the Google Chrome web browser with which users can extract information from any public website using HTML and CSS and export the data as a Comma Separated Value (CSV) file, which can be opened in spreadsheet processing software like Excel or Google Sheets. The scraper uses the developer tools menu built into Chrome (Chrome DevTools) to select the different elements of a website, including links, tables, and the HTML code itself. With developer tools users can look at a web page to see the code that generates everything that is seen on the page, from text, to images, to the layout. Webscraper.io uses this code either to extract information or navigate throughout the overall page. This is helpful for users who don’t have another way to extract important information from websites. It is important to be aware of any copyright information listed on a website. Consult a copyright lawyer or professional to ensure the information can legally be scraped and shared.
    [Show full text]
  • 1 Running Head: WEB SCRAPING TUTORIAL a Web Scraping
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repository@Nottingham 1 Running Head: WEB SCRAPING TUTORIAL A Web Scraping Tutorial using R. Author Note Alex Bradley, and Richard J. E. James, School of Psychology, University of Nottingham AB is the guarantor. AB created the website and videos. Both authors drafted the manuscript. All authors read, provided feedback and approved the final version of the manuscript. The authors declare that they have no conflicts of interest pertaining to this manuscript. Correspondence concerning this article should be addressed to Alex Bradley, School of Psychology, University of Nottingham, Nottingham, NG7 2RD. Email [email protected]. Tel: 0115 84 68188. 2 Running Head: WEB SCRAPING TUTORIAL Abstract The ubiquitous use of the internet for all manner of daily tasks means that there are now large reservoirs of data that can provide fresh insights into human behaviour. One of the key barriers preventing more researchers from utilising online data is that researchers do not have the skills to access the data. This tutorial aims to address this gap by providing a practical guide to web scraping online data using the popular statistical language R. Web scraping is the process of automatically collecting information from websites, which can take the form of numbers, text, images or videos. In this tutorial, readers will learn how to download web pages, extract information from those pages, be able to store the extracted information and learn how to navigate across multiple pages of a website. A website has been created to assist readers in learning how to web scrape.
    [Show full text]
  • A Study on Web Scraping
    ISSN (Print) : 2320 – 3765 ISSN (Online): 2278 – 8875 International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering (A High Impact Factor, Monthly, Peer Reviewed Journal) Website: www.ijareeie.com Vol. 8, Issue 6, June 2019 A Study on Web Scraping Rishabh Singh Tomar1 Delhi International School, Indore, Madhya Pradesh, India1 ABSTRACT: Web Scraping is the technique which allows user to fetch data from the World Wide Web. This paper gives a brief introduction to Web Scraping covering different technologies available and methods to prevent a website from getting scraped. Currently available software tools available for web scraping are also listed with a brief description. Web Scraping is explained with a practical example. KEYWORDS:Web Scraping, Web Mining, Web Crawling, Data Fetching, Data Analysis. I.INTRODUCTION Web Scraping (also known as Web Data Extraction, Web Harvesting, etc.) is a method used for extracting a large amount of data from websites, the data extracted is then saved in the local repository, which can later be used or analysed accordingly. Most of the websites’ data can be viewed on a web browser, they do not provide any explicit method to save this data locally. Manually copying and pasting the data from a website to a local device becomes a tedious job. However, web scraping makes things a lot easier. It is the automated version of the manual process; the computer takes care of copying and storing the data for further use in the desired format. This paper will describe the legal aspects of Web Scraping along with the process of Web Scraping, the techniques available for Web Scraping, the current software tools with their functionalities and possible applications of it.
    [Show full text]
  • CSCI 452 (Data Mining) Dr. Schwartz HTML Web Scraping 150 Pts
    CSCI 452 (Data Mining) Dr. Schwartz HTML Web Scraping 150 pts Overview For this assignment, you'll be scraping the White House press briefings for President Obama's terms in the White House to see which countries have been mentioned and how often. We will use a mirrored image (originally so that we wouldn’t cause undue load on the White House servers, now because the data is historical). You will use Python 3 with the following libraries: • Beautiful Soup 4 (makes it easier to pull data out of HTML and XML documents) • Requests (for handling HTTP requests from python) • lxml (XML and HTML parser) We will use Wikipedia's list of sovereign states to identify the entities we will be counting, then we will download all of the press briefings and count how many times a country's name has been mentioned in the press briefings during President Obama's terms. Specification There will be a few distinct steps to this assignment. 1. Write a Python program to scrape the data from Wikipedia's list of sovereign states (link above). Note that there are some oddities in the table. In most cases, the title on the href (what comes up when you hover your mouse over the link) should work well. See "Korea, South" for example -- its title is "South Korea". In entries where there are arrows (see "Kosovo"), there is no title, and you will want to use the text from the link. These minor nuisances are very typical in data scraping, and you'll have to deal with them programmatically.
    [Show full text]
  • Computer Files & Data Storage
    STORAGE & FILE CONCEPTS, UTILITIES (Pages 6, 150-158 - Discovering Computers & Microsoft Office 2010) I. Computer files – data, information or instructions residing on secondary storage are stored in the form of a file. A. Software files are also called program files. Program files (instructions) are created by a computer programmer and generally cannot be modified by a user. It’s important that we not move or delete program files because your computer requires them to perform operations. Program files are also referred to as “executables”. 1. You can identify a program file by its extension:“.EXE”, “.COM”, “.BAT”, “.DLL”, “.SYS”, or “.INI” (there are others) or a distinct program icon. B. Data files - when you select a “save” option while using an application program, you are in essence creating a data file. Users create data files. 1. File naming conventions refer to the guidelines followed while assigning file names and will vary with the operating system and application in use (see figure 4-1). File names in Windows 7 may be up to 255 characters, you're not allowed to use reserved characters or certain reserved words. File extensions are used to identify the application that was used to create the file and format data in a manner recognized by the source application used to create it. FALL 2012 1 II. Selecting secondary storage media A. There are three type of technologies for storage devices: magnetic, optical, & solid state, there are advantages & disadvantages between them. When selecting a secondary storage device, certain factors should be considered: 1. Capacity - the capacity of computer storage is expressed in bytes.
    [Show full text]
  • Lecture 18: HTML and Web Scraping
    Lecture 18: HTML and Web Scraping November 6, 2018 Reminders ● Project 2 extended until Thursday at midnight! ○ Turn in your python script and a .txt file ○ For extra credit, run your program on two .txt files and compare the sentiment analysis/bigram and unigram counts in a comment. Turn in both .txt files ● Final project released Thursday ○ You can do this with a partner if you want! ○ End goal is a presentation in front of the class in December on your results ○ Proposals will be due next Thursday Today’s Goals ● Understand what Beautiful Soup is ● Have ability to: ● download webpages ● Print webpage titles ● Print webpage paragraphs of text HTML ● Hypertext Markup Language: the language that provides a template for web pages ● Made up of tags that represent different elements (links, text, photos, etc) ● See HTML when inspecting the source of a webpage HTML Tags ● <html>, indicates the start of an html page ● <body>, contains the items on the actual webpage (text, links, images, etc) ● <p>, the paragraph tag. Can contain text and links ● <a>, the link tag. Contains a link url, and possibly a description of the link ● <input>, a form input tag. Used for text boxes, and other user input ● <form>, a form start tag, to indicate the start of a form ● <img>, an image tag containing the link to an image Getting webpages online ● Similar to using an API like last time ● Uses a specific way of requesting, HTTP (Hypertext Transfer Protocol) ● HTTPS has an additional layer of security ● Sends a request to the site and downloads it ● HTTP/HTTPS
    [Show full text]
  • CSCI 452 (Data Mining) Basic HTML Web Scraping 75 Pts Overview for This Assignment, You'll Write Several Small Python Programs T
    CSCI 452 (Data Mining) Basic HTML Web Scraping 75 pts Overview For this assignment, you'll write several small python programs to scrape simple HTML data from several websites. You will use Python 3 with the following libraries: • Beautiful Soup 4 (makes it easier to pull data out of HTML and XML documents) • Requests (for handling HTTP requests from python) • lxml (XML and HTML parser) Here is a fairly simple example for finding out how many datasets can currently be searched/accessed on data.gov. You should make sure you can run this code before going on to the questions you’ll be writing (the answer when I last ran this was 195,384). import bs4 import requests response = requests.get('http://www.data.gov/') soup = bs4.BeautifulSoup(response.text,"lxml") link = soup.select("small a")[0] print(link.text) #Credit to Dan Nguyen at Stanford’s Computational Journalism program Specification Write python programs to answer the following questions. Be sure to follow the specified output format (including prompts) carefully since it will be autograded. You will need to do some reading/research regarding the Beautiful Soup interface and possibly on Python as well. There are links to the documentation on my website. Do not hardcode any data; everything should be dynamically scraped from the live websites. Points will be awarded on functionality, but there will be a code inspection. If values are hardcoded or if the style/commenting is insufficient, points will be deducted. 1. (30 pts) Data.gov (relevant url, http://catalog.data.gov/dataset?q=&sort=metadata_created+desc): accept an integer as input and find the name (href text) of the nth "most recent" dataset on data.gov.
    [Show full text]
  • File Format Guidelines for Management and Long-Term Retention of Electronic Records
    FILE FORMAT GUIDELINES FOR MANAGEMENT AND LONG-TERM RETENTION OF ELECTRONIC RECORDS 9/10/2012 State Archives of North Carolina File Format Guidelines for Management and Long-Term Retention of Electronic records Table of Contents 1. GUIDELINES AND RECOMMENDATIONS .................................................................................. 3 2. DESCRIPTION OF FORMATS RECOMMENDED FOR LONG-TERM RETENTION ......................... 7 2.1 Word Processing Documents ...................................................................................................................... 7 2.1.1 PDF/A-1a (.pdf) (ISO 19005-1 compliant PDF/A) ........................................................................ 7 2.1.2 OpenDocument Text (.odt) ................................................................................................................... 3 2.1.3 Special Note on Google Docs™ .......................................................................................................... 4 2.2 Plain Text Documents ................................................................................................................................... 5 2.2.1 Plain Text (.txt) US-ASCII or UTF-8 encoding ................................................................................... 6 2.2.2 Comma-separated file (.csv) US-ASCII or UTF-8 encoding ........................................................... 7 2.2.3 Tab-delimited file (.txt) US-ASCII or UTF-8 encoding .................................................................... 8 2.3
    [Show full text]