<<

Advice, technology and tools

Send your careers story Your Work to: naturecareerseditor story @nature.com SHUTTERSTOCK TOOLS THAT EASE COLLECTION FROM THE WEB Custom web scrapers are driving research — and collaborations. By Nicholas J. DeVito, Georgia C. Richards and Peter Inglesby

n research, time and resources are pre- quickly, efficiently and reproducibly collect all date by re-running our program as new PDFs cious. Automating common tasks, such the PDFs and create a spreadsheet that docu- are posted. as data collection, can make a project mented each case. efficient and repeatable, leading in turn Such a tool is called a ‘web scraper’, and our How does scraping work? to increased productivity and output. You group employs them regularly. We use them Web scrapers are programs that Iwill end up with a shareable and reproducible to collect from clinical-trial reg- extract information from — that is, ‘scrape’ — method for data collection that can be veri- istries, and to enrich our OpenPrescribing.net web sites. The structure and content of a web fied, used and expanded on by others — in data set, which tracks primary-care prescrib- page are encoded in Hypertext Markup Lan- other words, a computationally reproducible ing in England — tasks that would range from guage (HTML), which you can see using your data-collection workflow. annoying to impossible without the help of browser’s ‘view source’ or ‘inspect element’ In a current project, we are analysing some relatively simple code. function. A scraper understands HTML, and coroners’ reports to help to prevent future In the case of our coroner-reports project, is able to parse and extract information from deaths. It has required downloading more we could manually screen and save about it. For example, you can program your scraper than 3,000 PDFs to search for opioid-re- 25 case reports every hour. Now, our program to extract specific fields of information from lated deaths, a huge data-collection task. In can save more than 1,000 cases per hour while an online or download documents linked discussion with the larger team, we decided we work on other things, a 40-fold time sav- on the page. that this task was a good candidate for auto- ing. It also opens up opportunities for collab- A common scraping task involves iterating mation. With a few days of work, we were oration, because we can share the resulting over every possible URL from www.example. able to write a that could . And we can keep that database up to com/data/1 to www.example.com/data/100

Nature | Vol 585 | 24 September 2020 | 621 ©2020 Spri nger Nature Li mited. All rights reserved.

Work / Careers (sometimes called ‘crawling’) and storing what individual web pages, but there might be a fee you need from each page without the risk of associated with API access (see, for example, human error during extraction. Once your pro- Google’s Map API). In our work, the PubMed gram is written, you can recapture these data API is often useful. Alternatively, check whenever you need to, assuming the structure whether the website operators can provide of the website stays mostly the same. the data to you directly. Can this website be scraped? Some websites How do I get started? don’t make their data available directly in the Not all scraping tasks require programming. HTML and might require some more advanced When you visit a web page in your browser, techniques (check resources such as Stack- off-the-shelf browser extensions such as web- Overflow for help with specific questions). scraper.io let you click on the elements of the Other websites include protections like cap- page that contain the data that you’re inter- tchas and anti-denial-of-service (DoS) meas- ested in. They can then automatically parse ures that can make scraping difficult. A few the relevant parts of the HTML and export the websites simply don’t want to be scraped and data as a spreadsheet. are built to discourage it. It’s also common to The alternative is to build your own scraper allow scraping but only if you follow certain — a more difficult process, but one that offers rules, usually codified in a robots.txt file. greater control. We use Python, but any mod- Are you being a courteous scraper? Every ern should work. (For time your program requests data from a web- specific packages, Requests and Beautiful- site, the underlying information needs to be What matters Soup work well together in Python; for R, try ‘served’ to you. You can only move so quickly rvest.) It’s worth checking whether anyone in a browser, but a scraper could potentially in science else has already written a scraper for your data send hundreds to thousands of requests per source. If not, there’s no shortage of resources minute. Hammering a web server like that and why – and free tutorials to help you to get started no can slow, or entirely bring down, the website free in your matter your preferred language. (essentially performing an unintentional DoS As with most programming projects, there attack). This could get you temporarily, or inbox every will be some trial and error, and different web- even permanently, blocked from the website sites might use different data structures or — and you should take care to minimize the weekday. variations in how their HTML is implemented chances of harm. For instance, you can pause that will require tweaks to your approach. Yet your program for a few seconds between each this problem-solving aspect of development request. (Check the site’s robots.txt file to see The best from Nature’s can be quite rewarding. As you get more com- whether it specifies a desired pause length.) journalists and other fortable with the process, overcoming these Are the data restricted? Be sure to check publications worldwide. barriers will start to seem like second nature. for licensing or copyright restrictions on But be advised: depending on the number the extracted data. You might be able to use Always balanced, never of pages, your connection and the what you scrape, but it’s worth checking that oversimpli ed, and website’s server, a scraping job could still you can also legally share it. Ideally, the web- crafted with the scienti c take days. If you have access and know-how, site content licence will be readily available. community in mind. running your code on a private server can help. Whether or not you can share the data, you On a personal computer, make sure to prevent should share your code using services such your computer from sleeping, which will dis- as GitHub — this is good open-science practice rupt the Internet connection. Also, think care- and ensures that others can discover, repeat fully about how your scraper can fail. Ideally, and build on what you’ve done. SIN NOW you should have a way to log failures so that We strongly feel that more researchers go.nature.com/brie ng you know what worked, what didn’t and where should be developing code to help conduct to investigate further. their research, and then sharing it with the community. If manual data collection has Things to consider been an issue for your project, a web scraper Can you get the data an easier way? Scraping could be the perfect solution and a great all 300,000+ records off of ClinicalTrials. beginner coding project. Scrapers are com- gov every day would be a massive job for our plex enough to teach important lessons about FDAAA TrialsTracker project. Luckily, Clini- development, but common and calTrials.gov makes their full dataset available well-documented enough that beginners can for download; our software simply grabs that feel confident experimenting. Writing some file once per day. We weren’t so lucky with the relatively simple code on your computer and data for our EU TrialsTracker, so we scrape the having it interact with the outside world can EU registry monthly. feel like a research superpower. What are you If there’s no bulk download available, check waiting for? to see whether the website has an application programming interface (API). An API lets soft- Nicholas J. DeVito and Georgia C. Richards ware interact with a website’s data directly, are doctoral candidates and researchers and rather than requesting the HTML. This can Peter Inglesby is a software engineer, at the

A80371 be much less burdensome than scraping EBM DataLab at the University of Oxford, UK.

622 | Nature | Vol 585 | 24 September 2020 ©2020 Spri nger Nature Li mited. All rights reserved.