Master Fixes to Json Exporter (Commit Cfc2d46) • Fix Permission and Set Umask Before Generating Sdist Tarball (Commit 06149E0)
Total Page:16
File Type:pdf, Size:1020Kb
Scrapy Documentation Release 2.5.0 Scrapy developers Oct 01, 2021 FIRST STEPS 1 Getting help 3 2 First steps 5 2.1 Scrapy at a glance............................................5 2.2 Installation guide.............................................7 2.3 Scrapy Tutorial.............................................. 11 2.4 Examples................................................. 23 3 Basic concepts 25 3.1 Command line tool............................................ 25 3.2 Spiders.................................................. 34 3.3 Selectors................................................. 46 3.4 Items................................................... 64 3.5 Item Loaders............................................... 71 3.6 Scrapy shell................................................ 80 3.7 Item Pipeline............................................... 84 3.8 Feed exports............................................... 89 3.9 Requests and Responses......................................... 100 3.10 Link Extractors.............................................. 116 3.11 Settings.................................................. 118 3.12 Exceptions................................................ 148 4 Built-in services 151 4.1 Logging.................................................. 151 4.2 Stats Collection.............................................. 157 4.3 Sending e-mail.............................................. 159 4.4 Telnet Console.............................................. 161 4.5 Web Service............................................... 164 5 Solving specific problems 165 5.1 Frequently Asked Questions....................................... 165 5.2 Debugging Spiders............................................ 171 5.3 Spiders Contracts............................................. 174 5.4 Common Practices............................................ 176 5.5 Broad Crawls............................................... 180 5.6 Using your browser’s Developer Tools for scraping........................... 184 5.7 Selecting dynamically-loaded content.................................. 189 5.8 Debugging memory leaks........................................ 193 5.9 Downloading and processing files and images.............................. 197 5.10 Deploying Spiders............................................ 206 i 5.11 AutoThrottle extension.......................................... 207 5.12 Benchmarking.............................................. 209 5.13 Jobs: pausing and resuming crawls................................... 211 5.14 Coroutines................................................ 212 5.15 asyncio.................................................. 214 6 Extending Scrapy 217 6.1 Architecture overview.......................................... 217 6.2 Downloader Middleware......................................... 220 6.3 Spider Middleware............................................ 236 6.4 Extensions................................................ 243 6.5 Core API................................................. 249 6.6 Signals.................................................. 257 6.7 Scheduler................................................. 264 6.8 Item Exporters.............................................. 266 7 All the rest 275 7.1 Release notes............................................... 275 7.2 Contributing to Scrapy.......................................... 348 7.3 Versioning and API stability....................................... 352 Python Module Index 353 Index 355 ii Scrapy Documentation, Release 2.5.0 Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. FIRST STEPS 1 Scrapy Documentation, Release 2.5.0 2 FIRST STEPS CHAPTER ONE GETTING HELP Having trouble? We’d like to help! • Try the FAQ – it’s got answers to some common questions. • Looking for specific information? Try the genindex or modindex. • Ask or search questions in StackOverflow using the scrapy tag. • Ask or search questions in the Scrapy subreddit. • Search for questions on the archives of the scrapy-users mailing list. • Ask a question in the #scrapy IRC channel, • Report bugs with Scrapy in our issue tracker. 3 Scrapy Documentation, Release 2.5.0 4 Chapter 1. Getting help CHAPTER TWO FIRST STEPS 2.1 Scrapy at a glance Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler. 2.1.1 Walk-through of an example spider In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider. Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagina- tion: import scrapy class QuotesSpider(scrapy.Spider): name= 'quotes' start_urls=[ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'author': quote.xpath('span/small/text()').get(), 'text': quote.css('span.text::text').get(), } next_page= response.css( 'li.next a::attr("href")').get() if next_page is not None: yield response.follow(next_page, self.parse) Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command: scrapy runspider quotes_spider.py-o quotes.jl 5 Scrapy Documentation, Release 2.5.0 When this finishes you will have in the quotes.jl file a list of the quotes in JSON Lines format, containing textand author, looking like this: {"author":"Jane Austen","text":" \u201cThe person, be it gentleman or lady, who has ,!not pleasure in a good novel, must be intolerably stupid.\u201d"} {"author":"Steve Martin","text":" \u201cA day without sunshine is like, you know, ,!night.\u201d"} {"author":"Garrison Keillor","text":" \u201cAnyone who thinks sitting in church can ,!make you a Christian must also think that sitting in a garage can make you a car.\u201d ,!"} ... What just happened? When you ran the command scrapy runspider quotes_spider.py, Scrapy looked for a Spider definition inside it and ran it through its crawler engine. The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback. Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request ordo other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it. While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically. Note: This is using feed exports to generate the JSON file, you can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example). You can also write an item pipeline to store the items in a database. 2.1.2 What else? You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as: • Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions. • An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders. • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem) • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding decla- rations. 6 Chapter 2. First steps Scrapy Documentation, Release 2.5.0 • Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines). • Wide range of built-in extensions and middlewares for handling: – cookies and session handling – HTTP features like compression, authentication, caching – user-agent spoofing – robots.txt – crawl depth restriction – and more • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler • Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically