Scrapy-Cookbook Documentation År´ Såÿ´ Cˇ 0.2.2
Total Page:16
File Type:pdf, Size:1020Kb
WenQuanYi Micro Hei [Scale=0.9]WenQuanYi Micro Hei Mono songWen- QuanYi Micro Hei sfWenQuanYi Micro Hei "zh" = 0pt plus 1pt scrapy-cookbook Documentation åR´ Såÿ´ Cˇ 0.2.2 Xiong Neng 11æIJ´L 10, 2017 Contents 1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘ 1 1.1 åoL’è˝ cˇEscrapyˇ ...................................1 1.2 ço˝Aå˘ ¸Tçd’žä¿N´ ..................................2 1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L..............................4 2 Scrapyæ ¸TZç´ ´lN02-´ åo˝Næˇ ¸Tt’çd’žä¿N´ 4 2.1 å´LZå˙ zžScrapyå˚u˙ eçˇ ´lN´ ...............................4 2.2 åoŽä´zL’æ˝ ´LSä´ z˙n玡 DItemˇ .............................5 2.3 çnˇnäÿˇ AäÿłSpider˘ .................................5 2.4 è£Rèˇ ˛aNçˇ ´Lnèˇ Z´ n´..................................6 2.5 åd’Dçˇ Rˇ ˛Eé¸S¿æO˝ eˇ.................................6 2.6 årijå´ GžæŁ¸Så˘ R´Uæ˝ ¸Træˇ o˝.............................7 2.7 ä£Iå˙ Ÿæ ¸Træˇ oå˝ ´Lræˇ ¸Træˇ o垸S˝ ..........................7 2.8 äÿNäÿ´ Aæ˘ eˇ....................................9 3 Scrapyæ ¸TZç´ ´lN03-´ Spiderèr˛eè´ g˘cˇ9 3.1 CrawlSpider....................................9 3.2 XMLFeedSpider................................. 10 3.3 CSVFeedSpider.................................. 11 3.4 SitemapSpider................................... 12 4 Scrapyæ ¸TZç´ ´lN04-´ Selectorèr˛eè´ g˘cˇ 12 4.1 åE¸s䞡 Oé˝ AL’æ˘ Nl’å´ Z´´l................................ 12 4.2 ä¡£çTˇ´léAL’æ˘ Nl’å´ Z´´l................................ 12 4.3 å¸tNåˇ eˇUé˚ AL’æ˘ Nl’å´ Z´´l................................ 14 4.4 ä¡£çTˇ´læ cåˇ ´LZè´ ˛a´lè¿¿åijR´ ............................. 15 4.5 XPathçZÿå˙ r´zè˚u´ rå¿´ Dˇ ................................ 16 4.6 XPathåzžè˙ o˝o˝................................... 16 5 Scrapyæ ¸TZç´ ´lN05-´ Itemèr˛eè´ g˘cˇ 16 5.1 åoŽä´zL’Item˝ .................................... 17 5.2 Item Fields.................................... 17 5.3 Itemä¡£çTˇ´lçd’žä¿N´ ................................ 17 5.4 Item Loader.................................... 18 5.5 迸SåEˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l........................... 19 5.6 èGłå˘ oŽä´zL’ItemLoader˝ .............................. 19 5.7 åIJ´lFieldåoŽä´zL’äÿ˝ åcˇræŸˇ O迸Så˝ Eˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l............ 20 5.8 Item LoaderäÿŁäÿNæ´ U˝ G˘ ............................. 20 5.9 å ˛EEç¡ˇ o玽 Dåd’ˇ Dçˇ Rˇ ˛EåZ´´l.............................. 21 6 Scrapyæ ¸TZç´ ´lN06-´ Item Pipeline 21 6.1 çijUå˝ ˛EZè´ Głå˚u˘ s玴 DPipelineˇ ........................... 22 6.2 Item Pipelineçd’žä¿N´ ............................... 22 6.3 æ£Aæt’˘ zäÿ˙ AäÿłItem˘ Pipelineçz˙Däˇ z˝u˙ ....................... 24 6.4 Feed exports.................................... 24 6.5 èr˚uæ´ s´C労 Nå¸Sˇ åžTˇ ................................ 24 7 Scrapyæ ¸TZç´ ´lN07-´ å ˛EEç¡ˇ oæIJ˝ åŁ ˛a 26 7.1 åR´ Sé´ A˘ ˛Aemail................................... 26 7.2 åRˇ Näÿˇ Aäÿłè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’ŽäÿłSpiderˇ .................... 27 7.3 å´L ˛EåÿCåijˇ Rç´ ´Lnèˇ Z´ n´................................ 27 7.4 韚æ cè´ c´nå´ rˇ ˛AçŽDçˇ Uç˝ ¸Teˇ............................ 28 8 Scrapyæ ¸TZç´ ´lN08-´ æU˝ Gä˘ z˝uäÿ˙ Oå˝ Z¿çL’˙ G˘ 28 8.1 ä¡£çTˇ´lFiles Pipeline................................ 28 8.2 ä¡£çTˇ´lImages Pipeline.............................. 29 8.3 ä¡£çTˇ´lä¿Nå´ Rˇ ................................... 29 8.4 èGłå˘ oŽä´zL’åłŠä¡¸Sç˝ o˝ ˛aé˛A¸S............................. 30 9 Scrapyæ ¸TZç´ ´lN09-´ éCˇ ´lç¡š 30 9.1 éCˇ ´lç¡šå´LrScrapydˇ ................................. 30 9.2 éCˇ ´lç¡šå´LrScrapyˇ Cloud.............................. 33 10 Scrapyæ ¸TZç´ ´lN10-´ åŁ´læA˘ ˛AéEˇ ç¡oç˝ L´ nèˇ Z´ n´ 33 10.1 èDŽæIJˇ nè£ˇ Rèˇ ˛aNScrapyˇ .............................. 33 10.2 åRˇ Näÿˇ Aè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’Žäÿłspiderˇ ...................... 34 10.3 åoŽä´zL’è˝ g˘Dåˇ ´LZè´ ˛a´l................................ 35 10.4 åoŽä´zL’æ˝ U˝ Gç˘ n´aItem˘ ............................... 36 10.5 åoŽä´zL’ArticleSpider˝ ............................... 36 10.6 çijUå˝ ˛EZpipelineå´ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ ................... 37 10.7 ä£oæ˝ T´zrun.pyåˇ Rˇ råŁ´ ´lèDŽæIJˇ nˇ.......................... 38 11 Scrapyæ ¸TZç´ ´lN11-´ æ´l ˛aæN§ç´ Z´ zå¡˙ ¸T 39 11.1 éG˘ å ˛EZstart_requestsæ´ U´zæ¸s¸T˝ .......................... 41 11.2 ä¡£çTˇ´lFormRequest................................ 41 11.3 éG˘ å ˛EZ_requests_to_follow´ ........................... 42 11.4 é ˛a¸téI˙cåd’´ Dçˇ Rˇ ˛EæU´zæ¸s¸T˝ .............................. 43 11.5 åo˝Næˇ ¸Tt’æžRçˇ a˘ ˛A................................. 43 12 Scrapyæ ¸TZç´ ´lN12-´ æŁ¸SåR´ UåŁ˝ ´læA˘ ˛Aç¡Sç´ n´Z´ 46 12.1 scrapy-splashço˝Aä˘ z˙N´ ............................... 46 12.2 åoL’è˝ cˇEdockerˇ ................................... 46 12.3 åoL’è˝ cˇESplashˇ ................................... 47 12.4 åoL’è˝ cˇEscrapy-splashˇ ............................... 47 12.5 éEˇ ç¡oscrapy-splash˝ ............................... 48 12.6 ä¡£çTˇ´lscrapy-splash................................ 48 12.7 ä¡£çTˇ´låo˝d俯 N´ ................................... 50 13 è ˛ATç¸sˇ zæ˙ L´ S´ 52 Contents: 1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘ ScrapyæŸräÿ´ Aäÿłäÿžäž˘ ˛Eç´Lnåˇ R´Uç¡˝ Sç´ n´Zæ´ ¸Træˇ oïij˝ Næˇ R´ Råˇ R´Uç˝ z¸Sæ˙ d¯Dæˇ A˘ gæ˘ ¸Træˇ oè˝ A˘ Nçijˇ Uå˝ ˛EZ玴 Dåžˇ Tçˇ Tˇ´læ ˛a˛Eæd˝u㯠A˘ Cå´ R´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´låNˇ Eæˇ N´ næˇ ¸Træˇ oæ˝ Nˇ Uæ˝ OŸïij˝ Nˇ ä£ ˛aæ˛Aråd’´ Dçˇ Rˇ ˛Eæ´LUå˝ ŸåC´ ´låO˝ ˛EåRšæ´ ¸Træˇ oç˝ L’äÿAç¸s˘ zå˙ ´LUçŽ˚ Dçˇ ´lNåž´ Räÿ´ ãA˘ Cå´ E˝uæIJˇ Aå˘ ´LIæŸ˙ räÿžäž´ ˛Eé˛a¸téI˙cæŁ¸Så´ R´U(æ˝ Zt’ç˙ ˛aoå˝ ´LGæ˘ I˙eèˇ rt’,ç¡´ Sç´ zIJæŁ¸Så˙ R´U)æL’˝ Aè˘ o¿è˝ o˝ ˛açŽDïijˇ Nˇ ä´z§åR´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´lèO˚uå˝ R´UAPIæL’˝ Aè£˘ Tåˇ Z˙ d环 Dæˇ ¸Træˇ o(æ˝ r´Tå˛eˇ CWeb´ Ser- vices)æ´LUè˝ A˘ Eéˇ AŽç˘ Tˇ´lçŽDç¡ˇ Sç´ zIJç˙ ´Lnèˇ Z´ nã´ A˘ C´ Scrapyä´z§èC¡åÿˇ oä¡˝ aå˘ o˝dç¯ O˝ réˇ nŸéŸ˝u玴 Dçˇ ´Lnèˇ Z´ næ´ ˛a˛Eæd˝uïij¯ Næˇ r´Tå˛eˇ Cç´ ´Lnåˇ R´Uæ˝ U˝uçŽ˚ Dç¡ˇ Sç´ n´Zè´ od’è˝ r´ ˛AãA˘ ˛Aå˛EEåˇ o´z玽 Dåˇ ´L ˛Eæd¯Råd’ˇ Dçˇ Rˇ ˛EãA˘ ˛AéG˘ åd’ æŁ¸SåR´Uã˝ A˘ ˛Aå´L ˛EåÿCåijˇ Rç´ ´Lnåˇ R´Uç˝ L’ç L’å¿´Låd’ æI˙C玴 D䞡 Nã´ A˘ C´ 1.1å oL’è˝ cˇEscrapyˇ æ´LS玴 Dæ¸tˇ Nè´ r´ ¸TçO˝ rå´ c´CæŸˇ rcentos6.5´ å G瞢 gpythonå˘ ´LræIJˇ Aæ˘ U˝ rçL’ˇ ´LçŽD2.7ïijˇ Näÿˇ Né´ I˙c玴 DæL’ˇ AæIJL’æ˘ eéłd’éˇ C¡åˇ ´LGæ˘ cå´ ´Lrrootçˇ Tˇ´læ´L˚u çTˇ säž´ Oscrapyç˝ Z˙ oåL’˝ åRłè´ C¡è£ˇ Rèˇ ˛aNåIJˇ ´lpython2äÿŁïijNæL’ˇ Aä˘ z˙eåˇ Eˇ´LæZt’æ˙ U˝ rcentosäÿŁéˇ I˙c玴 Dpythonåˇ ´LræIJˇ Aæ˘ U˝ r玡 Dˇ Python 2.7.11ïijNˇ åE˚u䡸Sæˇ U´zæ¸s¸Tè˝ r˚ugoogleäÿ´ Nå¿´ ´Låd’Žè£Zæ´ a˚u玢 Dæˇ ¸TZç´ ´lNã´ A˘ C´ åEˇ´LåoL’è˝ cˇEäÿˇ A䞢 Zä¿˙ Iè¸t˙ Uè¡˝ rä´ z˝u˙ yum install python-devel yum install libffi-devel yum install openssl-devel çD˝uåˇ RˇOå˝ oL’è˝ cˇEpyopenssl垸Sˇ pip install pyopenssl åoL’è˝ cˇExlmlˇ yum install python-lxml yum install libxml2-devel yum install libxslt-devel åoL’è˝ cˇEservice-identityˇ pip install service-identity åoL’è˝ cˇEtwistedˇ pip install scrapy åoL’è˝ cˇEscrapyˇ pip install scrapy-U æ¸tNè´ r´ ¸Tscrapy scrapy bench æIJAç˘ z˙´Læ´LRåŁ§ïijˇ Nåd’łäÿˇ åo´z柸S䞲Eïij˛A˝ 1.2ç o˝ Aå˘ ¸Tçd’žä¿N´ å´LZå˙ zžäÿ˙ Aäÿłpythonæž˘ Ræˇ U˝ Gä˘ z˝uïij˙ Nåˇ Rˇ äÿžstackoverflow.pyïijNåˇ ˛EEåˇ o´zå˛e˝ Cäÿ´ NïijŽ´ import scrapy class StackOverflowSpider(scrapy.Spider): name='stackoverflow' start_urls=['http://stackoverflow.com/questions?sort=votes'] def parse(self, response): for href in response.css('.question-summary h3 a::attr(href) ,!'): full_url= response.urljoin(href.extract()) yield scrapy.Request(full_url, callback=self.parse_ ,!question) def parse_question(self, response): yield { 'title': response.css('h1 a::text').extract()[0], 'votes': response.css('.question .vote-count-post::text ,!').extract()[0], 'body': response.css('.question .post-text'). ,!extract()[0], 'tags': response.css('.question .post-tag::text'). ,!extract(), 'link': response.url, } è£Rèˇ ˛aNïijŽˇ scrapy runspider stackoverflow_spider.py-o top-stackoverflow- ,!questions.json çz¸Sæ˙ dIJç¯ s´zäijijäÿ˙ Né´ I˙cïijŽ´ [{ "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/11227809/why-is- ,!processing-a-sorted-array-faster-than-an-unsorted-array", "tags":["java","c++","performance","optimization"], "title":"Why is processing a sorted array faster than an ,!unsorted array?", "votes":"9924" }, { "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/1260748/how-do-i- ,!remove-a-git-submodule", "tags":["git","git-submodules"], "title":"How do I remove a Git submodule?", "votes":"1764" }, ...] 塸Sä¡aè£˘ Rèˇ ˛aNˇ scrapy runspider somefile.pyè£Zæ´ I˙ ˛aèr´ åR´ e玡 Dæˇ U˝uå˚ A˘ Zïij´ NScrapyäijŽåˇ O˝ zå˙ r´zæL’¿æž˙ Ræˇ U˝ Gä˘ z˝uäÿ˙ åoŽä´zL’玽 Däÿˇ Aäÿłspiderå´z˝uäÿ˘ Täžd’çˇ z˙Zç´ ´Lnèˇ Z´ nåij´ ¸Tæ¸SOæ˝ I˙eæL’ˇ gè˘ ˛aNåˇ o˝Cãˇ A˘ C´ start_urlsås´dæ¯ A˘ gå˘ oŽä´zL’äž˝ ˛EåijAå˘ g˘N玴 DURLïijˇ Nçˇ ´Lnèˇ Z´ näijŽé´ AŽè£˘ Gå˘ o˝Cæˇ I˙eæˇ d¯Dåˇ zžå˙ ´LIå˙ g˘N玴 Dèˇ r˚uæ´ s´Cïij´ Nè£ˇ Tåˇ Z˙ dresponseå¯ RˇOå˝ ˛E èrˇCçˇ Tˇ´lézŸè˙ od’玽 Dåˇ Z˙ dè¯ rˇCæˇ U´zæ¸s¸T˝ parseå´z˝uäijaå˘ Eˇ eè£ˇ Zäÿłresponseã´ A˘ C´ æ´LSä´ z˙nåIJˇ ´lparseåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Täÿ˝ éAŽè£˘ Gä¡£ç˘ Tˇ´lcsséAL’æ˘ Nl’å´ Z´´læR´ Råˇ R´Uæ˝ r´Räÿłæ´ R´ Réˇ U˚ oé˝ ˛a¸téI˙cé¸S¿æ´ O˝ e玡 Dhrefåˇ s´dæ¯ A˘ gå˘ Aijïij˘ Nçˇ D˝uåˇ RˇO˝ yieldåR˛eåd’´ Uäÿ˝ Aäÿłè˘ r˚uæ´ s´Cïij´ Nˇ å´z˝uæ¸s´lå ˛ENˇ parse_questionåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Tïij˝ NåIJˇ ´lè£Zäÿłè´ r˚uæ´ s´Cå´ o˝Næˇ ´LRåˇ RˇOè˝ c´næL’´ gè˘ ˛aNãˇ A˘ C´ åd’Dçˇ Rˇ ˛Eæ¸t ˛Aç´lNå´ Z¿ïijŽ˙ ScrapyçŽDäÿˇ Aäÿłå˘ e¡åd’ˇ DæŸˇ ræL’´ AæIJL’è˘ r˚uæ´ s´Cé´ C¡æŸˇ rè´ c´nè´ rˇCåž˛eå´z˝uåijˇ Cæ´ eåd’ˇ Dçˇ Rˇ ˛EïijNåˇ rˇsç´ o˝Uæ§˚ Räÿłèˇ r˚uæ´ s´Cå´ Gžé˘ Tˇ Zä´z§äÿ´ å¡så¸S´ åE˝uäˇ z˙Uè˝ r˚uæ´ s´Cç´ z˙gç˘ z˙ èc´nåd’´ Dçˇ Rˇ ˛EãA˘ C´ æ´LSä´ z˙n玡 Dçd’žä¿ˇ Näÿ´ årˇ ˛Eèg˘cæˇ d¯Rçˇ z¸Sæ˙ dIJç¯ T§æˇ ´LRjsonæˇ aijåij˘ Rïij´ Nä¡ˇ aè£Ÿå˘ R´ rä´ z˙eåˇ rijå´ Gžäÿžå˘ E˝uäˇ z˙Uæ˝ aijåij˘ Rïij´ ´Lær´Tå˛eˇ CXMLã´ A˘ ˛ACSVïijL’ïijNæˇ ´LUè˝ A˘ EæŸˇ rå´ rˇ ˛EåE˝uåˇ ŸåC´ ´lå´LrFTPãˇ A˘ ˛AAmazon S3äÿŁãA˘ C´ ä¡aè£Ÿå˘ R´ rä´ z˙eéˇ AŽè£˘ G˘ pipelineå rˇ ˛Eåo˝Cäˇ z˙nåˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ åO˝ zïij˙ Nè£ˇ Zäž´ Zæ˙ ¸Træˇ oä£˝ Iå˙ ŸçŽDæˇ U´zåij˝ Rå´ Rˇ Dçˇ g˘ åRˇ Dæˇ a˚uã˘ A˘ C´ 1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L ä¡aå˚ušç˘ z˙Rå´ R´ rä´ z˙eéˇ AŽè£˘ GScrapyä˘ z˙Oäÿ˝ Aäÿłç¡˘ Sç´ n´ZäÿŁé´ I˙cç´ ´Lnåˇ R´Uæ˝ ¸Træˇ oå´z˝uå˝ rˇ ˛EåE˝uèˇ g˘cæˇ d¯Rä£ˇ Iå˙ ŸäÿNæ´ I˙e䞡 ˛EïijNä¡ˇ ˛EæŸr裴 Zå´ RłæŸ´ rScrapy玴 D玡 oæ˝ r´Zã˙ A˘ C´ ScrapyæR´ Rä¿ˇ Zäž˙ ˛EæZt’åd’ŽçŽ˙ DçL’´zæˇ A˘ gæ˘ I˙eèˇ ol’ä¡˝ aç˘ ´Lnåˇ R´Uæ˝ Zt’åŁ˙ aå˘ o´z柸SåŠ˝ Néˇ nŸæ´ ¸T´LãA˘ Cæ´ r´Tå˛eˇ CïijŽ´ 1.