Scrapy-Cookbook Documentation År´ Såÿ´ Cˇ 0.2.2

Scrapy-Cookbook Documentation År´ Såÿ´ Cˇ 0.2.2

WenQuanYi Micro Hei [Scale=0.9]WenQuanYi Micro Hei Mono songWen- QuanYi Micro Hei sfWenQuanYi Micro Hei "zh" = 0pt plus 1pt scrapy-cookbook Documentation åR´ Såÿ´ Cˇ 0.2.2 Xiong Neng 11æIJ´L 10, 2017 Contents 1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘ 1 1.1 åoL’è˝ cˇEscrapyˇ ...................................1 1.2 ço˝Aå˘ ¸Tçd’žä¿N´ ..................................2 1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L..............................4 2 Scrapyæ ¸TZç´ ´lN02-´ åo˝Næˇ ¸Tt’çd’žä¿N´ 4 2.1 å´LZå˙ zžScrapyå˚u˙ eçˇ ´lN´ ...............................4 2.2 åoŽä´zL’æ˝ ´LSä´ z˙n玡 DItemˇ .............................5 2.3 çnˇnäÿˇ AäÿłSpider˘ .................................5 2.4 è£Rèˇ ˛aNçˇ ´Lnèˇ Z´ n´..................................6 2.5 åd’Dçˇ Rˇ ˛Eé¸S¿æO˝ eˇ.................................6 2.6 årijå´ GžæŁ¸Så˘ R´Uæ˝ ¸Træˇ o˝.............................7 2.7 ä£Iå˙ Ÿæ ¸Træˇ oå˝ ´Lræˇ ¸Træˇ o垸S˝ ..........................7 2.8 äÿNäÿ´ Aæ˘ eˇ....................................9 3 Scrapyæ ¸TZç´ ´lN03-´ Spiderèr˛eè´ g˘cˇ9 3.1 CrawlSpider....................................9 3.2 XMLFeedSpider................................. 10 3.3 CSVFeedSpider.................................. 11 3.4 SitemapSpider................................... 12 4 Scrapyæ ¸TZç´ ´lN04-´ Selectorèr˛eè´ g˘cˇ 12 4.1 åE¸s䞡 Oé˝ AL’æ˘ Nl’å´ Z´´l................................ 12 4.2 ä¡£çTˇ´léAL’æ˘ Nl’å´ Z´´l................................ 12 4.3 å¸tNåˇ eˇUé˚ AL’æ˘ Nl’å´ Z´´l................................ 14 4.4 ä¡£çTˇ´læ cåˇ ´LZè´ ˛a´lè¿¿åijR´ ............................. 15 4.5 XPathçZÿå˙ r´zè˚u´ rå¿´ Dˇ ................................ 16 4.6 XPathåzžè˙ o˝o˝................................... 16 5 Scrapyæ ¸TZç´ ´lN05-´ Itemèr˛eè´ g˘cˇ 16 5.1 åoŽä´zL’Item˝ .................................... 17 5.2 Item Fields.................................... 17 5.3 Itemä¡£çTˇ´lçd’žä¿N´ ................................ 17 5.4 Item Loader.................................... 18 5.5 迸SåEˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l........................... 19 5.6 èGłå˘ oŽä´zL’ItemLoader˝ .............................. 19 5.7 åIJ´lFieldåoŽä´zL’äÿ˝ åcˇræŸˇ O迸Så˝ Eˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l............ 20 5.8 Item LoaderäÿŁäÿNæ´ U˝ G˘ ............................. 20 5.9 å ˛EEç¡ˇ o玽 Dåd’ˇ Dçˇ Rˇ ˛EåZ´´l.............................. 21 6 Scrapyæ ¸TZç´ ´lN06-´ Item Pipeline 21 6.1 çijUå˝ ˛EZè´ Głå˚u˘ s玴 DPipelineˇ ........................... 22 6.2 Item Pipelineçd’žä¿N´ ............................... 22 6.3 æ£Aæt’˘ zäÿ˙ AäÿłItem˘ Pipelineçz˙Däˇ z˝u˙ ....................... 24 6.4 Feed exports.................................... 24 6.5 èr˚uæ´ s´C労 Nå¸Sˇ åžTˇ ................................ 24 7 Scrapyæ ¸TZç´ ´lN07-´ å ˛EEç¡ˇ oæIJ˝ åŁ ˛a 26 7.1 åR´ Sé´ A˘ ˛Aemail................................... 26 7.2 åRˇ Näÿˇ Aäÿłè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’ŽäÿłSpiderˇ .................... 27 7.3 å´L ˛EåÿCåijˇ Rç´ ´Lnèˇ Z´ n´................................ 27 7.4 韚æ cè´ c´nå´ rˇ ˛AçŽDçˇ Uç˝ ¸Teˇ............................ 28 8 Scrapyæ ¸TZç´ ´lN08-´ æU˝ Gä˘ z˝uäÿ˙ Oå˝ Z¿çL’˙ G˘ 28 8.1 ä¡£çTˇ´lFiles Pipeline................................ 28 8.2 ä¡£çTˇ´lImages Pipeline.............................. 29 8.3 ä¡£çTˇ´lä¿Nå´ Rˇ ................................... 29 8.4 èGłå˘ oŽä´zL’åłŠä¡¸Sç˝ o˝ ˛aé˛A¸S............................. 30 9 Scrapyæ ¸TZç´ ´lN09-´ éCˇ ´lç¡š 30 9.1 éCˇ ´lç¡šå´LrScrapydˇ ................................. 30 9.2 éCˇ ´lç¡šå´LrScrapyˇ Cloud.............................. 33 10 Scrapyæ ¸TZç´ ´lN10-´ åŁ´læA˘ ˛AéEˇ ç¡oç˝ L´ nèˇ Z´ n´ 33 10.1 èDŽæIJˇ nè£ˇ Rèˇ ˛aNScrapyˇ .............................. 33 10.2 åRˇ Näÿˇ Aè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’Žäÿłspiderˇ ...................... 34 10.3 åoŽä´zL’è˝ g˘Dåˇ ´LZè´ ˛a´l................................ 35 10.4 åoŽä´zL’æ˝ U˝ Gç˘ n´aItem˘ ............................... 36 10.5 åoŽä´zL’ArticleSpider˝ ............................... 36 10.6 çijUå˝ ˛EZpipelineå´ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ ................... 37 10.7 ä£oæ˝ T´zrun.pyåˇ Rˇ råŁ´ ´lèDŽæIJˇ nˇ.......................... 38 11 Scrapyæ ¸TZç´ ´lN11-´ æ´l ˛aæN§ç´ Z´ zå¡˙ ¸T 39 11.1 éG˘ å ˛EZstart_requestsæ´ U´zæ¸s¸T˝ .......................... 41 11.2 ä¡£çTˇ´lFormRequest................................ 41 11.3 éG˘ å ˛EZ_requests_to_follow´ ........................... 42 11.4 é ˛a¸téI˙cåd’´ Dçˇ Rˇ ˛EæU´zæ¸s¸T˝ .............................. 43 11.5 åo˝Næˇ ¸Tt’æžRçˇ a˘ ˛A................................. 43 12 Scrapyæ ¸TZç´ ´lN12-´ æŁ¸SåR´ UåŁ˝ ´læA˘ ˛Aç¡Sç´ n´Z´ 46 12.1 scrapy-splashço˝Aä˘ z˙N´ ............................... 46 12.2 åoL’è˝ cˇEdockerˇ ................................... 46 12.3 åoL’è˝ cˇESplashˇ ................................... 47 12.4 åoL’è˝ cˇEscrapy-splashˇ ............................... 47 12.5 éEˇ ç¡oscrapy-splash˝ ............................... 48 12.6 ä¡£çTˇ´lscrapy-splash................................ 48 12.7 ä¡£çTˇ´låo˝d俯 N´ ................................... 50 13 è ˛ATç¸sˇ zæ˙ L´ S´ 52 Contents: 1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘ ScrapyæŸräÿ´ Aäÿłäÿžäž˘ ˛Eç´Lnåˇ R´Uç¡˝ Sç´ n´Zæ´ ¸Træˇ oïij˝ Næˇ R´ Råˇ R´Uç˝ z¸Sæ˙ d¯Dæˇ A˘ gæ˘ ¸Træˇ oè˝ A˘ Nçijˇ Uå˝ ˛EZ玴 Dåžˇ Tçˇ Tˇ´læ ˛a˛Eæd˝u㯠A˘ Cå´ R´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´låNˇ Eæˇ N´ næˇ ¸Træˇ oæ˝ Nˇ Uæ˝ OŸïij˝ Nˇ ä£ ˛aæ˛Aråd’´ Dçˇ Rˇ ˛Eæ´LUå˝ ŸåC´ ´låO˝ ˛EåRšæ´ ¸Træˇ oç˝ L’äÿAç¸s˘ zå˙ ´LUçŽ˚ Dçˇ ´lNåž´ Räÿ´ ãA˘ Cå´ E˝uæIJˇ Aå˘ ´LIæŸ˙ räÿžäž´ ˛Eé˛a¸téI˙cæŁ¸Så´ R´U(æ˝ Zt’ç˙ ˛aoå˝ ´LGæ˘ I˙eèˇ rt’,ç¡´ Sç´ zIJæŁ¸Så˙ R´U)æL’˝ Aè˘ o¿è˝ o˝ ˛açŽDïijˇ Nˇ ä´z§åR´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´lèO˚uå˝ R´UAPIæL’˝ Aè£˘ Tåˇ Z˙ d环 Dæˇ ¸Træˇ o(æ˝ r´Tå˛eˇ CWeb´ Ser- vices)æ´LUè˝ A˘ Eéˇ AŽç˘ Tˇ´lçŽDç¡ˇ Sç´ zIJç˙ ´Lnèˇ Z´ nã´ A˘ C´ Scrapyä´z§èC¡åÿˇ oä¡˝ aå˘ o˝dç¯ O˝ réˇ nŸéŸ˝u玴 Dçˇ ´Lnèˇ Z´ næ´ ˛a˛Eæd˝uïij¯ Næˇ r´Tå˛eˇ Cç´ ´Lnåˇ R´Uæ˝ U˝uçŽ˚ Dç¡ˇ Sç´ n´Zè´ od’è˝ r´ ˛AãA˘ ˛Aå˛EEåˇ o´z玽 Dåˇ ´L ˛Eæd¯Råd’ˇ Dçˇ Rˇ ˛EãA˘ ˛AéG˘ åd’ æŁ¸SåR´Uã˝ A˘ ˛Aå´L ˛EåÿCåijˇ Rç´ ´Lnåˇ R´Uç˝ L’ç L’å¿´Låd’ æI˙C玴 D䞡 Nã´ A˘ C´ 1.1å oL’è˝ cˇEscrapyˇ æ´LS玴 Dæ¸tˇ Nè´ r´ ¸TçO˝ rå´ c´CæŸˇ rcentos6.5´ å G瞢 gpythonå˘ ´LræIJˇ Aæ˘ U˝ rçL’ˇ ´LçŽD2.7ïijˇ Näÿˇ Né´ I˙c玴 DæL’ˇ AæIJL’æ˘ eéłd’éˇ C¡åˇ ´LGæ˘ cå´ ´Lrrootçˇ Tˇ´læ´L˚u çTˇ säž´ Oscrapyç˝ Z˙ oåL’˝ åRłè´ C¡è£ˇ Rèˇ ˛aNåIJˇ ´lpython2äÿŁïijNæL’ˇ Aä˘ z˙eåˇ Eˇ´LæZt’æ˙ U˝ rcentosäÿŁéˇ I˙c玴 Dpythonåˇ ´LræIJˇ Aæ˘ U˝ r玡 Dˇ Python 2.7.11ïijNˇ åE˚u䡸Sæˇ U´zæ¸s¸Tè˝ r˚ugoogleäÿ´ Nå¿´ ´Låd’Žè£Zæ´ a˚u玢 Dæˇ ¸TZç´ ´lNã´ A˘ C´ åEˇ´LåoL’è˝ cˇEäÿˇ A䞢 Zä¿˙ Iè¸t˙ Uè¡˝ rä´ z˝u˙ yum install python-devel yum install libffi-devel yum install openssl-devel çD˝uåˇ RˇOå˝ oL’è˝ cˇEpyopenssl垸Sˇ pip install pyopenssl åoL’è˝ cˇExlmlˇ yum install python-lxml yum install libxml2-devel yum install libxslt-devel åoL’è˝ cˇEservice-identityˇ pip install service-identity åoL’è˝ cˇEtwistedˇ pip install scrapy åoL’è˝ cˇEscrapyˇ pip install scrapy-U æ¸tNè´ r´ ¸Tscrapy scrapy bench æIJAç˘ z˙´Læ´LRåŁ§ïijˇ Nåd’łäÿˇ åo´z柸S䞲Eïij˛A˝ 1.2ç o˝ Aå˘ ¸Tçd’žä¿N´ å´LZå˙ zžäÿ˙ Aäÿłpythonæž˘ Ræˇ U˝ Gä˘ z˝uïij˙ Nåˇ Rˇ äÿžstackoverflow.pyïijNåˇ ˛EEåˇ o´zå˛e˝ Cäÿ´ NïijŽ´ import scrapy class StackOverflowSpider(scrapy.Spider): name='stackoverflow' start_urls=['http://stackoverflow.com/questions?sort=votes'] def parse(self, response): for href in response.css('.question-summary h3 a::attr(href) ,!'): full_url= response.urljoin(href.extract()) yield scrapy.Request(full_url, callback=self.parse_ ,!question) def parse_question(self, response): yield { 'title': response.css('h1 a::text').extract()[0], 'votes': response.css('.question .vote-count-post::text ,!').extract()[0], 'body': response.css('.question .post-text'). ,!extract()[0], 'tags': response.css('.question .post-tag::text'). ,!extract(), 'link': response.url, } è£Rèˇ ˛aNïijŽˇ scrapy runspider stackoverflow_spider.py-o top-stackoverflow- ,!questions.json çz¸Sæ˙ dIJç¯ s´zäijijäÿ˙ Né´ I˙cïijŽ´ [{ "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/11227809/why-is- ,!processing-a-sorted-array-faster-than-an-unsorted-array", "tags":["java","c++","performance","optimization"], "title":"Why is processing a sorted array faster than an ,!unsorted array?", "votes":"9924" }, { "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/1260748/how-do-i- ,!remove-a-git-submodule", "tags":["git","git-submodules"], "title":"How do I remove a Git submodule?", "votes":"1764" }, ...] 塸Sä¡aè£˘ Rèˇ ˛aNˇ scrapy runspider somefile.pyè£Zæ´ I˙ ˛aèr´ åR´ e玡 Dæˇ U˝uå˚ A˘ Zïij´ NScrapyäijŽåˇ O˝ zå˙ r´zæL’¿æž˙ Ræˇ U˝ Gä˘ z˝uäÿ˙ åoŽä´zL’玽 Däÿˇ Aäÿłspiderå´z˝uäÿ˘ Täžd’çˇ z˙Zç´ ´Lnèˇ Z´ nåij´ ¸Tæ¸SOæ˝ I˙eæL’ˇ gè˘ ˛aNåˇ o˝Cãˇ A˘ C´ start_urlsås´dæ¯ A˘ gå˘ oŽä´zL’äž˝ ˛EåijAå˘ g˘N玴 DURLïijˇ Nçˇ ´Lnèˇ Z´ näijŽé´ AŽè£˘ Gå˘ o˝Cæˇ I˙eæˇ d¯Dåˇ zžå˙ ´LIå˙ g˘N玴 Dèˇ r˚uæ´ s´Cïij´ Nè£ˇ Tåˇ Z˙ dresponseå¯ RˇOå˝ ˛E èrˇCçˇ Tˇ´lézŸè˙ od’玽 Dåˇ Z˙ dè¯ rˇCæˇ U´zæ¸s¸T˝ parseå´z˝uäijaå˘ Eˇ eè£ˇ Zäÿłresponseã´ A˘ C´ æ´LSä´ z˙nåIJˇ ´lparseåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Täÿ˝ éAŽè£˘ Gä¡£ç˘ Tˇ´lcsséAL’æ˘ Nl’å´ Z´´læR´ Råˇ R´Uæ˝ r´Räÿłæ´ R´ Réˇ U˚ oé˝ ˛a¸téI˙cé¸S¿æ´ O˝ e玡 Dhrefåˇ s´dæ¯ A˘ gå˘ Aijïij˘ Nçˇ D˝uåˇ RˇO˝ yieldåR˛eåd’´ Uäÿ˝ Aäÿłè˘ r˚uæ´ s´Cïij´ Nˇ å´z˝uæ¸s´lå ˛ENˇ parse_questionåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Tïij˝ NåIJˇ ´lè£Zäÿłè´ r˚uæ´ s´Cå´ o˝Næˇ ´LRåˇ RˇOè˝ c´næL’´ gè˘ ˛aNãˇ A˘ C´ åd’Dçˇ Rˇ ˛Eæ¸t ˛Aç´lNå´ Z¿ïijŽ˙ ScrapyçŽDäÿˇ Aäÿłå˘ e¡åd’ˇ DæŸˇ ræL’´ AæIJL’è˘ r˚uæ´ s´Cé´ C¡æŸˇ rè´ c´nè´ rˇCåž˛eå´z˝uåijˇ Cæ´ eåd’ˇ Dçˇ Rˇ ˛EïijNåˇ rˇsç´ o˝Uæ§˚ Räÿłèˇ r˚uæ´ s´Cå´ Gžé˘ Tˇ Zä´z§äÿ´ å¡så¸S´ åE˝uäˇ z˙Uè˝ r˚uæ´ s´Cç´ z˙gç˘ z˙ èc´nåd’´ Dçˇ Rˇ ˛EãA˘ C´ æ´LSä´ z˙n玡 Dçd’žä¿ˇ Näÿ´ årˇ ˛Eèg˘cæˇ d¯Rçˇ z¸Sæ˙ dIJç¯ T§æˇ ´LRjsonæˇ aijåij˘ Rïij´ Nä¡ˇ aè£Ÿå˘ R´ rä´ z˙eåˇ rijå´ Gžäÿžå˘ E˝uäˇ z˙Uæ˝ aijåij˘ Rïij´ ´Lær´Tå˛eˇ CXMLã´ A˘ ˛ACSVïijL’ïijNæˇ ´LUè˝ A˘ EæŸˇ rå´ rˇ ˛EåE˝uåˇ ŸåC´ ´lå´LrFTPãˇ A˘ ˛AAmazon S3äÿŁãA˘ C´ ä¡aè£Ÿå˘ R´ rä´ z˙eéˇ AŽè£˘ G˘ pipelineå rˇ ˛Eåo˝Cäˇ z˙nåˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ åO˝ zïij˙ Nè£ˇ Zäž´ Zæ˙ ¸Træˇ oä£˝ Iå˙ ŸçŽDæˇ U´zåij˝ Rå´ Rˇ Dçˇ g˘ åRˇ Dæˇ a˚uã˘ A˘ C´ 1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L ä¡aå˚ušç˘ z˙Rå´ R´ rä´ z˙eéˇ AŽè£˘ GScrapyä˘ z˙Oäÿ˝ Aäÿłç¡˘ Sç´ n´ZäÿŁé´ I˙cç´ ´Lnåˇ R´Uæ˝ ¸Træˇ oå´z˝uå˝ rˇ ˛EåE˝uèˇ g˘cæˇ d¯Rä£ˇ Iå˙ ŸäÿNæ´ I˙e䞡 ˛EïijNä¡ˇ ˛EæŸr裴 Zå´ RłæŸ´ rScrapy玴 D玡 oæ˝ r´Zã˙ A˘ C´ ScrapyæR´ Rä¿ˇ Zäž˙ ˛EæZt’åd’ŽçŽ˙ DçL’´zæˇ A˘ gæ˘ I˙eèˇ ol’ä¡˝ aç˘ ´Lnåˇ R´Uæ˝ Zt’åŁ˙ aå˘ o´z柸SåŠ˝ Néˇ nŸæ´ ¸T´LãA˘ Cæ´ r´Tå˛eˇ CïijŽ´ 1.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    54 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us