WenQuanYi Micro Hei [Scale=0.9]WenQuanYi Micro Hei Mono songWen- QuanYi Micro Hei sfWenQuanYi Micro Hei "zh" = 0pt plus 1pt

scrapy-cookbook Documentation åR´ Såÿ´ Cˇ 0.2.2

Xiong Neng

11æIJ´L 10, 2017

Contents

1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘ 1 1.1 åoL’è˝ cˇEscrapyˇ ...... 1 1.2 ço˝Aå˘ ¸Tçd’žä¿N´ ...... 2 1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L...... 4

2 Scrapyæ ¸TZç´ ´lN02-´ åo˝Næˇ ¸Tt’çd’žä¿N´ 4 2.1 å´LZå˙ zžScrapyå˚u˙ eçˇ ´lN´ ...... 4 2.2 åoŽä´zL’æ˝ ´LSä´ z˙n玡 DItemˇ ...... 5 2.3 çnˇnäÿˇ AäÿłSpider˘ ...... 5 2.4 è£Rèˇ ˛aNçˇ ´Lnèˇ Z´ n´...... 6 2.5 åd’Dçˇ Rˇ ˛Eé¸S¿æO˝ eˇ...... 6 2.6 årijå´ GžæŁ¸Så˘ R´Uæ˝ ¸Træˇ o˝...... 7 2.7 ä£Iå˙ Ÿæ ¸Træˇ oå˝ ´Lræˇ ¸Træˇ o垸S˝ ...... 7 2.8 äÿNäÿ´ Aæ˘ eˇ...... 9

3 Scrapyæ ¸TZç´ ´lN03-´ Spiderèr˛eè´ g˘cˇ9 3.1 CrawlSpider...... 9 3.2 XMLFeedSpider...... 10 3.3 CSVFeedSpider...... 11 3.4 SitemapSpider...... 12

4 Scrapyæ ¸TZç´ ´lN04-´ Selectorèr˛eè´ g˘cˇ 12 4.1 åE¸s䞡 Oé˝ AL’æ˘ Nl’å´ Z´´l...... 12 4.2 ä¡£çTˇ´léAL’æ˘ Nl’å´ Z´´l...... 12 4.3 å¸tNåˇ eˇUé˚ AL’æ˘ Nl’å´ Z´´l...... 14 4.4 ä¡£çTˇ´læ cåˇ ´LZè´ ˛a´lè¿¿åijR´ ...... 15 4.5 XPathçZÿå˙ r´zè˚u´ rå¿´ Dˇ ...... 16 4.6 XPathåzžè˙ o˝o˝...... 16

5 Scrapyæ ¸TZç´ ´lN05-´ Itemèr˛eè´ g˘cˇ 16 5.1 åoŽä´zL’Item˝ ...... 17 5.2 Item Fields...... 17 5.3 Itemä¡£çTˇ´lçd’žä¿N´ ...... 17 5.4 Item Loader...... 18 5.5 迸SåEˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l...... 19 5.6 èGłå˘ oŽä´zL’ItemLoader˝ ...... 19 5.7 åIJ´lFieldåoŽä´zL’äÿ˝ åcˇræŸˇ O迸Så˝ Eˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´l...... 20 5.8 Item LoaderäÿŁäÿNæ´ U˝ G˘ ...... 20 5.9 å ˛EEç¡ˇ o玽 Dåd’ˇ Dçˇ Rˇ ˛EåZ´´l...... 21

6 Scrapyæ ¸TZç´ ´lN06-´ Item Pipeline 21 6.1 çijUå˝ ˛EZè´ Głå˚u˘ s玴 DPipelineˇ ...... 22 6.2 Item Pipelineçd’žä¿N´ ...... 22 6.3 æ£Aæt’˘ zäÿ˙ AäÿłItem˘ Pipelineçz˙Däˇ z˝u˙ ...... 24 6.4 Feed exports...... 24 6.5 èr˚uæ´ s´C労 Nå¸Sˇ åžTˇ ...... 24

7 Scrapyæ ¸TZç´ ´lN07-´ å ˛EEç¡ˇ oæIJ˝ åŁ ˛a 26 7.1 åR´ Sé´ A˘ ˛Aemail...... 26 7.2 åRˇ Näÿˇ Aäÿłè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’ŽäÿłSpiderˇ ...... 27 7.3 å´L ˛EåÿCåijˇ Rç´ ´Lnèˇ Z´ n´...... 27 7.4 韚æ cè´ c´nå´ rˇ ˛AçŽDçˇ Uç˝ ¸Teˇ...... 28

8 Scrapyæ ¸TZç´ ´lN08-´ æU˝ Gä˘ z˝uäÿ˙ Oå˝ Z¿çL’˙ G˘ 28 8.1 ä¡£çTˇ´lFiles Pipeline...... 28 8.2 ä¡£çTˇ´lImages Pipeline...... 29 8.3 ä¡£çTˇ´lä¿Nå´ Rˇ ...... 29 8.4 èGłå˘ oŽä´zL’åłŠä¡¸Sç˝ o˝ ˛aé˛A¸S...... 30

9 Scrapyæ ¸TZç´ ´lN09-´ éCˇ ´lç¡š 30 9.1 éCˇ ´lç¡šå´LrScrapydˇ ...... 30 9.2 éCˇ ´lç¡šå´LrScrapyˇ Cloud...... 33

10 Scrapyæ ¸TZç´ ´lN10-´ åŁ´læA˘ ˛AéEˇ ç¡oç˝ L´ nèˇ Z´ n´ 33 10.1 èDŽæIJˇ nè£ˇ Rèˇ ˛aNScrapyˇ ...... 33 10.2 åRˇ Näÿˇ Aè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’Žäÿłspiderˇ ...... 34 10.3 åoŽä´zL’è˝ g˘Dåˇ ´LZè´ ˛a´l...... 35 10.4 åoŽä´zL’æ˝ U˝ Gç˘ n´aItem˘ ...... 36 10.5 åoŽä´zL’ArticleSpider˝ ...... 36 10.6 çijUå˝ ˛EZpipelineå´ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ ...... 37 10.7 ä£oæ˝ T´zrun.pyåˇ Rˇ råŁ´ ´lèDŽæIJˇ nˇ...... 38

11 Scrapyæ ¸TZç´ ´lN11-´ æ´l ˛aæN§ç´ Z´ zå¡˙ ¸T 39 11.1 éG˘ å ˛EZstart_requestsæ´ U´zæ¸s¸T˝ ...... 41 11.2 ä¡£çTˇ´lFormRequest...... 41 11.3 éG˘ å ˛EZ_requests_to_follow´ ...... 42 11.4 é ˛a¸téI˙cåd’´ Dçˇ Rˇ ˛EæU´zæ¸s¸T˝ ...... 43 11.5 åo˝Næˇ ¸Tt’æžRçˇ a˘ ˛A...... 43

12 Scrapyæ ¸TZç´ ´lN12-´ æŁ¸SåR´ UåŁ˝ ´læA˘ ˛Aç¡Sç´ n´Z´ 46 12.1 scrapy-splashço˝Aä˘ z˙N´ ...... 46 12.2 åoL’è˝ cˇEdockerˇ ...... 46 12.3 åoL’è˝ cˇESplashˇ ...... 47 12.4 åoL’è˝ cˇEscrapy-splashˇ ...... 47 12.5 éEˇ ç¡oscrapy-splash˝ ...... 48 12.6 ä¡£çTˇ´lscrapy-splash...... 48 12.7 ä¡£çTˇ´låo˝d俯 N´ ...... 50

13 è ˛ATç¸sˇ zæ˙ L´ S´ 52

Contents: 1 Scrapyæ ¸TZç´ ´lN01-´ åEˇ eéˇ U˚ ´lçr´G˘

ScrapyæŸräÿ´ Aäÿłäÿžäž˘ ˛Eç´Lnåˇ R´Uç¡˝ Sç´ n´Zæ´ ¸Træˇ oïij˝ Næˇ R´ Råˇ R´Uç˝ z¸Sæ˙ d¯Dæˇ A˘ gæ˘ ¸Træˇ oè˝ A˘ Nçijˇ Uå˝ ˛EZ玴 Dåžˇ Tçˇ Tˇ´læ ˛a˛Eæd˝u㯠A˘ Cå´ R´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´låNˇ Eæˇ N´ næˇ ¸Træˇ oæ˝ Nˇ Uæ˝ OŸïij˝ Nˇ ä£ ˛aæ˛Aråd’´ Dçˇ Rˇ ˛Eæ´LUå˝ ŸåC´ ´låO˝ ˛EåRšæ´ ¸Træˇ oç˝ L’äÿAç¸s˘ zå˙ ´LUçŽ˚ Dçˇ ´lNåž´ Räÿ´ ãA˘ Cå´ E˝uæIJˇ Aå˘ ´LIæŸ˙ räÿžäž´ ˛Eé˛a¸téI˙cæŁ¸Så´ R´U(æ˝ Zt’ç˙ ˛aoå˝ ´LGæ˘ I˙eèˇ rt’,ç¡´ Sç´ zIJæŁ¸Så˙ R´U)æL’˝ Aè˘ o¿è˝ o˝ ˛açŽDïijˇ Nˇ ä´z§åR´ rä´ z˙eåžˇ Tçˇ Tˇ´låIJ´lèO˚uå˝ R´UAPIæL’˝ Aè£˘ Tåˇ Z˙ d环 Dæˇ ¸Træˇ o(æ˝ r´Tå˛eˇ CWeb´ Ser- vices)æ´LUè˝ A˘ Eéˇ AŽç˘ Tˇ´lçŽDç¡ˇ Sç´ zIJç˙ ´Lnèˇ Z´ nã´ A˘ C´ Scrapyä´z§èC¡åÿˇ oä¡˝ aå˘ o˝dç¯ O˝ réˇ nŸéŸ˝u玴 Dçˇ ´Lnèˇ Z´ næ´ ˛a˛Eæd˝uïij¯ Næˇ r´Tå˛eˇ Cç´ ´Lnåˇ R´Uæ˝ U˝uçŽ˚ Dç¡ˇ Sç´ n´Zè´ od’è˝ r´ ˛AãA˘ ˛Aå˛EEåˇ o´z玽 Dåˇ ´L ˛Eæd¯Råd’ˇ Dçˇ Rˇ ˛EãA˘ ˛AéG˘ åd’ æŁ¸SåR´Uã˝ A˘ ˛Aå´L ˛EåÿCåijˇ Rç´ ´Lnåˇ R´Uç˝ L’ç L’å¿´Låd’ æI˙C玴 D䞡 Nã´ A˘ C´

1.1å oL’è˝ cˇEscrapyˇ

æ´LS玴 Dæ¸tˇ Nè´ r´ ¸TçO˝ rå´ c´CæŸˇ rcentos6.5´ å G瞢 gpythonå˘ ´LræIJˇ Aæ˘ U˝ rçL’ˇ ´LçŽD2.7ïijˇ Näÿˇ Né´ I˙c玴 DæL’ˇ AæIJL’æ˘ eéłd’éˇ C¡åˇ ´LGæ˘ cå´ ´Lrrootçˇ Tˇ´læ´L˚u çTˇ säž´ Oscrapyç˝ Z˙ oåL’˝ åRłè´ C¡è£ˇ Rèˇ ˛aNåIJˇ ´lpython2äÿŁïijNæL’ˇ Aä˘ z˙eåˇ Eˇ´LæZt’æ˙ U˝ rcentosäÿŁéˇ I˙c玴 Dpythonåˇ ´LræIJˇ Aæ˘ U˝ r玡 Dˇ Python 2.7.11ïijNˇ åE˚u䡸Sæˇ U´zæ¸s¸Tè˝ r˚ugoogleäÿ´ Nå¿´ ´Låd’Žè£Zæ´ a˚u玢 Dæˇ ¸TZç´ ´lNã´ A˘ C´ åEˇ´LåoL’è˝ cˇEäÿˇ A䞢 Zä¿˙ Iè¸t˙ Uè¡˝ rä´ z˝u˙

yum install python-devel yum install libffi-devel yum install openssl-devel

çD˝uåˇ RˇOå˝ oL’è˝ cˇEpyopenssl垸Sˇ

pip install pyopenssl

åoL’è˝ cˇExlmlˇ

yum install python-lxml yum install libxml2-devel yum install libxslt-devel

åoL’è˝ cˇEservice-identityˇ

pip install service-identity

åoL’è˝ cˇEtwistedˇ

pip install scrapy

åoL’è˝ cˇEscrapyˇ

pip install scrapy-U

æ¸tNè´ r´ ¸Tscrapy

scrapy bench

æIJAç˘ z˙´Læ´LRåŁ§ïijˇ Nåd’łäÿˇ åo´z柸S䞲Eïij˛A˝ 1.2ç o˝ Aå˘ ¸Tçd’žä¿N´

å´LZå˙ zžäÿ˙ Aäÿłpythonæž˘ Ræˇ U˝ Gä˘ z˝uïij˙ Nåˇ Rˇ äÿžstackoverflow.pyïijNåˇ ˛EEåˇ o´zå˛e˝ Cäÿ´ NïijŽ´ import scrapy class StackOverflowSpider(scrapy.Spider): name='stackoverflow' start_urls=['http://stackoverflow.com/questions?sort=votes']

def parse(self, response): for href in response.css('.question-summary h3 a::attr(href)

˓→'): full_url= response.urljoin(href.extract()) yield scrapy.Request(full_url, callback=self.parse_

˓→question)

def parse_question(self, response): yield { 'title': response.css('h1 a::text').extract()[0], 'votes': response.css('.question .vote-count-post::text

˓→').extract()[0], 'body': response.css('.question .post-text').

˓→extract()[0], 'tags': response.css('.question .post-tag::text').

˓→extract(), 'link': response.url, }

è£Rèˇ ˛aNïijŽˇ scrapy runspider stackoverflow_spider.py-o top-stackoverflow-

˓→questions.json

çz¸Sæ˙ dIJç¯ s´zäijijäÿ˙ Né´ I˙cïijŽ´

[{ "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/11227809/why-is-

˓→processing-a-sorted-array-faster-than-an-unsorted-array", "tags":["java","c++","performance","optimization"], "title":"Why is processing a sorted array faster than an

˓→unsorted array?", "votes":"9924" }, { "body":"... LONG HTML HERE ...", "link":"http://stackoverflow.com/questions/1260748/how-do-i-

˓→remove-a-git-submodule", "tags":["git","git-submodules"], "title":"How do I remove a Git submodule?", "votes":"1764" }, ...]

塸Sä¡aè£˘ Rèˇ ˛aNˇ scrapy runspider somefile.pyè£Zæ´ I˙ ˛aèr´ åR´ e玡 Dæˇ U˝uå˚ A˘ Zïij´ NScrapyäijŽåˇ O˝ zå˙ r´zæL’¿æž˙ Ræˇ U˝ Gä˘ z˝uäÿ˙ åoŽä´zL’玽 Däÿˇ Aäÿłspiderå´z˝uäÿ˘ Täžd’çˇ z˙Zç´ ´Lnèˇ Z´ nåij´ ¸Tæ¸SOæ˝ I˙eæL’ˇ gè˘ ˛aNåˇ o˝Cãˇ A˘ C´ start_urlsås´dæ¯ A˘ gå˘ oŽä´zL’äž˝ ˛EåijAå˘ g˘N玴 DURLïijˇ Nçˇ ´Lnèˇ Z´ näijŽé´ AŽè£˘ Gå˘ o˝Cæˇ I˙eæˇ d¯Dåˇ zžå˙ ´LIå˙ g˘N玴 Dèˇ r˚uæ´ s´Cïij´ Nè£ˇ Tåˇ Z˙ dresponseå¯ RˇOå˝ ˛E èrˇCçˇ Tˇ´lézŸè˙ od’玽 Dåˇ Z˙ dè¯ rˇCæˇ U´zæ¸s¸T˝ parseå´z˝uäijaå˘ Eˇ eè£ˇ Zäÿłresponseã´ A˘ C´ æ´LSä´ z˙nåIJˇ ´lparseåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Täÿ˝ éAŽè£˘ Gä¡£ç˘ Tˇ´lcsséAL’æ˘ Nl’å´ Z´´læR´ Råˇ R´Uæ˝ r´Räÿłæ´ R´ Réˇ U˚ oé˝ ˛a¸téI˙cé¸S¿æ´ O˝ e玡 Dhrefåˇ s´dæ¯ A˘ gå˘ Aijïij˘ Nçˇ D˝uåˇ RˇO˝ yieldåR˛eåd’´ Uäÿ˝ Aäÿłè˘ r˚uæ´ s´Cïij´ Nˇ å´z˝uæ¸s´lå ˛ENˇ parse_questionåZ˙ dè¯ rˇCæˇ U´zæ¸s¸Tïij˝ NåIJˇ ´lè£Zäÿłè´ r˚uæ´ s´Cå´ o˝Næˇ ´LRåˇ RˇOè˝ c´næL’´ gè˘ ˛aNãˇ A˘ C´ åd’Dçˇ Rˇ ˛Eæ¸t ˛Aç´lNå´ Z¿ïijŽ˙

ScrapyçŽDäÿˇ Aäÿłå˘ e¡åd’ˇ DæŸˇ ræL’´ AæIJL’è˘ r˚uæ´ s´Cé´ C¡æŸˇ rè´ c´nè´ rˇCåž˛eå´z˝uåijˇ Cæ´ eåd’ˇ Dçˇ Rˇ ˛EïijNåˇ rˇsç´ o˝Uæ§˚ Räÿłèˇ r˚uæ´ s´Cå´ Gžé˘ Tˇ Zä´z§äÿ´ å¡så¸S´ åE˝uäˇ z˙Uè˝ r˚uæ´ s´Cç´ z˙gç˘ z˙ èc´nåd’´ Dçˇ Rˇ ˛EãA˘ C´ æ´LSä´ z˙n玡 Dçd’žä¿ˇ Näÿ´ årˇ ˛Eèg˘cæˇ d¯Rçˇ z¸Sæ˙ dIJç¯ T§æˇ ´LRjsonæˇ aijåij˘ Rïij´ Nä¡ˇ aè£Ÿå˘ R´ rä´ z˙eåˇ rijå´ Gžäÿžå˘ E˝uäˇ z˙Uæ˝ aijåij˘ Rïij´ ´Lær´Tå˛eˇ CXMLã´ A˘ ˛ACSVïijL’ïijNæˇ ´LUè˝ A˘ EæŸˇ rå´ rˇ ˛EåE˝uåˇ ŸåC´ ´lå´LrFTPãˇ A˘ ˛AAmazon S3äÿŁãA˘ C´ ä¡aè£Ÿå˘ R´ rä´ z˙eéˇ AŽè£˘ G˘ pipelineå rˇ ˛Eåo˝Cäˇ z˙nåˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ åO˝ zïij˙ Nè£ˇ Zäž´ Zæ˙ ¸Træˇ oä£˝ Iå˙ ŸçŽDæˇ U´zåij˝ Rå´ Rˇ Dçˇ g˘ åRˇ Dæˇ a˚uã˘ A˘ C´

1.3 ScrapyçL’´zæA˘ gäÿ˘ Aè˘ g˘´L

ä¡aå˚ušç˘ z˙Rå´ R´ rä´ z˙eéˇ AŽè£˘ GScrapyä˘ z˙Oäÿ˝ Aäÿłç¡˘ Sç´ n´ZäÿŁé´ I˙cç´ ´Lnåˇ R´Uæ˝ ¸Træˇ oå´z˝uå˝ rˇ ˛EåE˝uèˇ g˘cæˇ d¯Rä£ˇ Iå˙ ŸäÿNæ´ I˙e䞡 ˛EïijNä¡ˇ ˛EæŸr裴 Zå´ RłæŸ´ rScrapy玴 D玡 oæ˝ r´Zã˙ A˘ C´ ScrapyæR´ Rä¿ˇ Zäž˙ ˛EæZt’åd’ŽçŽ˙ DçL’´zæˇ A˘ gæ˘ I˙eèˇ ol’ä¡˝ aç˘ ´Lnåˇ R´Uæ˝ Zt’åŁ˙ aå˘ o´z柸SåŠ˝ Néˇ nŸæ´ ¸T´LãA˘ Cæ´ r´Tå˛eˇ CïijŽ´ 1. å ˛EEç¡ˇ oæ˝ Tˇ ræ´ Nˇ ˛AæL’l’ås´ ¸TçŽDCSSéˇ AL’æ˘ Nl’å´ Z´´låŠNXPathèˇ ˛a´lè¿¿åijRæ´ I˙eäˇ z˙OHTML/XMLæž˝ Rçˇ a˘ ˛Aäÿ éAL’æ˘ Nl’å´z˝uæ´ R´ Råˇ R´Uæ˝ ¸Træˇ oïij˝ Nè£Ÿèˇ C¡ä¡£çˇ Tˇ´læ cåˇ ´LZè´ ˛a´lè¿¿åijR´ 2.æ R´ Rä¿ˇ Zäžd’䞊åij˙ Rshellæ´ O˝ gå˘ ´L˝uåR´ rèˇ r´ ¸TéłNCSSåŠˇ NXPathèˇ ˛a´lè¿¿åijRïij´ Nè£ˇ ZäÿłåIJ´ ´lèrˇCèˇ r´ ¸Tä¡a玢 DèIJŸèˇ Z˙ Zç˙ ´lNåž´ Ræ´ U˝uå¿˚ ´LæIJL’çTˇ´l 3. å ˛EEç¡ˇ oæ˝ Tˇ ræ´ Nˇ ˛AçT§æˇ ´LRåd’Žçˇ g˘ æaijåij˘ R玴 Dèˇ o˝c韴 Eåˇ rijå´ Gžïij˘ ´LJSONãA˘ ˛ACSVãA˘ ˛AXMLïijL’å´z˝uårˇ ˛Eåo˝Cäˇ z˙nåˇ ŸåC´ ´låIJ´låd’Žäÿłä¡ ç¡oïij˝ ´LFTPãA˘ ˛AS3ãA˘ ˛AæIJnåIJˇ ræˇ U˝ Gä˘ z˝uç¸s˙ zç˙ z§ïijL’˙ 4. å ˛Aeåˇ cˇo玽 Dçijˇ Uç˝ a˘ ˛AæTˇ ræ´ Nˇ ˛AåŠNèˇ GłåŁ˘ ´lèr´ ˛Eå´Lnïij´ Nçˇ Tˇ´läžOåd’˝ Dçˇ Rˇ ˛Eåd’Uæ˝ U˝ Gã˘ A˘ ˛AéI˙dæ¯ a˘Gå˘ G˘ ˛EåŠNéˇ Tˇ Zè´ r´rçij´ Uç˝ a˘ ˛AéU˚ oé˝ cŸ´ 5.å R´ ræL’l’å´ s´ ¸TïijNåˇ Eˇ ˛Aèoÿä¡˝ aä¡£ç˘ Tˇ´lsignals åŠNåˇ R´ Nå´ e¡çŽˇ DAPI(middlewares,ˇ extensions, åŠNpipelines)æˇ I˙eçijˇ Uå˝ ˛EZè´ Głå˘ oŽä´zL’æ˝ RŠä´ z˝uåŁ§è˙ C¡ãˇ A˘ C´ 6. åd’gé˘ G˘ R玴 Dåˇ ˛EEç¡ˇ oæL’l’å˝ s´ ¸TåŠNäÿˇ éUt’ä˚ z˝uä¿˙ Zä¡£ç˙ Tˇ´lïijŽ • cookies and session handling • HTTP features like compression, authentication, caching • user-agent spoofing • robots.txt • crawl depth restriction • and more 7. 裟æIJL’åE˝uäˇ z˙Uå˝ e¡åd’Žåˇ e¡äÿIJäÿIJïijˇ Næˇ r´Tå˛eˇ Cå´ R´ ré´ G˘ åd’ å´Ll’çTˇ´lèIJŸèZ˙ Zæ˙ I˙eçˇ ´Lnåˇ R´U˝ SitemapsåŠNXML/CSVèˇ o˝c韴 Eïijˇ Nˇ äÿAäÿłè˚u§ç˘ ´Lnåˇ R´Uå˝ Eˇ Cçt’ˇ aå˘ E¸sèˇ ˛AT玡 DåłŠä¡¸Sçˇ o˝ ˛aé˛A¸SæI˙eˇè GłåŁ˘ ´läÿNè¡¡å´ Z¿çL’˙ G˘ ïijNˇ äÿAäÿłçij¸Så˘ ŸDNSèg˘cæˇ d¯Råˇ Z´´lç L’ç L’ãA˘ C´

2 Scrapyæ ¸TZç´ ´lN02-´ åo˝ Næˇ ¸Tt’çd’žä¿N´

è£Zç´ r´Gæ˘ U˝ Gç˘ n´aæ˘ ´LSä´ z˙néˇ AŽè£˘ Gäÿ˘ Aäÿłæ˘ r´Tè¿ˇ Cåˇ o˝Næˇ ¸Tt’çŽDä¿ˇ Nå´ Ræˇ I˙eæˇ ¸TZä¡´ aä¡£ç˘ Tˇ´lScrapyïijNæˇ ´LSé´ AL’æ˘ Nl’ç´ ´Lnåˇ R´U˝ èZ´ Oå˝ U˚ Eç¡ˇ Sé˛e´ Ué˝ ˛a¸tçŽDæˇ U˝ réˇ U˚ zå˙ ´LUè˚ ˛a´lãA˘ C´ è£Zé´ G˘ Næˇ ´LSä´ z˙nåˇ rˇ ˛Eåo˝Næˇ ´LRå˛eˇ Cäÿ´ Nå´ G˘ aäÿłæ˘ eéłd’ïijŽˇ •å ´LZå˙ zžäÿ˙ Aäÿłæ˘ U˝ r玡 DScrapyå˚uˇ eçˇ ´lN´ •å oŽä´zL’ä¡˝ aæL’˘ AéIJ˘ Aè˛e˘ ˛Aè˛e ˛AæŁ¡åR´U玽 DItemåˇ r´zè´ s´ ˛a • çijUå˝ ˛EZäÿ´ Aäÿłspideræ˘ I˙eçˇ ´Lnåˇ R´Uæ§˝ Räÿłç¡ˇ Sç´ n´Zå´z˝uæ´ R´ Råˇ R´Uå˝ GžæL’˘ AæIJL’玢 DItemåˇ r´zè´ s´ ˛a • çijUå˝ ˛EZäÿ´ AäÿłItem˘ PiplineæI˙eåˇ ŸåC´ ´læR´ Råˇ R´Uå˝ Gžæ˘ I˙e玡 DItemåˇ r´zè´ s´ ˛a Scrapyä¡£çTˇ´lPythonèr´ è´lAçij˘ Uå˝ ˛EZïij´ Nå˛eˇ Cæ´ dIJ䡯 aå˘ r´z裴 Zé´ U˚ ´lèr´ è´lA裟äÿ˘ ç ˛E§ïijNèˇ r˚uå´ Eˇ´LåO˝ zå˙ ˛eä´zaäÿ˘ N姞æIJ´ n秡 eèˇ r´ ˛EãA˘ C´

2.1å ´LZå˙zžScrapyå˙ ˚ueçˇ ´lN´

åIJ´läz˙zä¡˙ ¸Tä¡aå˘ UIJæ˝ nˇc玴 Dçˇ Z˙ oå¡˝ ¸TæL’gè˘ ˛aNå˛eˇ Cäÿ´ Nå´ S¡ä´ zd’˙

scrapy startproject coolscrapy

årˇ ˛EäijŽå´LZå˙ zžcoolscrapyæ˙ U˝ Gä˘ z˝uåd’´zïij˙ Nåˇ E˝uçˇ Z˙ oå¡˝ ¸Tçz¸Sæ˙ d¯Då˛eˇ Cäÿ´ NïijŽ´

coolscrapy/ scrapy.cfg # éCˇlç¡šé´ Eˇ ç¡oæ˝ U˝Gä˘ z˝u˙

coolscrapy/ #

˓→Pythonæl´ ˛aåI˙Uïij˚ Nä¡ˇ aæL’˘ AæIJL’玢 ˇDäz˙cçˇ a˘ ˛AéC¡æˇ T¿è£ˇ Zé´ G˘Néˇ I˙c´ __init__.py items.py # ItemåoŽä´zL’æ˝ U˝Gä˘ z˝u˙

pipelines.py # pipelinesåoŽä´zL’æ˝ U˝Gä˘ z˝u˙

settings.py # éEˇ ç¡oæ˝ U˝Gä˘ z˝u˙

spiders/ #

˓→æL’AæIJL’ç˘ ´Lnèˇ Z´nspideré´ C¡æˇ T¿è£ˇ Zäÿłæ´ U˝Gä˘ z˝uåd’´zäÿ˙ Né´ I˙c´ __init__.py ...

2.2å oŽä´zL’æ˝ ´LSä˙z´ n玡 DItemˇ

æ´LSä´ z˙néˇ AŽè£˘ Gå˘ ´LZå˙ zžäÿ˙ Aäÿłscrapy.Itemç˘ s´zïij˙ Nå´z˝uåˇ oŽä´zL’å˝ o˝C玡 Dçˇ s´zå˙ d¯Näÿžscrapy.Field玴 Dåˇ s´dæ¯ A˘ gïij˘ Nˇ æ´LSä´ z˙nåˇ G˘ ˛Eåd’Gå˘ rˇ ˛EèZ´ Oå˝ U˚ Eç¡ˇ Sæ´ U˝ réˇ U˚ zå˙ ´LUè˚ ˛a´lçŽDåˇ Rˇ çg˘rãˇ A˘ ˛Aé¸S¿æO˝ eåIJˇ råˇ I˙AåŠ˘ Næˇ SŸè˛e´ ˛Aç´Lnåˇ R´Uäÿ˝ Næ´ I˙eãˇ A˘ C´ import scrapy class HuxiuItem(scrapy.Item): title= scrapy.Field() # æa˘Gé˘ cŸ´ link= scrapy.Field() # é¸S¿æO˝eˇ desc= scrapy.Field() # ço˝Aè£˘ rˇ posttime= scrapy.Field() # åR´Såÿ´ Cæˇ U˝ué˚ Ut’˚

ä´z§èoÿä¡˝ aè˘ gL’å¿˘ Uå˚ oŽä´zL’è£˝ ZäÿłItemæIJL’ç´ C´zéž´ zç˙ C˛eïijˇ Nä¡ˇ ˛EæŸrå´ oŽä´zL’å˝ o˝Nä´zˇ Nå´ RˇOä¡˝ aå˘ R´ rä´ z˙eå¿ˇ Uå˚ ´Lrèˇ oÿåd’Žå˝ e¡åd’ˇ Dïijˇ Nè£ˇ Zæ´ a˚uä¡˘ aå˘ rˇså´ R´ rä´ z˙eä¡£çˇ Tˇ´lScrapyäÿ åE˝uäˇ z˙UæIJL’ç˝ Tˇ´lçŽDçˇ z˙Däˇ z˝uåŠ˙ Nåÿˇ oåŁl’ç˝ s´zã˙ A˘ C´

2.3ç nˇ näÿˇ AäÿłSpider˘

èIJŸèZ˙ Zå˙ rˇs柴 rä¡´ aå˘ oŽä´zL’玽 Däÿˇ A䞢 Zç˙ s´zïij˙ NScrapyä¡£çˇ Tˇ´låo˝Cäˇ z˙næˇ I˙eäˇ z˙Oäÿ˝ Aäÿłdomainïij˘ ´Læ´LUdomainç˝ z˙DïijL’çˇ ´Lnåˇ R´Uä£˝ ˛aæ˛Arã´ A˘ C´ åIJ´lèIJŸèZ˙ Zç˙ s´zäÿ˙ åoŽä´zL’äž˝ ˛EäÿAäÿłå˘ ´LIå˙ g˘Nå´ Nˇ U玽 DURLäÿˇ Nè¡¡å´ ´LUè˚ ˛a´lïijNäˇ z˙eåˇ RŁæ´ A˘ Oæ˝ a˚uè˚u§èÿłé¸S¿æ˘ O˝ eïijˇ Nå˛eˇ Cä¡´ ¸Tèg˘cæˇ d¯Réˇ ˛a¸téI˙cå´ ˛EEåˇ o´zæ˝ I˙eæˇ R´ Råˇ R´UItemã˝ A˘ C´ åoŽä´zL’äÿ˝ AäÿłSpiderïij˘ Nåˇ RłéIJ´ Aç˘ z˙gæL’£˘ scrapy.Spiderçs´zå´z˝uå˙ oŽäž˝ Oäÿ˝ A䞢 Zå˙ s´dæ¯ A˘ gïijŽ˘ • name: SpideråRˇ çg˘rïijˇ Nå£ˇ Eéˇ ˛azæŸ˙ rå´ Tˇ räÿ´ A玢 Dˇ • start_urls: å´LIå˙ g˘Nå´ Nˇ Uäÿ˝ Nè¡¡é¸S¿æ´ O˝ eURLˇ • parse(): çTˇ´læI˙eèˇ g˘cæˇ d¯Räÿˇ Nè¡¡å´ RˇO玽 DResponseåˇ r´zè´ s´ ˛aïijNèˇ r´eåˇ r´zè´ s´ ˛aä´z§æŸr裴 Zäÿłæ´ U´zæ¸s¸T玽 Dåˇ Tˇ räÿ´ Aå˘ R´ Cæ´ ¸Trãˇ A˘ C´ åo˝Cèt’§èt’ˇ cèˇ g˘cæˇ d¯Rè£ˇ Tåˇ Z˙ dé¯ ˛a¸téI˙cæ´ ¸Træˇ oå´z˝uæ˝ R´ Råˇ R´Uå˝ Gžç˘ Zÿåž˙ T玡 DItemïijˇ ´Lè£Tåˇ Z˙ dItemå¯ r´zè´ s´ ˛aïijL’ïijN裟æIJL’åˇ E˝uäˇ z˙Uå˝ Rˇ ´Læ¸s¸TçŽDé¸S¿æˇ O˝ eURLïijˇ ´Lè£Tåˇ Z˙ dRequestå¯ r´zè´ s´ ˛aïijL’ãA˘ C´ æ´LSä´ z˙nåIJˇ ´lcoolscrapy/spidersæU˝ Gä˘ z˝uåd’´zäÿ˙ Né´ I˙cæ´ U˝ råˇ zž˙ huxiu_spider. pyïijNåˇ ˛EEåˇ o´zå˛e˝ Cäÿ´ NïijŽ´

#!/usr/bin/env python #-*- encoding: utf-8 -*- """ Topic: sample Desc : """ from coolscrapy.items import HuxiuItem import scrapy class HuxiuSpider(scrapy.Spider): name="huxiu" allowed_domains=["huxiu.com"] start_urls=[ "http://www.huxiu.com/index.php" ]

def parse(self, response): for sel in response.xpath('//div[@class="mod-info-flow"]/

˓→div/div[@class="mob-ctt"]'): item= HuxiuItem() item['title']= sel.xpath('h3/a/text()')[0].extract() item['link']= sel.xpath('h3/a/@href')[0].extract() url= response.urljoin(item['link']) item['desc']= sel.xpath('div[@class="mob-sub"]/text()

˓→')[0].extract() print(item['title'],item['link'],item['desc'])

2.4 è£Rèˇ ˛aNçˇ ´Lnèˇ Z´ n´

åIJ´læa´zç˘ Z˙ oå¡˝ ¸TæL’gè˘ ˛aNäÿˇ Né´ I˙c玴 Dåˇ S¡ä´ zd’ïij˙ Nåˇ E˝uäÿˇ huxiuæŸrä¡´ aå˘ oŽä´zL’玽 Dspideråˇ Rˇ å UïijŽ˚ scrapy crawl huxiu

å˛eCæ´ dIJäÿ¯ Aå˘ ´LGæ˘ cåÿÿïijˇ Nåžˇ Tèˇ r´eåˇ R´ rä´ z˙eæL’¸Såˇ råˇ Gžæ˘ r´Räÿ´ Aäÿłæ˘ U˝ réˇ U˚ z˙

2.5 åd’Dçˇ Rˇ ˛Eé¸S¿æO˝ eˇ

å˛eCæ´ dIJæ¯ C¸sçˇ z˙gç˘ z˙ è˚u§èÿłær´Räÿłæ´ U˝ réˇ U˚ zé¸S¿æ˙ O˝ eè£ˇ Zå˙ O˝ zïij˙ NçIJˇ NçIJ´ Nå´ o˝C玡 Dèˇ r˛eç´ z˙ ˛Eå˛EEåˇ o´z玽 Dèˇ r´Iïij˙ Néˇ C´ cä´zˇ ´LåR´ rä´ z˙eåIJˇ ´lparse()æU´zæ¸s¸Täÿ˝ è£Tåˇ Z˙ däÿ¯ AäÿłRequestå˘ r´zè´ s´ ˛aïijNˇ çD˝uåˇ RˇOæ¸s˝ ´lå ˛ENäÿˇ Aäÿłå˘ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Træˇ I˙eèˇ g˘cæˇ d¯Ræˇ U˝ réˇ U˚ zè˙ r˛eæ´ Cˇ Eãˇ A˘ C´ from coolscrapy.items import HuxiuItem import scrapy class HuxiuSpider(scrapy.Spider): name="huxiu" allowed_domains=["huxiu.com"] start_urls=[ "http://www.huxiu.com/index.php" ]

def parse(self, response): for sel in response.xpath('//div[@class="mod-info-flow"]/

˓→div/div[@class="mob-ctt"]'): item= HuxiuItem() item['title']= sel.xpath('h3/a/text()')[0].extract() item['link']= sel.xpath('h3/a/@href')[0].extract() url= response.urljoin(item['link']) item['desc']= sel.xpath('div[@class="mob-sub"]/text()

˓→')[0].extract() # print(item['title'],item['link'],item['desc']) yield scrapy.Request(url, callback=self.parse_article)

def parse_article(self, response): detail= response.xpath('//div[@class="article-wrap"]') item= HuxiuItem() item['title']= detail.xpath('h1/text()')[0].extract() item['link']= response.url item['posttime']= detail.xpath( 'div[@class="article-author"]/span[@class="article-time

˓→"]/text()')[0].extract() print(item['title'],item['link'],item['posttime']) yield item

çO˝ råIJˇ ´lparseåRłæ´ R´ Råˇ R´Uæ˝ D§åˇ Et’è˝uˇ c玡 Dé¸S¿æˇ O˝ eïijˇ Nçˇ D˝uåˇ RˇOå˝ rˇ ˛Eé¸S¿æO˝ eåˇ ˛EEåˇ o´zè˝ g˘cæˇ d¯Räžd’çˇ z˙Zå´ R˛eåd’´ U玽 Dæˇ U´zæ¸s¸Tå˝ O˝ zåd’˙ Dçˇ Rˇ ˛E䞲EãA˘ C´ ä¡aå˘ R´ rä´ z˙eå§žäžˇ Oè£˝ Zäÿłæ´ d¯Dåˇ zžæ˙ Zt’åŁ˙ aåd’˘ æI˙C玴 Dçˇ ´Lnèˇ Z´ nç´ ´lNåž´ Räž´ ˛EãA˘ C´

2.6å rijå´ GžæŁ˘ ¸SåR´ Uæ˝ ¸Træˇ o˝

æIJAç˘ o˝Aå˘ ¸TçŽDä£ˇ Iå˙ ŸæŁ¸SåR´Uæ˝ ¸Træˇ o玽 Dæˇ U´zåij˝ R柴 rä¡£ç´ Tˇ´ljsonæaijåij˘ R玴 Dæˇ U˝ Gä˘ z˝uä£˙ Iå˙ ŸåIJ´læIJnåIJˇ rïijˇ Nåˇ Cˇ Räÿ´ Né´ I˙c裴 Zæ´ a˚uè£˘ Rèˇ ˛aNïijŽˇ

scrapy crawl huxiu -o items.json

åIJ´læijTçd’žçŽˇ Dåˇ rˇRç¸s´ zç˙ z§é˙ G˘ Néˇ I˙c裴 Zç´ g˘ æU´zåij˝ Rè˝u¸såd’§äž´ ˛EãA˘ Cäÿ´ è£Gå˛e˘ Cæ´ dIJ䡯 aè˛e˘ ˛Aæd¯Dåˇ zžåd’˙ æI˙C玴 Dçˇ ´Lnèˇ Z´ nç¸s´ zç˙ z§ïij˙ Nˇ æIJAå˘ e¡èˇ Głå˚u˘ sçij´ Uå˝ ˛EZ´ Item PipelineãA˘ C´

2.7 ä£˙Iå Ÿæ ¸Træˇ oå˝ ´Lræˇ ¸Træˇ oåž˝ ¸S

äÿŁéI˙cæ´ ´LSä´ z˙näˇ z˙Nç´ z˙ äž ˛EåR´ rä´ z˙eåˇ rˇ ˛EæŁ¸SåR´U玽 DItemåˇ rijå´ Gžäÿžjsonæ˘ aijåij˘ R玴 Dæˇ U˝ Gä˘ z˝uïij˙ Näÿˇ è£GæIJ˘ Aåÿÿè˘ g˘ ˛AçŽDåˇ ˛AŽæ¸s¸T裟æŸrçij´ Uå˝ ˛EZPipelineå´ rˇ ˛EåE˝uåˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ ãA˘ Cæ´ ´LSä´ z˙nåIJˇ ´lcoolscrapy/ pipelines.pyåoŽä´zL’˝

#-*- coding: utf-8 -*- import datetime import redis import json import logging from contextlib import contextmanager

from scrapy import signals from scrapy.exporters import JsonItemExporter from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from sqlalchemy.orm import sessionmaker from coolscrapy.models import News, db_connect, create_news_table,

˓→Article class ArticleDataBasePipeline(object): """ä£Iå˙ ŸæU˝Gç˘ n´aå˘ ´Lræ¸Tˇ ræˇ o垸S"""˝

def __init__(self): engine= db_connect() create_news_table(engine) self.Session= sessionmaker(bind=engine)

def open_spider(self, spider): """This method is called when the spider is opened.""" pass

def process_item(self, item, spider): a= Article(url=item["url"], title=item["title"].encode("utf-8"), publish_time=item["publish_time"].encode("utf-8

˓→"), body=item["body"].encode("utf-8"), source_site=item["source_site"].encode("utf-8")) with session_scope(self.Session) as session: session.add(a)

def close_spider(self, spider): pass

äÿŁéI˙cæ´ ´LSä¡£ç´ Tˇ´läž ˛Epythonäÿ çŽDSQLAlchemyæˇ I˙eä£ˇ Iå˙ Ÿæ ¸Træˇ o垸Sïij˝ Nè£ˇ ZäÿłæŸ´ räÿ´ Aäÿłé˘ I˙dåÿÿäijŸç¯ g˘A玢 DORM垸Sïijˇ Næˇ ´LSå´ ˛EZäž´ ˛Eçr´Gå˘ E¸s䞡 Oå˝ o˝C玡 Dˇ åEˇ eéˇ U˚ ´læ ¸TZç´ ´lN´ ïijNåˇ R´ rä´ z˙eåˇ R´ Cè´ A˘ Cäÿˇ Nã´ A˘ C´ çD˝uåˇ RˇOåIJ˝ ´lsetting.pyäÿ éEˇ ç¡oè£˝ ZäÿłPipelineïij´ N裟æIJL’æ¸Tˇ ræˇ o垸Sé¸S¿æ˝ O˝ eçˇ L’ä£˛aæ˛ArïijŽ´

ITEM_PIPELINES={ 'coolscrapy.pipelines.ArticleDataBasePipeline':5, }

# pip install MySQL-python DATABASE={'drivername':'mysql', 'host':'192.168.203.95', 'port':'3306', 'username':'root', 'password':'mysql', 'database':'spider', 'query':{'charset':'utf8'}}

å ˛E ænˇ ˛aè£Rèˇ ˛aNçˇ ´Lnèˇ Z´ n´ scrapy crawl huxiu

éC´ cä´zˇ ´LæL’AæIJL’æ˘ U˝ réˇ U˚ zçŽ˙ Dæˇ U˝ Gç˘ n´aé˘ C¡åˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝ åO˝ zäž˙ ˛EãA˘ C´

2.8 äÿNäÿ´ Aæ˘ eˇ

æIJnçˇ n´aå˘ RłæŸ´ råÿ˛eä¡´ aé˘ c´ ˛Eç¸Te䞡 ˛EscrapyæIJA姞æIJ˘ n玡 DåŁ§èˇ C¡ïijˇ N裟æIJL’å¿ˇ ´Låd’ŽénŸçž´ gçL’´zæ˘ A˘ gæš˘ ˛aæIJL’èošå˝ ´Lrãˇ A˘ Cæ´ O˝ eäÿˇ Næ´ I˙eäijŽéˇ AŽè£˘ Gåd’Žäÿłä¿˘ Nå´ Råˇ Rˇ Sä¡´ aå˘ s´ ¸Tçd’žscrapyçŽDåˇ E˝uäˇ z˙UçL’´zæ˝ A˘ gïij˘ Nçˇ D˝uåˇ RˇOå˝ ˛E æ˚uså´ Eˇ eèˇ ošè£˝ ræˇ r´RäÿłçL’´zæ´ A˘ gã˘ A˘ C´

3 Scrapyæ ¸TZç´ ´lN03-´ Spiderèr˛eè´ g˘ cˇ

SpideræŸrç´ ´Lnèˇ Z´ næ´ ˛a˛Eæd˝u环 Dæˇ aÿå£˘ Cïijˇ Nçˇ ´Lnåˇ R´Uæ¸t˝ ˛Aç´lNå˛e´ Cäÿ´ NïijŽ´ 1.å Eˇ´Lå´LIå˙ g˘Nå´ Nˇ Uè˝ r˚uæ´ s´CURLå´ ´LUè˚ ˛a´lïijNå´z˝uæˇ Nˇ Gå˘ oŽäÿ˝ Nè¡¡å´ RˇOåd’˝ Dçˇ Rˇ ˛EresponseçŽDåˇ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Trãˇ A˘ Cå´ ´LIæ˙ nˇ ˛aèr˚uæ´ s´CURLé´ AŽè£˘ G˘ start_urlsæNˇ Gå˘ oŽïij˝ Nèˇ rˇCçˇ Tˇ´lstart_requests()äžgç˘ T§ˇ Requestår´zè´ s´ ˛aïijNçˇ D˝uåˇ RˇOæ¸s˝ ´lå ˛ENˇ parseæU´zæ¸s¸Tä¡IJäÿžå˝ Z˙ dè¯ rˇCˇ 2. åIJ´lparseåZ˙ dè¯ rˇCäÿˇ èg˘cæˇ d¯Rresponseå´z˝uè£ˇ Tåˇ Z˙ då¯ Uå˚ Eÿ,ˇ Itemår´zè´ s´ ˛a,Requestår´zè´ s´ ˛aæ´LUå˝ o˝Cäˇ z˙n玡 Dè£ˇ äz˙cåˇ r´zè´ s´ ˛aãA˘ C´ Requestår´zè´ s´ ˛a裟äijŽåNˇ Eåˇ Rˇ nå´ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Trïijˇ Nä´zˇ Nå´ RˇOScrapyäÿ˝ Nè¡¡å´ o˝Nåˇ RˇOäijŽè˝ c´n裴 Zé´ G˘ Næ¸sˇ ´lå ˛EN玡 Dåˇ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Tråd’ˇ Dçˇ Rˇ ˛EãA˘ C´ 3. åIJ´låZ˙ dè¯ rˇCåˇ G¡æ˘ ¸Tréˇ G˘ Néˇ I˙cïij´ Nä¡ˇ aé˘ AŽè£˘ Gä¡£ç˘ Tˇ´léAL’æ˘ Nl’å´ Z´´lïij´LåRˇ Næˇ a˚uå˘ R´ rä´ z˙eä¡£çˇ Tˇ´lBeautifulSoup,lxmlæ´LUå˝ E˝uäˇ z˙Uå˚u˝ eåˇ E˚uïijL’èˇ g˘cæˇ d¯Réˇ ˛a¸téI˙cå´ ˛EEåˇ o´zïij˝ Nå´z˝uçˇ T§æˇ ´LRèˇ g˘cæˇ d¯Råˇ RˇO玽 Dçˇ z¸Sæ˙ dIJItem㯠A˘ C´ 4. æIJAå˘ RˇOè£˝ Tåˇ Z˙ d环 Dè£ˇ Zäž´ ZItemé˙ AŽåÿÿäijŽè˘ c´næ´ Nˇ ˛Aä´zEåˇ Nˇ Uå˝ ´Lræˇ ¸Træˇ o垸Säÿ˝ (ä¡£çTˇ´lItem Pipeline)æ´LUè˝ A˘ Eä¡£çˇ Tˇ´lFeed exportsårˇ ˛EåE˝uä£ˇ Iå˙ Ÿå´Lræˇ U˝ Gä˘ z˝uäÿ˙ ãA˘ C´ år¡çˇ o˝ ˛aè£Zäÿłæ¸t´ ˛Aç´lNé´ A˘ Cå´ Rˇ ´LäžOæL’˝ AæIJL’玢 DèIJŸèˇ Z˙ Zïij˙ Nä¡ˇ ˛EæŸrScrapyé´ G˘ Néˇ I˙cäÿžäÿ´ åRˇ N玡 Dä¡£çˇ Tˇ´lçZ˙ o玽 Dåˇ o˝dç¯ O˝ r䞡 ˛EäÿA䞢 Zåÿÿè˙ g˘ ˛AçŽDSpiderãˇ A˘ Cäÿ´ Né´ I˙cæ´ ´LSä´ z˙næŁŁåˇ o˝Cäˇ z˙nåˇ ´LUå˚ Gžæ˘ I˙eãˇ A˘ C´

3.1 CrawlSpider

é¸S¿æO˝ eçˇ ´Lnåˇ R´UèIJŸè˝ Z˙ Zïij˙ Näÿ¸Séˇ U˚ ´läÿžéC´ c䞡 Zç˙ ´Lnåˇ R´UæIJL’çL’´zå˝ oŽè˝ g˘Då¿ˇ N玴 Dé¸S¿æˇ O˝ eåˇ ˛EEåˇ o´zè˝ A˘ Nåˇ G˘ ˛Eåd’G玢 Dãˇ A˘ C´ å˛eCæ´ dIJ䡯 aè˘ gL’å¿˘ Uå˚ o˝C裟äÿˇ è˝u¸säz˙eéˇ A˘ Cå´ Rˇ ´Lä¡a玢 DéIJˇ Aæ˘ s´Cïij´ Nåˇ R´ rä´ z˙eåˇ Eˇ´Lçz˙gæL’£å˘ o˝Cçˇ D˝uåˇ RˇOè˛e˛Eç˝ Z˙ Uç˝ Zÿåž˙ T玡 Dæˇ U´zæ¸s¸Tïij˝ Næˇ ´LUè˝ A˘ Eèˇ Głå˘ oŽä´zL’Spiderä´z§è˛a˝ Nãˇ A˘ C´ åo˝Céˇ Zd’äž´ ˛Eäz˙O˝ scrapy.Spiderçs´zç˙ z˙gæL’£çŽ˘ Dåˇ s´dæ¯ A˘ gåd’˘ Uïij˝ N裟æIJL’äÿˇ Aäÿłæ˘ U˝ r玡 Dåˇ s´dæ¯ A˘ g˘rules,åo˝CæŸˇ räÿ´ Aäÿł˘ Ruleår´zè´ s´ ˛aå´LUè˚ ˛a´lïijNæˇ r´Räÿł´ Ruleår´zè´ s´ ˛aåoŽä´zL’äž˝ ˛Eæ§Räÿłèˇ g˘Dåˇ ´LZïij´ Nå˛eˇ Cæ´ dIJåd’Žäÿł¯ RuleåN´zéˇ Eˇ äÿAäÿłè£˘ dæ¯ O˝ eïijˇ Néˇ C´ cä´zˇ ´Lä¡£çTˇ´lçnˇnäÿˇ Aäÿłïij˘ Næˇ a´zæ˘ oå˝ oŽä´zL’玽 Déˇ ˛ažåžRã´ A˘ C´ äÿAäÿłè˘ r˛eç´ z˙ ˛EçŽDä¿ˇ Nå´ RïijŽˇ

from coolscrapy.items import HuxiuItem import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class LinkSpider(CrawlSpider): name="link" allowed_domains=["huxiu.com"] start_urls=[ "http://www.huxiu.com/index.php" ]

rules=( # æR´Råˇ R´Uå˝ N´zéˇ Eˇ æ cåˇ ´LZåij´ R'/group?f=index_group'é¸S¿æ´ O˝eˇ

˓→(ä¡ ˛EæŸräÿ´ èC¡åˇ N´zéˇ Eˇ 'deny.php') #

˓→å´z˝uäÿTäijŽéˇ AŠå¡Šç˘ ´Lnåˇ R´U(å˛e˝ Cæ´ ¯dIJæš ˛aæIJL’åoŽä´zL’callbackïij˝ Néˇ zŸè˙ od’follow=True).˝

˓→ Rule(LinkExtractor(allow=('/group?f=index_group', ), deny=(

˓→'deny\.php', ))), # æR´Råˇ R´Uå˝ N´zéˇ Eˇ '/article/\d+/\d+.html

˓→'玡Dé¸S¿æO˝eïijˇ Nå´z˝uä¡£çˇ Tˇlparse_´

˓→itemæI˙eèˇ g˘cæˇ ¯dRåˇ o˝Cäˇ z˙näÿˇ Nè¡¡å´ RˇO玽 ˇDå ˛EEåˇ o´zïij˝ Näÿˇ éAŠå¡Š˘ Rule(LinkExtractor(allow=('/article/\d+/\d+\.html', )),

˓→callback='parse_item'), )

def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.

˓→url) detail= response.xpath('//div[@class="article-wrap"]') item= HuxiuItem() item['title']= detail.xpath('h1/text()')[0].extract() item['link']= response.url item['posttime']= detail.xpath( 'div[@class="article-author"]/span[@class="article-time

˓→"]/text()')[0].extract() print(item['title'],item['link'],item['posttime']) yield item

3.2 XMLFeedSpider

XMLèo˝c韴 EèIJŸèˇ Z˙ Zïij˙ Nçˇ Tˇ´læI˙eçˇ ´Lnåˇ R´UXMLå¡˝ cåij´ R玴 Dèˇ o˝c韴 Eåˇ ˛EEåˇ o´zïij˝ Néˇ AŽè£˘ Gæ§˘ Räÿłæˇ Nˇ Gå˘ oŽçŽ˝ DèŁˇ Cç´ C´zæ´ I˙eéˇ ˛A åO˝ ˛EãA˘ C´ åR´ rä¡£ç´ Tˇ´liternodes, xml, åŠNˇ htmläÿL’çg˘ å¡cåij´ R玴 Dè£ˇ äz˙cåˇ Z´´lïijNäÿˇ è£G塸Så˛E˘ Eåˇ o´zæ˝ r´Tè¿ˇ Cåd’ŽçŽˇ Dæˇ U˝uå˚ A˘ Zæ´ O˝ ´lè Rä¡£çˇ Tˇ´liternodesïijNˇ ézŸè˙ od’ä´z§æŸ˝ rå´ o˝Cïijˇ Nåˇ R´ rä´ z˙eèŁˇ CçIJ´ ˛Aå˛EEåˇ ŸæR´ Råˇ Gæ˘ A˘ gè˘ C¡ïijˇ Näÿˇ éIJAè˛e˘ ˛Aårˇ ˛Eæ¸Tt’äÿłDOMåŁaè¡¡å˘ ´Lråˇ ˛EEåˇ Ÿäÿ å ˛E èg˘cæˇ d¯Rãˇ A˘ Cè´ A˘ Nä¡£çˇ Tˇ´lhtmlåR´ rä´ z˙eåd’ˇ Dçˇ Rˇ ˛EXMLæIJL’æaijåij˘ Ré´ Tˇ Zè´ r´r玴 Dåˇ ˛EEåˇ o´zã˝ A˘ C´ åd’Dçˇ Rˇ ˛EXMLçŽDæˇ U˝uå˚ A˘ ZæIJ´ Aå˘ e¡åˇ Eˇ´LRemoving namespaces æO˝ eäÿˇ Næ´ I˙eæˇ ´LSé´ AŽè£˘ Gç˘ ´Lnåˇ R´Uæ˝ ´LS玴 Dåˇ Žåo˝cè´ o˝c韴 EXMLæˇ I˙eåˇ s´ ¸Tçd’žåo˝C玡 Dä¡£çˇ Tˇ´læU´zæ¸s¸Tã˝ A˘ C´ from coolscrapy.items import BlogItem import scrapy from scrapy.spiders import XMLFeedSpider class XMLSpider(XMLFeedSpider): name="xml" namespaces=[('atom','http://www.w3.org/2005/Atom')] allowed_domains=["github.io"] start_urls=[ "http://www.pycoding.com/atom.xml" ] iterator='xml' #

˓→çijžçIJ ˛A玡DiternodesïijNèšˇ Näijijåˇ r´zäž´ OæIJL’namespace玽 ˇDxmläÿ è ˛aNˇ itertag='atom:entry'

def parse_node(self, response, node): # self.logger.info('Hi, this is a <%s> node!', self.itertag) item= BlogItem() item['title']= node.xpath('atom:title/text()')[0].extract() item['link']= node.xpath('atom:link/@href')[0].extract() item['id']= node.xpath('atom:id/text()')[0].extract() item['published']= node.xpath('atom:published/text()')[0].

˓→extract() item['updated']= node.xpath('atom:updated/text()')[0].

˓→extract() self.logger.info('|'.join([item['title'],item['link'],item[

˓→'id'],item['published']])) return item

3.3 CSVFeedSpider

è£Zäÿłè˚u§äÿŁé´ I˙c玴 DXMLFeedSpiderå¿ˇ ´Lçs´zäijijïij˙ Nåˇ Nžåˇ ´LnåIJ´ ´läžOå˝ o˝CäijŽäÿˇ Aè˘ ˛aNäÿˇ Aè˘ ˛aN玡 Dè£ˇ äz˙cïijˇ Nèˇ A˘ Näÿˇ æŸräÿ´ AäÿłèŁ˘ Cç´ C´zäÿ´ AäÿłèŁ˘ Cç´ C´z玴 Dè£ˇ äz˙cãˇ A˘ C´ ær´Ræ´ nˇ ˛aè£ äz˙cèˇ ˛aN玡 Dæˇ U˝uå˚ A˘ ZäijŽè´ rˇCçˇ Tˇ´lparse_row()æU´zæ¸s¸Tã˝ A˘ C´ from coolscrapy.items import BlogItem from scrapy.spiders import CSVFeedSpider class CSVSpider(CSVFeedSpider): name="csv" allowed_domains=['example.com'] start_urls=['http://www.example.com/feed.csv'] delimiter=';' quotechar="'" headers=['id','name','description']

def parse_row(self, response, row): self.logger.info('Hi, this is a row!: %r', row) item= BlogItem() item['id']= row['id'] item['name']= row['name'] return item

3.4 SitemapSpider

çn´Zç´ C´zåIJ´ råˇ Z¿èIJŸè˙ Z˙ Zïij˙ Nåˇ Eˇ ˛Aèoÿä¡˝ aä¡£ç˘ Tˇ´lSitemapsåR´ Sç´ O˝ rURLåˇ RˇOç˝ ´Lnåˇ R´Uæ˝ ¸Tt’äÿłçn´Zç´ C´zã´ A˘ C´ 裟æTˇ ræ´ Nˇ ˛Aå¸tNåˇ eˇUçŽ˚ Dçˇ n´Zç´ C´zåIJ´ råˇ Z¿ä˙ z˙eåˇ RŁä´ z˙O˝ robots.txtäÿ åR´ Sç´ O˝ rçˇ n´Zç´ C´zURL´ 4 Scrapyæ ¸TZç´ ´lN04-´ Selectorèr˛eè´ g˘ cˇ

åIJ´lä¡aç˘ ´Lnåˇ R´Uç¡˝ Sé´ ˛a¸tçŽDæˇ U˝uå˚ A˘ Zïij´ NæIJˇ Aæ˘ Z´ oé˝ ˛A çŽD䞡 Næ´ Cˇ Eåˇ rˇs柴 råIJ´ ´lé ˛a¸téI˙cæž´ Rçˇ a˘ ˛Aäÿ æR´ Råˇ R´UéIJ˝ Aè˛e˘ ˛AçŽDæˇ ¸Træˇ oïij˝ Næˇ ´LSä´ z˙næIJL’åˇ G˘ aäÿłåž¸Så˘ R´ rä´ z˙eåÿˇ oä¡˝ aå˘ o˝Næˇ ´LRè£ˇ Zäÿłä´ z˙zåŁ˙ ˛aïijŽ 1. BeautifulSoupæŸrpythonäÿ´ äÿAäÿłé˘ I˙dåÿÿæ¸t¯ ˛Aè˛aN玡 DæŁ¸Såˇ R´U垸S,˝ åo˝Cè£Ÿèˇ C¡åˇ Rˇ ´LçRˇ ˛EçŽDåd’ˇ Dçˇ Rˇ ˛EéTˇ Zè´ r´ræ´ aijåij˘ R玴 Dæˇ a˘Gç˘ ¿ïijNä¡ˇ ˛EæŸræIJL’äÿ´ Aäÿłå˘ Tˇ räÿ´ Açijžç˘ C´zå´ rˇs柴 rïijŽå´ o˝Cè£ˇ Rèˇ ˛aNå¿ˇ ´LæEˇ cã´ A˘ C´ 2. lxmlæŸräÿ´ Aäÿłå§žäž˘ O˝ ElementTreeçŽDXMLèˇ g˘cæˇ d¯R垸S(åˇ Rˇ Næˇ U˝u裟è˚ C¡èˇ g˘cæˇ d¯RHTML),ˇ äÿ è£Glxmlå´z˝uäÿ˘ æŸrPythonæ´ a˘Gå˘ G˘ ˛E垸S èA˘ NScrapyåˇ o˝dç¯ O˝ r䞡 ˛EèGłå˚u˘ s玴 Dæˇ ¸Træˇ oæ˝ R´ Råˇ R´UæIJžå˝ ´L˝uïijNåˇ o˝Cäˇ z˙nèˇ c´nç´ g˘räÿžéˇ AL’æ˘ Nl’å´ Z´´lïijNéˇ AŽè£˘ G˘ XPathæ´LU˝ CSSè ˛a´lè¿¿åijRåIJ´ ´lHTMLæU˝ Gæ˘ ˛acäÿˇ æI˙eéˇ AL’æ˘ Nl’çL’´zå´ oŽçŽ˝ Déˇ Cˇ ´lå´L ˛E XPathæŸräÿ´ Aç˘ Tˇ´læI˙eåIJˇ ´lXMLäÿ éAL’æ˘ Nl’èŁ´ Cç´ C´z玴 Dèˇ r´ è´lAïij˘ Nåˇ Rˇ Næˇ U˝uå˚ R´ rä´ z˙eçˇ Tˇ´låIJ´lHTMLäÿŁéI˙cã´ A˘ C´ CSSæŸräÿ´ Aç˘ g˘ HTMLæU˝ Gæ˘ ˛acäÿŁéˇ I˙c玴 Dæˇ a˚uåij˘ Rè´ r´ è´lAã˘ A˘ C´ ScrapyéAL’æ˘ Nl’å´ Z´´læd¯Dåˇ zžåIJ˙ ´llxmlå§žç ˛aAä´z˘ NäÿŁïij´ NæL’ˇ Aä˘ z˙eåˇ R´ rä´ z˙eä£ˇ Iè˙ r´ ˛AéA§åž˛eåŠ˘ Nåˇ G˘ ˛Eç˛aoæ˝ A˘ gã˘ A˘ C´ æIJnçˇ n´aæ˘ ´LSä´ z˙næˇ I˙eèˇ r˛eç´ z˙ ˛Eèošè˝ g˘cäÿˇ Né´ AL’æ˘ Nl’å´ Z´´lçŽDå˚uˇ eä¡IJåˇ O§ç˝ Rˇ ˛EïijN裟æIJL’åˇ o˝Cäˇ z˙næˇ d¯ ˛AåE˝uçˇ o˝Aå˘ ¸TåŠNçˇ ZÿäijijçŽ˙ DAPIïijˇ Næˇ r´Tlxml玡 DAPIåˇ rˇSåd’Žäž´ ˛EïijNåˇ Z˙ aäÿžlxmlå˘ R´ rä´ z˙eçˇ Tˇ´läžOå¿˝ ´Låd’ŽåE˝uäˇ z˙Ué˝ c´ ˛E姧ãA˘ C´ åo˝Næˇ ¸Tt’çŽDAPIèˇ r˚u槴 eçIJˇ N´ SelectoråR´ Cè´ A˘ Cˇ

4.1å E¸s䞡 Oé˝ AL’æ˘ Nl’å´ Z´´l

Scrapyåÿoæ˝ ´LSä´ z˙näÿˇ Nè¡¡å´ o˝Néˇ ˛a¸téI˙cå´ RˇOïij˝ Næˇ ´LSä´ z˙næˇ A˘ Oæ˝ a˚uåIJ˘ ´læz˙ ˛aæŸrhtmlæ´ a˘Gç˘ ¿çŽDåˇ ˛EEåˇ o´zäÿ˝ æL’¿å´Lræˇ ´LSä´ z˙næL’ˇ AéIJ˘ Aè˛e˘ ˛AçŽDåˇ Eˇ Cçt’ˇ aå˘ S´cïij´ Nè£ˇ Zé´ G˘ Nåˇ rˇséIJ´ Aè˛e˘ ˛Aä¡£çTˇ´lå´Lréˇ AL’æ˘ Nl’å´ Z´´läž ˛EïijNåˇ o˝Cäˇ z˙næŸˇ rç´ Tˇ´læI˙eåˇ oŽä¡˝ åEˇ Cçt’ˇ aå´z˝uäÿ˘ Tæˇ R´ Råˇ R´Uå˝ Eˇ Cçt’ˇ a玢 Dåˇ Aijã˘ A˘ Cå´ Eˇ´LæI˙eäÿ¿åˇ G˘ aäÿłä¿˘ Nå´ RçIJˇ NçIJ´ NïijŽ´ • /html/head/title: éAL’æ˘ Nl’´ èŁCç´ C´z,å´ o˝Cä¡ˇ äžOhtmlæ˝ U˝ Gæ˘ ˛ac玡 Dˇ <head>èŁCç´ C´zå˛E´ Eˇ • /html/head/title/text(): éAL’æ˘ Nl’äÿŁé´ I˙c玴 Dˇ <title>èŁCç´ C´z玴 Dåˇ ˛EEåˇ o´z.˝ • //td: éAL’æ˘ Nl’é´ ˛a¸téI˙cäÿ´ æL’AæIJL’玢 Dåˇ Eˇ Cçt’ˇ a˘ • //div[@class=”mine”]: éAL’æ˘ Nl’æL’´ AæIJL’æ˘ N´ eæIJL’åˇ s´dæ¯ A˘ g˘class="mine"çŽDdivåˇ Eˇ Cçt’ˇ a˘ Scrapyä¡£çTˇ´lcssåŠNxpathéˇ AL’æ˘ Nl’å´ Z´´læI˙eåˇ oŽä¡˝ åEˇ Cçt’ˇ aïij˘ Nåˇ o˝CæIJL’åˇ Z˙ Zäÿłå§žæIJ˙ næˇ U´zæ¸s¸TïijŽ˝ • xpath(): è£Tåˇ Z˙ dé¯ AL’æ˘ Nl’å´ Z´´lå´LUè˚ ˛a´lïijNæˇ r´Räÿłé´ AL’æ˘ Nl’å´ Z´´läz˙cèˇ ˛a´lä¡£çTˇ´lxpathèr´ æ¸s¸TéAL’æ˘ Nl’玴 DèŁˇ Cç´ C´z´ • css(): è£Tåˇ Z˙ dé¯ AL’æ˘ Nl’å´ Z´´lå´LUè˚ ˛a´lïijNæˇ r´Räÿłé´ AL’æ˘ Nl’å´ Z´´läz˙cèˇ ˛a´lä¡£çTˇ´lcssèr´ æ¸s¸TéAL’æ˘ Nl’玴 DèŁˇ Cç´ C´z´ • extract(): è£Tåˇ Z˙ dè¯ c´né´ AL’æ˘ Nl’å´ Eˇ Cçt’ˇ a玢 Dunicodeåˇ Uç˚ n˛eäÿšˇ • re(): è£Tåˇ Z˙ dé¯ AŽè£˘ Gæ˘ cåˇ ´LZè´ ˛a´lè¿¿åijRæ´ R´ Råˇ R´U玽 Dunicodeåˇ Uç˚ n˛eäÿšåˇ ´LUè˚ ˛a´l</p><p>4.2 ä¡£çTˇ´léAL’æ˘ Nl’å´ Z´´l</p><p>äÿNé´ I˙cæ´ ´LSä´ z˙néˇ AŽè£˘ GScrapy˘ shellæijTçd’žäÿˇ Né´ AL’æ˘ Nl’å´ Z´´lçŽDä¡£çˇ Tˇ´lïijNåˇ ˛AGè˘ o¿æ˝ ´LSä´ z˙næIJL’å˛eˇ Cäÿ´ N玴 Däÿˇ Aäÿłç¡˘ Sé´ ˛a¸thttp: //doc.scrapy.org/en/latest/_static/selectors-sample1.htmlïijNåˇ ˛EEåˇ o´zå˛e˝ Cäÿ´ NïijŽ´</p><p><html> <head> <base href='http://example.com/' /> <title>Example website

é˛eUå˝ Eˇ´Læ´LSä´ z˙næL’¸Såijˇ Ashell˘ scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-

˓→sample1.html

è£Rèˇ ˛aNˇ

>>> response.xpath('//title/text()') [] >>> response.css('title::text') []

çz¸Sæ˙ dIJå¯ R´ rä´ z˙eçIJˇ Nå´ Gž,˘ xpath()åŠNˇ css()æU´zæ¸s¸Tè£˝ Tåˇ Z˙ d环 DæŸˇ r´SelectorListåo˝d俯 Nïij´ NæŸˇ räÿ´ Aäÿłé˘ AL’æ˘ Nl’å´ Z´´lå´LUè˚ ˛a´lïijNä¡ˇ aå˘ R´ rä´ z˙eéˇ AL’æ˘ Nl’å¸t´ Nåˇ eˇUçŽ˚ Dæˇ ¸Træˇ oïijŽ˝

>>> response.css('img').xpath('@src').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']

å£Eéˇ ˛azä¡£ç˙ Tˇ´l.extract()æL’ èC¡æˇ R´ Råˇ R´UæIJ˝ Aç˘ z˙´LçŽDæˇ ¸Træˇ oïij˝ Nå˛eˇ Cæ´ dIJ䡯 aå˘ Rłæ´ C¸sèˇ O˚uå¿˝ Uç˚ nˇnäÿˇ Aäÿłå˘ N´zéˇ Eˇ çŽDïijˇ Nåˇ R´ rä´ z˙eä¡£çˇ Tˇ´l. extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '

å˛eCæ´ dIJ暯 ˛aæIJL’æL’¿å´Lrïijˇ NäijŽè£ˇ Tåˇ Z˙ d¯NoneïijNè£Ÿåˇ R´ ré´ AL’æ˘ Nl’é´ zŸè˙ od’å˝ Aij˘

>>> response.xpath('//div[@id="not-exists"]/text()').extract_

˓→first(default='not-found') 'not-found'

èA˘ NCSSéˇ AL’æ˘ Nl’å´ Z´´l裟åR´ rä´ z˙eä¡£çˇ Tˇ´lCSS3æa˘Gå˘ G˘ ˛EïijŽ >>> response.css('title::text').extract() [u'Example website']

äÿNé´ I˙c柴 rå´ G˘ aäÿłæ˘ r´Tè¿ˇ Cåˇ Eˇ´léI˙c玴 Dçd’žä¿ˇ NïijŽ´

>>> response.xpath('//base/@href').extract() [u'http://example.com/']

>>> response.css('base::attr(href)').extract() [u'http://example.com/']

>>> response.xpath('//a[contains(@href,"image")]/@href').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html']

>>> response.xpath('//a[contains(@href,"image")]/img/@src').

˓→extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']

4.3 å¸tNåˇ eˇUé˚ AL’æ˘ Nl’å´ Z´´l xpath()åŠNˇ css()è£Tåˇ Z˙ d环 DæŸˇ ré´ AL’æ˘ Nl’å´ Z´´lå´LUè˚ ˛a´lïijNæL’ˇ Aä˘ z˙eä¡ˇ aå˘ R´ rä´ z˙eçˇ z˙gç˘ z˙ ä¡£çTˇ´låo˝Cäˇ z˙n玡 Dæˇ U´zæ¸s¸Tã˝ A˘ Cäÿ¿ä¿´ Næ´ I˙eèˇ ošïijŽ˝

>>> links= response.xpath('//a[contains(@href,"image")]') >>> links.extract() [u'Name: My image 1

˓→thumb.jpg">', u'Name: My image 2

˓→thumb.jpg">', u'Name: My image 3

˓→thumb.jpg">', u'Name: My image 4

˓→thumb.jpg">', u'Name: My image 5

˓→thumb.jpg">']

>>> for index, link in enumerate(links): ... args= (index, link.xpath('@href').extract(), link.xpath(

˓→'img/@src').extract()) ... print'Link number %d points to url %s and image %s'% args

Link number 0 points to url [u'image1.html'] and image [u'image1_

˓→thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_

˓→thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_

˓→thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_

˓→thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_

˓→thumb.jpg']

4.4 ä¡£çTˇ´læ cåˇ ´LZè´ ˛a´lè¿¿åijR´

SelectoræIJL’äÿAäÿł˘ re()æU´zæ¸s¸Té˝ AŽè£˘ Gæ˘ cåˇ ´LZè´ ˛a´lè¿¿åijRæ´ R´ Råˇ R´Uæ˝ ¸Træˇ oïij˝ Nåˇ o˝Cè£ˇ Tåˇ Z˙ d环 DæŸˇ runicodeå´ Uç˚ n˛eäÿšåˇ ´LUè˚ ˛a´lïijNä¡ˇ aäÿ˘ èC¡åˇ ˛E åO˝ zå¸t˙ Nåˇ eˇUä¡£ç˚ Tˇ´l

>>> response.xpath('//a[contains(@href,"image")]/text()').re(r ˓→'Name:\s*(.*)') [u'My image 1', u'My image 2', u'My image 3', u'My image 4', u'My image 5']

>>> response.xpath('//a[contains(@href,"image")]/text()').re_ ˓→first(r'Name:\s*(.*)') u'My image 1'

4.5 XPathçZÿå˙ r´zè˚u´ rå¿´ Dˇ

塸Sä¡aå¸t˘ Nåˇ eˇUä¡£ç˚ Tˇ´lXPathæU˝uå˚ A˘ Zïij´ Näÿˇ è˛e ˛Aä¡£çTˇ´l/åijAåd’t’玢 Dïijˇ Nåˇ Z˙ aäÿžè£˘ ZäÿłäijŽç´ Zÿå˙ r´zæ´ U˝ Gæ˘ ˛acæˇ a´zèŁ˘ Cç´ C´zåij´ Aå˘ g˘Nç´ o˝Uè¸t˚uïij˚ NéIJˇ Aè˛e˘ ˛Aä¡£çTˇ´lçZÿå˙ r´zè˚u´ rå¿´ Dˇ

>>> divs= response.xpath('//div') >>> for p in divs.xpath('.//p'): # extracts all

inside ... printp.extract() # æ´LUè˝ A˘Eäÿˇ Né´ I˙c裴 Zäÿłç´ Zt’æ˙ O˝eä¡£çˇ Tˇlpä´z§å´ R´rä´ z˙eˇ >>> for p in divs.xpath('p'): ... printp.extract()

4.6 XPathå˙zžèo˝ o˝

ä¡£çTˇ´ltextä¡IJäÿžæ˙I ˛aä˙z˝uæU˚ ˝u

é ˛A£åEˇ ä¡£çTˇ´l.//text(),çZt’æ˙ O˝ eä¡£çˇ Tˇ´l.

>>> sel.xpath("//a[contains(.,'Next Page')]").extract() [u'Click here to go to the Next Page

˓→a>']

//node[1]åŠN(//node)[1]åˇ Nžåˇ ´Ln´

• //node[1]: éAL’æ˘ Nl’æL’´ AæIJL’ä¡˘ äžOç˝ nˇnäÿˇ Aäÿłå˘ RèŁˇ Cç´ C´zä¡´ ç¡o玽 DnodeèŁˇ Cç´ C´z´ • (//node)[1]: éAL’æ˘ Nl’æL’´ AæIJL’玢 DnodeèŁˇ Cç´ C´zïij´ Nçˇ D˝uåˇ RˇOè£˝ Tåˇ Z˙ dç¯ z¸Sæ˙ dIJäÿ¯ çŽDçˇ nˇnäÿˇ AäÿłnodeèŁ˘ Cç´ C´z´

éAŽè£˘ Gclassæ§˘ eæL’¿æˇ U˚ ˝uäijŸåEˇ ´LèA˘ Cèˇ Z´ SCSS´

>> from scrapy import Selector >>> sel= Selector(text='

') >>> sel.css('.shout').xpath('./time/@datetime').extract() [u'2014-07-23 19:00']

5 Scrapyæ ¸TZç´ ´lN05-´ Itemèr˛eè´ g˘ cˇ

ItemæŸr䣴 Iå˙ Ÿçz¸Sæ˙ d¯Dæˇ ¸Træˇ o玽 DåIJˇ ræˇ U´zïij˝ NScrapyåˇ R´ rä´ z˙eåˇ rˇ ˛Eèg˘cæˇ d¯Rçˇ z¸Sæ˙ dIJä¯ z˙eåˇ Uå˚ Eÿå¡ˇ cåij´ R裴 Tåˇ Z˙ dïij¯ Nä¡ˇ ˛EæŸrPythonäÿ´ å Uå˚ Eÿçijžåˇ rˇSç´ z¸Sæ˙ d¯Dïijˇ NåIJˇ ´låd’gå˘ d¯Nç´ ´Lnèˇ Z´ nç¸s´ zç˙ z§äÿ˙ å¿´Läÿ æU´zä¿£ã˝ A˘ C´ ItemæR´ Rä¿ˇ Zäž˙ ˛Eçs´zå˙ Uå˚ Eÿ玡 DAPIïijˇ Nå´z˝uäÿˇ Tåˇ R´ rä´ z˙eå¿ˇ ´LæU´zä¿£çŽ˝ Dåˇ cˇræŸˇ Oå˝ Uæ˚ o¸tïij˝ Nå¿ˇ ´Låd’ŽScrapyçz˙Däˇ z˝uå˙ R´ rä´ z˙eåˇ ´Ll’çTˇ´lItemçŽDåˇ E˝uäˇ z˙Uä£˝ ˛aæ˛Arã´ A˘ C´

5.1å oŽä´zL’Item˝

åoŽä´zL’Itemé˝ I˙dåÿÿç¯ o˝Aå˘ ¸TïijNåˇ RłéIJ´ Aè˛e˘ ˛Açz˙gæL’£˘ scrapy.Itemçs´zïij˙ Nå´z˝uåˇ rˇ ˛EæL’AæIJL’å˘ Uæ˚ o¸té˝ C¡åˇ oŽä´zL’äÿž˝ scrapy. Fieldçs´zå˙ d¯Nå´ ¸såR´ r´

import scrapy

class Product(scrapy.Item): name= scrapy.Field() price= scrapy.Field() stock= scrapy.Field() last_updated= scrapy.Field(serializer=str)

5.2 Item Fields

Fieldår´zè´ s´ ˛aåR´ rç´ Tˇ´læI˙eåˇ r´zæ´ r´Räÿłå´ Uæ˚ o¸tæ˝ Nˇ Gå˘ oŽå˝ Eˇ Cæˇ ¸Træˇ oã˝ A˘ Cä¿´ Nå˛e´ CäÿŁé´ I˙c´last_updatedçŽDåžˇ Rå´ ´LUå˚ Nˇ Uå˝ G¡æ˘ ¸Træˇ Nˇ Gå˘ oŽäÿž˝ strïijNåˇ R´ rä´ z˙zæ˙ Dˇ Ræ´ Nˇ Gå˘ oŽå˝ Eˇ Cæˇ ¸Træˇ oïij˝ Näÿˇ è£Gæ˘ r´Rç´ g˘ åEˇ Cæˇ ¸Træˇ oå˝ r´zäž´ Oäÿ˝ åRˇ N玡 Dçˇ z˙Däˇ z˝uæ˙ Dˇ Rä´zL’äÿ´ äÿAæ˘ a˚uã˘ A˘ C´

5.3 Itemä¡£çTˇ´lçd’žä¿N´

ä¡aäijŽçIJ˘ Nå´ ´LrItem玡 Dä¡£çˇ Tˇ´lè˚u§Pythonäÿ çŽDåˇ Uå˚ EÿAPIéˇ I˙dåÿÿç¯ s´zäijij˙

å´LZå˙zžItem˙

>>> product= Product(name='Desktop PC', price=1000) >>> print product Product(name='Desktop PC', price=1000)

èO˝ ˚uåR´ Uå˝ Aij˘

>>> product['name'] Desktop PC >>> product.get('name') Desktop PC

>>> product['price'] 1000

>>> product['last_updated'] Traceback (most recent call last): ... KeyError: 'last_updated'

>>> product.get('last_updated','not set') not set

>>> product['lala'] # getting unknown field Traceback (most recent call last): ... KeyError: 'lala'

>>> product.get('lala','unknown field') 'unknown field' >>> 'name' in product # is name field populated? True

>>> 'last_updated' in product # is last_updated populated? False

>>> 'last_updated' in product.fields # is last_updated a declared

˓→field? True

>>> 'lala' in product.fields # is lala a declared field? False

èo¿ç¡˝ oå˝ Aij˘

>>> product['last_updated']='today' >>> product['last_updated'] today

>>> product['lala']='test' # setting unknown field Traceback (most recent call last): ... KeyError: 'Product does not support field: lala'

èo£é˝ U˚ oæL’˝ AæIJL’玢 Dåˇ Aij˘

>>> product.keys() ['price', 'name']

>>> product.items() [('price', 1000), ('name', 'Desktop PC')]

5.4 Item Loader

Item Loaderäÿžæ´LSä´ z˙næˇ R´ Rä¿ˇ Zäž˙ ˛EçT§æˇ ´LRItem玡 Dçˇ Zÿ塸Sä¿£å˙ ´Ll’çŽDæˇ U´zæ¸s¸Tã˝ A˘ CItemäÿžæŁ¸Så´ R´U玽 Dæˇ ¸Træˇ oæ˝ R´ Rä¿ˇ Zäž˙ ˛Eåo´zå˝ Z´´lïijNèˇ A˘ NItemˇ LoaderåR´ rä´ z˙eèˇ ol’æ˝ ´LSä´ z˙néˇ I˙dåÿÿæ¯ U´zä¿£çŽ˝ Dåˇ rˇ ˛E迸SåEˇ eåˇ ˛anå´ Eˇ Eåˇ ´Lråˇ o´zå˝ Z´´läÿ ãA˘ C´ äÿNé´ I˙cæ´ ´LSä´ z˙néˇ AŽè£˘ Gäÿ˘ Aäÿłä¿˘ Nå´ Ræˇ I˙eåˇ s´ ¸Tçd’žäÿAè˘ ´Lnä¡£çˇ Tˇ´læU´zæ¸s¸TïijŽ˝ from scrapy.loader import ItemLoader from myproject.items import Product def parse(self, response): l= ItemLoader(item=Product(), response=response) l.add_xpath('name','//div[@class="product_name"]') l.add_xpath('name','//div[@class="product_title"]') l.add_xpath('price','//p[@id="price"]') l.add_css('stock','p#stock]') l.add_value('last_updated','today') # you can also use literal

˓→values return l.load_item()

æ¸s´læDˇ RäÿŁé´ I˙c玴 Dˇ nameå Uæ˚ o¸tæŸ˝ rä´ z˙Oäÿd’äÿłxpathè˚u˝ rå¿´ Dæ˚uˇ zçt’˙ råŁ´ aå˘ RˇOå¿˝ Uå˚ ´Lrãˇ A˘ C´

5.5 è¿ ¸SåEˇ e/è¿ˇ ¸SåGžåd’˘ Dçˇ Rˇ ˛EåZ´´l

ær´RäÿłItem´ Loaderår´zæ´ r´Räÿł´ FieldéC¡æIJL’äÿˇ Aäÿłè¿¸Så˘ Eˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´låŠNäÿˇ Aäÿłè¿¸Så˘ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´lãA˘ C迸Så´ Eˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´låIJ´læ ¸Træˇ oè˝ c´næ´ O˝ eåˇ R´Uå˚ ´Lræˇ U˝uæL’˚ gè˘ ˛aNïijˇ N塸Sæ¸Tˇ ræˇ oæ˝ T˝uéˇ Z˙ ˛Eåo˝Nåˇ RˇOè˝ rˇCçˇ Tˇ´lItemLoader. load_item()æU˝uå˛E˚ æL’gè˘ ˛aN迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´lïijNè£ˇ Tåˇ Z˙ dæIJ¯ Aç˘ z˙´Lçz¸Sæ˙ dIJ㯠A˘ C´

l= ItemLoader(Product(), some_selector) l.add_xpath('name', xpath1) # (1) l.add_xpath('name', xpath2) # (2) l.add_css('name', css) # (3) l.add_value('name','test') # (4) return l.load_item() # (5)

æL’gè˘ ˛aNæ¸tˇ ˛Aç´lN柴 r裴 Zæ´ a˚u玢 DïijŽˇ 1. xpath1äÿ çŽDæˇ ¸Træˇ oè˝ c´næ´ R´ Råˇ R´Uå˝ Gžæ˘ I˙eïijˇ Nçˇ D˝uåˇ RˇOäij˝ a迸Så˘ ´Lrˇnameå Uæ˚ o¸t玽 D迸Såˇ Eˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´läÿ ïijNåIJˇ ´l迸SåEˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´låd’Dçˇ Rˇ ˛Eåo˝Nåˇ RˇOç˝ T§æˇ ´LRçˇ z¸Sæ˙ dIJæ¯ T¿åIJˇ ´lItem LoaderéG˘ Néˇ I˙c(裴 Zæ´ U˝uå˚ A˘ Zæš´ ˛aæIJL’è¸tNå´ Aijç˘ z˙Zitem)´ 2. xpath2æ ¸Træˇ oè˝ c´næ´ R´ Råˇ R´Uå˝ Gžæ˘ I˙eïijˇ Nçˇ D˝uåˇ RˇOäij˝ a迸Sç˘ z˙Z(1)äÿ´ åRˇ Næˇ a˚u玢 D迸Såˇ Eˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´lïijNåˇ Z˙ aäÿžå˘ o˝Cäˇ z˙néˇ C¡æŸˇ r´nameå Uæ˚ o¸t玽 Dåd’ˇ Dçˇ Rˇ ˛EåZ´´lïijNçˇ D˝uåˇ RˇOåd’˝ Dçˇ Rˇ ˛Eçz¸Sæ˙ dIJè¯ c´né´ Z´ DåŁˇ aå˘ ´Lr(1)玡 Dçˇ z¸Sæ˙ dIJå¯ RˇOé˝ I˙c´ 3. è˚u§2äÿAæ˘ a˚u˘ 4. è˚u§3äÿAæ˘ a˚uïij˘ Näÿˇ è£Gè£˘ Zæ´ nˇ ˛aæŸrç´ Zt’æ˙ O˝ e玡 Dåˇ Ué˚ I˙cå´ Uç˚ n˛eäÿšåˇ Aijïij˘ Nåˇ Eˇ´Lè¡næˇ cæ´ ´LRäÿˇ Aäÿłå˘ ¸TåEˇ Cçt’ˇ a玢 Dåˇ R´ r裴 äz˙cåˇ r´zè´ s´ ˛aå˛E äijaç˘ z˙Z迸Så´ Eˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´l 5. äÿŁéI˙c4æ´ e玡 Dæˇ ¸Træˇ oè˝ c´näij´ a迸Sç˘ z˙Z´ nameçŽD迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´lïijNåˇ rˇ ˛EæIJAç˘ z˙´LçŽDçˇ z¸Sæ˙ dIJè¸t¯ Nå´ Aijç˘ z˙Z´ nameå Uæ˚ o¸t˝

5.6è Głå˘ oŽä´zL’Item˝ Loader

ä¡£çTˇ´lçs´zå˙ oŽä´zL’è˝ r´ æ¸s¸TïijNäÿˇ Né´ I˙c柴 räÿ´ Aäÿłä¿˘ Nå´ Rˇ

from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose, Join

class ProductLoader(ItemLoader):

default_output_processor= TakeFirst()

name_in= MapCompose(unicode.title) name_out= Join()

price_in= MapCompose(unicode.strip) # ...

éAŽè£˘ G˘ _inåŠNˇ _outåRˇOçij˝ Aæ˘ I˙eåˇ oŽä´zL’迸Så˝ Eˇ eåŠˇ N迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´lïijNå´z˝uäÿˇ Tè£Ÿåˇ R´ rä´ z˙eåˇ oŽä´zL’é˝ zŸè˙ od’玽 Dˇ ItemLoader. default_input_processoråŠNˇ ItemLoader.default_input_processor.

5.7 åIJ´lFieldåoŽä´zL’äÿ˝ åcˇræŸˇ Oè¿˝ ¸SåEˇ e/è¿ˇ ¸SåGžåd’˘ Dçˇ Rˇ ˛EåZ´´l

裟æIJL’äÿłåIJræˇ U´zå˝ R´ rä´ z˙eéˇ I˙dåÿÿæ¯ U´zä¿£çŽ˝ Dæ˚uˇ zåŁ˙ a迸Så˘ Eˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´lïijNéˇ C´ cåˇ rˇs柴 rç´ Zt’æ˙ O˝ eåIJˇ ´lFieldåoŽä´zL’äÿ˝

import scrapy from scrapy.loader.processors import Join, MapCompose, TakeFirst from w3lib.html import remove_tags

def filter_price(value): if value.isdigit(): return value

class Product(scrapy.Item): name= scrapy.Field( input_processor=MapCompose(remove_tags), output_processor=Join(), ) price= scrapy.Field( input_processor=MapCompose(remove_tags, filter_price), output_processor=TakeFirst(), )

äijŸåEˇ´LçžgïijŽ˘ 1. åIJ´lItem Loaderäÿ åoŽä´zL’玽 Dˇ field_inåŠNˇ field_out 2. FiledåEˇ Cæˇ ¸Træˇ o(˝ input_processoråŠNˇ output_processoråE¸séˇ Tˇ oå˝ U)˚ 3. Item Loaderäÿ çŽDéˇ zŸè˙ od’玽 Dˇ TipsïijŽäÿAè˘ ´Lnæˇ I˙eèˇ ošïij˝ Nåˇ rˇ ˛E迸SåEˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´låoŽä´zL’åIJ˝ ´lItem Load- erçŽDåˇ oŽä´zL’äÿ˝ field_inïijNçˇ D˝uåˇ RˇOå˝ rˇ ˛E迸SåGžåd’˘ Dçˇ Rˇ ˛EåZ´´låoŽä´zL’åIJ˝ ´lFieldåEˇ Cæˇ ¸Træˇ oäÿ˝

5.8 Item LoaderäÿŁäÿNæ´ U˝ G˘

Item LoaderäÿŁäÿNæ´ U˝ Gè˘ c´næL’´ AæIJL’迸Så˘ Eˇ e/迸Såˇ Gžåd’˘ Dçˇ Rˇ ˛EåZ´´låEˇ säž´ nïij´ Næˇ r´Tå˛eˇ Cä¡´ aæIJL’äÿ˘ Aäÿłè˘ g˘cæˇ d¯Réˇ ¸T£åž˛eçŽDåˇ G¡æ˘ ¸Trˇ

def parse_length(text, loader_context): unit= loader_context.get('unit','m') # ... length parsing code goes here ... return parsed_length

å´LIå˙ g˘Nå´ Nˇ UåŠ˝ Nä£ˇ oæ˝ T´zäÿŁäÿˇ Næ´ U˝ G玢 Dåˇ Aij˘ loader= ItemLoader(product) loader.context['unit']='cm'

loader= ItemLoader(product, unit='cm')

class ProductLoader(ItemLoader): length_out= MapCompose(parse_length, unit='cm')

5.9 å ˛EEç¡ˇ o玽 Dåd’ˇ Dçˇ Rˇ ˛EåZ´´l

1. Identity å ¸Teä´z§äÿˇ å ˛AŽ 2. TakeFirst è£Tåˇ Z˙ dç¯ nˇnäÿˇ Aäÿłé˘ I˙dçl’žå¯ Aijïij˘ Néˇ AŽåÿÿç˘ Tˇ´lä¡IJ迸SåGžåd’˘ Dçˇ Rˇ ˛EåZ´´l 3. Join årˇ ˛Eçz¸Sæ˙ dIJ裯 dè¸t˚uæ¯ I˙eïijˇ Néˇ zŸè˙ od’ä¡£ç˝ Tˇ´lçl’žæaij’˘ ‘ 4. Compose årˇ ˛EåG¡æ˘ ¸Tré¸S¿æˇ O˝ eè¸t˚uæˇ I˙eå¡ˇ cæ´ ´LRçˇ o˝ ˛aé˛A¸Sæ¸t ˛AïijN䞡 gç˘ T§æIJˇ Aå˘ RˇO玽 D迸Såˇ Gž˘ 5. MapCompose è˚u§äÿŁéI˙c玴 Dˇ Composeçs´zäijijïij˙ Nåˇ Nžåˇ ´LnåIJ´ ´läžOå˝ ˛EEéˇ Cˇ ´lçz¸Sæ˙ dIJåIJ¯ ´låG¡æ˘ ¸Träÿˇ çŽDäijˇ aé˘ AŠæ˘ U´zåij˝ R.´ åo˝C玡 D迸Såˇ Eˇ eåˇ AijæŸ˘ rå´ R´ r裴 äz˙c玡 Dïijˇ Né˛eˇ Uå˝ Eˇ´Lårˇ ˛Eçnˇnäÿˇ Aäÿłå˘ G¡æ˘ ¸Trä¿ˇ Iæ˙ nˇ ˛aä¡IJçTˇ´läžOæL’˝ AæIJL’å˘ Aijïij˘ N䞡 gç˘ T§æˇ U˝ r玡 Dåˇ R´ r裴 äz˙c迸Såˇ Eˇ eïijˇ Nä¡IJäÿžçˇ nˇn䞡 Näÿłåˇ G¡æ˘ ¸Tr玡 D迸Såˇ Eˇ eïijˇ NæIJˇ Aå˘ RˇOç˝ T§æˇ ´LR玡 Dçˇ z¸Sæ˙ dIJ裯 dè¸t˚uæ¯ I˙eè£ˇ Tåˇ Z˙ dæIJ¯ Aç˘ z˙´LåAijïij˘ Näÿˇ Aè˘ ´Lnçˇ Tˇ´låIJ´l迸SåEˇ eåd’ˇ Dçˇ Rˇ ˛EåZ´´läÿ ãA˘ C´ 6. SelectJmes ä¡£çTˇ´ljsonè˚urå¿´ Dæˇ I˙eæ§ˇ eèˇ r´cå´ Aijå´z˝uè£˘ Tåˇ Z˙ dç¯ z¸Sæ˙ dIJ¯

6 Scrapyæ ¸TZç´ ´lN06-´ Item Pipeline

塸SäÿAäÿłitemè˘ c´nèIJŸè´ Z˙ Zç˙ ´Lnåˇ R´Uå˝ ´Lrä´zˇ Nå´ RˇOäijŽè˝ c´nå´ R´ Sé´ A˘ ˛Açz˙ZItem´ PipelineïijNçˇ D˝uåˇ RˇOåd’Žäÿłç˝ z˙Däˇ z˝uæ˙ NL’çˇ Eˇ gé˘ ˛ažåžRåd’´ Dçˇ Rˇ ˛Eè£Zäÿłitemã´ A˘ C´ ær´RäÿłItem´ Pipelineçz˙Däˇ z˝uå˙ E˝uåˇ o˝då¯ rˇs柴 räÿ´ Aäÿłå˘ o˝dç¯ O˝ r䞡 ˛EäÿAäÿłç˘ o˝Aå˘ ¸TæU´zæ¸s¸T玽 DPythonçˇ s´zã˙ A˘ Cä´ z˙Uä˝ z˙næˇ O˝ eåˇ R´Uäÿ˚ Aäÿłitemå´z˝uåIJ˘ ´läÿŁéI˙cæL’´ gè˘ ˛aNéˇ A˘ zè¿˙ Sïij´ Nè£Ÿèˇ C¡åˇ ˛E¸såoŽè£˝ Zäÿłitemå´ ´Lråžˇ ¸TæŸrå´ R˛e裟è˛eˇ ˛Açz˙gç˘ z˙ å¿Aäÿ˘ Näij´ a迸Sïij˘ Nå˛eˇ Cæ´ dIJäÿ¯ è˛e ˛A䞲Eårˇsç´ Zt’æ˙ O˝ eäÿˇ cåij´ Cãˇ A˘ C´ ä¡£çTˇ´lItem PipelineçŽDåÿÿçˇ Tˇ´låIJžæZ´ rïijŽ´ • æÿEçˇ Rˇ ˛EHTMLæ¸Træˇ o˝ • éłNèˇ r´ ˛Aèc´næŁ¸Så´ R´U玽 Dæˇ ¸Træˇ o(æ˝ cˇAæ§˘ eitemæŸˇ rå´ R˛eåˇ Nˇ Eåˇ Rˇ n槴 R䞡 Zå˙ Uæ˚ o¸t)˝ •é G˘ åd’ æA˘ gæ˘ cˇAæ§˘ e(çˇ D˝uåˇ RˇOäÿ˝ cåij´ C)ˇ •å rˇ ˛EæŁ¸SåR´U玽 Dæˇ ¸Træˇ oå˝ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝

6.1 çijUå˝ ˛EZè´ Głå˘ ˚us玴 DPipelineˇ

åoŽä´zL’äÿ˝ AäÿłPythonç˘ s´zïij˙ Nçˇ D˝uåˇ RˇOå˝ o˝dç¯ O˝ ræˇ U´zæ¸s¸T˝ process_item(self, item, spider)å ¸såR´ rïij´ Nè£ˇ Tåˇ Z˙ däÿ¯ Aäÿłå˘ Uå˚ Eÿæˇ ´LUItemïij˝ Næˇ ´LUè˝ A˘ EæŁˇ Zå˙ Gž˘ DropItemåijCåÿÿäÿ´ cåij´ Cè£ˇ ZäÿłItemã´ A˘ C´ æ´LUè˝ A˘ Eè£Ÿåˇ R´ rä´ z˙eåˇ o˝dç¯ O˝ räÿˇ Né´ I˙cå´ G˘ aäÿłæ˘ U´zæ¸s¸TïijŽ˝ • open_spider(self, spider) èIJŸèZ˙ ZæL’¸Såij˙ A玢 Dæˇ U˝uæL’˚ gè˘ ˛aNˇ • close_spider(self, spider) èIJŸèZ˙ Zå˙ E¸séˇ U˚ æU˝uæL’˚ gè˘ ˛aNˇ • from_crawler(cls, crawler) åR´ rè´ o£é˝ U˚ oæ˝ aÿå£˘ Cçˇ z˙Däˇ z˝uæ˙ r´Tå˛eˇ Cé´ Eˇ ç¡oåŠ˝ Nä£ˇ ˛aåR˚uïij´ Nå´z˝uæ¸sˇ ´lå ˛ENéŠl’åˇ Råˇ G¡æ˘ ¸Tråˇ ´LrScrapyäÿˇ

6.2 Item Pipelineçd’žä¿N´

ä˙z˚uæaij鳢 Nèˇ r´ ˛A

æ´LSä´ z˙néˇ AŽè£˘ Gäÿ˘ Aäÿłä˘ z˚uæ˙ aij鳢 Nèˇ r´ ˛Aä¿Nå´ Ræˇ I˙eçIJˇ NçIJ´ Næ´ A˘ Oæ˝ a˚uä¡£ç˘ Tˇ´l

from scrapy.exceptions import DropItem

class PricePipeline(object):

vat_factor= 1.15

def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price']= item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s"% item)

årˇ ˛Eitemå˛EZå´ Eˇ ejsonæˇ U˝ Gä˙z˝u˘

äÿNé´ I˙c玴 Dè£ˇ ZäÿłPipelineå´ rˇ ˛EæL’AæIJL’玢 Ditemåˇ ˛EZå´ Eˇ eåˇ ´Lräÿˇ Aäÿłå˘ ¸TçN´ n玡 Djsonæˇ U˝ Gä˘ z˝uïij˙ Näÿˇ Aè˘ ˛aNäÿˇ Aäÿłitem˘

import json

class JsonWriterPipeline(object):

def __init__(self): self.file= open('items.jl','wb')

def process_item(self, item, spider): line= json.dumps(dict(item))+" \n" self.file.write(line) return item

årˇ ˛Eitemå ŸåC´ ´lå´LrMongoDBäÿˇ

è£Zäÿłä¿´ Nå´ Rä¡£çˇ Tˇ´lpymongoæI˙eæijˇ Tçd’žæˇ A˘ Oæ˝ a˚uè˘ ošitemä£˝ Iå˙ Ÿå´LrMongoDBäÿˇ ãA˘ C´ MongoDBçŽDåIJˇ råˇ I˙AåŠ˘ Næˇ ¸Træˇ o垸Så˝ Rˇ åIJ´léEˇ ç¡oäÿ˝ æNˇ Gå˘ oŽïij˝ Nè£ˇ Zäÿłä¿´ Nå´ Räÿˇ zè˛e˙ ˛AæŸrå´ Rˇ Sä¡´ aå˘ s´ ¸Tçd’žæA˘ Oæ˝ a˚uä¡£ç˘ Tˇ´lfrom_crawler()æU´zæ¸s¸Tïij˝ Näˇ z˙eåˇ RŁå˛e´ Cä¡´ ¸TæÿEçˇ Rˇ ˛Eè¸tDæžˇ Rãˇ A˘ C´

import pymongo

class MongoPipeline(object): collection_name='scrapy_items'

def __init__(self, mongo_uri, mongo_db): self.mongo_uri= mongo_uri self.mongo_db= mongo_db

@classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE','items') )

def open_spider(self, spider): self.client= pymongo.MongoClient(self.mongo_uri) self.db= self.client[self.mongo_db]

def close_spider(self, spider): self.client.close()

def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item

éG˘ åd’ è£Gæ˙zd’å˘ Z´´l

å ˛AGè˘ o¿æ˝ ´LSä´ z˙n玡 Diteméˇ G˘ Néˇ I˙c玴 Didåˇ Uå˚ EÿæŸˇ rå´ Tˇ räÿ´ A玢 Dïijˇ Nä¡ˇ ˛EæŸræ´ ´LSä´ z˙n玡 DèIJŸèˇ Z˙ Zè£˙ Tåˇ Z˙ d䞯 ˛Eåd’ŽäÿłçZÿå˙ Rˇ Nid玡 Ditemˇ from scrapy.exceptions import DropItem class DuplicatesPipeline(object):

def __init__(self): self.ids_seen= set()

def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s"% item) else: self.ids_seen.add(item['id']) return item

6.3 æ£Aæt’˙zäÿ˘ AäÿłItem˘ Pipelineç˙zDä˙z˝uˇ

ä¡aå£˘ Eéˇ ˛azåIJ˙ ´léEˇ ç¡oæ˝ U˝ Gä˘ z˝uäÿ˙ årˇ ˛Eä¡aéIJ˘ Aè˛e˘ ˛Aæ£Aæt’˘ zçŽ˙ DPiplineçˇ z˙Däˇ z˝uæ˚u˙ zåŁ˙ aå˘ ´LrˇITEM_PIPELINESäÿ ITEM_PIPELINES={ 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, }

åRˇOé˝ I˙c玴 Dæˇ ¸Tråˇ Uè˚ ˛a´lçd’žåo˝C玡 DæL’ˇ gè˘ ˛aNéˇ ˛ažåžRïij´ Näˇ z˙Oä¡˝ Oå˝ ´Lréˇ nŸæL’´ gè˘ ˛aNïijˇ Nèˇ Nˇ Cåˇ Zt’0-˙ 1000

6.4 Feed exports

è£Zé´ G˘ Néˇ ˛ažä¿£æR´ Räÿˇ NFeed´ exportsïijNäÿˇ Aè˘ ´LnæIJL’玡 Dçˇ ´Lnèˇ Z´ nç´ Zt’æ˙ O˝ eåˇ rˇ ˛Eç´Lnåˇ R´Uç˝ z¸Sæ˙ dIJ垯 Rå´ ´LUå˚ Nˇ Uå˝ ´Lræˇ U˝ Gä˘ z˝uäÿ˙ ïijNå´z˝uä£ˇ Iå˙ Ÿå´Lræ§ˇ Räÿłåˇ ŸåC´ ´läz˙Nèt’´ ´läÿ ãA˘ Cå´ RłéIJ´ Aè˛e˘ ˛AåIJ´lsettingséG˘ Néˇ I˙cè´ o¿ç¡˝ oå˝ G˘ aäÿłå˘ ¸såR´ rïijŽ´

* FEED_FORMAT= json # json|jsonlines|csv|xml|pickle|marshal * FEED_URI= file:///tmp/export.csv|ftp://user:[email protected]/ ˓→path/to/export.csv|s3://aws_key:aws_secret@mybucket/path/to/

˓→export.csv|stdout: * FEED_EXPORT_FIELDS=["foo","bar","baz"] # ˓→è£ZäÿłåIJ´ lå´ rijå´ Gžcsv玢 ˇDæU˝uå˚ A˘ZæIJL’ç´ Tˇl´

6.5è r´ ˚uæs´C労 Nåˇ ¸S åžTˇ

Scrapyä¡£çTˇ´lRequeståŠNˇ Responseår´zè´ s´ ˛aæI˙eçˇ ´Lnåˇ R´Uç¡˝ Sç´ n´Zã´ A˘ C´ Requestår´zè´ s´ ˛aèc´nèIJŸè´ Z˙ Zç˙ T§æˇ ´LRïijˇ Nçˇ D˝uåˇ RˇOè˝ c´näij´ aé˘ AŠç˘ z˙Zäÿ´ Nè¡¡å´ Z´´lïijNä´zˇ Nå´ RˇOäÿ˝ Nè¡¡å´ Z´´låd’Dçˇ Rˇ ˛Eè£Zäÿł´ RequeståRˇOè£˝ Tåˇ Z˙ d¯Responseår´zè´ s´ ˛aïijNçˇ D˝uåˇ RˇOè£˝ Tåˇ Z˙ dç¯ z˙Zç´ T§æˇ ´LRˇ RequestçŽDè£ˇ ZäÿłèIJŸè´ Z˙ Zã˙ A˘ C´

ç˙zZå´ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Träijˇ aé˘ AŠé˘ c´˙Iåd’U玽 Dåˇ R´ Cæ´ ¸Trˇ

Requestår´zè´ s´ ˛açT§æˇ ´LR玡 Dæˇ U˝uå˚ A˘ ZäijŽé´ AŽè£˘ Gå˘ E¸séˇ Tˇ oå˝ Uå˚ R´ Cæ´ ¸TrˇcallbackæNˇ Gå˘ oŽå˝ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Trïijˇ Nˇ Responseår´zè´ s´ ˛aèc´n塸Så˛AŽç´ nˇnäÿˇ Aäÿłå˘ R´ Cæ´ ¸Träijˇ aå˘ Eˇ eïijˇ NæIJL’æˇ U˝uå˚ A˘ Zæ´ ´LSä´ z˙næˇ C¸säijˇ aé˘ AŠé˘ c´Iåd’˙ U玽 Dåˇ R´ Cæ´ ¸Trïijˇ Næˇ r´Tå˛eˇ Cæ´ ´LSä´ z˙næˇ d¯Dåˇ zžæ§˙ RäÿłItem玡 Dæˇ U˝uå˚ A˘ Zïij´ NéIJˇ Aè˛e˘ ˛Aäÿd’æ eïijˇ Nçˇ nˇnäÿˇ Aæ˘ eæŸˇ ré¸S¿æ´ O˝ eåˇ s´dæ¯ A˘ gïij˘ Nçˇ nˇn䞡 Næˇ eæŸˇ rè´ r˛eæ´ Cˇ Eåˇ s´dæ¯ A˘ gïij˘ Nåˇ R´ rä´ z˙eæˇ Nˇ Gå˘ oŽ˝ Request. meta

def parse_page1(self, response): item= MyItem() item['main_url']= response.url request= scrapy.Request("http://www.example.com/some_page.html

˓→", callback=self.parse_page2) request.meta['item']= item return request

def parse_page2(self, response): item= response.meta['item'] item['other_url']= response.url return item Requestå Rçˇ s˙z´

ScrapyäÿžåRˇ Dçˇ g˘ äÿ åRˇ N玡 DåIJžæˇ Z´ rå´ ˛EEç¡ˇ oäž˝ ˛Eå¿´Låd’ŽRequestå Rçˇ s´zïij˙ Nä¡ˇ aè£Ÿå˘ R´ rä´ z˙eçˇ z˙gæL’£å˘ o˝Cèˇ Głå˘ oŽä´zL’è˝ Głå˚u˘ s玴 Dèˇ r˚uæ´ s´Cç´ s´zã˙ A˘ C´ FormRequestè£Zäÿłäÿ¸Sé´ U˚ ´läÿžformè ˛a´lå ¸Tèo¿è˝ o˝ ˛aïijNæˇ ´l ˛aæN§è´ ˛a´lå ¸TæR´ Räžd’玡 Dçd’žä¿ˇ N´

return [FormRequest(url="http://www.example.com/post/action", formdata={'name':'John Doe','age':'27'}, callback=self.after_post)]

æ´LSä´ z˙nåˇ ˛E æI˙eäÿˇ Aäÿłä¿˘ Nå´ Ræˇ ´l ˛aæN§ç´ Tˇ´læ´L˚uçZ´ zå¡˙ ¸TïijNä¡£çˇ Tˇ´läž ˛EFormRequest. from_response()

import scrapy

class LoginSpider(scrapy.Spider): name='example.com' start_urls=['http://www.example.com/users/login.php']

def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'username':'john','password':'secret'}, callback=self.after_login )

def after_login(self, response): # check login succeed before going on if "authentication failed" in response.body: self.logger.error("Login failed") return

# continue scraping with authenticated session...

Responseå Rçˇ s˙z´

äÿAäÿł˘ scrapy.http.Responseår´zè´ s´ ˛aäz˙cèˇ ˛a´läž ˛EäÿAäÿłHTTPç˘ Zÿåž˙ Tïijˇ Néˇ AŽåÿÿæŸ˘ rè´ c´näÿ´ Nè¡¡å´ Z´´läÿNè¡¡å´ RˇOå¿˝ Uå˚ ´Lrïijˇ Nå´z˝uäžd’çˇ z˙ZSpiderå´ ˛AŽè£Zäÿ˙ Aæ˘ e玡 Dåd’ˇ Dçˇ Rˇ ˛EãA˘ CResponseä´z§æIJL’å¿´ ´Låd’ŽézŸè˙ od’玽 Dåˇ Rçˇ s´zïij˙ Nçˇ Tˇ´läžOè˝ ˛a´lçd’žåRˇ Dçˇ g˘ äÿ åRˇ N玡 Då¸Sˇ åžTçˇ s´zå˙ d¯Nã´ A˘ C´ • TextResponse åIJ´l姞æIJnˇResponseçs´z姞ç˙ ˛aAä´z˘ NäÿŁå´ c´dåŁ¯ a䞢 ˛EçijUç˝ a˘ ˛AåŁ§èC¡ïijˇ Näÿ¸Séˇ U˚ ´lçTˇ´läžOäž˝ Nè£ˇ Zå˙ ´L˝uæ¸Træˇ oæ˝ r´Tå˛eˇ Cå´ Z¿çL’˙ Gã˘ A˘ ˛Aåcˇr駸sæˇ ´LUå˝ E˝uäˇ z˙UåłŠä¡¸Sæ˝ U˝ Gä˘ z˝u˙ • HtmlResponse æ d’çs´zæŸ˙ r´TextResponseçŽDåˇ Rçˇ s´zïij˙ Néˇ AŽè£˘ Gæ§˘ eèˇ r´cHTML玴 Dˇ meta http-equivås´dæ¯ A˘ gå˘ o˝dç¯ O˝ r䞡 ˛EçijUç˝ a˘ ˛AèGłåŁ˘ ´låR´ Sç´ O˝ rˇ • XmlResponse æ d’çs´zæŸ˙ r´TextResponseçŽDåˇ Rçˇ s´zïij˙ Néˇ AŽè£˘ Gæ§˘ eèˇ r´cXMLå´ cˇræŸˇ Oå˝ o˝dç¯ O˝ rçijˇ Uç˝ a˘ ˛AèGłåŁ˘ ´låR´ Sç´ O˝ rˇ

7 Scrapyæ ¸TZç´ ´lN07-´ å ˛EEç¡ˇ oæIJ˝ åŁ ˛a

Scrapyä¡£çTˇ´lPythonå ˛EEç¡ˇ o玽 D玡 Dæˇ U˚ eå£ˇ Uç¸s˚ zç˙ z§æ˙ I˙eèˇ o˝rå¡ˇ ¸TäžNä´ z˝uæ˙ U˚ eå£ˇ Uã˚ A˘ C´ æU˚ eå£ˇ Ué˚ Eˇ ç¡o˝ LOG_ENABLED= true LOG_ENCODING="utf-8" LOG_LEVEL= logging.INFO LOG_FILE="log/spider.log" LOG_STDOUT= True LOG_FORMAT=" %(asctime)s [%(name)s] %(levelname)s: %(message)s" LOG_DATEFORMAT="%Y-%m- %d %H:%M:%S"

ä¡£çTˇ´lä´z§å¿´Lço˝Aå˘ ¸T import logging logger= logging.getLogger(__name__) logger.warning("This is a warning")

å˛eCæ´ dIJåIJ¯ ´lSpideréG˘ Néˇ I˙cä¡£ç´ Tˇ´lïijNéˇ C´ cåˇ rˇsæ´ Zt’ç˙ o˝Aå˘ ¸T䞲EïijNåˇ Z˙ aäÿžloggerå˘ rˇs柴 rå´ o˝C玡 Däÿˇ Aäÿłå˘ o˝d俯 Nå´ RŸé´ G˘ R´ import scrapy class MySpider(scrapy.Spider):

name='myspider' start_urls=['http://scrapinghub.com']

def parse(self, response): self.logger.info('Parse function called on %s', response.

˓→url)

7.1å R´ Sé´ A˘ ˛Aemail

ScrapyåR´ Sé´ A˘ ˛Aemail姞äžO˝ non-blocking IOåo˝dç¯ O˝ rïijˇ Nåˇ RłéIJ´ Aå˘ G˘ aäÿłç˘ o˝Aå˘ ¸TéEˇ ç¡oå˝ ¸såR´ rã´ A˘ C´ å´LIå˙ g˘Nå´ Nˇ U˝ mailer= MailSender.from_settings(settings)

åR´ Sé´ A˘ ˛Aäÿ åNˇ Eåˇ Rˇ né´ Z´ Däˇ z˝u˙ mailer.send(to=["[email protected]"], subject="Some subject",

˓→body="Some body", cc=["[email protected]"])

éEˇ ç¡o˝

MAIL_FROM='scrapy@localhost' MAIL_HOST='localhost' MAIL_PORT= 25 MAIL_USER="" MAIL_PASS="" MAIL_TLS= False MAIL_SSL= False 7.2å Rˇ Näÿˇ Aäÿłè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’ŽäÿłSpiderˇ

import scrapy from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider): # Your first spider definition ...

class MySpider2(scrapy.Spider): # Your second spider definition ...

process= CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling

˓→jobs are finished

7.3å ´L ˛EåÿCåijˇ Rç´ ´Lnèˇ Z´ n´

Scrapyå´z˝uæš˛aæIJL’æR´ Rä¿ˇ Zå˙ ˛EEç¡ˇ o玽 Dåˇ ´L ˛EåÿCåijˇ RæŁ¸Så´ R´UåŁ§è˝ C¡ïijˇ Näÿˇ è£GæIJL’å¿˘ ´Låd’ŽæU´zæ¸s¸Tå˝ R´ rä´ z˙eåÿˇ oä¡˝ aå˘ o˝dç¯ O˝ rãˇ A˘ C´ å˛eCæ´ dIJ䡯 aæIJL’å¿˘ ´Låd’ŽäÿłspiderïijNæIJˇ Aç˘ o˝Aå˘ ¸TçŽDæˇ U´zåij˝ Rå´ rˇs柴 rå´ Rˇ råŁ´ ´låd’ŽäÿłScrapydåo˝d俯 Nïij´ Nçˇ D˝uåˇ RˇOå˝ rˇ ˛Espiderå´L ˛EåÿCåˇ ´Lråˇ Rˇ DäÿłæIJžåˇ Z´´läÿŁéI˙cã´ A˘ C´ å˛eCæ´ dIJ䡯 aæ˘ C¸såd’ŽäÿłæIJžåˇ Z´´lè£Rèˇ ˛aNåˇ Rˇ Näÿˇ Aäÿłspiderïij˘ Nåˇ R´ rä´ z˙eåˇ rˇ ˛Eurlå´L ˛EçL’Gå˘ RˇOäžd’ç˝ z˙Zæ´ r´RäÿłæIJžå´ Z´´läÿŁéI˙c玴 Dspiderãˇ A˘ Cæ´ r´Tå˛eˇ Cä¡´ aæŁŁURLå˘ ´L ˛Eæ´LR3äˇ z¡˙

http://somedomain.com/urls-to-crawl/spider1/part1.list http://somedomain.com/urls-to-crawl/spider1/part2.list http://somedomain.com/urls-to-crawl/spider1/part3.list

çD˝uåˇ RˇOè£˝ Rèˇ ˛aN3äÿłˇ Scrapydåo˝d俯 Nïij´ Nåˇ ´L ˛Eå´Lnå´ Rˇ råŁ´ ´låo˝Cäˇ z˙nïijˇ Nå´z˝uäijˇ aé˘ AŠpartå˘ R´ Cæ´ ¸Trˇ

curl http://scrapy1.mycompany.com:6800/schedule.json-d

˓→project=myproject-d spider=spider1-d part=1 curl http://scrapy2.mycompany.com:6800/schedule.json-d

˓→project=myproject-d spider=spider1-d part=2 curl http://scrapy3.mycompany.com:6800/schedule.json-d

˓→project=myproject-d spider=spider1-d part=3

7.4 韚æ cè´ c´nå´ rˇ ˛AçŽDçˇ Uç˝ ¸Teˇ

äÿA䞢 Zç¡˙ Sç´ n´Zå´ o˝dç¯ O˝ r䞡 ˛EäÿA䞢 Zç˙ Uç˝ ¸Teæˇ I˙eç˛eˇ ˛Aæ cç´ ´Lnèˇ Z´ næ´ I˙eçˇ ´Lnåˇ R´Uå˝ o˝Cäˇ z˙n玡 Dç¡ˇ Sé´ ˛a¸tãA˘ CæIJL’玴 Dæˇ r´Tè¿ˇ Cçˇ o˝Aå˘ ¸TïijNæIJL’玡 Dçˇ Zÿ塸Såd’˙ æI˙Cïij´ Nå˛eˇ Cæ´ dIJ䡯 aéIJ˘ Aè˛e˘ ˛Aèr˛eç´ z˙ ˛E䞲Eèg˘cåˇ R´ rä´ z˙eåŠˇ ´lèr´c´å ¸T˛EäÿŽæTˇ ræ´ Nˇ ˛A äÿNé´ I˙c柴 rå´ r´zäž´ Oè£˝ Zäž´ Zç¡˙ Sç´ n´Z玴 Däÿˇ A䞢 ZæIJL’ç˙ Tˇ´lçŽDåˇ zžè˙ o˝oïijŽ˝ • ä¡£çTˇ´luser agentæs´aã˘ A˘ Cä´z§å´ rˇs柴 ræ´ r´Ræ´ nˇ ˛aåR´ Sé´ A˘ ˛AçŽDæˇ U˝uå˚ A˘ Z鎴 RæIJžä´ z˙Oæ˝ s´aäÿ˘ éAL’æ˘ Nl’äÿ´ äÿAæ˘ a˚u玢 Dæ¸tˇ Rè´ g˘´LåZ´´låd’t’ä£ ˛aæ˛Arïij´ NéŸšæˇ cæŽt’éIJšç´ ´Lnèˇ Z´ nèž´ nä´ z¡˙ • ç˛e ˛Aæ cCookieïij´ Næ§ˇ R䞡 Zç¡˙ Sç´ n´ZäijŽé´ AŽè£˘ GCookieè˘ r´ ˛Eå´Lnç´ Tˇ´læ´L˚uèžnä´ z¡ïij˙ Nç˛eˇ ˛AçTˇ´låRˇOä¡£å¿˝ UæIJ˚ åŁ ˛aåZ´´læU˚ aæ¸s¸Tè˘ r´ ˛Eå´Lnç´ ´Lnèˇ Z´ nè¡´ ´l裴z •è o¿ç¡˝ odownload_delayäÿ˝ Nè¡¡å´ z˝u裧ïij˙ Næˇ ¸Tråˇ Uè˚ o¿ç¡˝ oäÿž5ç˝ gŠïij˘ Nè˝uŁåd’ˇ gè˝uŁå˘ oL’å˝ Eˇ´l • å˛eCæ´ dIJæIJL’å¯ R´ rè´ C¡çŽˇ Dèˇ r´Iå˙ r¡éˇ G˘ Rä¡£ç´ Tˇ´lGoogle cacheèO˚uå˝ R´Uç¡˝ Sé´ ˛a¸tïijNèˇ A˘ Näÿˇ æŸrç´ Zt’æ˙ O˝ eèˇ o£é˝ U˚ o˝ • ä¡£çTˇ´läÿAäÿłè¡˘ oè¡˝ nIPæˇ s´aïij˘ Nä¿ˇ Nå˛e´ Cå´ Eˇ èt’´zçŽDˇ Tor projectæ´LUè˝ A˘ EæŸˇ rä´ zŸèt’´zçŽ˙ Dˇ ProxyMesh • ä¡£çTˇ´låd’gå˘ d¯Nå´ ´L ˛EåÿCåijˇ Räÿ´ Nè¡¡å´ Z´´lïijNè£ˇ Zæ´ a˚uå˘ rˇsè´ C¡åˇ o˝Nåˇ Eˇ´lé ˛A£åEˇ èc´nå´ rˇ ˛A䞲EïijNåˇ RłéIJ´ Aè˛e˘ ˛AåE¸sæ¸sˇ ´læA˘ Oæ˝ a˚uè˘ g˘cæˇ d¯Réˇ ˛a¸téI˙cå´ rˇsè´ ˛aNãˇ A˘ Cäÿ´ Aäÿłä¿˘ Nå´ Råˇ rˇs柴 r´Crawlera å˛eCæ´ dIJ裯 Zäž´ Z裟æŸ˙ ræ´ U˚ aæ¸s¸Té˛A£å˘ Eˇ èc´nç˛e´ ˛AïijNåˇ R´ rä´ z˙eèˇ A˘ Cèˇ Z´ S´å ¸T˛EäÿŽæTˇ ræ´ Nˇ ˛A

8 Scrapyæ ¸TZç´ ´lN08-´ æU˝ Gä˙z˝uäÿ˘ Oå˝ Z¿çL’˙ G˘

Scrapyäÿžæ´LSä´ z˙næˇ R´ Rä¿ˇ Zäž˙ ˛EåR´ ré´ G˘ çTˇ´lçŽDˇ item pipelinesäÿžæ§RäÿłçL’´zåˇ oŽçŽ˝ DItemåˇ O˝ zäÿ˙ Nè¡¡æ´ U˝ Gä˘ z˝uã˙ A˘ C´ éAŽåÿÿæ˘ I˙eèˇ rt’ä¡´ aäijŽé˘ AL’æ˘ Nl’ä¡£ç´ Tˇ´lFiles Pipelineæ´LUImages˝ PipelineãA˘ C´ è£Zäÿd’äÿłç´ o˝ ˛aé˛A¸SéC¡åˇ o˝dç¯ O˝ r䞡 ˛EïijŽ • é ˛A£åEˇ éG˘ åd’ äÿNè¡¡´ •å R´ rä´ z˙eæˇ Nˇ Gå˘ oŽäÿ˝ Nè¡¡å´ RˇOä£˝ Iå˙ ŸçŽDåIJˇ ræˇ U´z(æ˝ U˝ Gä˘ z˝uç¸s˙ zç˙ z§ç˙ Z˙ oå¡˝ ¸Täÿ ,Amazon S3äÿ ) Images Pipelineäÿžåd’Dçˇ Rˇ ˛EåZ¿çL’˙ Gæ˘ R´ Rä¿ˇ Zäž˙ ˛Eéc´Iåd’˙ U玽 DåŁ§èˇ C¡ïijŽˇ •å rˇ ˛EæL’AæIJL’äÿ˘ N衡玴 Dåˇ Z¿çL’˙ Gæ˘ aijåij˘ Rè¡´ næˇ cæ´ ´LRæˇ Z´ oé˝ AŽçŽ˘ DJPGå´z˝uä¡£çˇ Tˇ´lRGBécIJèL’šæ´ ´l ˛aåijR´ •ç T§æˇ ´LRçijl’ç¸Tˇ eåˇ Z¿˙ •æ cˇAæ§˘ eåˇ Z¿çL’˙ G玢 Dåˇ o¡åž˛eåŠ˝ Néˇ nŸåž˛eç˛a´ oä£˝ Iå˙ o˝Cäˇ z˙næˇ z˙ ˛aè˝u¸sæIJAå˘ rˇR玴 Dåˇ ržåˇ rÿé´ Z´ Råˇ ´L˝u ço˝ ˛aé˛A¸SåRˇ Næˇ U˝uäijŽåIJ˚ ´lå ˛EEéˇ Cˇ ´lä£Iå˙ ŸäÿAäÿłè˘ c´nè´ rˇCåž˛eäÿˇ N衡玴 DURLåˇ ´LUè˚ ˛a´lïijNçˇ D˝uåˇ RˇOå˝ rˇ ˛EåNˇ Eåˇ Rˇ nç´ Zÿå˙ Rˇ NåłŠä¡¸S玡 Dçˇ Zÿåž˙ Tåˇ E¸sèˇ ˛ATåˇ ´Lrè£ˇ ZäÿłéŸ§å´ ´LUäÿŁæ˚ I˙eïijˇ Näˇ z˙Oè˝ A˘ NéŸšæˇ cäž´ ˛Eåd’ŽäÿłitemåEˇ säž´ n裴 ZäÿłåłŠä¡¸Sæ´ U˝ué˚ G˘ åd’ äÿNè¡¡ã´ A˘ C´

8.1 ä¡£çTˇ´lFiles Pipeline

äÿAè˘ ´Lnæˇ ´LSä´ z˙näijŽæˇ NL’çˇ Eˇ gäÿ˘ Né´ I˙c玴 Dæˇ eéłd’æˇ I˙eä¡£çˇ Tˇ´læU˝ Gä˘ z˝uç˙ o˝ ˛aé˛A¸SïijŽ 1. åIJ´læ§RäÿłSpideräÿˇ ïijNä¡ˇ aç˘ ´Lnåˇ R´Uäÿ˝ Aäÿłitemå˘ RˇOïij˝ Nåˇ rˇ ˛EçZÿåž˙ T玡 Dæˇ U˝ Gä˘ z˝uURLæ˙ T¿åˇ Eˇ eˇfile_urlså Uæ˚ o¸täÿ˝ 2. itemèc´n裴 Tåˇ Z˙ dä´z¯ Nå´ RˇOå˝ rˇsäijŽè¡´ näžd’çˇ z˙Zitem´ pipeline 3. 塸Sè£Zäÿłitemå´ ´Lrè¿¿ˇ FilesPipelineæU˝uïij˚ NåIJˇ ´lfile_urlså Uæ˚ o¸täÿ˝ çŽDURLåˇ ´LUè˚ ˛a´läijŽéAŽè£˘ Gæ˘ a˘Gå˘ G˘ ˛EçŽDScrapyèˇ rˇCåž˛eåˇ Z´´låŠNäÿˇ Nè¡¡å´ Z´´læI˙eèˇ rˇCåž˛eäÿˇ Nè¡¡ïij´ Nå´z˝uäÿˇ TäijŸåˇ Eˇ´Lçžgå¿˘ ´LénŸïij´ NåIJˇ ´læŁ¸SåR´Uå˝ E˝uäˇ z˙Ué˝ ˛a¸téI˙cåL’´ årˇsè´ c´nåd’´ Dçˇ Rˇ ˛EãA˘ Cè´ A˘ Nè£ˇ Zäÿł´ itemäijŽäÿAç˘ Zt’åIJ˙ ´lè£Zäÿłpipelineäÿ´ èc´né´ Tˇ ˛AåoŽïij˝ Nçˇ Zt’å˙ ´LræL’ˇ AæIJL’玢 Dæˇ U˝ Gä˘ z˝uäÿ˙ Nè¡¡å´ o˝Næˇ ´LRãˇ A˘ C´ 4. 塸SæU˝ Gä˘ z˝uè˙ c´näÿ´ Nè¡¡å´ o˝Nä´zˇ Nå´ RˇOïij˝ Nçˇ z¸Sæ˙ dIJäijŽè¯ c´nè¸t´ Nå´ Aijç˘ z˙Zå´ R˛eäÿ´ Aäÿł˘ fileså Uæ˚ o¸tã˝ A˘ C裴 Zäÿłå´ Uæ˚ o¸tå˝ Nˇ Eåˇ Rˇ näÿ´ Aäÿłå˘ E¸s䞡 Oäÿ˝ Nè¡¡æ´ U˝ Gä˘ z˝uæ˙ U˝ r玡 Dåˇ Uå˚ Eÿåˇ ´LUè˚ ˛a´lïijNæˇ r´Tå˛eˇ Cäÿ´ Nè¡¡è˚u´ rå¿´ Dïijˇ Næžˇ RåIJˇ råˇ I˙Aïij˘ Næˇ U˝ Gä˘ z˝uæ˙ a˘ ˛aéłNçˇ a˘ ˛AãA˘ C´ fileséG˘ Néˇ I˙c玴 Déˇ ˛ažåžR労 Nˇ file_urlé ˛ažåžR柴 räÿ´ Aè˘ Gt’玢 Dãˇ A˘ Cè˛e´ ˛AæŸr槴 Räÿłåˇ ˛EZæ´ U˝ Gä˘ z˝uäÿ˙ Nè¡¡å´ Gžé˘ Tˇ Zå´ rˇsäÿ´ äijŽåGžç˘ O˝ råIJˇ ´lè£Zäÿł´ filesäÿ äž ˛EãA˘ C´

8.2 ä¡£çTˇ´lImages Pipeline

ImagesPipelineè˚u§FilesPipelineçŽDä¡£çˇ Tˇ´lå˚uoäÿ˝ åd’ŽïijNäÿˇ è£Gä¡£ç˘ Tˇ´lçŽDåˇ Uæ˚ o¸tå˝ Rˇ äÿ äÿAæ˘ a˚uïij˘ Nˇ image_urlsä£Iå˙ ŸåZ¿çL’˙ GURLåIJ˘ råˇ I˙Aïij˘ Nˇ imagesä£Iå˙ ŸäÿNè¡¡å´ RˇO玽 Dåˇ Z¿çL’˙ Gä£˘ ˛aæ˛Arã´ A˘ C´ ä¡£çTˇ´lImagesPipelineçŽDåˇ e¡åd’ˇ DæŸˇ rä¡´ aå˘ R´ rä´ z˙eéˇ AŽè£˘ Gé˘ Eˇ ç¡oæ˝ I˙eæˇ R´ Rä¿ˇ Zé˙ c´Iåd’˙ U玽 DåŁ§èˇ C¡ïijˇ Næˇ r´Tå˛eˇ Cç´ T§æˇ ´LRæˇ U˝ Gä˘ z˝uçijl’ç¸T˙ eåˇ Z¿ïij˙ Néˇ AŽè£˘ Gå˘ Z¿çL’˙ Gåd’˘ gå˘ rˇR裴 Gæ˘ zd’éIJ˙ Aè˛e˘ ˛AäÿN衡玴 Dåˇ Z¿çL’˙ Gç˘ L’ãA˘ C´ ImagesPipelineä¡£çTˇ´lPillowæI˙eçˇ T§æˇ ´LRçijl’ç¸Tˇ eåˇ Z¿ä˙ z˙eåˇ RŁè¡´ næˇ cæ´ ´LRæˇ a˘Gå˘ G˘ ˛EçŽDJPEG/RGBæˇ aijåij˘ Rã´ A˘ Cå´ Z˙ aæ˘ d’ä¡aéIJ˘ Aè˛e˘ ˛AåoL’è˝ cˇEè£ˇ Zäÿłå´ Nˇ Eïijˇ Næˇ ´LSä´ z˙nåˇ zžè˙ o˝oä¡˝ aä¡£ç˘ Tˇ´lPillowèA˘ Näÿˇ æŸrPILã´ A˘ C´ 8.3 ä¡£çTˇ´lä¿Nå´ Rˇ

è˛e ˛Aä¡£çTˇ´låłŠä¡¸Sço˝ ˛aé˛A¸SïijNèˇ r˚uå´ Eˇ´LåIJ´léEˇ ç¡oæ˝ U˝ Gä˘ z˝uäÿ˙ æL’¸SåijAå˘ o˝Cˇ

# åRˇNæˇ U˝uä¡£ç˚ Tˇlå´ Z¿çL’˙ GåŠ˘ Næˇ U˝Gä˘ z˝uç˙ o˝ ˛aé˛A¸S ITEM_PIPELINES={ 'scrapy.pipelines.images.ImagesPipeline':1, 'scrapy.pipelines.files.FilesPipeline':2, } FILES_STORE='/path/to/valid/dir' # æU˝Gä˘ z˝uå˙ ŸåC´lè˚u´ rå¿´ ˇD IMAGES_STORE='/path/to/valid/dir' # åZ¿çL’˙ Gå˘ ŸåC´lè˚u´ rå¿´ ˇD # 90 days of delay for files expiration FILES_EXPIRES= 90 # 30 days of delay for images expiration IMAGES_EXPIRES= 30 # åZ¿çL’˙ Gçijl’ç¸T˘ eåˇ Z¿˙ IMAGES_THUMBS={ 'small':(50, 50), 'big':(270, 270), } # åZ¿çL’˙ Gè£˘ Gæ˘ zd’å˙ Z´lïij´ NæIJˇ Aå˘ rˇRé´ nŸåž˛e労 Nåˇ o¡åž˛e˝ IMAGES_MIN_HEIGHT= 110 IMAGES_MIN_WIDTH= 110

äÿAäÿłä¡£ç˘ Tˇ´läž ˛Eçijl’ç¸Teåˇ Z¿çŽ˙ Däÿˇ Nè¡¡ä¿´ Nå´ RäijŽçˇ T§æˇ ´LRå˛eˇ Cäÿ´ Nå´ Z¿çL’˙ GïijŽ˘

/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg /thumbs/small/

˓→63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg /thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.

˓→jpg

çD˝uåˇ RˇOïij˝ Næ§ˇ RäÿłItemè£ˇ Tåˇ Z˙ dæ¯ U˝uïij˚ NæIJL’ˇ file_urlsæ´LU˝ image_urlsïijNå´z˝uäÿˇ Tåˇ ŸåIJ´lçZÿåž˙ T玡 Dˇ filesæ´LU˝ imageså Uæ˚ o¸t˝ import scrapy class MyItem(scrapy.Item):

# ... other item fields ... image_urls= scrapy.Field() images= scrapy.Field()

8.4è Głå˘ oŽä´zL’åłŠä¡¸Sç˝ o˝ ˛aé˛A¸S

å˛eCæ´ dIJ䡯 a裟éIJ˘ Aè˛e˘ ˛AæZt’åŁ˙ aåd’˘ æI˙C玴 DåŁ§èˇ C¡ïijˇ Næˇ C¸sèˇ Głå˘ oŽä´zL’äÿ˝ Nè¡¡åłŠä¡¸Sé´ A˘ zè¿˙ Sïij´ Nèˇ r˚uå´ R´ Cè´ A˘ Cˇ æL’l’ås´ ¸TåłŠä¡¸Sço˝ ˛aé˛A¸S äÿ ço˝ ˛aæŸræL’l’å´ s´ ¸TFilesPipeline裟æŸr´ImagesPipeline,éC¡åˇ RłéIJ´ Aé˘ G˘ å ˛EZäÿ´ Né´ I˙cäÿd’äÿłæ´ U´zæ¸s¸T˝ • get_media_requests(self, item, info),è£Tåˇ Z˙ däÿ¯ Aäÿł˘ Requestår´zè´ s´ ˛a • item_completed(self, results, item, info),塸SäÿŁéU˚ ´lçŽDRequestäÿˇ Nè¡¡å´ o˝Næˇ ´LRåˇ RˇOå˝ Z˙ dè¯ rˇCè£ˇ Zäÿłæ´ U´zæ¸s¸Tïij˝ Nçˇ D˝uåˇ RˇOå˝ ˛anå´ Eˇ Eˇ filesæ´LU˝ imageså Uæ˚ o¸t˝ äÿNé´ I˙c柴 räÿ´ AäÿłæL’l’å˘ s´ ¸TImagesPipelineçŽDä¿ˇ Nå´ Rïijˇ Næˇ ´LSå´ Rłå´ R´Upathä£˝ ˛aæ˛Arïij´ Nå´z˝uåˇ rˇ ˛Eåo˝Cè¸tˇ Nç´ z˙Z´ image_pathså Uæ˚ o¸tïij˝ Nèˇ A˘ Näÿˇ æŸré´ zŸè˙ od’玽 Dˇ images

import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url)

def item_completed(self, results, item, info): image_paths= [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths']= image_paths return item

9 Scrapyæ ¸TZç´ ´lN09-´ éCˇ ´lç¡š

æIJnçˇ r´Gäÿ˘ zè˛e˙ ˛Aäz˙Nç´ z˙ äÿd’çg˘ éCˇ ´lç¡šç´Lnèˇ Z´ n玴 Dæˇ U´zæ˛a˝ ´LãA˘ Cå˛e´ Cæ´ dIJä¯ z˙Eäˇ z˙EåIJˇ ´låijAå˘ R´ Sè´ rˇCèˇ r´ ¸TçŽDæˇ U˝uå˚ A˘ ZåIJ´ ´læIJnåIJˇ réˇ Cˇ ´lç¡šè˚uSè¸t˚uæ´ I˙eæŸˇ rå¿´ ´Låo´z柸S玽 Dïijˇ Näÿˇ è£Gè˛e˘ ˛AæŸrç´ T§äžˇ gç˘ O˝ rå´ c´Cïijˇ Nçˇ ´Lnèˇ Z´ nä´ z˙zåŁ˙ ˛aéG˘ Råd’´ gïij˘ Nå´z˝uäÿˇ Tæˇ Nˇ ˛Açz˙ æU˝ué˚ Ut’é¸T£ïij˚ Néˇ C´ cä´zˇ ´L裟æŸrå´ zžè˙ o˝oä¡£ç˝ Tˇ´läÿ¸SäÿŽçŽDéˇ Cˇ ´lç¡šæU´zæ¸s¸Tã˝ A˘ Cäÿ´ zè˛e˙ ˛AæŸräÿd’ç´ g˘ æU´zæ˛a˝ ´LïijŽ • Scrapyd åijAæž˘ Ræˇ U´zæ˛a˝ ´L • Scrapy Cloud äžSæ´ U´zæ˛a˝ ´L

9.1é Cˇ ´lç¡šå´LrScrapydˇ

ScrapydæŸräÿ´ Aäÿłåij˘ Aæž˘ Rè¡ˇ rä´ z˝uïij˙ Nçˇ Tˇ´læI˙eè£ˇ Rèˇ ˛aNèIJŸèˇ Z˙ Zç˙ ´Lnèˇ Z´ nã´ A˘ Cå´ o˝Cæˇ R´ Rä¿ˇ Zäž˙ ˛EHTTP APIçŽDæIJˇ åŁ ˛aåZ´´lïijNè£Ÿèˇ C¡è£ˇ Rèˇ ˛aNåŠˇ Nçˇ Z˙ Sæ´ O˝ gScrapy玢 DèIJŸèˇ Z˙ Z˙ è˛e ˛AéCˇ ´lç¡šç´Lnèˇ Z´ nå´ ´LrScrapydïijˇ NéIJˇ Aè˛e˘ ˛Aä¡£çTˇ´lå´Lrˇscrapyd-clientéCˇ ´lç¡šå˚ueåˇ E˚uéˇ Z˙ ˛EïijNäÿˇ Né´ I˙cæ´ ´LSæij´ Tçd’žäÿˇ Né´ Cˇ ´lç¡šçŽDæˇ eéłd’ˇ ScrapydéAŽåÿÿä˘ z˙eåˇ o˝´LæŁd’è£Zç˙ ´lNdaemonå¡´ cåij´ R裴 Rèˇ ˛aNïijˇ Nçˇ Z˙ Så´ Rˇ nspider玡 Dèˇ r˚uæ´ s´Cïij´ Nçˇ D˝uåˇ RˇOäÿžæ˝ r´Räÿłspiderå´ ´LZå˙ zžäÿ˙ Aäÿłè£˘ Zç˙ ´lNæL’´ gè˘ ˛aNˇ scrapy crawl myspider,åRˇ Næˇ U˝uScrapyd裟è˚ C¡äˇ z˙eåd’Žè£ˇ Zç˙ ´lNæ´ U´zåij˝ Rå´ Rˇ råŁ´ ´lïijNéˇ AŽè£˘ Gé˘ Eˇ ç¡o˝max_procåŠNˇ max_proc_per_cpuéAL’é˛a´z˘

åoL’è˝ cˇEˇ

ä¡£çTˇ´lpipåoL’è˝ cˇEˇ

pip install scrapyd

åIJ´lubuntuç¸szç˙ z§äÿŁé˙ I˙c´ apt-get install scrapyd

éEˇ ç¡o˝

éEˇ ç¡oæ˝ U˝ Gä˘ z˝uåIJ˙ råˇ I˙Aïij˘ NäijŸåˇ Eˇ´Lçžgä˘ z˙Oä¡˝ Oå˝ ´Lréˇ nŸ´ • /etc/scrapyd/scrapyd.conf (Unix) • /etc/scrapyd/conf.d/* (in alphabetical order, Unix) • scrapyd.conf • ~/.scrapyd.conf (users home directory) åE˚u䡸Såˇ R´ Cæ´ ¸Tråˇ R´ Cè´ A˘ Cˇ scrapydéEˇ ç¡o˝ ço˝Aå˘ ¸TçŽDä¿ˇ Nå´ Rˇ

[scrapyd] eggs_dir= eggs logs_dir= logs items_dir= jobs_to_keep=5 dbs_dir= dbs max_proc=0 max_proc_per_cpu=4 finished_to_keep= 100 poll_interval=5 bind_address= 0.0.0.0 http_port= 6800 debug= off runner= scrapyd.runner application= scrapyd.app.application launcher= scrapyd.launcher.Launcher webroot= scrapyd.website.Root

[services] schedule.json= scrapyd.webservice.Schedule cancel.json= scrapyd.webservice.Cancel addversion.json= scrapyd.webservice.AddVersion listprojects.json= scrapyd.webservice.ListProjects listversions.json= scrapyd.webservice.ListVersions listspiders.json= scrapyd.webservice.ListSpiders delproject.json= scrapyd.webservice.DeleteProject delversion.json= scrapyd.webservice.DeleteVersion listjobs.json= scrapyd.webservice.ListJobs daemonstatus.json= scrapyd.webservice.DaemonStatus éCˇ ´lç¡š

ä¡£çTˇ´lscrapyd-clientæIJAæ˘ U´zä¿£ïij˝ Nˇ Scrapyd-clientæŸr´scrapydçŽDäÿˇ Aäÿłå˘ o˝cæ´ ´L˚uçn´rïij´ Nåˇ o˝Cæˇ R´ Rä¿ˇ Zäž˙ ˛Escrapyd-deployå˚ueåˇ E˚uåˇ rˇ ˛Eå˚ueçˇ ´lNé´ Cˇ ´lç¡šå´LrScrapydæIJˇ åŁ ˛aåZ´´läÿŁéI˙c´ éAŽåÿÿå˘ rˇ ˛Eä¡a玢 Då˚uˇ eçˇ ´lNé´ Cˇ ´lç¡šå´LrScrapydéIJˇ Aè˛e˘ ˛Aäÿd’äÿłæ eéłd’ïijŽˇ 1.å rˇ ˛Eå˚ueçˇ ´lNæL’¸Så´ Nˇ Eæˇ ´LRpythonèˇ Z˙ Nïij´ Nä¡ˇ aéIJ˘ Aè˛e˘ ˛AåoL’è˝ cˇEˇ setuptools 2.é AŽè£˘ G˘ addversion.jsonçz˙´Lçn´rå´ rˇ ˛E觊èZ˙ Gè˘ Z˙ NäÿŁäij´ aè˘ G¸sScrapdæIJ˘ åŁ ˛aåZ´´l ä¡aå˘ R´ rä´ z˙eåIJˇ ´lä¡a玢 Då˚uˇ eçˇ ´lNé´ Eˇ ç¡oæ˝ U˝ Gä˘ z˝u˙ scrapy.cfgåoŽä´zL’Scrapydç˝ Z˙ oæ˝ a˘G˘

[deploy:example] url= http://scrapyd.example.com/api/scrapyd username= scrapy password= secret

å´LUå˚ GžæL’˘ AæIJL’å˘ R´ rç´ Tˇ´lçZ˙ oæ˝ a˘Gä¡£ç˘ Tˇ´låS¡ä´ zd’˙

scrapyd-deploy-l

å´LUå˚ Gžæ§˘ Räÿłçˇ Z˙ oæ˝ a˘GäÿŁé˘ I˙cæL’´ AæIJL’å˘ R´ r裴 Rèˇ ˛aN玡 Då˚uˇ eçˇ ´lNïij´ NæL’ˇ gè˘ ˛aNåˇ S¡ä´ zd’˙

scrapyd-deploy-L example

åEˇ´Lcdå´Lrå˚uˇ eçˇ ´lNæ´ a´zç˘ Z˙ oå¡˝ ¸TïijNçˇ D˝uåˇ RˇOä¡£ç˝ Tˇ´lå˛eCäÿ´ Nå´ S¡ä´ zd’æ˙ I˙eéˇ Cˇ ´lç¡šïijŽ

scrapyd-deploy-p

ä¡aè£Ÿå˘ R´ rä´ z˙eåˇ oŽä´zL’é˝ zŸè˙ od’玽 DtargetåŠˇ Nprojectïijˇ NçIJˇ ˛AçŽDä¡ˇ aæ˘ r´Ræ´ nˇ ˛aéC¡åˇ O˝ zæ˙ ¸Tšäz˙cçˇ a˘ ˛A

[deploy] url= http://scrapyd.example.com/api/scrapyd username= scrapy password= secret project= yourproject

è£Zæ´ a˚uä¡˘ aå˘ rˇsç´ Zt’æ˙ O˝ eåˇ R´UæL’˝ gè˘ ˛aNˇ

scrapyd-deploy

å˛eCæ´ dIJ䡯 aæIJL’åd’Žäÿłtargetïij˘ Néˇ C´ cä´zˇ ´LåR´ rä´ z˙eä¡£çˇ Tˇ´läÿNé´ I˙cå´ S¡ä´ zd’å˙ rˇ ˛EprojectéCˇ ´lç¡šå´Lråd’ŽäÿłtargetæIJˇ åŁ ˛aåZ´´läÿŁéI˙c´

scrapyd-deploy-a-p

9.2é Cˇ ´lç¡šå´LrScrapyˇ Cloud

Scrapy CloudæŸräÿ´ AäÿłæL’Ÿç˘ o˝ ˛açŽD䞡 SæIJ´ åŁ ˛aåZ´´lïijNçˇ Tˇ sScrapyè´ Cˇ Nåˇ RˇO玽 Dåˇ Eˇ nåˇ Rÿ´ Scrapinghubçzt’æŁd’˙ åo˝Cåˇ Eˇ éZd’äž´ ˛EåoL’è˝ cˇEåŠˇ Nçˇ Z˙ Sæ´ O˝ gæIJ˘ åŁ ˛aåZ´´lçŽDéIJˇ Aè˛e˘ ˛AïijNå´z˝uæˇ R´ Rä¿ˇ Zäž˙ ˛EéI˙dåÿÿ翯 Oè˝ g˘C玴 DUIæˇ I˙eçˇ o˝ ˛açRˇ ˛EåRˇ DäÿłSpiderïijˇ Nè£Ÿèˇ C¡æ§ˇ eçIJˇ Nè´ c´næŁ¸Så´ R´U玽 DItemïijˇ Næˇ U˚ eå£ˇ UåŠ˚ N磽uæˇ A˘ ˛Aç L’ãA˘ C´ ä¡aå˘ R´ rä´ z˙eä¡£çˇ Tˇ´lshubåS¡ä´ zd’è˙ ˛aNå˚uˇ eåˇ E˚uæˇ I˙eèˇ ošspideré˝ Cˇ ´lç¡šå´LrScrapyˇ CloudãA˘ Cæ´ Zt’åd’Žè˙ r˚uå´ R´ Cè´ A˘ Cˇ åoŸæ˝ U´zæ˝ U˝ Gæ˘ ˛acˇ Scrapy CloudåŠNScrapydæŸˇ rå´ Eijåˇ o´z玽 Dïijˇ Nä¡ˇ aå˘ R´ rä´ z˙eæˇ a´zæ˘ oéIJ˝ Aè˛e˘ ˛AåIJ´läÿd’èA˘ Eä´zˇ NåL’´ å´LGæ˘ cïij´ Néˇ Eˇ ç¡oæ˝ U˝ Gä˘ z˝uä´z§æŸ˙ r´scrapy. cfgïijNè˚u§ˇ scrapyd-deployèr´zå˙ R´U玽 DæŸˇ räÿ´ Aæ˘ a˚u玢 Dãˇ A˘ C´

10 Scrapyæ ¸TZç´ ´lN10-´ åŁ´læA˘ ˛AéEˇ ç¡oç˝ ´Lnèˇ Z´ n´

æIJL’å¿´Låd’ŽæU˝uå˚ A˘ Zæ´ ´LSä´ z˙néIJˇ Aè˛e˘ ˛Aäz˙Oåd’Žäÿłç¡˝ Sç´ n´Zç´ ´Lnåˇ R´UæL’˝ AéIJ˘ Aè˛e˘ ˛AçŽDæˇ ¸Træˇ oïij˝ Næˇ r´Tå˛eˇ Cæ´ ´LSä´ z˙næˇ C¸sçˇ ´Lnåˇ R´Uåd’Žäÿłç¡˝ Sç´ n´Z玴 Dæˇ U˝ réˇ U˚ zïij˙ Nåˇ rˇ ˛EåE˝uåˇ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Så˝ Rˇ Näÿˇ Aäÿłè˘ ˛a´läÿ ãA˘ Cæ´ ´LSä´ z˙næŸˇ räÿ´ æŸrè˛e´ ˛Aår´zæ´ r´Räÿłç¡´ Sç´ n´Zé´ C¡å¿ˇ Uå˚ O˝ zå˙ oŽä´zL’äÿ˝ AäÿłSpiderç˘ s´zå˙ S´cïij§´ åE˝uåˇ o˝däÿ¯ éIJAè˛e˘ ˛AïijNæˇ ´LSä´ z˙nåˇ R´ rä´ z˙eéˇ AŽè£˘ Gç˘ zt’æŁd’äÿ˙ Aäÿłè˘ g˘Dåˇ ´LZé´ Eˇ ç¡oè˝ ˛a´læ´LUè˝ A˘ Eäÿˇ Aäÿłè˘ g˘Dåˇ ´LZé´ Eˇ ç¡oæ˝ U˝ Gä˘ z˝uæ˙ I˙eåŁˇ ´læA˘ ˛Aåc´dåŁ¯ aæ˘ ´LUä£˝ oæ˝ T´zçˇ ´Lnåˇ R´Uè˝ g˘Dåˇ ´LZïij´ Nçˇ D˝uåˇ RˇOç˝ ´lNåž´ Rä´ z˙cçˇ a˘ ˛Aäÿ éIJAè˛e˘ ˛AæZt’æ˙ T´zåˇ rˇsè´ C¡åˇ o˝dç¯ O˝ råd’Žäÿłç¡ˇ Sç´ n´Zç´ ´Lnåˇ R´Uã˝ A˘ C´ è˛e ˛Aè£Zæ´ a˚uå˘ ˛AŽïijNæˇ ´LSä´ z˙nåˇ rˇsäÿ´ èC¡åˇ ˛E ä¡£çTˇ´låL’ éI˙c玴 Dˇ scrapy crawl testè£Zç´ g˘ åS¡ä´ zd’äž˙ ˛EïijNæˇ ´LSä´ z˙néIJˇ Aè˛e˘ ˛Aä¡£çTˇ´lçijUç˝ ´lN玴 Dæˇ U´zåij˝ R裴 Rèˇ ˛aNScrapyˇ spiderïijNåˇ R´ Cè´ A˘ Cˇ åoŸæ˝ U´zæ˝ U˝ Gæ˘ ˛acˇ

10.1è DŽæIJˇ nè£ˇ Rèˇ ˛aNScrapyˇ

åR´ rä´ z˙eåˇ ´Ll’çTˇ´lscrapyæR´ Rä¿ˇ ZçŽ˙ Dˇ æaÿå£˘ CAPIˇ éAŽè£˘ Gçij˘ Uç˝ ´lNæ´ U´zåij˝ Rå´ Rˇ råŁ´ ´lscrapyïijNäˇ z˙cæˇ Z£äij˙ aç˘ z§çŽ˙ Dˇ scrapy crawlåRˇ råŁ´ ´læU´zåij˝ Rã´ A˘ C´ Scrapyæd¯Dåˇ zžäž˙ OTwistedåij˝ Cæ´ eç¡ˇ Sç´ zIJæ˙ ˛a˛Eæd˝u姞ç˛a¯ Aä´z˘ NäÿŁïij´ Nåˇ Z˙ aæ˘ d’ä¡aéIJ˘ Aè˛e˘ ˛AåIJ´lTwisted reactoréG˘ Néˇ I˙c裴 Rèˇ ˛aNãˇ A˘ C´ é˛eUå˝ Eˇ´Lä¡aå˘ R´ rä´ z˙eä¡£çˇ Tˇ´lscrapy.crawler.CrawlerProcessè£Zäÿłç´ s´zæ˙ I˙eè£ˇ Rèˇ ˛aNä¡ˇ a玢 Dspiderïijˇ Nè£ˇ Zäÿłç´ s´zäijŽäÿžä¡˙ aå˘ Rˇ råŁ´ ´läÿAäÿłTwisted˘ reactorïijNå´z˝uèˇ C¡éˇ Eˇ ç¡oä¡˝ a玢 Dæˇ U˚ eå£ˇ UåŠ˚ Nshutdownåd’ˇ Dçˇ Rˇ ˛EåZ´´lãA˘ CæL’´ AæIJL’玢 Dscrapyåˇ S¡ä´ zd’é˙ C¡ä¡£çˇ Tˇ´lè£Zäÿłç´ s´zã˙ A˘ C´

import scrapy from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings

process= CrawlerProcess(get_project_settings())

process.crawl(MySpider) process.start() # the script will block here until the crawling is

˓→finished

çD˝uåˇ RˇOä¡˝ aå˘ rˇså´ R´ rä´ z˙eçˇ Zt’æ˙ O˝ eæL’ˇ gè˘ ˛aNè£ˇ Zäÿłè´ DŽæIJˇ nˇ

python run.py

åR˛eåd’´ Uäÿ˝ AäÿłåŁ§è˘ C¡æˇ Zt’åijžåd’˙ g玢 Dçˇ s´zæŸ˙ r´scrapy.crawler. CrawlerRunnerïijNæˇ O˝ ´lè Rä¡ˇ aä¡£ç˘ Tˇ´lè£Zäÿł´

from twisted.internet import reactor import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider(scrapy.Spider): # Your spider definition ...

configure_logging({'LOG_FORMAT':' %(levelname)s: %(message)s'}) runner= CrawlerRunner()

d= runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is

˓→finished

10.2å Rˇ Näÿˇ Aè£˘ Zç˙ ´lN裴 Rèˇ ˛aNåd’Žäÿłspiderˇ

ézŸè˙ od’æ˝ Cˇ Eåˇ ˛E¸t塸Sä¡aæ˘ r´Ræ´ nˇ ˛aæL’gè˘ ˛aNˇ scrapy crawlåS¡ä´ zd’æ˙ U˝uäijŽå˚ ´LZå˙ zžäÿ˙ Aäÿłæ˘ U˝ r玡 Dè£ˇ Zç˙ ´lNã´ A˘ Cä¡´ ˛Eæ´LSä´ z˙nåˇ R´ rä´ z˙eä¡£çˇ Tˇ´læaÿå£˘ CAPIåIJˇ ´låRˇ Näÿˇ Aäÿłè£˘ Zç˙ ´lNäÿ´ åRˇ Næˇ U˝uè£˚ Rèˇ ˛aNåd’Žäÿłspiderˇ

import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner= CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d= runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs

˓→are finished

10.3å oŽä´zL’è˝ g˘ Dåˇ ´LZè´ ˛a´l

åe¡äžˇ ˛Eè´lAå¡Šæ˘ cäijˇ aïij˘ NæIJL’䞡 ˛EåL’ éI˙c玴 Dèˇ DŽæIJˇ nåˇ Rˇ råŁ´ ´lå§žç ˛aAïij˘ Nåˇ rˇså´ R´ rä´ z˙eåijˇ Aå˘ g˘Næ´ ´LSä´ z˙n玡 DåŁˇ ´læA˘ ˛AéEˇ ç¡oç˝ ´Lnèˇ Z´ näž´ ˛EãA˘ C´ æ´LSä´ z˙n玡 DéIJˇ Aæ˘ s´C柴 r裴 Zæ´ a˚u玢 Dïijˇ Näˇ z˙Oäÿd’äÿłäÿ˝ åRˇ N玡 Dç¡ˇ Sç´ n´Zç´ ´Lnåˇ R´Uæ˝ ´LSä´ z˙næL’ˇ AéIJ˘ Aè˛e˘ ˛AçŽDæˇ U˝ réˇ U˚ zæ˙ U˝ Gç˘ n´aïij˘ Nçˇ D˝uåˇ RˇOå˝ ŸåC´ ´lå´Lrarticleèˇ ˛a´läÿ ãA˘ C´ é˛eUå˝ Eˇ´Læ´LSä´ z˙néIJˇ Aè˛e˘ ˛AåoŽä´zL’è˝ g˘Dåˇ ´LZè´ ˛a´låŠNæˇ U˝ Gç˘ n´aè˘ ˛a´lïijNéˇ AŽè£˘ GåŁ˘ ´læA˘ ˛AçŽDåˇ ´LZå˙ zžèIJŸè˙ Z˙ Zç˙ s´zïij˙ Næˇ ´LSä´ z˙näˇ z˙eåˇ RˇOå˝ rˇså´ RłéIJ´ Aè˛e˘ ˛Açzt’æŁd’è˙ g˘Dåˇ ´LZè´ ˛a´lå ¸såR´ räž´ ˛EãA˘ C裴 Zé´ G˘ Næˇ ´LSä¡£ç´ Tˇ´lSQLAlchemyæ ˛a˛Eæd˝uæ¯ I˙eæŸˇ aå˘ rˇDæˇ ¸Træˇ o垸Sã˝ A˘ C´ #!/usr/bin/env python #-*- encoding: utf-8 -*- """ Topic: åoŽä´zL’æ¸T˝ ræˇ o垸Sæ˝ l´ ˛aå¯dNå´ o˝¯d䡸S Desc : """ import datetime from sqlalchemy.engine.url import URL from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import create_engine, Column, Integer, String, Text,

˓→ DateTime from coolscrapy.settings import DATABASE

Base= declarative_base() class ArticleRule(Base): """èGłå˘ oŽä´zL’æ˝ U˝Gç˘ n´aç˘ ´Lnåˇ R´Uè˝ g˘ˇDå´LZ"""´ __tablename__='article_rule'

id= Column(Integer, primary_key=True) # èg˘ˇDå´LZå´ Rˇ çg˘rˇ name= Column(String(30)) # è£Rèˇ ˛aN玡 ˇD姧åRˇ å´LUè˚ ˛alïij´ Néˇ A˘Uå˚ R˚u鎴 Tåijˇ A˘ allow_domains= Column(String(100)) # åijAå˘ g˘NURLå´ ´LUè˚ ˛alïij´ Néˇ A˘Uå˚ R˚u鎴 Tåijˇ A˘ start_urls= Column(String(100)) # äÿNäÿ´ Aé˘ ˛a¸t玡Dxpath next_page= Column(String(100)) # æU˝Gç˘ n´aé¸S¿æ˘ O˝eæˇ cåˇ ´LZè´ ˛alè¿¿åij´ R(å´ Räÿš)ˇ allow_url= Column(String(200)) # æU˝Gç˘ n´aé¸S¿æ˘ O˝eæˇ R´Råˇ R´Uå˝ Nžå§§xpathˇ extract_from= Column(String(200)) # æU˝Gç˘ n´aæ˘ a˘Gé˘ cŸxpath´ title_xpath= Column(String(100)) # æU˝Gç˘ n´aå˘ ˛EEåˇ o´zxpath˝ body_xpath= Column(Text) # åR´Såÿ´ Cæˇ U˝ué˚ Ut’xpath˚ publish_time_xpath= Column(String(30)) # æU˝Gç˘ n´aæ˘ I˙eæžˇ Rˇ source_site= Column(String(30)) # èg˘ˇDå´LZ柴 rå´ R˛eçˇ T§æ¸Tˇ ´L enable= Column(Integer) class Article(Base): """æU˝Gç˘ n´aç˘ s´z"""˙ __tablename__='articles'

id= Column(Integer, primary_key=True) url= Column(String(100)) title= Column(String(100)) body= Column(Text) publish_time= Column(String(30)) source_site= Column(String(30))

10.4å oŽä´zL’æ˝ U˝ Gç˘ n´ aItem˘

è£Zäÿłå¿´ ´Lço˝Aå˘ ¸T䞲EïijNæšˇ ˛aäz˙Aä´z˘ ´LéIJAè˛e˘ ˛Aèrt’柴 O玽 Dˇ import scrapy class Article(scrapy.Item): title= scrapy.Field() url= scrapy.Field() body= scrapy.Field() publish_time= scrapy.Field() source_site= scrapy.Field()

10.5å oŽä´zL’ArticleSpider˝

æO˝ eäÿˇ Næ´ I˙eæˇ ´LSä´ z˙nåˇ rˇ ˛EåoŽä´zL’ç˝ ´Lnåˇ R´Uæ˝ U˝ Gç˘ n´a玢 DèIJŸèˇ Z˙ Zïij˙ Nè£ˇ ZäÿłspideräijŽä¡£ç´ Tˇ´läÿAäÿłRuleå˘ o˝d俯 Næ´ I˙eåˇ ´LIå˙ g˘Nå´ Nˇ Uïij˝ Nçˇ D˝uåˇ RˇOæ˝ a´zæ˘ oRuleå˝ o˝d俯 Näÿ´ çŽDxpathèˇ g˘Dåˇ ´LZæ´ I˙eèˇ O˚uå˝ R´Uç˝ Zÿåž˙ T玡 Dæˇ ¸Træˇ oã˝ A˘ C´ from coolscrapy.utils import parse_text from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from coolscrapy.items import Article class ArticleSpider(CrawlSpider): name="article"

def __init__(self, rule): self.rule= rule self.name= rule.name self.allowed_domains= rule.allow_domains.split(",") self.start_urls= rule.start_urls.split(",") rule_list=[] # æ˚uzåŁ˙ a`äÿ˘ Näÿ´ Aé˘ ˛a¸t`玡Dèg˘ˇDå´LZ´ if rule.next_page: rule_list.append(Rule(LinkExtractor(restrict_

˓→xpaths=rule.next_page))) # æ˚uzåŁ˙ aæŁ¡å˘ R´Uæ˝ U˝Gç˘ n´aé¸S¿æ˘ O˝e玡 ˇDèg˘ˇDå´LZ´ rule_list.append(Rule(LinkExtractor( allow=[rule.allow_url], restrict_xpaths=[rule.extract_from]), callback='parse_item')) self.rules= tuple(rule_list) super(ArticleSpider, self).__init__()

def parse_item(self, response): self.log('Hi, this is an article page! %s'% response.url)

article= Article() article["url"]= response.url

title= response.xpath(self.rule.title_xpath).extract() article["title"]= parse_text(title, self.rule.name,'title

˓→')

body= response.xpath(self.rule.body_xpath).extract() article["body"]= parse_text(body, self.rule.name,'body')

publish_time= response.xpath(self.rule.publish_time_xpath).

˓→extract() article["publish_time"]= parse_text(publish_time, self.

˓→rule.name,'publish_time')

article["source_site"]= self.rule.source_site

return article

è˛e ˛Aæ¸s´læDˇ R玴 DæŸˇ rstart_urlsïij´ Nrulesçˇ L’éC¡åˇ ´LIå˙ g˘Nå´ Nˇ Uæ˝ ´LR䞡 ˛Eår´zè´ s´ ˛açŽDåˇ s´dæ¯ A˘ gïij˘ Néˇ C¡çˇ Tˇ säij´ aå˘ Eˇ e玡 Druleåˇ r´zè´ s´ ˛aå´LIå˙ g˘Nå´ Nˇ Uïij˝ Nparse_itemæˇ U´zæ¸s¸Täÿ˝ çŽDæŁ¡åˇ R´Uè˝ g˘Dåˇ ´LZä´z§é´ C¡æIJL’ruleåˇ r´zè´ s´ ˛aæR´ Rä¿ˇ Zã˙ A˘ C´

10.6 çijUå˝ ˛EZpipelineå´ ŸåC´ ´lå´Lræˇ ¸Træˇ oåž˝ ¸Säÿ

æ´LSä´ z˙nè£ŸæŸˇ rä¡£ç´ Tˇ´lSQLAlchemyæI˙eåˇ rˇ ˛EæU˝ Gç˘ n´aItemæ˘ ¸Træˇ oå˝ ŸåC´ ´lå´Lræˇ ¸Træˇ o垸Säÿ˝

@contextmanager def session_scope(Session): """Provide a transactional scope around a series of operations."

˓→"" session= Session() try: yield session session.commit() except: session.rollback() raise finally: session.close() class ArticleDataBasePipeline(object): """ä£Iå˙ ŸæU˝Gç˘ n´aå˘ ´Lræ¸Tˇ ræˇ o垸S"""˝ def __init__(self): engine= db_connect() create_news_table(engine) self.Session= sessionmaker(bind=engine)

def open_spider(self, spider): """This method is called when the spider is opened.""" pass

def process_item(self, item, spider): a= Article(url=item["url"], title=item["title"].encode("utf-8"), publish_time=item["publish_time"].encode("utf-8

˓→"), body=item["body"].encode("utf-8"), source_site=item["source_site"].encode("utf-8")) with session_scope(self.Session) as session: session.add(a)

def close_spider(self, spider): pass

10.7 ä£oæ˝ T´zrun.pyåˇ Rˇ råŁ´ ´lèDŽæIJˇ nˇ

æ´LSä´ z˙nåˇ rˇ ˛EäÿŁéI˙c玴 Drun.pyçˇ ´l ä¡IJä£oæ˝ T´zåˇ ¸såR´ rå´ oŽå˝ ´L˝uæ´LSä´ z˙n玡 Dæˇ U˝ Gç˘ n´aç˘ ´Lnèˇ Z´ nå´ Rˇ råŁ´ ´lèDŽæIJˇ nˇ import logging from spiders.article_spider import ArticleSpider from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings from scrapy.utils.log import configure_logging from coolscrapy.models import db_connect from coolscrapy.models import ArticleRule from sqlalchemy.orm import sessionmaker if __name__ =='__main__': settings= get_project_settings() configure_logging(settings) db= db_connect() Session= sessionmaker(bind=db) session= Session() rules= session.query(ArticleRule).filter(ArticleRule.enable ==

˓→1).all() session.close() runner= CrawlerRunner(settings)

for rule in rules: # stop reactor when spider closes # runner.signals.connect(spider_closing, signal=signals.

˓→spider_closed) runner.crawl(ArticleSpider, rule=rule)

d= runner.join() d.addBoth(lambda _: reactor.stop())

# blocks process so always keep as the last statement reactor.run() logging.info('all finished.')

OKïijNäÿˇ Aå˘ ´LGæ˘ Rˇ då¯ oŽã˝ A˘ Cç´ O˝ råIJˇ ´læ´LSä´ z˙nåˇ R´ rä´ z˙eå¿ˇ AArticleRuleè˘ ˛a´läÿ åŁaå˘ Eˇ eæˇ ´LRçˇ Z¿äÿŁå´ Cäÿłç¡ˇ Sç´ n´Z玴 Dèˇ g˘Dåˇ ´LZïij´ Nèˇ A˘ Näÿˇ çTˇ´læ˚uzåŁ˙ aäÿ˘ Aè˘ ˛aNäˇ z˙cçˇ a˘ ˛AïijNåˇ rˇså´ R´ rä´ z˙eåˇ r´z裴 Zæ´ ´LRçˇ Z¿äÿŁå´ Cäÿłç¡ˇ Sç´ n´Z裴 Zè˙ ˛aNçˇ ´Lnåˇ R´Uã˝ A˘ C´ 塸SçD˝uä¡ˇ aå˘ o˝Nåˇ Eˇ´låR´ rä´ z˙eåˇ ˛AŽäÿAäÿłWebåL’˘ çn´ræ´ I˙eåˇ o˝Næˇ ´LRçˇ zt’æŁd’ArticleRuleè˙ ˛a´lçŽDäˇ z˙zåŁ˙ ˛aãA˘ C塸Sç´ D˝uArticleRuleèˇ g˘Dåˇ ´LZä´z§å´ R´ rä´ z˙eæˇ T¿åIJˇ ´léZd’äž´ ˛Eæ¸Træˇ o垸S玽 Däˇ z˙zä¡˙ ¸TåIJræˇ U´zïij˝ Næˇ r´Tå˛eˇ Cé´ Eˇ ç¡oæ˝ U˝ Gä˘ z˝uã˙ A˘ C´ ä¡aå˘ R´ rä´ z˙eåIJˇ ´lGitHubäÿŁçIJNå´ ´LræIJˇ næˇ U˝ G玢 Dåˇ o˝Næˇ ¸Tt’é ˛a´zçZ˙ oæž˝ Rçˇ a˘ ˛AãA˘ C´

11 Scrapyæ ¸TZç´ ´lN11-´ æ´l ˛aæN§ç´ Z˙z塸T´

æIJL’æU˝uå˚ A˘ Zç´ ´Lnåˇ R´Uç¡˝ Sç´ n´Z玴 Dæˇ U˝uå˚ A˘ ZéIJ´ Aè˛e˘ ˛AçZ´ zå¡˙ ¸TïijNåIJˇ ´lScrapyäÿ åR´ rä´ z˙eéˇ AŽè£˘ Gæ˘ ´l ˛aæN§ç´ Z´ zå¡˙ ¸Tä£Iå˙ ŸcookieåRˇOå˝ ˛E åO˝ zç˙ ´Lnåˇ R´Uç˝ Zÿåž˙ T玡 Déˇ ˛a¸téI˙cã´ A˘ C裴 Zé´ G˘ Næˇ ´LSé´ AŽè£˘ Gç˘ Z´ zå¡˙ ¸TgithubçD˝uåˇ RˇOç˝ ´Lnåˇ R´Uè˝ Głå˚u˘ s玴 Dissueåˇ ´LUè˚ ˛a´læI˙eæijˇ Tçd’žäÿˇ Næ´ ¸Tt’äÿłåO§ç˝ Rˇ ˛EãA˘ C´ è˛e ˛AæC¸såˇ o˝dç¯ O˝ rçˇ Z´ zå¡˙ ¸TårˇséIJ´ Aè˛e˘ ˛Aè˛a´lå ¸TæR´ Räžd’ïijˇ Nåˇ Eˇ´LéAŽè£˘ Gæ¸t˘ Rè´ g˘´LåZ´´lèo£é˝ U˚ ogithub玽 Dçˇ Z´ zå¡˙ ¸Té˛a¸téI˙c´https: //github.com/loginïijNˇ çD˝uåˇ RˇOä¡£ç˝ Tˇ´læ¸tRè´ g˘´LåZ´´lèrˇCèˇ r´ ¸Tå˚ueåˇ E˚uæˇ I˙eå¿ˇ Uå˚ ´Lrçˇ Z´ zå¡˙ ¸TæU˝uéIJ˚ Aè˛e˘ ˛AæR´ Räžd’äˇ z˙Aä´z˘ ´LäÿIJèe£:ˇ

æ´LS裴 Zé´ G˘ Nä¡£çˇ Tˇ´lchromeæ¸tRè´ g˘´LåZ´´lçŽDèˇ rˇCèˇ r´ ¸Tå˚ueåˇ E˚uïijˇ NF12æL’¸Såijˇ Aå˘ RˇOé˝ AL’æ˘ Nl’Networkïij´ Nå´z˝uåˇ rˇ ˛EPreserve logåN¿äÿŁã´ A˘ C´ æ´LSæ´ ¸TEæˇ Dˇ R迸Så´ Eˇ eéˇ Tˇ Zè´ r´r玴 Dçˇ Tˇ´læ´L˚uåRˇ åŠNåˇ r´ ˛Eça˘ ˛AïijNå¿ˇ Uå˚ ´Lråˇ o˝Cæˇ R´ Räžd’玡 Dformèˇ ˛a´lå ¸TåR´ Cæ´ ¸Tr裟æIJL’POSTæˇ R´ Räžd’玡 DURL:ˇ åO˝ zæ§˙ eçIJˇ Nhtmlæž´ Rçˇ a˘ ˛AäijŽåR´ Sç´ O˝ rèˇ ˛a´lå ¸TéG˘ Néˇ I˙cæIJL’äÿłéŽ´ Rèˇ U˚ R玴 Dˇ authenticity_tokenåAijïij˘ Nè£ˇ ZäÿłæŸ´ réIJ´ Aè˛e˘ ˛AåEˇ´LèO˚uå˝ R´Uç˝ D˝uåˇ RˇOè˚u§ç˝ Tˇ´læ´L˚uåRˇ åŠNåˇ r´ ˛Eça˘ ˛AäÿAè¸t˚uæ˘ R´ Räžd’玡 D:ˇ

11.1é G˘ å ˛EZstart_requestsæ´ U´zæ¸s¸T˝

è˛e ˛Aä¡£çTˇ´lcookieïijNçˇ nˇnäÿˇ Aæ˘ eå¿ˇ UæL’¸Såij˚ Aå˘ o˝Cåˇ S´Aïij˘ Néˇ zŸè˙ od’scrapyä¡£ç˝ Tˇ´lCookiesMiddlewareäÿ éUt’ä˚ z˝uïij˙ Nå´z˝uäÿˇ TæL’¸Såijˇ A䞢 ˛EãA˘ Cå˛e´ Cæ´ dIJ䡯 aä´z˘ NåL’´ ç˛e ˛Aæ c裴 Gïij˘ Nèˇ r˚uè´ o¿ç¡˝ oå˛e˝ Cäÿ´ N´

COOKIES_ENABLES= True

æ´LSä´ z˙nåˇ Eˇ´Lè˛e ˛AæL’¸SåijAç˘ Z´ zå¡˙ ¸Té˛a¸téI˙cïij´ Nèˇ O˚uå˝ R´U˝ authenticity_tokenåAijïij˘ Nè£ˇ Zé´ G˘ Næˇ ´LSé´ G˘ å ˛EZäž´ ˛Estart_requestsæU´zæ¸s¸T˝

# éG˘ å ˛EZäž´ ˛Eç´Lnèˇ Z´nç´ s´zçŽ˙ ˇDæU´zæ¸s¸T,å˝ o˝¯dçO˝r䞡 ˛EèGłå˘ oŽä´zL’è˝ r˚uæ´ s´C,´ ˓→è£Rèˇ ˛aNæˇ ´LRåŁ§åˇ RˇOäijŽè˝ rˇCçˇ Tˇlcallbackå´ Z˙¯dèrˇCåˇ G¡æ¸T˘ rˇ def start_requests(self): return [Request("https://github.com/login", meta={'cookiejar':1}, callback=self.post_

˓→login)]

# FormRequeset def post_login(self, response): # åEˇ´LåO˝zæ˙ N£éŽ´ Rèˇ U˚R玴 ˇDè ˛alå´ ¸TåR´Cæ¸T´ rauthenticity_tokenˇ authenticity_token= response.xpath( '//input[@name="authenticity_token"]/@value').extract_

˓→first() logging.info('authenticity_token='+ authenticity_token) pass

start_requestsæU´zæ¸s¸Tæ˝ Nˇ Gå˘ oŽäž˝ ˛EåZ˙ dè¯ rˇCåˇ G¡æ˘ ¸Trïijˇ Nçˇ Tˇ´læI˙eèˇ O˚uå˝ R´U鎽 Rèˇ U˚ Rè´ ˛a´lå ¸TåAij˘ authenticity_tokenïijNåˇ Rˇ Næˇ U˝uæ˚ ´LSä´ z˙nè£Ÿçˇ z˙ZRequestæ´ Nˇ Gå˘ oŽäž˝ ˛EcookiejarçŽDåˇ Eˇ Cæˇ ¸Træˇ oïij˝ Nçˇ Tˇ´læI˙eå¿ˇ Aå˘ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Träijˇ aé˘ AŠcookieæ˘ a˘Gè˘ r´ ˛EãA˘ C´ 11.2 ä¡£çTˇ´lFormRequest

Scrapyäÿžæ´LSä´ z˙nåˇ G˘ ˛Eåd’G䞢 ˛EFormRequestçs´zäÿ¸Sé˙ U˚ ´lçTˇ´læI˙eè£ˇ Zè˙ ˛aNFormèˇ ˛a´lå ¸TæR´ Räžd’玡 Dˇ

# äÿžäž ˛Eæl´ ˛aæN§æ¸t´ Rè´ g˘´LåZ´lïij´ Næˇ ´LSä´ z˙nåˇ oŽä´zL’httpheader˝ post_headers={ "Accept":"text/html,application/xhtml+xml,application/xml;q=0. ˓→9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6", "Cache-Control":"no-cache", "Connection":"keep-alive", "Content-Type":"application/x-www-form-urlencoded", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/

˓→537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36", "Referer":"https://github.com/", } # ä¡£çTˇlFormRequesetæ´ l´ ˛aæN§è´ ˛alå´ ¸TæR´Räžd’ˇ def post_login(self, response): # åEˇ´LåO˝zæ˙ N£éŽ´ Rèˇ U˚R玴 ˇDè ˛alå´ ¸TåR´Cæ¸T´ rauthenticity_tokenˇ authenticity_token= response.xpath( '//input[@name="authenticity_token"]/@value').extract_

˓→first() logging.info('authenticity_token='+ authenticity_token) # FormRequeset.from_responseæŸrScrapyæ´ R´Rä¿ˇ ZçŽ˙ ˇDäÿAäÿłå˘ G¡æ¸T˘ r,ˇ

˓→çTˇläž´ Opostè˝ ˛alå´ ¸T # çZ´zé˙ Z´ ˛Eæ´LRåŁ§åˇ RˇO,˝ äijŽèrˇCçˇ Tˇlafter_´

˓→loginåZ˙¯dèrˇCåˇ G¡æ¸T˘ rïijˇ Nå˛eˇ Cæ´ ¯dIJurlè˚u§Requesté˛a¸téI˙c玴 ˇDäÿAæ˘ a˚uå˘ rˇsçIJ´ ˛Aç¸Teæˇ OL’˝ return [FormRequest.from_response(response, url='https://github.com/

˓→session', meta={'cookiejar': response.

˓→meta['cookiejar']}, headers=self.post_headers, #

˓→æ¸slæ´ ˇDRæ´ d’åd’ˇD玡Dheaders formdata={ 'utf8':' X', 'login':'yidao620c', 'password':' ******', 'authenticity_token':

˓→authenticity_token }, callback=self.after_login, dont_filter=True )] def after_login(self, response): pass

FormRequest.from_response()æU´zæ¸s¸Tè˝ ol’ä¡˝ aæ˘ Nˇ Gå˘ oŽæ˝ R´ Räžd’玡 Durlïijˇ Nèˇ r˚uæ´ s´Cåd’t’裟æIJL’formè˛a´ ´lå ¸TåAijïij˘ Næ¸sˇ ´læDˇ Ræ´ ´LSä´ z˙nè£Ÿéˇ AŽè£˘ G˘ metaäijaé˘ AŠäž˘ ˛Ecookieæa˘Gè˘ r´ ˛EãA˘ Cå´ o˝Cåˇ Rˇ Næˇ a˚uæIJL’äÿłå˘ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Trïijˇ Nçˇ Z´ zå¡˙ ¸Tæ´LRåŁ§åˇ RˇOè˝ rˇCçˇ Tˇ´lãA˘ Cäÿ´ Né´ I˙cæ´ ´LSä´ z˙næˇ I˙eåˇ o˝dç¯ O˝ råˇ o˝Cˇ def after_login(self, response): # çZ´z塸Tä´z˙ Nå´ RˇOïij˝ Nåijˇ Aå˘ g˘N裴 Zå˙ Eˇeæˇ ´LSè˛e˛Aç´ ´Lnåˇ R´U玽 ˇDçg˘ ˛Aä£˛aé˛a¸téI˙c´ for url in self.start_urls: #

˓→åZ˙aäÿžæ˘ ´LSä´ z˙näÿŁéˇ I˙cå´ oŽä´zL’䞲ERuleïij˝ NæL’ˇ Aä˘ z˙eåˇ RłéIJ´ Aè˛e˛Aç˘ o˝Aå˘ ¸T玡DçT§æˇ ´LRåˇ ´LIå˙ g˘Nç´ ´Lnåˇ R´URequestå˝ ¸såR´r´ yield Request(url, meta={'cookiejar': response.meta[

˓→'cookiejar']})

è£Zé´ G˘ Næˇ ´LSé´ AŽè£˘ G˘ start_urlsåoŽä´zL’äž˝ ˛EåijAå˘ g˘Né´ ˛a¸téI˙cïij´ Nçˇ D˝uåˇ RˇOç˝ T§æˇ ´LRRequestïijˇ Nåˇ E˚u䡸Sçˇ ´Lnåˇ R´U玽 Dèˇ g˘Dåˇ ´LZ労 Nˇ äÿNäÿ´ Aé˘ ˛a¸tèg˘Dåˇ ´LZåIJ´ ´låL’ éI˙c玴 DRuleéˇ G˘ Nåˇ oŽä´zL’äž˝ ˛EãA˘ Cæ¸s´ ´læDˇ R裴 Zé´ G˘ Næˇ ´LSç´ z˙gç˘ z˙ äijaé˘ AŠ˘ cookiejarïijNèˇ o£é˝ U˚ oå˝ ´LIå˙ g˘Né´ ˛a¸téI˙cæ´ U˝uåÿ˛eäÿŁcookieä£˛aæ˛A˚ rã´ A˘ C´

11.3é G˘ å ˛EZ_requests_to_follow´

æIJL’äÿłéU˚ oé˝ cŸå´ ´LŽåijAå˘ g˘Nå´ Z˙ ræL’ˇ ræˇ ´LSå¿´ ´Lä´zEåˇ rˇs柴 r裴 Zé´ G˘ Næˇ ´LSå´ oŽä´zL’玽 Dspiderçˇ z˙gæL’£è˘ GłCrawlSpiderïij˘ Nåˇ o˝Cåˇ ˛EEéˇ Cˇ ´lèGłåŁ˘ ´låO˝ zäÿ˙ Nè¡¡å´ N´zéˇ Eˇ çŽDé¸S¿æˇ O˝ eïijˇ Nèˇ A˘ Næˇ r´Ræ´ nˇ ˛aåO˝ zè˙ o£é˝ U˚ oé¸S¿æ˝ O˝ e玡 Dæˇ U˝uå˚ A˘ Zå´z˝uæš˛aæIJL’è´ GłåŁ˘ ´låÿ˛eäÿŁcookieïijNåˇ RˇOæ˝ I˙eæˇ ´LSé´ G˘ å ˛EZäž´ ˛Eåo˝C玡 Dˇ _requests_to_follow()æU´zæ¸s¸Tè˝ g˘cåˇ ˛E¸s䞲Eè£Zäÿłé´ U˚ oé˝ cŸ´ def _requests_to_follow(self, response): """éG˘ å ˛EZåŁ´ aå˘ Eˇecookiejar玡 ˇDæZt’æ˙ U˝r"""ˇ if not isinstance(response, HtmlResponse): return seen= set() for n, rule in enumerate(self._rules): links= [l for l in rule.link_extractor.extract_

˓→links(response) if l not in seen] if links and rule.process_links: links= rule.process_links(links) for link in links: seen.add(link) r= Request(url=link.url, callback=self._response_

˓→downloaded) # äÿNé´ I˙c裴 Zå´ R´eæŸˇ ræ´ ´LSé´ G˘ å ˛EZ玴 ˇD r.meta.update(rule=n, link_text=link.text,

˓→cookiejar=response.meta['cookiejar']) yield rule.process_request(r)

11.4 é ˛a¸té˙Icåd’´ Dçˇ Rˇ ˛EæU´zæ¸s¸T˝

åIJ´lèg˘Dåˇ ´LZRuleé´ G˘ Néˇ I˙cæ´ ´LSå´ oŽä´zL’äž˝ ˛Eær´Räÿłé¸S¿æ´ O˝ e玡 Dåˇ Z˙ dè¯ rˇCåˇ G¡æ˘ ¸Trˇparse_pageïijNåˇ rˇs柴 ræIJ´ Aç˘ z˙´Læ´LSä´ z˙nåd’ˇ Dçˇ Rˇ ˛Eær´Räÿłissueé´ ˛a¸téI˙cæ´ R´ Råˇ R´Uä£˝ ˛aæ˛Ar玴 Déˇ A˘ zè¿˙ S´ def parse_page(self, response): ""

˓→"è£ZäÿłæŸ´ rä¡£ç´ TˇlLinkExtractorè´ GłåŁ˘ låd’´ ˇDçRˇ ˛Eé¸S¿æO˝eäˇ z˙eåˇ RŁ`äÿ´ Näÿ´ Aé˘ ˛a¸t`"

˓→"" logging.info(u'------æ˝u´Læ ˛Arå´ ´L ˛EåL’šçž£------') logging.info(response.url) issue_title= response.xpath( '//span[@class="js-issue-title"]/text()').extract_first() logging.info(u'issue_titleïijŽ'+ issue_title.encode('utf-8')) 11.5å o˝ Næˇ ¸Tt’æžRçˇ a˘ ˛A

#!/usr/bin/env python #-*- encoding: utf-8 -*- """ Topic: çZ´z塸Tç˙ ´Lnèˇ Z´n´ Desc : æl´ ˛aæN§ç´ Z´z塸Thttps://github.˙

˓→comåRˇOå˝ rˇ ˛EèGłå˚u˘ s玴 ˇDissueåEˇlé´ Cˇlç´ ´Lnåˇ Gžæ˘ I˙eˇ tipsïijŽä¡£çTˇlchromeè´ rˇCèˇ r¸Tpostè˛a´ lå´ ¸T玡DæU˝uå˚ A˘Zå´ N¿é´ AL’Preserve˘

˓→logåŠNDisableˇ cache """ import logging import re import sys import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.http import Request, FormRequest, HtmlResponse logging.basicConfig(level=logging.INFO, format='%(asctime)s %(filename)s[line:

˓→%(lineno)d] %(levelname)s %(message)s', datefmt='%Y-%m-%d %H:%M:%S', handlers=[logging.StreamHandler(sys.stdout)]) class GithubSpider(CrawlSpider): name="github" allowed_domains=["github.com"] start_urls=[ 'https://github.com/issues', ] rules=( # æ˝u´Læ ˛Arå´ ´LUè˚ ˛al´ Rule(LinkExtractor(allow=('/issues/\d+',), restrict_xpaths='//ul[starts-with(@class,

˓→ "table-list")]/li/div[2]/a[2]'), callback='parse_page'), # äÿNäÿ´ Aé˘ ˛a¸t,If callback is None follow defaults to True,

˓→otherwise it defaults to False Rule(LinkExtractor(restrict_xpaths='//a[@class="next_page"]

˓→')), ) post_headers={ "Accept":"text/html,application/xhtml+xml,application/xml; ˓→q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6", "Cache-Control":"no-cache", "Connection":"keep-alive", "Content-Type":"application/x-www-form-urlencoded", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64)

˓→AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/

˓→537.36", "Referer":"https://github.com/", }

# éG˘ å ˛EZäž´ ˛Eç´Lnèˇ Z´nç´ s´zçŽ˙ ˇDæU´zæ¸s¸T,å˝ o˝¯dçO˝r䞡 ˛EèGłå˘ oŽä´zL’è˝ r˚uæ´ s´C,´ ˓→è£Rèˇ ˛aNæˇ ´LRåŁ§åˇ RˇOäijŽè˝ rˇCçˇ Tˇlcallbackå´ Z˙¯dèrˇCåˇ G¡æ¸T˘ rˇ def start_requests(self): return [Request("https://github.com/login", meta={'cookiejar':1}, callback=self.post_

˓→login)]

# FormRequeset def post_login(self, response): # åEˇ´LåO˝zæ˙ N£éŽ´ Rèˇ U˚R玴 ˇDè ˛alå´ ¸TåR´Cæ¸T´ rauthenticity_tokenˇ authenticity_token= response.xpath( '//input[@name="authenticity_token"]/@value').extract_

˓→first() logging.info('authenticity_token='+ authenticity_token) # FormRequeset.from_responseæŸrScrapyæ´ R´Rä¿ˇ ZçŽ˙ ˇDäÿAäÿłå˘ G¡æ¸T˘ r,ˇ

˓→çTˇläž´ Opostè˝ ˛alå´ ¸T # çZ´zé˙ Z´ ˛Eæ´LRåŁ§åˇ RˇO,˝ äijŽèrˇCçˇ Tˇlafter_´

˓→loginåZ˙¯dèrˇCåˇ G¡æ¸T˘ rïijˇ Nå˛eˇ Cæ´ ¯dIJurlè˚u§Requesté˛a¸téI˙c玴 ˇDäÿAæ˘ a˚uå˘ rˇsçIJ´ ˛Aç¸Teæˇ OL’˝ return [FormRequest.from_response(response, url='https://github.com/

˓→session', meta={'cookiejar':

˓→response.meta['cookiejar']}, headers=self.post_headers,

˓→ # æ¸slæ´ ˇDRæ´ d’åd’ˇD玡Dheaders formdata={ 'utf8':' X', 'login':'yidao620c', 'password':' ******', 'authenticity_token':

˓→authenticity_token }, callback=self.after_login, dont_filter=True )]

def after_login(self, response): for url in self.start_urls: #

˓→åZ˙aäÿžæ˘ ´LSä´ z˙näÿŁéˇ I˙cå´ oŽä´zL’䞲ERuleïij˝ NæL’ˇ Aä˘ z˙eåˇ RłéIJ´ Aè˛e˛Aç˘ o˝Aå˘ ¸T玡DçT§æˇ ´LRåˇ ´LIå˙ g˘Nç´ ´Lnåˇ R´URequestå˝ ¸såR´r´ yield Request(url, meta={'cookiejar': response.meta[

˓→'cookiejar']}) def parse_page(self, response): ""

˓→"è£ZäÿłæŸ´ rä¡£ç´ TˇlLinkExtractorè´ GłåŁ˘ låd’´ ˇDçRˇ ˛Eé¸S¿æO˝eäˇ z˙eåˇ RŁ`äÿ´ Näÿ´ Aé˘ ˛a¸t`"

˓→"" logging.info(u'------æ˝u´Læ ˛Arå´ ´L ˛EåL’šçž£------

˓→--') logging.info(response.url) issue_title= response.xpath( '//span[@class="js-issue-title"]/text()').extract_

˓→first() logging.info(u'issue_titleïijŽ'+ issue_title.encode('utf-8

˓→'))

def _requests_to_follow(self, response): """éG˘ å ˛EZåŁ´ aå˘ Eˇecookiejar玡 ˇDæZt’æ˙ U˝r"""ˇ if not isinstance(response, HtmlResponse): return seen= set() for n, rule in enumerate(self._rules): links= [l for l in rule.link_extractor.extract_

˓→links(response) if l not in seen] if links and rule.process_links: links= rule.process_links(links) for link in links: seen.add(link) r= Request(url=link.url, callback=self._response_

˓→downloaded) # äÿNé´ I˙c裴 Zå´ R´eæŸˇ ræ´ ´LSé´ G˘ å ˛EZ玴 ˇD r.meta.update(rule=n, link_text=link.text,

˓→cookiejar=response.meta['cookiejar']) yield rule.process_request(r)

ä¡aå˘ R´ rä´ z˙eåIJˇ ´lGitHubäÿŁçIJNå´ ´LræIJˇ næˇ U˝ G玢 Dåˇ o˝Næˇ ¸Tt’é ˛a´zçZ˙ oæž˝ Rçˇ a˘ ˛AïijN裟æIJL’åˇ R˛eåd’´ Uäÿ˝ Aäÿłè˘ GłåŁ˘ ´lçZ´ zé˙ Z´ ˛Eiteyeç¡Sç´ n´Z玴 Dä¿ˇ Nå´ Rãˇ A˘ C´

12 Scrapyæ ¸TZç´ ´lN12-´ æŁ ¸SåR´ UåŁ˝ ´læA˘ ˛Aç¡Sç´ n´ Z´

åL’ éI˙cæ´ ´LSä´ z˙näˇ z˙Nç´ z˙ çŽDéˇ C¡æŸˇ rå´ O˝ zæŁ¸Så˙ R´Ué˝ I˙Zæ´ A˘ ˛AçŽDç¡ˇ Sç´ n´Zé´ ˛a¸téI˙cïij´ Nä´z§åˇ rˇs柴 rè´ rt’æ´ ´LSä´ z˙næL’¸Såijˇ Aæ§˘ Räÿłé¸S¿æˇ O˝ eïijˇ Nåˇ o˝C玡 Dåˇ ˛EEåˇ o´zå˝ Eˇ´léCˇ ´låS´´LçO˝ råˇ Gžæ˘ I˙eãˇ A˘ C´ ä¡ ˛EæŸrå˛e´ Cä´ zŁçŽ˙ DäžŠèˇ ˛ATç¡ˇ Såd’´ gé˘ Cˇ ´lå´L ˛EçŽDwebéˇ ˛a¸téI˙cé´ C¡æŸˇ råŁ´ ´læA˘ ˛AçŽDïijˇ Nçˇ z˙Råÿÿé´ A˘ ZçŽ˙ Dç¡ˇ Sç´ n´Zä¿´ Nå˛e´ Cäž´ näÿIJãˇ A˘ ˛Aæ˚uŸåo˝Iç˙ L’ïijNåˇ ¸T˛Eå¸S˛Aå´LUè˚ ˛a´léC¡æŸˇ rjsïij´ Nå´z˝uæIJL’Ajaxæÿšæ§¸Sïijˇ Nˇ äÿN衡槴 Räÿłé¸S¿æˇ O˝ eå¿ˇ Uå˚ ´Lr玡 Déˇ ˛a¸téI˙cé´ G˘ Néˇ I˙cå´ Rˇ næIJL’åij´ Cæ´ eåŁˇ aè¡¡çŽ˘ Dåˇ ˛EEåˇ o´zïij˝ Nè£ˇ Zæ´ a˚uå˛E˘ ä¡£çTˇ´lä´zNåL’´ çŽDæˇ U´zåij˝ Ræ´ ´LSä´ z˙næˇ a´zæIJ˘ nèˇ O˚uå˝ R´Uäÿ˝ å´Lråijˇ Cæ´ eåŁˇ aè¡¡çŽ˘ Dè£ˇ Zäž´ Zç¡˙ Sé´ ˛a¸tå˛EEåˇ o´zã˝ A˘ C´ ä¡£çTˇ´lJavascriptæÿšæ§¸SåŠNåd’ˇ Dçˇ Rˇ ˛Eç¡Sé´ ˛a¸tæŸrç´ g˘ éI˙dåÿÿåÿÿè¯ g˘ ˛AçŽDåˇ ˛AŽæ¸s¸TïijNå˛eˇ Cä¡´ ¸Tåd’Dçˇ Rˇ ˛EäÿAäÿłåd’˘ gé˘ G˘ Rä¡£ç´ Tˇ´lJavascriptçŽDéˇ ˛a¸téI˙c柴 rScrapyç´ ´Lnèˇ Z´ nåij´ Aå˘ R´ Säÿ´ äÿAäÿłåÿÿè˘ g˘ ˛AçŽDéˇ U˚ oé˝ cŸïij´ Nˇ è£Zç´ r´Gæ˘ U˝ Gç˘ n´aå˘ rˇ ˛Eèrt’柴 Oå˛e˝ Cä¡´ ¸TåIJ´lScrapyç´Lnèˇ Z´ näÿ´ ä¡£çTˇ´lscrapy- splashæI˙eåd’ˇ Dçˇ Rˇ ˛Eé˛a¸téI˙cäÿ´ å¿UJavascriptã˚ A˘ C´ 12.1 scrapy-splashço˝ Aä˙z˘ N´

scrapy-splashå´Ll’çTˇ´lSplashårˇ ˛EjavascriptåŠNScrapyéˇ Z˙ ˛Eæ´LRè¸t˚uæˇ I˙eïijˇ Nä¡£å¿ˇ UScrapyå˚ R´ rä´ z˙eæŁ¸Såˇ R´UåŁ˝ ´læA˘ ˛Aç¡Sé´ ˛a¸tãA˘ C´ SplashæŸräÿ´ Aäÿłjavascriptæÿšæ§¸SæIJ˘ åŁ ˛aïijNæŸˇ rå´ o˝dç¯ O˝ r䞡 ˛EHTTP APIçŽDè¡ˇ zé˙ G˘ Rçž´ gæ¸t˘ Rè´ g˘´LåZ´´lïijNåžˇ ¸Tås´C姞䞴 OTwistedåŠ˝ NQTæˇ ˛a˛Eæd˝uïij¯ NPythonèˇ r´ è´lAçij˘ Uå˝ ˛EZã´ A˘ CæL’´ Aä˘ z˙eé˛eˇ Uå˝ Eˇ´Lä¡aå¿˘ Uå˚ oL’è˝ cˇESplashåˇ o˝d俯 N´

12.2å oL’è˝ cˇEdockerˇ

åoŸç¡˝ Så´ zžè˙ o˝oä¡£ç˝ Tˇ´ldockeråo´zå˝ Z´´låoL’è˝ cˇEæˇ U´zåij˝ RSplashã´ A˘ Cé´ C´ cä´zˇ ´Lé˛eUå˝ Eˇ´Lä¡aå¿˘ Uå˚ Eˇ´LåoL’è˝ cˇEdockerˇ åR´ Cè´ A˘ Cˇ åoŸæ˝ U´zå˝ oL’è˝ cˇEæˇ U˝ Gæ˘ ˛acˇïijNè£ˇ Zé´ G˘ Næˇ ´LSé´ AL’æ˘ Nl’Ubuntu´ 12.04 LT- SçL’´LæIJnåˇ oL’è˝ cˇEˇ å G瞢 gå˘ ˛EEæˇ aÿçL’˘ ´LæIJnïijˇ NdockeréIJˇ Aè˛e˘ ˛A3.13å˛EEæˇ aÿ˘

$ sudo apt-get update $ sudo apt-get install linux-image-generic-lts-trusty $ sudo reboot

åoL’è˝ cˇEˇ CAèod’è˝ r´ ˛A

$ sudo apt-get install apt-transport-https ca-certificates

åc´dåŁ¯ aæ˘ U˝ r玡 Dˇ GPGkey

$ sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80

˓→--recv-keys 58118E89F3A912897C070ADBF76221572C52609D

æL’¸SåijA˘ /etc/apt/sources.list.d/docker.listïijNå˛eˇ Cæ´ dIJ暯 ˛aæIJL’årˇså´ ´LZå˙ zžäÿ˙ Aäÿłïij˘ Nçˇ D˝uåˇ RˇOå˝ ´Laé˘ Zd’ä´ z˙zä¡˙ ¸Tå˚ušå ŸåIJ´lçŽDåˇ ˛EEåˇ o´zïij˝ Nåˇ ˛E åc´dåŁ¯ aäÿ˘ Né´ I˙cäÿ´ Aå˘ R´ eˇ

deb https://apt.dockerproject.org/repo ubuntu-precise main

æZt’æ˙ U˝ rAPTˇ

$ sudo apt-get update $ sudo apt-get purge lxc-docker $ apt-cache policy docker-engine

åoL’è˝ cˇEˇ

$ sudo apt-get install docker-engine

åRˇ råŁ´ ´ldockeræIJ åŁ ˛a

$ sudo service docker start

éłNèˇ r´ ˛AæŸrå´ R˛eåˇ Rˇ råŁ´ ´læ´LRåŁ§ˇ

$ sudo docker run hello-world äÿŁéI˙c裴 Zæ´ I˙ ˛aåS¡ä´ zd’äijŽäÿ˙ Nè¡¡äÿ´ Aäÿłæ¸t˘ Nè´ r´ ¸Té¸TIJåCˇ Rå´z˝uåIJ´ ´låo´zå˝ Z´´läÿ è£Rèˇ ˛aNåˇ o˝Cïijˇ Nåˇ o˝CäijŽæL’¸Såˇ räÿˇ Aäÿłæ˝u˘ ´Læ ˛Arïij´ Nçˇ D˝uåˇ RˇOé˝ A˘ Aå˘ Gžã˘ A˘ C´

12.3å oL’è˝ cˇESplashˇ

æNL’å´ R´Ué˝ ¸TIJåCˇ Räÿ´ Næ´ I˙eˇ

$ sudo docker pull scrapinghub/splash

åRˇ råŁ´ ´låo´zå˝ Z´´l

$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051

˓→scrapinghub/splash

çO˝ råIJˇ ´låR´ rä´ z˙eéˇ AŽè£˘ G0.0.0.0:8050(http),8051(https),5023˘ (telnet)æI˙eèˇ o£é˝ U˚ oSplashäž˝ ˛EãA˘ C´

12.4å oL’è˝ cˇEscrapy-splashˇ

ä¡£çTˇ´lpipåoL’è˝ cˇEˇ

$ pip install scrapy-splash

12.5é Eˇ ç¡oscrapy-splash˝

åIJ´lä¡a玢 Dscrapyå˚uˇ eçˇ ´lN玴 Déˇ Eˇ ç¡oæ˝ U˝ Gä˘ z˝u˙ settings.pyäÿ æ˚uzåŁ˙ a˘

SPLASH_URL='http://192.168.203.92:8050'

æ˚uzåŁ˙ aSplashäÿ˘ éUt’ä˚ z˝uïij˙ Nè£ŸæŸˇ råIJ´ ´lsettings.pyäÿ éAŽè£˘ G˘ DOWNLOADER_MIDDLEWARESæNˇ Gå˘ oŽïij˝ Nå´z˝uäÿˇ Tä£ˇ oæ˝ T´zˇ HttpCompressionMiddlewareçŽDäijŸåˇ Eˇ´Lçžg˘

DOWNLOADER_MIDDLEWARES={ 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.

˓→HttpCompressionMiddleware': 810, }

ézŸè˙ od’æ˝ Cˇ Eåˇ ˛E¸täÿNïij´ NHttpProxyMiddleware玡 DäijŸåˇ Eˇ´LçžgæŸ˘ r750ïij´ Nè˛eˇ ˛AæŁŁåo˝Cæˇ T¿åIJˇ ´lSplashäÿ éUt’ä˚ z˝uå˙ RˇOé˝ I˙c´ èo¿ç¡˝ oSplashè˝ Głå˚u˘ s玴 Dåˇ O˝ zé˙ G˘ è£Gæ˘ zd’å˙ Z´´l

DUPEFILTER_CLASS='scrapy_splash.SplashAwareDupeFilter'

å˛eCæ´ dIJ䡯 aä¡£ç˘ Tˇ´lSplashçŽDHttpçij¸Såˇ ŸïijNéˇ C´ cä´zˇ ´L裟è˛e ˛AæNˇ Gå˘ oŽäÿ˝ Aäÿłè˘ Głå˘ oŽä´zL’玽 Dçij¸Såˇ ŸåRˇOå˝ R´ råˇ ŸåC´ ´läz˙Nèt’´ ´lïijNscrapy-ˇ splashæR´ Rä¿ˇ Zäž˙ ˛EäÿAäÿł˘ scrapy.contrib.httpcache. FilesystemCacheStorageçŽDåˇ Rçˇ s´z˙ HTTPCACHE_STORAGE='scrapy_splash.SplashAwareFSCacheStorage'

å˛eCæ´ dIJ䡯 aè˛e˘ ˛Aä¡£çTˇ´låE˝uäˇ z˙U玽 Dçij¸Såˇ Ÿå ŸåC´ ´lïijNéˇ C´ cä´zˇ ´LéIJAè˛e˘ ˛Açz˙gæL’£è£˘ Zäÿłç´ s´zå´z˝uäÿ˙ Tåˇ rˇ ˛EæL’AæIJL’玢 Dˇ scrapy. util.request.request_fingerprintèrˇCçˇ Tˇ´læZ£æ˙ cæ´ ´LRˇ scrapy_splash. splash_request_fingerprint

12.6 ä¡£çTˇ´lscrapy-splash

SplashRequest

æIJAç˘ o˝Aå˘ ¸TçŽDæÿšæ§¸Sèˇ r˚uæ´ s´C玴 Dæˇ U´zåij˝ R柴 rä¡£ç´ Tˇ´lscrapy_splash. SplashRequestïijNéˇ AŽåÿÿä¡˘ aåž˘ Tèˇ r´eéˇ AL’æ˘ Nl’ä¡£ç´ Tˇ´lè£Zäÿł´ yield SplashRequest(url, self.parse_result, args={ # optional; parameters passed to Splash HTTP API 'wait': 0.5,

# 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, endpoint='render.json', # optional; default is render.html splash_url='', # optional; overrides SPLASH_URL slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional )

åR˛eåd’´ Uïij˝ Nä¡ˇ aè£Ÿå˘ R´ rä´ z˙eåIJˇ ´læZ´ oé˝ AŽçŽ˘ Dscrapyèˇ r˚uæ´ s´Cäÿ´ äijaé˘ AŠ˘ splashèr˚uæ´ s´Cmetaå´ E¸séˇ Tˇ oå˝ Uè¿¿å˚ ´Lråˇ Rˇ Næˇ a˚u玢 Dæˇ ¸T´LædIJ¯ yield scrapy.Request(url, self.parse_result, meta={ 'splash':{ 'args':{ # set rendering arguments here 'html':1, 'png':1,

# 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests },

# optional parameters 'endpoint':'render.json', # optional; default is render.

˓→json 'splash_url':'', # optional; overrides SPLASH_URL 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN, 'splash_headers': {}, # optional; a dict with headers

˓→sent to Splash 'dont_process_response': True, # optional, default is False 'dont_send_headers': True, # optional, default is False 'magic_response': False, # optional, default is True } })

Splash APIèrt’柴 Oïij˝ Nä¡£çˇ Tˇ´lSplashRequestæŸräÿ´ Aäÿłé˘ I˙dåÿÿä¿£å¯ ´Ll’çŽDå˚uˇ eåˇ E˚uæˇ I˙eåˇ ˛anå´ Eˇ Eˇ request. meta['splash']éG˘ N玡 Dæˇ ¸Træˇ o˝ • meta[‘splash’][‘args’] åNˇ Eåˇ Rˇ näž´ ˛EåR´ Så¿´ ASplash玢 Dåˇ R´ Cæ´ ¸Trãˇ A˘ C´ • meta[‘splash’][‘endpoint’] æNˇ Gå˘ oŽäž˝ ˛ESplashæL’Aä¡£ç˘ Tˇ´lçŽDendpointïijˇ Néˇ zŸè˙ od’æŸ˝ r´render.html • meta[‘splash’][‘splash_url’] è˛e˛EçZ˙ Uäž˝ ˛Esettings.pyæU˝ Gä˘ z˝uäÿ˙ éEˇ ç¡o玽 DSplashˇ URL • meta[‘splash’][‘splash_headers’] è£Rèˇ ˛aNä¡ˇ aå˘ c´dåŁ¯ aæ˘ ´LUä£˝ oæ˝ T´zåˇ R´ Så¿´ ASplashæIJ˘ åŁ ˛aåZ´´lçŽDHTTPåd’t’éˇ Cˇ ´lä£ ˛aæ˛Arïij´ Næ¸sˇ ´læDˇ R裴 Zäÿłäÿ´ æŸr䣴 oæ˝ T´zåˇ R´ Så¿´ Aè£IJç˘ ´lNwebç´ n´Zç´ C´z玴 DHTTPåd’t’éˇ Cˇ ´l • meta[‘splash’][‘dont_send_headers’] å˛eCæ´ dIJ䡯 aäÿ˘ æC¸säijˇ aé˘ AŠheadersç˘ z˙ZSplashïij´ Nåˇ rˇ ˛Eåo˝Cèˇ o¿ç¡˝ oæ˝ ´LRTrueˇ • meta[‘splash’][‘slot_policy’] èol’ä¡˝ aè˘ Głå˘ oŽä´zL’Splashè˝ r˚uæ´ s´C玴 Dåˇ Rˇ Næˇ eèˇ o¿ç¡˝ o˝ • meta[‘splash’][‘dont_process_response’] 塸Sä¡aè˘ o¿ç¡˝ oæ˝ ´LRTrueåˇ RˇOïij˝ Nˇ SplashMiddlewareäÿ äijŽä£oæ˝ T´zéˇ zŸè˙ od’玽 Dˇ scrapy. Responseèr˚uæ´ s´Cã´ A˘ Cé´ zŸè˙ od’æŸ˝ räijŽè£´ Tåˇ Z˙ d¯SplashResponseå Rçˇ s´zå¸S˙ åžTæˇ r´Tå˛eˇ C´ SplashTextResponse • meta[‘splash’][‘magic_response’] ézŸè˙ od’äÿžTrueïij˝ NS-ˇ plashäijŽèGłåŁ˘ ´lèo¿ç¡˝ oResponse玽 Däÿˇ A䞢 Zå˙ s´dæ¯ A˘ gïij˘ Næˇ r´Tå˛eˇ C´ response. headers,response.bodyç L’ å˛eCæ´ dIJ䡯 aæ˘ C¸séˇ AŽè£˘ GSplashæ˘ I˙eæˇ R´ Räžd’Formèˇ r˚uæ´ s´Cïij´ Nåˇ R´ rä´ z˙eä¡£çˇ Tˇ´lscrapy_splash. SplashFormRequestïijNåˇ o˝Cè˚u§ˇ SplashRequestä¡£çTˇ´læŸräÿ´ Aæ˘ a˚u玢 Dãˇ A˘ C´

Responses

år´zäž´ Oäÿ˝ åRˇ N玡 DSplashèˇ r˚uæ´ s´Cïij´ Nscrapy-splashè£ˇ Tåˇ Z˙ däÿ¯ åRˇ N玡 DResponseåˇ Rçˇ s´z˙ • SplashResponse äžNè£ˇ Zå˙ ´L˝uå¸S åžTïijˇ Næˇ r´Tå˛eˇ Cå´ r´z/render.png玴 Då¸Sˇ åžTˇ • SplashTextResponse æU˝ GæIJ˘ nå¸Sˇ åžTïijˇ Næˇ r´Tå˛eˇ Cå´ r´z/render.html玴 Då¸Sˇ åžTˇ • SplashJsonResponse JSONå¸S åžTïijˇ Næˇ r´Tå˛eˇ Cå´ r´z/render.jsonæ´ ´LUä¡£ç˝ Tˇ´lLuaèDŽæIJˇ n玡 D/execute玡 Då¸Sˇ åžTˇ å˛eCæ´ dIJ䡯 aå˘ Rłæ´ C¸sä¡£çˇ Tˇ´læa˘Gå˘ G˘ ˛EçŽDResponseåˇ r´zè´ s´ ˛aïijNåˇ rˇsè´ o¿ç¡˝ o˝meta['splash']['dont_process_response']=True æL’AæIJL’è£˘ Zäž´ ZResponseäijŽæŁŁ˙ response.urlèo¿ç¡˝ oæ˝ ´LRåˇ O§å˝ g˘Nè´ r˚uæ´ s´CURL(ä´z§å´ rˇs柴 rä¡´ aè˛e˘ ˛Aæÿšæ§¸SçŽDéˇ ˛a¸téI˙cURL)ïij´ Nèˇ A˘ Näÿˇ æŸrSplash´ endpointçŽDURLåIJˇ råˇ I˙Aã˘ A˘ Cå´ o˝dé¯ Z´ EåIJˇ råˇ I˙Aé˘ AŽè£˘ G˘ response.real_urlå¿Uå˚ ´Lrˇ

SessionçŽDåd’ˇ Dçˇ Rˇ ˛E

SplashæIJnèžˇ n柴 ræ´ U˚ a磽uæ˘ A˘ ˛AçŽDïijˇ Néˇ C´ cä´zˇ ´Läÿžäž ˛EæTˇ ræ´ Nˇ ˛Ascrapy- splashçŽDsessionå£ˇ Eéˇ ˛azçij˙ Uå˝ ˛EZLuaè´ DŽæIJˇ nïijˇ Nä¡£çˇ Tˇ´l/execute function main(splash) splash:init_cookies(splash.args.cookies)

-- ... your script

return { cookies= splash:get_cookies(), -- ... other results, e.g. html } end

èA˘ Næˇ a˘Gå˘ G˘ ˛EçŽDscrapyˇ sessionåR´ Cæ´ ¸Tråˇ R´ rä´ z˙eä¡£çˇ Tˇ´lSplashRequestårˇ ˛Ecookieæ˚uzåŁ˙ aå˘ ´Lr塸SåL’ˇ Splash cookiejaräÿ

12.7 ä¡£çTˇ´låo˝ d俯 N´

æO˝ eäÿˇ Næ´ I˙eæˇ ´LSé´ AŽè£˘ Gäÿ˘ Aäÿłå˘ o˝dé¯ Z´ E玡 Dä¿ˇ Nå´ Ræˇ I˙eæijˇ Tçd’žæˇ A˘ Oæ˝ a˚uä¡£ç˘ Tˇ´lïijNæˇ ´LSé´ AL’æ˘ Nl’ç´ ´Lnåˇ R´U˝ äžnäÿIJç¡ˇ S´é˛eUé˝ ˛a¸tçŽDåijˇ Cæ´ eåŁˇ aè¡¡å˘ ˛EEåˇ o´zã˝ A˘ C´ äžnäÿIJç¡ˇ SæL’¸Såij´ Aé˛e˘ Ué˝ ˛a¸tçŽDæˇ U˝uå˚ A˘ Zå´ RłäijŽå´ rˇ ˛Eårijè´ ´LłèRIJå´ ¸TåŁaè¡¡å˘ Gžæ˘ I˙eïijˇ Nåˇ E˝uäˇ z˙Uå˝ E˚u䡸Sé˛eˇ Ué˝ ˛a¸tå˛EEåˇ o´zé˝ C¡æŸˇ råij´ Cæ´ eåŁˇ aè¡¡çŽ˘ Dïijˇ Näÿˇ Né´ I˙cæIJL’äÿł”ç´ NIJä¡ˇ aå˘ UIJæ˝ nˇc”裴 Zäÿłå´ ˛EEåˇ o´zä´z§æŸ˝ råij´ Cæ´ eåŁˇ aè¡¡çŽ˘ Dïijˇ Nˇ æ´LSç´ O˝ råIJˇ ´lårˇsé´ AŽè£˘ Gç˘ ´Lnåˇ R´Uè£˝ Zäÿł”ç´ NIJä¡ˇ aå˘ UIJæ˝ nˇc”裴 Zå´ Z˙ Zäÿłå˙ Uæ˚ I˙eèˇ rt’柴 Oäÿ˝ Næ´ Z´ oé˝ AŽçŽ˘ DScrapyçˇ ´Lnåˇ R´UåŠ˝ Néˇ AŽè£˘ Gä¡£ç˘ Tˇ´läž ˛ESplashåŁaè¡¡åij˘ Cæ´ eåˇ ˛EEåˇ o´z玽 Dåˇ Nžåˇ ´Lnã´ A˘ C´ é˛eUå˝ Eˇ´Læ´LSä´ z˙nåˇ ˛EZäÿłç´ o˝Aå˘ ¸TçŽDæ¸tˇ Nè´ r´ ¸TSpiderïijNäÿˇ ä¡£çTˇ´lsplashïijŽ

class TestSpider(scrapy.Spider): name="test" allowed_domains=["jd.com"] start_urls=[ "http://www.jd.com/" ]

def parse(self, response): logging.info(u'------

˓→æ´LS裴 ZäÿłæŸ´ rç´ o˝Aå˘ ¸T玡DçZt’æ˙ O˝eèˇ O˚uå˝ R´Uäž˝ näÿIJç¡ˇ Sé˛e´ Ué˝ ˛a¸tæ¸tNè´ r¸T------´ ') guessyou= response.xpath('//div[@id="guessyou"]/div[1]/h2/

˓→text()').extract_first() logging.info(u"findïijŽ%s"% guessyou) logging.info(u'------success------')

çD˝uåˇ RˇOè£˝ Rèˇ ˛aNçˇ z¸Sæ˙ dIJïijŽ¯

2016-04-18 14:42:44 test_spider.py[line:20] INFO ------

˓→æ´LS裴 ZäÿłæŸ´ rç´ o˝Aå˘ ¸T玡DçZt’æ˙ O˝eèˇ O˚uå˝ R´Uäž˝ näÿIJç¡ˇ Sé˛e´ Ué˝ ˛a¸tæ¸tNè´ r¸T------´ 2016-04-18 14:42:44 test_spider.py[line:22] INFO findïijŽNone 2016-04-18 14:42:44 test_spider.py[line:23] INFO ------

˓→success------

æ´LSæL’¿äÿ´ å´Lréˇ C´ cäÿł”çˇ NIJä¡ˇ aå˘ UIJæ˝ nˇc”裴 Zå´ Z˙ Zäÿłå˙ U˚ æO˝ eäÿˇ Næ´ I˙eæˇ ´LSä¡£ç´ Tˇ´lsplashæI˙eçˇ ´Lnåˇ R´U˝ import scrapy from scrapy_splash import SplashRequest class JsSpider(scrapy.Spider): name="jd" allowed_domains=["jd.com"] start_urls=[ "http://www.jd.com/" ]

def start_requests(self): splash_args={ 'wait': 0.5, } for url in self.start_urls: yield SplashRequest(url, self.parse_result, endpoint=

˓→'render.html', args=splash_args)

def parse_result(self, response): logging.info(u'------

˓→ä¡£çTˇlsplashç´ ´Lnåˇ R´Uäž˝ näÿIJç¡ˇ Sé˛e´ Ué˝ ˛a¸tåijCæ´ eåŁˇ aè¡¡å˘ ˛EEåˇ o´z------˝ ') guessyou= response.xpath('//div[@id="guessyou"]/div[1]/h2/

˓→text()').extract_first() logging.info(u"findïijŽ%s"% guessyou) logging.info(u'------success------')

è£Rèˇ ˛aNçˇ z¸Sæ˙ dIJïijŽ¯

2016-04-18 14:42:51 js_spider.py[line:36] INFO ------

˓→ä¡£çTˇlsplashç´ ´Lnåˇ R´Uäž˝ näÿIJç¡ˇ Sé˛e´ Ué˝ ˛a¸tåijCæ´ eåŁˇ aè¡¡å˘ ˛EEåˇ o´z------˝ 2016-04-18 14:42:51 js_spider.py[line:38] INFO

˓→findïijŽçNIJä¡ˇ aå˘ UIJæ˝ nˇc´ 2016-04-18 14:42:51 js_spider.py[line:39] INFO ------

˓→success------

åR´ rä´ z˙eçIJˇ Nå´ Gžç˘ z¸Sæ˙ dIJé¯ G˘ Néˇ I˙cå˚ušç´ z˙RæL’¿å´ ´Lr䞡 ˛Eè£Zäÿł”ç´ NIJä¡ˇ aå˘ UIJæ˝ nˇc”ïij´ Nèˇ rt’柴 Oåij˝ Cæ´ eåŁˇ aè¡¡å˘ ˛EEåˇ o´zç˝ ´Lnåˇ R´Uæ˝ ´LRåŁ§ïijˇ ˛A

13 è ˛ATç¸s˙zæˇ ´LS´

• EmailïijŽ [email protected] •å Žåo˝cïijŽ´ https://www.xncoding.com/ • GitHubïijŽ https://github.com/yidao620c

å¿oä£˝ ˛aæL’näÿ´ AæL’˘ nïij´ Nèˇ r˚uå´ Žäÿzå˙ U˝ Iæ˙ I˙r労 Uå˝ ¸T˛a