Spidering Hacks
Total Page:16
File Type:pdf, Size:1020Kb
SPIDERING HACKSTM 100 Industrial-Strength Tips & Tools Kevin Hemenway & Tara Calishain Spidering Hacks Spidering Hacks By Tara Calishain, Kevin Hemenway Publisher: O'Reilly Pub Date: October 2003 ISBN: 0−596−00577−6 Pages: 424 Copyright Credits About the Authors Contributors Preface Why Spidering Hacks? How This Book Is Organized How to Use This Book Conventions Used in This Book How to Contact Us Got a Hack? Chapter 1. Walking Softly Hacks #1−7 Hack 1. A Crash Course in Spidering and Scraping Hack 2. Best Practices for You and Your Spider Hack 3. Anatomy of an HTML Page Hack 4. Registering Your Spider Hack 5. Preempting Discovery Hack 6. Keeping Your Spider Out of Sticky Situations Hack 7. Finding the Patterns of Identifiers Chapter 2. Assembling a Toolbox Hacks #8−32 Perl Modules Resources You May Find Helpful Hack 8. Installing Perl Modules Hack 9. Simply Fetching with LWP::Simple Hack 10. More Involved Requests with LWP::UserAgent Hack 11. Adding HTTP Headers to Your Request Hack 12. Posting Form Data with LWP Hack 13. Authentication, Cookies, and Proxies Hack 14. Handling Relative and Absolute URLs Hack 15. Secured Access and Browser Attributes Hack 16. Respecting Your Scrapee's Bandwidth Hack 17. Respecting robots.txt Hack 18. Adding Progress Bars to Your Scripts Hack 19. Scraping with HTML::TreeBuilder Hack 20. Parsing with HTML::TokeParser Hack 21. WWW::Mechanize 101 Hack 22. Scraping with WWW::Mechanize Hack 23. In Praise of Regular Expressions Hack 24. Painless RSS with Template::Extract Hack 25. A Quick Introduction to XPath Hack 26. Downloading with curl and wget Hack 27. More Advanced wget Techniques 1 Spidering Hacks Hack 28. Using Pipes to Chain Commands Hack 29. Running Multiple Utilities at Once Hack 30. Utilizing the Web Scraping Proxy Hack 31. Being Warned When Things Go Wrong Hack 32. Being Adaptive to Site Redesigns Chapter 3. Collecting Media Files Hacks #33−42 Hack 33. Detective Case Study: Newgrounds Hack 34. Detective Case Study: iFilm Hack 35. Downloading Movies from the Library of Congress Hack 36. Downloading Images from Webshots Hack 37. Downloading Comics with dailystrips Hack 38. Archiving Your Favorite Webcams Hack 39. News Wallpaper for Your Site Hack 40. Saving Only POP3 Email Attachments Hack 41. Downloading MP3s from a Playlist Hack 42. Downloading from Usenet with nget Chapter 4. Gleaning Data from Databases Hacks #43−89 Hack 43. Archiving Yahoo! Groups Messages with yahoo2mbox Hack 44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups Hack 45. Gleaning Buzz from Yahoo! Hack 46. Spidering the Yahoo! Catalog Hack 47. Tracking Additions to Yahoo! Hack 48. Scattersearch with Yahoo! and Google Hack 49. Yahoo! Directory Mindshare in Google Hack 50. Weblog−Free Google Results Hack 51. Spidering, Google, and Multiple Domains Hack 52. Scraping Amazon.com Product Reviews Hack 53. Receive an Email Alert for Newly Added Amazon.com Reviews Hack 54. Scraping Amazon.com Customer Advice Hack 55. Publishing Amazon.com Associates Statistics Hack 56. Sorting Amazon.com Recommendations by Rating Hack 57. Related Amazon.com Products with Alexa Hack 58. Scraping Alexa's Competitive Data with Java Hack 59. Finding Album Information with FreeDB and Amazon.com Hack 60. Expanding Your Musical Tastes Hack 61. Saving Daily Horoscopes to Your iPod Hack 62. Graphing Data with RRDTOOL Hack 63. Stocking Up on Financial Quotes Hack 64. Super Author Searching Hack 65. Mapping O'Reilly Best Sellers to Library Popularity Hack 66. Using All Consuming to Get Book Lists Hack 67. Tracking Packages with FedEx Hack 68. Checking Blogs for New Comments Hack 69. Aggregating RSS and Posting Changes Hack 70. Using the Link Cosmos of Technorati Hack 71. Finding Related RSS Feeds Hack 72. Automatically Finding Blogs of Interest Hack 73. Scraping TV Listings Hack 74. What's Your Visitor's Weather Like? Hack 75. Trendspotting with Geotargeting Hack 76. Getting the Best Travel Route by Train Hack 77. Geographic Distance and Back Again 2 Spidering Hacks Hack 78. Super Word Lookup Hack 79. Word Associations with Lexical Freenet Hack 80. Reformatting Bugtraq Reports Hack 81. Keeping Tabs on the Web via Email Hack 82. Publish IE's Favorites to Your Web Site Hack 83. Spidering GameStop.com Game Prices Hack 84. Bargain Hunting with PHP Hack 85. Aggregating Multiple Search Engine Results Hack 86. Robot Karaoke Hack 87. Searching the Better Business Bureau Hack 88. Searching for Health Inspections Hack 89. Filtering for the Naughties Chapter 5. Maintaining Your Collections Hacks #90−93 Hack 90. Using cron to Automate Tasks Hack 91. Scheduling Tasks Without cron Hack 92. Mirroring Web Sites with wget and rsync Hack 93. Accumulating Search Results Over Time Chapter 6. Giving Back to the World Hacks #94−100 Hack 94. Using XML::RSS to Repurpose Data Hack 95. Placing RSS Headlines on Your Site Hack 96. Making Your Resources Scrapable with Regular Expressions Hack 97. Making Your Resources Scrapable with a REST Interface Hack 98. Making Your Resources Scrapable with XML−RPC Hack 99. Creating an IM Interface Hack 100. Going Beyond the Book Colophon Index Book: Spidering Hacks Copyright © 2004 O'Reilly & Associates, Inc. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998−9938 or [email protected]. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a flex scraper and the topic of spidering is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information 3 Spidering Hacks contained herein. The trademarks "Hacks Books" and "The Hacks Series," and related trade dress, are owned by O'Reilly & Associates, Inc., in the United States and other countries, and may not be used without written permission. All other trademarks are property of their respective owners. URL spiderhks−PREFACE−1 4 Spidering Hacks Book: Spidering Hacks Section: Credits About the Authors Kevin Hemenway, coauthor of Mac OS X Hacks (O'Reilly), is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine (like the popular open source aggregator AmphetaDesk, the best−kept gaming secret Gamegrene.com, the popular Ghost Sites and Nonsense Network, giggle−inducing articles at the O'Reilly Network, etc.), he has a fondness for expensive bags of beef jerky. Living in Concord, New Hampshire, he can be reached at [email protected]. Tara Calishain is the editor of the online newsletter ResearchBuzz (http://www.researchbuzz.com) and author or coauthor of several books, including the O'Reilly best seller Google Hacks. A search engine enthusiast for many years, she began her foray into the world of Perl when Google released its API in April 2002. URL /spiderhks−PREFACE−2−SECT−1 About the Authors 5 Spidering Hacks Book: Spidering Hacks Section: Credits Contributors The following people contributed their hacks, writing, and inspiration to this book: • Jacek Artymiak (http://www.artymiak.com) is a freelance consultant, developer and writer. He's been programming computers since 1986, starting with Sinclair ZX Spectrum. His interests include network security, computer graphics and animation, and multimedia. Jacek lives in Lublin, Poland, with his wife Gosia and can be reached at [email protected]. • Chris Ball (http://printf.net) holds a BSc in Computation from UMIST, England, and works on machine learning problems as a researcher in Cambridge University's Inference Group. In his spare time, he can be found not answering email and hanging rooks at chess tournaments. You can email him at [email protected]. • Paul Bausch (http://www.onfocus.com) is a web application developer and the cocreator of Blogger (http://www.blogger.com), the popular weblog software. His favorite site to spider is Amazon.com, the subject of his book Amazon Hacks (O'Reilly). He's also the coauthor of We Blog: Publishing Online with Weblogs (John Wiley & Sons), and he posts thoughts and photos almost daily to his personal weblog: onfocus. • Erik Benson (http://www.erikbenson.com) is the Technical Program Manager of Personalized Merchandizing at Amazon.com (http://www.amazon.com). He runs All Consuming (http://www.allconsuming.net) and has a weblog at http://www.erikbenson.com. • Adam Bregenzer is a programmer in Atlanta, GA. He has been programming for most of his life and is an advocate of Unix−like operating systems. In his free time, he plots to take over the world . and eat Mexican food. • Daniel Biddle spends far too much time reading abstruse technical documents and solving people's programming problems on IRC at irc.freenode.net, where he goes by the nickname deltab. He can be contacted at [email protected]. • Sean M. Burke is the author of O'Reilly's Perl & LWP, RTF Pocket Guide, and many of the articles in the Best of the Perl Journal volumes. An active member in the Perl open source community, he is one of CPAN's most prolific module authors and an authority on markup languages.