27_597582_bindex.qxd 8/5/05 10:15 PM Page 585

Index

NUMERICS AmazonSearchScraper class, 405–406 0.3 specification, Atom, 39, 211 AmazonWishlistScraper class, 409–411 1.0 specification, RSS, 39 America Online Instant Messenger. See AIM (AOL 2.0 specification, RSS, 39 Instant Messenger) 4Suite package, 20–21, 433 AmphetaDesk feed aggregator, 8 AmphetaRate service, 481 A AOL Instant Messenger. See AIM “A Plan for Spam” (Graham), 451 Apache Web server access logs, Apache, 304–312 compression, enabling, 231–232 ACCESS_LOG constant, 305 conditional GET, enabling, 235–236 ACCESS_RE constant, 310 configuring MIME types for feeds, 25 Active Channels, Internet Explorer, 4 monitoring access logs, 304–312 addConnection() method, 117–118 monitoring error logs, 301–304 addTrack() method, 192 API_BLOGID constant, 525 AddType configuration, Apache, 25 API_PASSWD constant, 525 AdiumX IM client, 105 API_URL constant, 525 Advanced Packaging Tool, Debian , 18 API_USER constant, 525 agg01_pollsubs.py program, 52–59 append_entry() method, 297–298 agg01_subscribe.py program, 50–51 AppleScript agg02_pollsubs.py program, 68–77 accessing from Python, 160 agg03_emailsubs.py program, 80–86 importing MP3s into iTunes, 189–194 agg04_emailsubs.py program, 86–91 application ID, Yahoo! Web services, 384–385, 387 agg05_im_subs.py program, 105–111 ARCHIVE_FN constant, 517 agg06_aggbot.py program, 114–124 ASSOCIATE_TAG constant, 508–509 AggBot class, 116–124 “ATOM 0.3” links, 23 agglib.py module, 77–79, 112–114, 145–146 Atom feeds. See also feeds aggregator. See feed aggregator conversion to RSS feeds AIM (AOL Instant Messenger). See also IM (Instant commercial programs for, 444 Messenger) definition of, 418 definition of, 93–94 with feedparser, 437–443 handling connections from, 97–101 with XSLT, building normalizer for, 420–432 protocols for, 94 with XSLT, data model for, 418–419 AIMConnection class, 97–101 with XSLT, processor for, 433 Amazon Web Services (AWS) with XSLT, testing normalizer for, 434–436 definition of, 394 definition of, 3, 13–18 inserting related items intoCOPYRIGHTED feeds, 506–512 generating MATERIAL from HTML content, 209–217 product searches, building feeds from, 403–407 0.3 specification, 39, 211 queries, building, 395–398 “Atom” links, 23 registering as AWS developer, 395, 506 ATOM_DATE_FMT template, 248 REST-style interface for, 395 ATOM_ENTRY_TMPL template searches, building feeds from, 398–403 amazon.lib.py module, 399–400 wish list items, building feeds from, 407–412 ch07_feedmaker.py program, 557 wish lists, searching for, 396–398 ch18_mod_event_feed_filter.py program, XSLT, building feeds using, 413 560 AmazonAdFeed class, 508–511 mailfeedlib.py module, 363–364 AMAZON_KEY constant, 508–509 scraperlib.py module, 248 amazon.lib.py module, 398–403 atomfeed feed generator, 223 AmazonScraper class, 398–402 ATOM_FEED_TMPL template, 211, 248, 560 amazon_search() method, 510–511 AtomTemplateDocWrapper class, 215 27_597582_bindex.qxd 8/5/05 10:15 PM Page 586

586 Index ■ B–C

audio feeds, creating, 158–167 inserting bookmarks into feeds, 498–505 tag, 17 merging feeds from, 485–486 autodiscovery of feeds, 26–32 bookmark_tailgrep() function, 292–294, 302–303 AvantGo program, 167 Bradbury, Nick (FeedDemon feed aggregator), 64–65 AWS. See Amazon Web Services buddy list. See IM (Instant Messenger) AWS_ID constant, 396, 404 build() method, 532–533 AWS_INDEX constant, 404 build_guid_for_message() method, 366 AWS_KEYWORDS constant, 404 buildNotesForEntries() function, 152–153 AWS_URL constant, 396, 399 buildNotesForFeeds() function, 150–152 buildPluckerDocument() function, 138–139 B Babu, Vattekkat Satheesh (spycyroll feed aggregator), C 61 Cache-Control header, 236–237 “Bake, Don’t Fry” (Swartz), 227 cached_feed_reader.py module, 583–584 baked feeds, 226–229 CACHE_DIR constant, 577 bandwidth, hosting feeds and caching caching feeds, 229–230 dynamically generated feeds, 229–230 compressing feeds, 230–233 shared feeds, 575–584 delivering partial feeds, 240 calendar events how often to regenerate feeds, 226–227 finding, 541–543, 570 issues involving, 225–226 harvesting from feeds, 564–569 metadata with update schedule hints, 237–239 parsing from HTML into feeds, 548–557 redundant downloads, reducing, 233–237 rendering as HTML pages, 543–548 uploading feeds with FTP, 227–229, 240 Carlo’s Bootleg RSS Fedpalooza Web site, 287 BASE_HREF constant, 210, 250, 260–261 CDF (Channel Definition Format), 4 Bayes class, 452, 453 cgi_buffer package (Nottingham), 232–233, 236 BayesFilter class, 465–467 ch05_ipod_notes_aggregator.py program, Bayesian classifier 146–153 filtering feeds using, 452–459, 467–469 ch05_plucker_aggregator.py program, 137–140 suggesting feed entries using, 463–467 ch05_speech_aggregator.py program, 161–166 training with feedback mechanism, 459–463 ch06_enclosure_finder.py program, 171 Bayesian statistics, 450–451 ch06_podcast_tuner.py program, 176–180, Beautiful Soup module, 288 188–189, 192–194 BitTorrent, 180–189 ch07_feedmaker.py program, 209–215, 217–219 BitTorrentDownloader class, 182–188 ch08_feed_fetch.py program, 229–230 blagg feed aggregator (Dornfest), 61 ch08_feedfilter.py program, 232–233 blending feeds ch08_ftpupload.py program, 228–229 adding related links to feeds, 488–497 ch09_fcc_scraper.py program, 270–273 commercial programs for, 513 ch09_loc_scraper.py program, 259–263 inserting related items into feeds, 506–512 ch09_whitehouse_scraper.py program, 284–286 merging feeds, 483–487 ch10_apache_error_feed.py program, 301–304 mixing daily links into feeds, 497–505 ch10_apache_referrer_feed.py program, 305–312 Blogdex diffusion index, merging feeds from, 485–486 ch10_bookmark_tailgrep.py program, 292–294 BLOG_FN constant, 517 ch10_logins_feed.py program, 312–317 blog-like aggregators, 7–8 ch10_pytailgrep.py program, 290–291 blogs. See Web logs ch11_cvs_history_scraper.py program, 333–339 Blosxom blogging package, 223 ch11_svn_log_scraper.py program, 343–348 blosxom blogging package (Dornfest), 61 ch12_imaplib_example.py program, 356–357 Bond, Julian (Gnews2RSS, Google News scraper), 412 ch12_mailing_list_feed.py program, 369–372 bookmarks ch12_poplib_email_example.py program, 358–359 in database, 292–294 ch12_poplib_example.py program, 354–355 del.icio.us bookmarks manager ch13_amazon_find_wishlist.py program, 396–398 API for, 497–498 ch13_amazon_search_scraper.py program, definition of, 497 403–407 27_597582_bindex.qxd 8/5/05 10:15 PM Page 587

Index ■ C–D 587

ch13_amazon_wishlist_scraper.py program, commit operation, CVS, 322 408–411 compression, for feed transfers, 230–233 ch13_google_search_scraper.py program, Concurrent Version System. See CVS 378–381 conditional GET, HTTP, 234–236 ch13_yahoo_news_scraper.py program, 390–393 connect() method, 98–99, 101–102, 118–119 ch13_yahoo_search_scraper.py program, 386–389 content management system (CMS), 13 ch14_feed_normalizer.py program, 437–441 CONTENT_TMPL template, 471–472 ch14_xslt_feed_normalizer.py program, 433 Content-Type header, 233 ch14_xslt_feed_normalizer.xsl transformation, conversion of feeds 420–432 commercial programs for, 443–444 ch15_bayes_agg.py program, 452–459 definition of, 417, 418 ch15_bayes_filter.py program, 463–467 with feedparser, 437–443 ch15_bayes_mark_entry.cgi program, 459–463 with XSLT ch15_feed_filter.py program, 445–449 building normalizer, 420–432 ch15_popular_links.py program, 470–478 commercial programs for, 444 ch16_feed_amazon_ads.py program, 507–511 conversion enabled by, 420 ch16_feed_delicious_recaps.py program, data model for, 418–419 498–504 testing normalizer, 434–436 ch16_feed_merger.py program, 483–486 XPath expressions for, 419 ch16_feed_related.py program, 491–495 XSLT processor for, 433 ch16_technorati_search.py program, 490–491 cPickle module, 297 ch17_feed_blog.py program, 515–523 CrispAds service, 513 ch17_feed_reposter.py program, 525–528 cron scheduler, 60 ch17_feed_to_javascript.py program, 530–533 cURL tool, 489 ch18_feed_to_ical.py program, 564–567 CVS commit triggers, 351 ch18_ical_to_hcal.py program, 543–547 CVS (Concurrent Version System) ch18_mod_event_feed_filter.py program, commercial programs for, 335, 351 559–563 definition of, 321–322 Channel Definition Format (CDF), 4 history events tag, 15, 237–238 feed scraper for, 333–339 chatbots parsing, 327–333 commercial chatbots, 126 querying, 324, 325 managing feed subscriptions using, 114–125 installing, 324 check-out operation, CVS, 322 log entries _choose() method, 186 parsing, 327–333 CIA open source notification system, 351 querying, 325–327 clean_entries() method, 298–299 scraping, 333–338 clickable feed buttons, 24–26 repositories, finding, 322–324 closeDBs() function, 70 _cvs() method, 329 cmd_list() method, 122 CVSClient class, 328–330 cmd_poll() method, 123–124 CVSHistoryEvent class, 331–332 cmd_signoff() method, 121 cvslib.py module, 327–333 cmd_subscribe() method, 123 CVSLogEntry class, 331 cmd_unsubscribe() method, 122 CVSScraper class, 335–337 __cmp__() method Cygwin, Windows, 19 CVSHistoryEvent class, 332 EntryWrapper class, 55–56 D FeedEntryDict class, 246 data model for normalization of feeds, 418–419 HTMLMetaDoc class, 204 date and time formats TemplateDocWrapper class, 213 converting between, in normalization, 424, 426, CMS (content management system), 13 428–432 Colloquy IRC client, 340–341 handling when parsing feeds, 47–48 COMMAND constant, 313 RFC 822 format, 218 Command Prompt, Windows, 19 W3CDTF format, 213 27_597582_bindex.qxd 8/5/05 10:15 PM Page 588

588 Index ■ D–E

date ranges containing report of new or modified feeds, 80–86 Google searches using, 382–383 filtering to produce feeds from, 369–372 Yahoo! searches using, 389 format of, 357–359 DATE_HDR_TMPL template, 54, 57, 457, 520–521 producing feeds from tag, 238 building mail protocol wrappers for, 360–363 Daypop search engine, 481, 485–486 filtering email for, 369–372 Debian Linux, 18. See also Linux generating feed entries for, 363–369 decode_entities() method, 46, 207–208 email package, Python, 358 __del__() method, 116–117 tag, 17 DEL_API_URL constant, 501 emailAggregatorPage() function, 82 DEL_ENTRY_TMPL template, 501 emailEntries() function, 87–89 del.icio.us bookmarks manager email.MIMEMultipart module, Python, 81, 82 API for, 497–498 email.MIMEText module, Python, 81, 82 definition of, 497 tag, 169–170 inserting bookmarks into feeds, 498–505 enclosures, RSS merging feeds from, 485–486 commercial programs for, 194–197 DeliciousFeed class, 500–504 downloading content for, 171–175 DEL_LINK_TMPL template, 501 downloading content with feed aggregator, 176–180 DEL_PASSWD constant, 501 finding multimedia content using, 169–171 DEL_TAG_TMPL template, 501 end_entry() method, 43 DEL_USER constant, 501 end_feed() method, 258 tag, 15, 17 end_feed_entry() method, 258 Desktop component, Plucker, 132 end_item() method, 43 difflib module, Python, 312, 315 entries_from_messages() method, 365–366 digg news feed, merging feeds from, 485–486 ENTRIES_XPATH constant, 282–283, 284 directories of feeds, 32–37 ENTRY_DB_FN constant, 69, 516, 525 _display() method, 186–188 ENTRY_LINK_TMPL template, 148 Distiller component, Plucker ENTRY_RE constant, 268, 271 building feed aggregator using, 135–140 ENTRY_TMPL template definition of, 132 agg01_pollsubs.py program, 54, 57 installing, 132 agg05_im_subs.py program, 110 using, 132–135 ch05_ipod_notes_aggregator.py program, Divmod (Reverend), 451–452 148 Documentation component, Plucker, 132 ch15_bayes_agg.py program, 457–458 dodgeit service, 373 ch17_feed_blog.py program, 521 Doppler podcast receiver, 196–197 ch17_feed_to_javascript.py program, Dornfest, Rael 531–532 blagg feed aggregator, 61 EntryWrapper class, 55–57, 73 blosxom blogging package, 61 ENTRY_XPATHS constant, 282–283, 284 downloaders.py module, 172–175, 181–188 error logs, Apache, 301–304 DOWNLOAD_PATH constant, 176 _error() method, 186 downloadURL() method, 173, 183–184 ERROR_LOG constant, 301 downloadURLs() function, 178–179 ETag header, 234, 236 Drupal portal, 6 tag, 538–539 EVDB (Events and Venues Database), 570 E Event Log, monitoring, 290, 318 email event_filter() function, 309 accessing programmatically, 353–357 EventMeister package, 318 commercial programs Events and Venues Database (EVDB), 570 for converting email to feeds, 373 EXCLUDE_EXACT constant, 306, 309 for routing feeds to email, 91–92 EXCLUDE_PARTIAL constant, 306, 309 comparing RSS feeds to, 15–16 _executeAppleScript() method, 192 containing individual new or modified feeds, 86–92 Expires header, 236–237 27_597582_bindex.qxd 8/5/05 10:15 PM Page 589

Index ■ E–F 589

extending feeds feed handler adding metadata to feeds, 538–539, 541, 557–564 building, 37–48 with calendar events, 541–557 Universal Feed Parser (Pilgrim), 48–49 commercial programs for, 569–570 Feed on Feeds feed aggregator (Minutillo), 61–62 definition of, 537 feed playlist format, 25 harvesting calendar events from feeds, 564–569 feed: protocol scheme, 26 structuring feeds with microformats, 539–541, feed reader, 4–5. See also feed aggregator 543–548 tag, 17, 42 Extensible Markup Language (XML) Feed Validator, 220–222 definition of, 13–14 FeedAutodiscoveryParser class, 28–29 tools for, 20–21 FeedBurner service, 443, 513 extract_summary_from_message() method, FeedCache class, 577–582 367–368 feedcache.py module, 575–584 Extras component, Plucker, 132 FeedCacheRecord class, 576–577 FEED_DB_FN constant, 69, 516, 525 F FeedDemon feed aggregator (Bradbury), 64–65 FCCScraper class, 270–272 FEED_DIR constant, 301, 305, 313, 470 Federal Communications Commission, example using, FeedEntryDict class, 245–247 264–274 feed_fetch.py module, 38 feed aggregator FeedFilter class, 447–449 building FEED_HDR_TMPL template using iPod Notes, 144–153 agg01_pollsubs.py program, 54, 57 using Plucker Distiller, 135–140 agg05_im_subs.py program, 110 using Python, 52–59 ch15_bayes_agg.py program, 457 commercial programs for, 61–65, 167 ch17_feed_blog.py program, 520–521 creating group Web log from, 515–524 FEED_IN_URL constant, 559 definition of, 4–13 FEED_LINK_TMPL template, 148, 149 downloading multimedia content using, 176–180 FeedMerger class, 485–486 emailing individual new or modified feeds, 86–92 FEED_META constant emailing report of new or modified feeds, 80–86 AmazonWishlistScraper class, 409 finding feeds for ch07_feedmaker.py program, 209–210, 212 clickable feed buttons, 24–26 FCCScraper class, 271 feed autodiscovery, 26–32 LOCScraper class, 260–261 feed directories, 32–37 Scraper class, 249–250, 253 links on Web sites and blogs for, 23–24 YahooSearchScraper class, 387–388 Web Services, 32–37 FEED_NAME_FN constant future of, 10–13 ch10_apache_error_feed.py program, 301 generating audio files from, 160–167 ch10_apache_referrer_feed.py program, 305 mixed reverse-chronological aggregators, 7–8 ch10_logins_feed.py program, 313 module for reusable parts of, 77–79 ch15_feed_filter.py program, 446 parsing feeds for ch15_popular_links.py program, 470 building feed parser, 38–46 FeedNormalizer class, 438–441 extending feed parser, 47–48 FEED_OUT_FN constant, 559 testing feed parser, 46–47 Feedpalooza Web site, 287 Universal Feed Parser (Pilgrim), 48–49 feed_reader.py program, 46–47 personalized portals, 5–7 FeedRelator class, 492–495 retrieving only new or modified feeds, 67–77 feeds scheduling, 60–61 aggregator for subscribing to feeds for, 49–51 building, 52–59, 135–140, 144–153 three-pane aggregators, 8–13 commercial programs for, 61–65, 167 Web-based aggregators, 10 creating group Web log from, 515–524 feed cache, for shared feeds, 575–584 definition of, 4–13 Continued 27_597582_bindex.qxd 8/5/05 10:15 PM Page 590

590 Index ■ F

feeds links on Web sites and blogs for, 23–24 aggregator for (continued) MIME types for, 25 downloading multimedia content using, 176–180 URI scheme for, 26 emailing individual new or modified feeds, 86–92 Web Services for, 32–37 emailing report of new or modified feeds, 80–86 fried, 226 finding feeds for, 24–37 hosting future of, 10–13 caching feeds, 229–230 generating audio files from, 160–167 compression used for, 230–233 mixed reverse-chronological aggregators, 7–8 delivering partial feeds, 240 module for reusable parts of, 77–79 how often to regenerate feeds, 226–227 parsing feeds for, 38–49 issues involving, 225–226 personalized portals, 5–7 metadata with update schedule hints, 237–239 retrieving only new or modified feeds, 67–77 redundant downloads, reducing, 233–237 scheduling, 60–61 scheduling feed generation, 227 subscribing to feeds for, 49–51 testing, 240 three-pane aggregators, 8–13 uploading with FTP, 227–229, 240 Web-based aggregators, 10 inserting items into audio, creating, 158–167 Amazon items, 506–512 baking, 226–229 commercial programs for, 513 blending daily links, 497–505 adding related links to feeds, 488–497 related links, 488–497 commercial programs for, 513 links inserting related items into feeds, 506–512 daily, mixing into feeds, 497–505 merging feeds, 483–487 related, adding to feeds, 488–497 mixing daily links into feeds, 497–505 sifting from feeds, 469–480 caching, 229–230, 575–584 merging, 483–487 compressing, 230–233 metadata with update schedule hints, 237–239 conversion of monitoring login activity using, 312–317 commercial programs for, 443–444 monitoring logs using, 290–294, 301–312 definition of, 418 monitoring projects in CVS repositories with feedparser, 437–443 history events, 324, 325, 333–339 with XSLT, building normalizer for, 420–432 log entries, 325–338 with XSLT, data model for, 418–419 monitoring projects in Subversion repositories, with XSLT, processor for, 433 340–350 with XSLT, testing normalizer for, 434–436 multimedia content in definition of, 3–4, 13–18 commercial programs for, 194–197 extending downloading from URLs, 171–175 adding metadata to feeds, 538–539, 541, 557–564 downloading using BitTorrent, 180–189 with calendar events, 541–557 downloading with feed aggregator, 176–180 commercial programs for, 569–570 finding, using RSS enclosures, 169–171 definition of, 537 importing MP3s into iTunes, 189–194 harvesting calendar events from feeds, 564–569 normalization of structuring feeds with microformats, 539–541, commercial programs for, 443–444 543–548 definition of, 417, 483 extracting calendar events from, 564–569 with feedparser, 437–443 filtering with XSLT, building normalizer for, 420–432 commercial programs for, 481 with XSLT, data model for, 418–419 definition of, 13 with XSLT, processor for, 433 by keywords and metadata, 445–450 with XSLT, testing normalizer for, 434–436 using Bayesian classifier, 452–469 parsing finding building feed parser, 38–46 clickable feed buttons, 24–26 conversion of feeds using, 437–443 feed autodiscovery, 26–32 date and time formats, handling, 47–48 feed directories, 32–37 definition of, 483 27_597582_bindex.qxd 8/5/05 10:15 PM Page 591

Index ■ F–G 591

extending feed parser, 47–48 sending as Instant Messages, 105–111 for feed aggregator, 38–49 subscriptions history events in CVS, 327–333 managing with IM chatbot, 114–125 HTML files into feeds, 548–557 subscribing to, 49–51 HTMLParser class, Python, 27–30, 253–263 update schedule hints in, 237–239 log entries in CVS, 327–333 validating, 220–222 testing feed parser, 46–47 FEEDS_FN constant, 52, 471, 516, 525 Universal Feed Parser (Pilgrim), 48–49, 171, 419, Feedsplitter feed conversion, 444 437–443 FEED_TAGLINE constant, 470 producing FEED_TITLE constant, 470 with Amazon Web Services, 394–412 FEED_TYPES constant, 29, 30, 33 from calendar events in HTML, 548–557 FEED_URI constant, 446 commercial programs for, 223–224 FEED_URL constant, 492, 499–500, 531, 564 from email, filtering feeds for, 369–373 fetch_items() method from email, generating feeds for, 363–369 AmazonSearchScraper class, 406 from email, mail protocol wrappers for, 360–363 AmazonWishlistScraper class, 410–411 from filtered email, 369–372 fetch_messages() method with Google Web services, 375–383 IMAP4Client class, 362–363 from HTML files, 209–215, 217–219 POP3Client class, 360–361 incrementally generating from logs, 294–300 filter_aggregator_entries() method, 466–467 with Yahoo! Web services, 383–394 filtering email to produce feeds, 369–372 publishing, 13 filtering feeds reading from PalmOS devices commercial programs for, 481 commercial programs for, 167 definition of, 13 feed aggregator for, 135–140 by keywords and metadata, 445–450 loading documents to, 141 using Bayesian classifier Plucker package for, 130–135 classifier for, 463–467 receiving in Usenet newsgroups, 92 feed aggregator for, 452–459 remixing, 417 feedback mechanism for, 459–463 republishing performing, 467–469 commercial programs for, 535–536 filter_messages() method, 364–365, 370–372 as group web log, 515–524 FILTER_RE constant, 446 in JavaScript includes, 529–535 _fin() method, 185–186 as web log with MetaWeblog, 524–529 findEntry() function, 456 retrieving, 37–38 findHTML() function, 202–203, 212 retrieving from IM chatbot, 114–125 finish() method, 552 retrieving only new or modified feeds FooListScraper class, 370 aggregator memory for, 67–77 4Suite package, 20–21, 433 emailing reports of, 80–91 fried feeds, 226 retrieving using iPod Notes FTP, baking feeds using, 227–229, 240 building feed aggregator for, 144–153 ftplib module, Python, 227–228 using feed aggregator for, 153–157 fuzzy logic, used in scraping, 244 scraping from CVS entries, 333–339 scraping from Subversion entries, 343–350 G scraping Web sites for GET, conditional, 234–236 building feed scraper, 244–253 _getCachedURIs() method, 581 commercial programs for, 287–288 getEnclosuresFromEntries() function, 171, definition of, 13, 243–244 177–178 with HTML Tidy, 274–278 get_entry_paths() method, 299 with HTMLParser class, 253–263 getFeedEntries() function, 53 problems with, 244 GetFeedFields() method, Syndic8, 34 with regular expressions, 264–274 GetFeedInfo() method, Syndic8, 34 tools for, 244 getFeedRecord() method, 578 with XPath, 274–275, 278–286 getFeeds() function, 30 27_597582_bindex.qxd 8/5/05 10:15 PM Page 592

592 Index ■ G–H

getFeedsDetail() function, 30–31, 34 handle_data() method, 207–208 GetFeedStates() method, Syndic8, 34 handle_endtag() method, 206–207, 262–263, __getitem__() method 553–555 CVSHistoryEvent class, 332 handle_entityref() method, 207–208 CVSLogEntry class, 331 handle_starttag() method, 206, 261–262, 552 EntryWrapper class, 56 hash() method, 73, 74–76 FeedEntryDict class, 246–247 hash_entry() method, 299 HTMLMetaDoc class, 203–204 hCalendar microformat, 539–541, 548–557. See also ICalTmplWrapper class, 546 calendar events ScoredEntryWrapper class, 454 HCalendarParser class, 549–556 TemplateDocWrapper class, 214–215 hcalendar.py module, 548–556 TrampTmplWrapper class, 402–403 tag, feeds as metadata in, 26–27 getNewFeedEntries() function HEVENT_TMPL template, 544, 546–547 agg02_pollsubs.py program, 70–74 history() method, 330 agglib.py module, 145–146, 177, 453 history operation, CVS getOwner() method, 117–118 definition of, 324, 325 getSubscriptionURI() method, 119 feed scraper for, 333–339 _getTorrentMeta() method, 184–185 parsing results of, 327–333 Gmail service, 373 HISTORY_FN constant, 516 Gnews2RSS, Google News scraper (Bond), 412 hosting feeds go() method, 118–119 caching feeds, 229–230 Google News scrapers, 412 compression used for, 230–233 Google Web services delivering partial feeds, 240 definition of, 375–376 how often to regenerate feeds, 226–227 license key for, 376–377, 379 issues involving, 225–226 searches, building feeds from, 378–383 metadata with update schedule hints, 237–239 SOAP interface for, accessing, 378 redundant downloads, reducing, 233–237 GOOGLE_LICENSE_KEY constant, 379 scheduling feed generation, 227 GOOGLE_SEARCH_QUERY constant, 379 testing, 240 GoogleSearchScraper class, 379–381 uploading with FTP, 227–229, 240 Graham, Paul (“A Plan for Spam”), 451 HotLinks link aggregator, merging feeds from, 485–486 Gregorio, Joe (httpcache module), 37–38 tag, 237–238 grep command, log monitoring and, 290–294 .htaccess file, Apache, 25 Growl global notification system, 350 HTML files guessEntry() function, 455–456 examining to produce feed scraper for, 255–256, tag, 15, 75 264–267, 274–275 gzip module, Python, 316 extracting metadata from, 201–209 parsing into feeds, 548–557 H producing feeds from, 209–215, 217–219 Hammersley, Ben (tracking installed modules), 317 rendering calendar events as, 543–548 handheld devices HTML Tidy, scraping Web sites using, 274–278 iPod devices HTML_DIR constant, 544 commercial programs for, 167 HTML_FN constant, 52, 134 creating and managing Notes on, 142–145 HTMLMetaDoc class, 203–204 feed aggregator for, building, 144–153 htmlmetalib.py module, 201–209 feed aggregator for, using, 153–157 HTMLMetaParser class, 204–207 Note Reader for, 141–142 _HTMLMetaparserFinished class, 204–205 PalmOS devices HTMLParser class, Python, 27–30, 253–263 commercial programs for, 167 HTMLScraper class, 257–259 definition of, 129 HTMLStripper class, 163, 165–166 feed aggregator for, 135–140 HTTP 1.1 loading documents to, 141 conditional GET facilities, 234–236 Plucker package for, 130–135 content coding allowing transformation, 230 handle_comment() method, 261 HTTP status codes, 72 27_597582_bindex.qxd 8/5/05 10:15 PM Page 593

Index ■ H–I 593

HTTPCache class, 395–396, 398, 490–491 DeliciousFeed class, 501 httpcache module (Gregorio), 37–38, 229–230 FeedCache class, 577–578 HTTPDownloader class, 172–175 FeedEntryDict class, 246 FeedFilter class, 447 I FeedNormalizer class, 438 iCalendar standard, 539, 541. See also calendar events FeedRelator class, 493 ICalTmplWrapper class, 546–547 GoogleSearchScraper class, 380 ICAL_URL constant, 544 HCalendarParser class, 549–550 ICS_FN constant, 564 HTMLMetaDoc class, 203–204 ID for Yahoo! Web services, 384–385 ICalTmplWrapper class, 546 tag, 17, 75 IMAP4Client class, 362 If-Modified-Since header, 234, 236 JavaScriptFeed class, 531–532 If-None-Match header, 234, 236 LogBufferFeed class, 296–297 IFrame-style includes, republishing feeds using, 536 MailScraper class, 363–364 IM (Instant Messenger) ModEventFeed class, 562 AIM (AOL Instant Messenger) POP3Client class, 360–361 definition of, 93–94 ScoredEntryWrapper class, 454 handling connections from, 97–101 SVNScraper class, 345 protocols for, 94 TemplateDocWrapper class, 213 commercial programs for, 126 TrampTmplWrapper class, 402 Jabber YahooNewsScraper class, 392 commercial chatbots for, 126 YahooSearchScraper class, 387–388 definition of, 94–95 INSERT_ITEM_TMPL template, 493, 508–509 handling connections from, 101–104 INSERT_TMPL template, 493, 508–509 managing feed subscriptions using, 114–125 Instant Messenger (IM). See IM sending new feeds as Instant Messages, 105–111 Internet Engineering Task Force, 94 test function for, 95–97 Internet Mail Access Protocol version 4 mailbox, IMAP4 mailbox, accessing programmatically, 355–357 accessing programmatically, 355–357 IMAP4Client class, 362–363 Internet Message Format (RFC 2822), 357–359 imaplib module, Python, 356 iPod Agent program, 167 IM_CHUNK constant, 106 iPod devices IM_CLASS constant, 106 commercial programs for, 167 imconn.py module, 95–105 creating and managing Notes on, 142–145 IM_PASSWD constant, 106 feed aggregator for impedence mismatch, 225 building, 144–153 importAudioFiles() function, 193–194 using, 153–157 importSoundFile() function, 163–165 Note Reader for, 141–142 IM_TO constant, 106 iPodder podcast receiver, 194–195, 322–324 IM_USER constant, 106 iPodderX podcast receiver, 196 inbox, accessing programmatically, 353–357 IPOD_FEEDS_DIR constant, 146–147 includes, JavaScript, 529–535, 536 IPOD_FEEDS_IDX constant, 146–147 INCLUDE_TMPL template, 531–532 IPOD_FEEDS_IDX_TITLE constant, 146–147 “Industrial Strength Web Publishing” (Kallen), 227 IPOD_NOTES_PATH constant, 146–147 __init__() method tag, 17 AggBot class, 116–117 isValidAudioFile() function, 193–194 AmazonAdFeed class, 508–509 tag, 15 AmazonSearchScraper class, 405 items() method, 551 AmazonWishlistScraper class, 409–410 ITEM_TRACK constant, 399 BayesFilter class, 465–466 iTunes CVSClient class, 329 accessing from Python, 160 CVSHistoryEvent class, 332 importing MP3s into, 189–194 CVSLogEntry class, 331 iTunesMac class, 190–192 CVSScraper class, 335–336 27_597582_bindex.qxd 8/5/05 10:15 PM Page 594

594 Index ■ J–M

J LOG_CONF constant, 471 Jabber instant messaging. See also IM (Instant Messenger) login activity, monitoring, 312–317 commercial chatbots for, 126 LogMeister package, 318 definition of, 94–95 LOG_PERIOD constant, 345 handling connections from, 101–104 logs JabberConnection class, 101–104 CVS JabRSS chatbot, 126 parsing, 327–333 JavaScript includes, republishing feeds using, 529–535, querying, 325–327 536 scraping, 333–338 JavaScriptFeed class, 531–533 monitoring JS_FN constant, 531 Apache access logs, 304–312 js_format() method, 533 Apache error logs, 301–304 Julian date ranges, Google searches using, 382–383 commercial programs for, 317–318 julian.py module, 382–383 incremental feeds for, 294–300 login activity, 312–317 K system logs, 290–294 Kallen, Ian (“Industrial Strength Web Publishing”), 227 Windows Event Log, 290 keywords, filtering feeds by, 445–450 Subversion Kibo, 310–312 querying, 341–343 KNewsTicker (K Desktop Environment), 5 scraping, 343–350 L M last command, 313 M3U format, 25 Last-Modified header, 234, 235 Mac OS X learning machines, 4 creating audio feeds with Text-to-Speech, 158–167 license key, Google Web services, 376–377, 379 importing MP3s into iTunes, 189–194 tag, 15, 17, 26–27 installing CVS on, 324 LINKER_TMPL, 471–472 installing 4Suite on, 21 links installing HTML Tidy, 276 daily, mixing into feeds, 497–505 installing Python on, 20 indicating feeds, 23–24 installing Subversion on, 340 related, adding to feeds, 488–497 installing XPath on, 279 sifting from feeds loading Plucker documents to PalmOS, 141 building feed generator for, 469–478 NetNewsWire feed aggregator, 63–64 using feed generator for, 478–480 Radio UserLand feed aggregator, 62–63 LinkSkimmer class, 478 UNIX-based tools for, 19 LINK_TMPL template, 306–307, 335, 471–472 machine learning, 450 Linux MagpieRSS feed parser, 61 installing CVS on, 324 mailbox, accessing programmatically, 353–357 installing 4Suite on, 21 MailBucket service, 373 installing HTML Tidy, 276 MAIL_CLIENT constant, 369 installing Python on, 19 mailfeedlib.py module, 359–369 installing Subversion on, 340 MailScraper class, 363–368 installing XPath on, 279 makeEntryID() function, 456–457 loading Plucker documents to PalmOS, 141 Manalang, Rich (XSLT feed conversion), 444 login activity, monitoring, 312–317 MAX_ENTRIES constant UNIX-based tools for, 18–19 ch07_feedmaker.py program, 209–210, 212 _loadRecord() method, 581–582 ch17_feed_blog.py program, 516–517 loadSubs() function, 112 scraperlib.py module, 250, 253 localhost, SMTP server set to, 81 MAX_ENTRY_AGE constant, 471 LOCScraper class, 260–263 MAX_ITEMS constant, 508–509 log operation, Subversion, 342–343 md5 module, Python, 68, 75–76 LogBufferFeed class, 296–299, 302 md5_hash() function, 150 27_597582_bindex.qxd 8/5/05 10:16 PM Page 595

Index ■ M–O 595

merging feeds, 483–487 Multipurpose Internet Mail Extensions (MIME), 25, _messageCB() method, 103–104 357–359 tag, 208 My portal, 6 metadata adding to feeds, 538–539, 541, 557–564 N extracting from HTML files, 201–209 tag, 17 filtering feeds by, 445–450 named groups, 267 update schedule hints in, 237–239 ndiff() function, 315 MetaWeblog API, reposting feeds using, 524–529 NetNewsWire feed aggregator (Ranchero Software), microcontent, 4 63–64 microformats NEW_FEED_LIMIT constant, 176, 177–178 specifications for, 570 news searches, Yahoo!, 390–394 structuring feeds with, 539–541, 543–548 newsmaster, 4 MIME (Multipurpose Internet Mail Extensions), 25, Newspipe program, 91–92 357–359 nntp//rss aggregator, 92 MIMEMultipart module, Python, 81, 82 normalization of feeds MIMEText module, Python, 81, 82 commercial programs for, 443–444 minifeedfinder.py module, 27–32 definition of, 417, 483 MiniFeedParser class, 40 with feedparser, 437–443 minifeedparser.py module, 39–48 with XSLT MIN_LINKS constant, 471 building normalizer, 420–432 Minutillo, Steve (Feed on Feeds feed aggregator), 61–62 conversion enabled by, 420 mixed reverse-chronological aggregators, 7–8 data model for, 418–419 mod_deflate module, Apache, 232 testing normalizer, 434–436 mod_event module, 538–539 XPath expressions for, 419 ModEventFeed class, 560–563 XSLT processor for, 433 mod_expires module, Apache, 235–236 normalize_entries() method, 440–441, 485–486 mod_gzip module, Apache, 231 normalize_feed_meta() method, 439–440 tag, 17 “Normalizing Syndicated Feed Content” (Pilgrim), 419 monitorfeedlib.py module, 294–300 Note Reader for iPod monitoring creating and managing Notes, 142–145 Apache access logs, 304–312 definition of, 141–142 Apache error logs, 301–304 feed aggregator for commercial programs for, 317–318 building, 144–153 incremental feeds for, 294–300 using, 153–157 login activity, 312–317 NOTE_TMPL template, 148 projects in CVS repositories Nottingham, Mark (cgi_buffer package), 232–233, history events, 324, 325, 333–339 236 log entries, 325–338 NUM_DAYS constant, 501 projects in Subversion repositories, 340–350 UNIX system logs, 290–294 O Windows Event Log, 290, 318 on_IM_IN() method, 100–101 Movable Type content management package, 226, 524 Open Source projects, tracking MP3 files, importing into iTunes, 189–194 CVS (Concurrent Version System) MP3 links, 25 commercial programs for, 335, 351 MP3PLAYER constant, 192 definition of, 321–322 mp3players.py module, 189–192 history events, feed scraper for, 333–339 multimedia content history events, parsing, 327–333 commercial programs for, 194–197 history events, querying, 324, 325 downloading from URLs, 171–175 installing, 324 downloading using BitTorrent, 180–189 log entries, feed scraper for, 333–338 downloading with feed aggregator, 176–180 log entries, parsing, 327–333 finding, using RSS enclosures, 169–171 log entries, querying, 325–327 importing MP3s into iTunes, 189–194 repositories, finding, 322–324 27_597582_bindex.qxd 8/5/05 10:16 PM Page 596

596 Index ■ O–P

Open Source projects, tracking (continued) personalized portals, 5–7 Subversion repositories PHP iCalendar, 541 commercial programs for, 335, 351 PHP includes, republishing feeds using, 536 definition of, 340 Pickle module, 297 finding, 340–341 Pilgrim, Mark installing, 340 “Normalizing Syndicated Feed Content”, 419 logs, querying, 341–343 Ultra-Liberal Feed Finder module, 36–37 logs, scraping, 343–350 Universal Feed Parser, 48–49, 419 Open Source Web community, 6 Planet feed aggregator, 535 openDBs() function, 70 playlist files, 25 osascript command line tool, 160, 165 Plone portal, 6 OSCAR (Open System for Communication in Real PLS format, 25 Time), 94 Plucker package for PalmOS os.popen function, Python, 160 components in os.walk() function, Python, 203 downloading, 131–132 list of, 131–132 P definition of, 130–131 PAGE_TMPL template, 54, 520–521 Desktop component, 132 PalmOS devices Distiller component commercial programs for, 167 building feed aggregator using, 135–140 definition of, 129 installing, 132 feed aggregator for, 135–140 using, 132–135 loading documents to, 141 Documentation component, 132 Plucker package for, 130–135 Extras component, 132 parse() function, 39–40, 550 PLUCKER_BPP constant, 134–135 parse() method, 578, 582–583 PLUCKER_DEPTH constant, 134–135 parse_file() method, 205, 259 PLUCKER_DIR constant, 134 parse_results() method, 389 PLUCKER_FN constant, 134–135 parse_uri() method, 550 PLUCKER_TITLE constant, 134 parsing podcast feeds, listings of, 179 conversion of feeds using, 437–443 podcasting, 169. See also multimedia content date and time formats, handling, 47–48 PointCast, Inc., 5 definition of, 483 pollFeed() method, 119–120 for feed aggregator polling, 225 building feed parser, 38–46 POP3 mailbox, accessing programmatically, 353–355 extending feed parser, 47–48 POP3 standard, 354 testing feed parser, 46–47 POP3Client class, 360–361 Universal Feed Parser (Pilgrim), 48–49 popen2 module, Python, 312 history events in CVS, 327–333 popen4() method, 314 HTML files into feeds, 548–557 poplib module, Python, 354 HTMLParser class, Python, 27–30, 253–263 portals, personalized, 5–7 log entries in CVS, 327–333 Post Office Protocol version 3 mailbox, accessing Universal Feed Parser (Pilgrim) programmatically, 353–355 definition of, 48 produce_entries() method downloading, 48–49 AmazonAdFeed class, 509–510 enclosures and, 171 AmazonScraper class, 400–402 normalizing and converting feeds with, 419, BayesFilter class, 466 437–443 CVSScraper class, 336 using, 49 DeliciousFeed class, 501–504 pattern recognition, used in scraping, 244 FCCScraper class, 272 PERC_STEP constant, 182 FeedFilter class, 448–449 performance. See bandwidth, hosting feeds and FeedMerger class, 485–486 Perl modules, installed, tracking (Hammersley), 317 FeedNormalizer class, 438–439 personal aggregators, 7–8 FeedRelator class, 493–495 27_597582_bindex.qxd 8/5/05 10:16 PM Page 597

Index ■ P–R 597

GoogleSearchScraper class, 380–381 Reinacker, Greg (Windows Event Log, monitoring), 318 HTMLScraper class, 257 remixing feeds, 417. See also conversion of feeds; LogBufferFeed class, 296–297 normalization of feeds MailScraper class, 364–365 repositories, CVS, finding, 322–324 ModEventFeed class, 562 reset() method, 41, 204–205, 552 RegexScraper class, 269 reset_doc() method, 205 SVNScraper class, 346–347 resources. See Web site resources XPathScraper class, 283–284 REST style of Web services YahooNewsScraper class, 392–393 for Amazon, 395 YahooSearchScraper class, 388 for Yahoo!, 384 PROG_OUT constant, 182 Reverend (Divmod), 451–452 tag, 15 revision control systems. See CVS (Concurrent Version pull technology, 225 System) push technology, 225 RFC 822 format, 218, 424, 426, 428–432 PyBlosxom blogging package, 223 RFC 2822 format, 357–359 pygoogle package, 378 RFC3229 format, 240 PyPlucker.Spider module, 137 rlog() method, 329 PyRSS2Gen feed generator, 223 rlog operation, CVS Python definition of, 325–327 feed aggregator using, 52–59 feed scraper for, 333–338 feed autodiscovery using, 27–32 parsing results of, 327–333 fetching feeds using, 37 “RSS 2.0” link, 23 installing on Linux, 19 RSS Digest service, 536 installing on Mac OS X, 20 RSS enclosures installing on Windows, 20 commercial programs for, 194–197 spycyroll feed aggregator (Babu), 61 downloading content for, 171–175 Web Services access using, 33–37 downloading content with feed aggregator, 176–180 Py-TOC module, Python (Turner), 94 finding multimedia content using, 169–171 RSS feeds. See also feeds Q conversion to Atom feeds QueryFeeds() method, Syndic8, 34 commercial programs for, 444 QuickNews feed aggregator, 167 definition of, 418 with feedparser, 437–443 R with XSLT, building normalizer for, 420–432 Radio UserLand feed aggregator, 7–8, 62–63 with XSLT, data model for, 418–419 Ranchero Software, NetNewsWire feed aggregator, 63–64 with XSLT, processor for, 433 tag, 42 with XSLT, testing normalizer for, 434–436 Really Simple Syndication. See RSS feeds definition of, 3, 13–18 reBlog package, 536 generating from HTML content, 217–220 receiveIM() method, 120–121 mod_event extension to, 538–539 _recordFN() method, 581 1.0 specification, 39 referrer spam, 306 2.0 specification, 39 referrers, monitoring, 304–312 update schedule hints in, 237–239 REFER_SEEN constant, 305 “RSS” links, 3, 23 refreshFeed() method, 579–581 tag, 15, 42 refreshFeeds() method, 578–579 rss2email program (Swartz), 91 REFRESH_PERIOD constant, 577 rss2jabber chatbot, 126 RegexScraper class, 267–269, 271 RSSCalendar calendar events, 570 regular expressions RSS_DATE_FMT template, 248–249 for Apache access log, 310 RSS_ENTRY_TMPL template, 249, 558, 561 definition of, 266 RSS_FEED_TMPL template, 248–249, 561 feed scraper using, 264–274 RSS-IM Gateway IM chatbot, 126 for feeds with category of “python”, 446 RSSTemplateDocWrapper class, 218–219 resources for learning about, 266, 267, 269 runOnce() method, 99–100, 103–104, 118–119 27_597582_bindex.qxd 8/5/05 10:16 PM Page 598

598 Index ■ S

S sifting feeds _saveRecord() method, 582 building feed generator for, 469–478 saveSubs() function, 112 using feed generator for, 478–480 say command, Mac OS X, 158–160 SITE_NAME constant, 301, 305, 313 scheduled tasks, Windows XP, 60–61 SITE_ROOT constant, 305 ScoredEntryWrapper class, 453–454 tag, 238 scoreEntries() function, 454 tag, 237–238 scoreEntry() function, 454–455 SMTP server, set to localhost,81 scrape() method, 251–253 smtplib module, Python, 81 scrape_atom() method, 250–251 SOAP modules, Python, 378 Scraper class, 249–253 SOAP Web services, 376 _ScraperFinishedException() method, 247 socket module, Python, 99 scraperlib.py module, 245–253, 257–259, 268–269, Source Code component, Plucker, 132 282–284 SourceForge CVS repositories, 322–324 scrape_rss() method, 250–251 spam, referrer, 306 SCRAPE_URL constant, 250, 260–261 speakTextIntoSoundFile() function, 163 scraping CVS entries, 333–339 speech synthesis, creating audio feeds using, 158–167 scraping Subversion entries, 343–350 spycyroll feed aggregator (Babu), 61 scraping Web sites start() method, 98–99 building feed scraper, 244–253 start_entry() method, 43 commercial programs for, 287–288 start_feed() method, 258 definition of, 13, 243–244 start_feed_entry() method, 258 with HTML Tidy, 274–278 start_item() method, 43 with HTMLParser class, 253–263 START_TIMEOUT constant, 182 problems with, 244 state machine, 41 with regular expressions, 264–274 STATE_FN constant tools for, 244 AmazonScraper class, 399 with XPath, 274–275, 278–286 CVSScraper class, 335 ScrappyGoo, Google News scraper (Yang), 412 LOCScraper class, 260–261