27_597582_bindex.qxd 8/5/05 10:15 PM Page 585
Index
NUMERICS AmazonSearchScraper class, 405–406 0.3 specification, Atom, 39, 211 AmazonWishlistScraper class, 409–411 1.0 specification, RSS, 39 America Online Instant Messenger. See AIM (AOL 2.0 specification, RSS, 39 Instant Messenger) 4Suite package, 20–21, 433 AmphetaDesk feed aggregator, 8 AmphetaRate service, 481 A AOL Instant Messenger. See AIM “A Plan for Spam” (Graham), 451 Apache Web server access logs, Apache, 304–312 compression, enabling, 231–232 ACCESS_LOG constant, 305 conditional GET, enabling, 235–236 ACCESS_RE constant, 310 configuring MIME types for feeds, 25 Active Channels, Internet Explorer, 4 monitoring access logs, 304–312 addConnection() method, 117–118 monitoring error logs, 301–304 addTrack() method, 192 API_BLOGID constant, 525 AddType configuration, Apache, 25 API_PASSWD constant, 525 AdiumX IM client, 105 API_URL constant, 525 Advanced Packaging Tool, Debian Linux, 18 API_USER constant, 525 agg01_pollsubs.py program, 52–59 append_entry() method, 297–298 agg01_subscribe.py program, 50–51 AppleScript agg02_pollsubs.py program, 68–77 accessing from Python, 160 agg03_emailsubs.py program, 80–86 importing MP3s into iTunes, 189–194 agg04_emailsubs.py program, 86–91 application ID, Yahoo! Web services, 384–385, 387 agg05_im_subs.py program, 105–111 ARCHIVE_FN constant, 517 agg06_aggbot.py program, 114–124 ASSOCIATE_TAG constant, 508–509 AggBot class, 116–124 “ATOM 0.3” links, 23 agglib.py module, 77–79, 112–114, 145–146 Atom feeds. See also feeds aggregator. See feed aggregator conversion to RSS feeds AIM (AOL Instant Messenger). See also IM (Instant commercial programs for, 444 Messenger) definition of, 418 definition of, 93–94 with feedparser, 437–443 handling connections from, 97–101 with XSLT, building normalizer for, 420–432 protocols for, 94 with XSLT, data model for, 418–419 AIMConnection class, 97–101 with XSLT, processor for, 433 Amazon Web Services (AWS) with XSLT, testing normalizer for, 434–436 definition of, 394 definition of, 3, 13–18 inserting related items intoCOPYRIGHTED feeds, 506–512 generating MATERIAL from HTML content, 209–217 product searches, building feeds from, 403–407 0.3 specification, 39, 211 queries, building, 395–398 “Atom” links, 23 registering as AWS developer, 395, 506 ATOM_DATE_FMT template, 248 REST-style interface for, 395 ATOM_ENTRY_TMPL template searches, building feeds from, 398–403 amazon.lib.py module, 399–400 wish list items, building feeds from, 407–412 ch07_feedmaker.py program, 557 wish lists, searching for, 396–398 ch18_mod_event_feed_filter.py program, XSLT, building feeds using, 413 560 AmazonAdFeed class, 508–511 mailfeedlib.py module, 363–364 AMAZON_KEY constant, 508–509 scraperlib.py module, 248 amazon.lib.py module, 398–403 atomfeed feed generator, 223 AmazonScraper class, 398–402 ATOM_FEED_TMPL template, 211, 248, 560 amazon_search() method, 510–511 AtomTemplateDocWrapper class, 215 27_597582_bindex.qxd 8/5/05 10:15 PM Page 586
586 Index ■ B–C
audio feeds, creating, 158–167 inserting bookmarks into feeds, 498–505
Index ■ C–D 587
ch13_amazon_wishlist_scraper.py program, commit operation, CVS, 322 408–411 compression, for feed transfers, 230–233 ch13_google_search_scraper.py program, Concurrent Version System. See CVS 378–381 conditional GET, HTTP, 234–236 ch13_yahoo_news_scraper.py program, 390–393 connect() method, 98–99, 101–102, 118–119 ch13_yahoo_search_scraper.py program, 386–389 content management system (CMS), 13 ch14_feed_normalizer.py program, 437–441 CONTENT_TMPL template, 471–472 ch14_xslt_feed_normalizer.py program, 433 Content-Type header, 233 ch14_xslt_feed_normalizer.xsl transformation, conversion of feeds 420–432 commercial programs for, 443–444 ch15_bayes_agg.py program, 452–459 definition of, 417, 418 ch15_bayes_filter.py program, 463–467 with feedparser, 437–443 ch15_bayes_mark_entry.cgi program, 459–463 with XSLT ch15_feed_filter.py program, 445–449 building normalizer, 420–432 ch15_popular_links.py program, 470–478 commercial programs for, 444 ch16_feed_amazon_ads.py program, 507–511 conversion enabled by, 420 ch16_feed_delicious_recaps.py program, data model for, 418–419 498–504 testing normalizer, 434–436 ch16_feed_merger.py program, 483–486 XPath expressions for, 419 ch16_feed_related.py program, 491–495 XSLT processor for, 433 ch16_technorati_search.py program, 490–491 cPickle module, 297 ch17_feed_blog.py program, 515–523 CrispAds service, 513 ch17_feed_reposter.py program, 525–528 cron scheduler, 60 ch17_feed_to_javascript.py program, 530–533 cURL tool, 489 ch18_feed_to_ical.py program, 564–567 CVS commit triggers, 351 ch18_ical_to_hcal.py program, 543–547 CVS (Concurrent Version System) ch18_mod_event_feed_filter.py program, commercial programs for, 335, 351 559–563 definition of, 321–322 Channel Definition Format (CDF), 4 history events
588 Index ■ D–E
date ranges containing report of new or modified feeds, 80–86 Google searches using, 382–383 filtering to produce feeds from, 369–372 Yahoo! searches using, 389 format of, 357–359 DATE_HDR_TMPL template, 54, 57, 457, 520–521 producing feeds from
Index ■ E–F 589
extending feeds feed handler adding metadata to feeds, 538–539, 541, 557–564 building, 37–48 with calendar events, 541–557 Universal Feed Parser (Pilgrim), 48–49 commercial programs for, 569–570 Feed on Feeds feed aggregator (Minutillo), 61–62 definition of, 537 feed playlist format, 25 harvesting calendar events from feeds, 564–569 feed: protocol scheme, 26 structuring feeds with microformats, 539–541, feed reader, 4–5. See also feed aggregator 543–548
590 Index ■ F
feeds links on Web sites and blogs for, 23–24 aggregator for (continued) MIME types for, 25 downloading multimedia content using, 176–180 URI scheme for, 26 emailing individual new or modified feeds, 86–92 Web Services for, 32–37 emailing report of new or modified feeds, 80–86 fried, 226 finding feeds for, 24–37 hosting future of, 10–13 caching feeds, 229–230 generating audio files from, 160–167 compression used for, 230–233 mixed reverse-chronological aggregators, 7–8 delivering partial feeds, 240 module for reusable parts of, 77–79 how often to regenerate feeds, 226–227 parsing feeds for, 38–49 issues involving, 225–226 personalized portals, 5–7 metadata with update schedule hints, 237–239 retrieving only new or modified feeds, 67–77 redundant downloads, reducing, 233–237 scheduling, 60–61 scheduling feed generation, 227 subscribing to feeds for, 49–51 testing, 240 three-pane aggregators, 8–13 uploading with FTP, 227–229, 240 Web-based aggregators, 10 inserting items into audio, creating, 158–167 Amazon items, 506–512 baking, 226–229 commercial programs for, 513 blending daily links, 497–505 adding related links to feeds, 488–497 related links, 488–497 commercial programs for, 513 links inserting related items into feeds, 506–512 daily, mixing into feeds, 497–505 merging feeds, 483–487 related, adding to feeds, 488–497 mixing daily links into feeds, 497–505 sifting from feeds, 469–480 caching, 229–230, 575–584 merging, 483–487 compressing, 230–233 metadata with update schedule hints, 237–239 conversion of monitoring login activity using, 312–317 commercial programs for, 443–444 monitoring logs using, 290–294, 301–312 definition of, 418 monitoring projects in CVS repositories with feedparser, 437–443 history events, 324, 325, 333–339 with XSLT, building normalizer for, 420–432 log entries, 325–338 with XSLT, data model for, 418–419 monitoring projects in Subversion repositories, with XSLT, processor for, 433 340–350 with XSLT, testing normalizer for, 434–436 multimedia content in definition of, 3–4, 13–18 commercial programs for, 194–197 extending downloading from URLs, 171–175 adding metadata to feeds, 538–539, 541, 557–564 downloading using BitTorrent, 180–189 with calendar events, 541–557 downloading with feed aggregator, 176–180 commercial programs for, 569–570 finding, using RSS enclosures, 169–171 definition of, 537 importing MP3s into iTunes, 189–194 harvesting calendar events from feeds, 564–569 normalization of structuring feeds with microformats, 539–541, commercial programs for, 443–444 543–548 definition of, 417, 483 extracting calendar events from, 564–569 with feedparser, 437–443 filtering with XSLT, building normalizer for, 420–432 commercial programs for, 481 with XSLT, data model for, 418–419 definition of, 13 with XSLT, processor for, 433 by keywords and metadata, 445–450 with XSLT, testing normalizer for, 434–436 using Bayesian classifier, 452–469 parsing finding building feed parser, 38–46 clickable feed buttons, 24–26 conversion of feeds using, 437–443 feed autodiscovery, 26–32 date and time formats, handling, 47–48 feed directories, 32–37 definition of, 483 27_597582_bindex.qxd 8/5/05 10:15 PM Page 591
Index ■ F–G 591
extending feed parser, 47–48 sending as Instant Messages, 105–111 for feed aggregator, 38–49 subscriptions history events in CVS, 327–333 managing with IM chatbot, 114–125 HTML files into feeds, 548–557 subscribing to, 49–51 HTMLParser class, Python, 27–30, 253–263 update schedule hints in, 237–239 log entries in CVS, 327–333 validating, 220–222 testing feed parser, 46–47 FEEDS_FN constant, 52, 471, 516, 525 Universal Feed Parser (Pilgrim), 48–49, 171, 419, Feedsplitter feed conversion, 444 437–443 FEED_TAGLINE constant, 470 producing FEED_TITLE constant, 470 with Amazon Web Services, 394–412 FEED_TYPES constant, 29, 30, 33 from calendar events in HTML, 548–557 FEED_URI constant, 446 commercial programs for, 223–224 FEED_URL constant, 492, 499–500, 531, 564 from email, filtering feeds for, 369–373 fetch_items() method from email, generating feeds for, 363–369 AmazonSearchScraper class, 406 from email, mail protocol wrappers for, 360–363 AmazonWishlistScraper class, 410–411 from filtered email, 369–372 fetch_messages() method with Google Web services, 375–383 IMAP4Client class, 362–363 from HTML files, 209–215, 217–219 POP3Client class, 360–361 incrementally generating from logs, 294–300 filter_aggregator_entries() method, 466–467 with Yahoo! Web services, 383–394 filtering email to produce feeds, 369–372 publishing, 13 filtering feeds reading from PalmOS devices commercial programs for, 481 commercial programs for, 167 definition of, 13 feed aggregator for, 135–140 by keywords and metadata, 445–450 loading documents to, 141 using Bayesian classifier Plucker package for, 130–135 classifier for, 463–467 receiving in Usenet newsgroups, 92 feed aggregator for, 452–459 remixing, 417 feedback mechanism for, 459–463 republishing performing, 467–469 commercial programs for, 535–536 filter_messages() method, 364–365, 370–372 as group web log, 515–524 FILTER_RE constant, 446 in JavaScript includes, 529–535 _fin() method, 185–186 as web log with MetaWeblog, 524–529 findEntry() function, 456 retrieving, 37–38 findHTML() function, 202–203, 212 retrieving from IM chatbot, 114–125 finish() method, 552 retrieving only new or modified feeds FooListScraper class, 370 aggregator memory for, 67–77 4Suite package, 20–21, 433 emailing reports of, 80–91 fried feeds, 226 retrieving using iPod Notes FTP, baking feeds using, 227–229, 240 building feed aggregator for, 144–153 ftplib module, Python, 227–228 using feed aggregator for, 153–157 fuzzy logic, used in scraping, 244 scraping from CVS entries, 333–339 scraping from Subversion entries, 343–350 G scraping Web sites for GET, conditional, 234–236 building feed scraper, 244–253 _getCachedURIs() method, 581 commercial programs for, 287–288 getEnclosuresFromEntries() function, 171, definition of, 13, 243–244 177–178 with HTML Tidy, 274–278 get_entry_paths() method, 299 with HTMLParser class, 253–263 getFeedEntries() function, 53 problems with, 244 GetFeedFields() method, Syndic8, 34 with regular expressions, 264–274 GetFeedInfo() method, Syndic8, 34 tools for, 244 getFeedRecord() method, 578 with XPath, 274–275, 278–286 getFeeds() function, 30 27_597582_bindex.qxd 8/5/05 10:15 PM Page 592
592 Index ■ G–H
getFeedsDetail() function, 30–31, 34 handle_data() method, 207–208 GetFeedStates() method, Syndic8, 34 handle_endtag() method, 206–207, 262–263, __getitem__() method 553–555 CVSHistoryEvent class, 332 handle_entityref() method, 207–208 CVSLogEntry class, 331 handle_starttag() method, 206, 261–262, 552 EntryWrapper class, 56 hash() method, 73, 74–76 FeedEntryDict class, 246–247 hash_entry() method, 299 HTMLMetaDoc class, 203–204 hCalendar microformat, 539–541, 548–557. See also ICalTmplWrapper class, 546 calendar events ScoredEntryWrapper class, 454 HCalendarParser class, 549–556 TemplateDocWrapper class, 214–215 hcalendar.py module, 548–556 TrampTmplWrapper class, 402–403
tag, feeds as metadata in, 26–27 getNewFeedEntries() function HEVENT_TMPL template, 544, 546–547 agg02_pollsubs.py program, 70–74 history() method, 330 agglib.py module, 145–146, 177, 453 history operation, CVS getOwner() method, 117–118 definition of, 324, 325 getSubscriptionURI() method, 119 feed scraper for, 333–339 _getTorrentMeta() method, 184–185 parsing results of, 327–333 Gmail service, 373 HISTORY_FN constant, 516 Gnews2RSS, Google News scraper (Bond), 412 hosting feeds go() method, 118–119 caching feeds, 229–230 Google News scrapers, 412 compression used for, 230–233 Google Web services delivering partial feeds, 240 definition of, 375–376 how often to regenerate feeds, 226–227 license key for, 376–377, 379 issues involving, 225–226 searches, building feeds from, 378–383 metadata with update schedule hints, 237–239 SOAP interface for, accessing, 378 redundant downloads, reducing, 233–237 GOOGLE_LICENSE_KEY constant, 379 scheduling feed generation, 227 GOOGLE_SEARCH_QUERY constant, 379 testing, 240 GoogleSearchScraper class, 379–381 uploading with FTP, 227–229, 240 Graham, Paul (“A Plan for Spam”), 451 HotLinks link aggregator, merging feeds from, 485–486 Gregorio, Joe (httpcache module), 37–38Index ■ H–I 593
HTTPCache class, 395–396, 398, 490–491 DeliciousFeed class, 501 httpcache module (Gregorio), 37–38, 229–230 FeedCache class, 577–578 HTTPDownloader class, 172–175 FeedEntryDict class, 246 FeedFilter class, 447 I FeedNormalizer class, 438 iCalendar standard, 539, 541. See also calendar events FeedRelator class, 493 ICalTmplWrapper class, 546–547 GoogleSearchScraper class, 380 ICAL_URL constant, 544 HCalendarParser class, 549–550 ICS_FN constant, 564 HTMLMetaDoc class, 203–204 ID for Yahoo! Web services, 384–385 ICalTmplWrapper class, 546
594 Index ■ J–M
J LOG_CONF constant, 471 Jabber instant messaging. See also IM (Instant Messenger) login activity, monitoring, 312–317 commercial chatbots for, 126 LogMeister package, 318 definition of, 94–95 LOG_PERIOD constant, 345 handling connections from, 101–104 logs JabberConnection class, 101–104 CVS JabRSS chatbot, 126 parsing, 327–333 JavaScript includes, republishing feeds using, 529–535, querying, 325–327 536 scraping, 333–338 JavaScriptFeed class, 531–533 monitoring JS_FN constant, 531 Apache access logs, 304–312 js_format() method, 533 Apache error logs, 301–304 Julian date ranges, Google searches using, 382–383 commercial programs for, 317–318 julian.py module, 382–383 incremental feeds for, 294–300 login activity, 312–317 K UNIX system logs, 290–294 Kallen, Ian (“Industrial Strength Web Publishing”), 227 Windows Event Log, 290 keywords, filtering feeds by, 445–450 Subversion Kibo, 310–312 querying, 341–343 KNewsTicker (K Desktop Environment), 5 scraping, 343–350 L M last command, 313 M3U format, 25 Last-Modified header, 234, 235 Mac OS X learning machines, 4 creating audio feeds with Text-to-Speech, 158–167 license key, Google Web services, 376–377, 379 importing MP3s into iTunes, 189–194 tag, 15, 17, 26–27 installing CVS on, 324 LINKER_TMPL, 471–472 installing 4Suite on, 21 links installing HTML Tidy, 276 daily, mixing into feeds, 497–505 installing Python on, 20 indicating feeds, 23–24 installing Subversion on, 340 related, adding to feeds, 488–497 installing XPath on, 279 sifting from feeds loading Plucker documents to PalmOS, 141 building feed generator for, 469–478 NetNewsWire feed aggregator, 63–64 using feed generator for, 478–480 Radio UserLand feed aggregator, 62–63 LinkSkimmer class, 478 UNIX-based tools for, 19 LINK_TMPL template, 306–307, 335, 471–472 machine learning, 450 Linux MagpieRSS feed parser, 61 installing CVS on, 324 mailbox, accessing programmatically, 353–357 installing 4Suite on, 21 MailBucket service, 373 installing HTML Tidy, 276 MAIL_CLIENT constant, 369 installing Python on, 19 mailfeedlib.py module, 359–369 installing Subversion on, 340 MailScraper class, 363–368 installing XPath on, 279 makeEntryID() function, 456–457 loading Plucker documents to PalmOS, 141 Manalang, Rich (XSLT feed conversion), 444 login activity, monitoring, 312–317 MAX_ENTRIES constant UNIX-based tools for, 18–19 ch07_feedmaker.py program, 209–210, 212 _loadRecord() method, 581–582 ch17_feed_blog.py program, 516–517 loadSubs() function, 112 scraperlib.py module, 250, 253 localhost, SMTP server set to, 81 MAX_ENTRY_AGE constant, 471 LOCScraper class, 260–263 MAX_ITEMS constant, 508–509 log operation, Subversion, 342–343 md5 module, Python, 68, 75–76 LogBufferFeed class, 296–299, 302 md5_hash() function, 150 27_597582_bindex.qxd 8/5/05 10:16 PM Page 595
Index ■ M–O 595
merging feeds, 483–487 Multipurpose Internet Mail Extensions (MIME), 25, _messageCB() method, 103–104 357–359 tag, 208 My Netscape portal, 6 metadata adding to feeds, 538–539, 541, 557–564 N extracting from HTML files, 201–209
596 Index ■ O–P
Open Source projects, tracking (continued) personalized portals, 5–7 Subversion repositories PHP iCalendar, 541 commercial programs for, 335, 351 PHP includes, republishing feeds using, 536 definition of, 340 Pickle module, 297 finding, 340–341 Pilgrim, Mark installing, 340 “Normalizing Syndicated Feed Content”, 419 logs, querying, 341–343 Ultra-Liberal Feed Finder module, 36–37 logs, scraping, 343–350 Universal Feed Parser, 48–49, 419 Open Source Web community, 6 Planet feed aggregator, 535 openDBs() function, 70 playlist files, 25 osascript command line tool, 160, 165 Plone portal, 6 OSCAR (Open System for Communication in Real PLS format, 25 Time), 94 Plucker package for PalmOS os.popen function, Python, 160 components in os.walk() function, Python, 203 downloading, 131–132 list of, 131–132 P definition of, 130–131 PAGE_TMPL template, 54, 520–521 Desktop component, 132 PalmOS devices Distiller component commercial programs for, 167 building feed aggregator using, 135–140 definition of, 129 installing, 132 feed aggregator for, 135–140 using, 132–135 loading documents to, 141 Documentation component, 132 Plucker package for, 130–135 Extras component, 132 parse() function, 39–40, 550 PLUCKER_BPP constant, 134–135 parse() method, 578, 582–583 PLUCKER_DEPTH constant, 134–135 parse_file() method, 205, 259 PLUCKER_DIR constant, 134 parse_results() method, 389 PLUCKER_FN constant, 134–135 parse_uri() method, 550 PLUCKER_TITLE constant, 134 parsing podcast feeds, listings of, 179 conversion of feeds using, 437–443 podcasting, 169. See also multimedia content date and time formats, handling, 47–48 PointCast, Inc., 5 definition of, 483 pollFeed() method, 119–120 for feed aggregator polling, 225 building feed parser, 38–46 POP3 mailbox, accessing programmatically, 353–355 extending feed parser, 47–48 POP3 standard, 354 testing feed parser, 46–47 POP3Client class, 360–361 Universal Feed Parser (Pilgrim), 48–49 popen2 module, Python, 312 history events in CVS, 327–333 popen4() method, 314 HTML files into feeds, 548–557 poplib module, Python, 354 HTMLParser class, Python, 27–30, 253–263 portals, personalized, 5–7 log entries in CVS, 327–333 Post Office Protocol version 3 mailbox, accessing Universal Feed Parser (Pilgrim) programmatically, 353–355 definition of, 48 produce_entries() method downloading, 48–49 AmazonAdFeed class, 509–510 enclosures and, 171 AmazonScraper class, 400–402 normalizing and converting feeds with, 419, BayesFilter class, 466 437–443 CVSScraper class, 336 using, 49 DeliciousFeed class, 501–504 pattern recognition, used in scraping, 244 FCCScraper class, 272 PERC_STEP constant, 182 FeedFilter class, 448–449 performance. See bandwidth, hosting feeds and FeedMerger class, 485–486 Perl modules, installed, tracking (Hammersley), 317 FeedNormalizer class, 438–439 personal aggregators, 7–8 FeedRelator class, 493–495 27_597582_bindex.qxd 8/5/05 10:16 PM Page 597
Index ■ P–R 597
GoogleSearchScraper class, 380–381 Reinacker, Greg (Windows Event Log, monitoring), 318 HTMLScraper class, 257 remixing feeds, 417. See also conversion of feeds; LogBufferFeed class, 296–297 normalization of feeds MailScraper class, 364–365 repositories, CVS, finding, 322–324 ModEventFeed class, 562 reset() method, 41, 204–205, 552 RegexScraper class, 269 reset_doc() method, 205 SVNScraper class, 346–347 resources. See Web site resources XPathScraper class, 283–284 REST style of Web services YahooNewsScraper class, 392–393 for Amazon, 395 YahooSearchScraper class, 388 for Yahoo!, 384 PROG_OUT constant, 182 Reverend (Divmod), 451–452
598 Index ■ S
S sifting feeds _saveRecord() method, 582 building feed generator for, 469–478 saveSubs() function, 112 using feed generator for, 478–480 say command, Mac OS X, 158–160 SITE_NAME constant, 301, 305, 313 scheduled tasks, Windows XP, 60–61 SITE_ROOT constant, 305 ScoredEntryWrapper class, 453–454