webbots2e.book Page 345 Thursday, February 9, 2012 10:08 AM

INDEX

Symbols & Numbers log record of, 268 spoofing, 29, 280, 311, 316, 329 & (ampersand), in GET method, 67 aggregating information by $address array, 156–157 relevance, 16 $content_type variable, 158 aggregation webbots, 92, 129–137 $data_array, 168 CDATA, 135 for LIB_http library functions, 35 $FETCH_DELAY, 175, 180 choosing data sources, 130 $filter_array, 130, 135–136 downloading and parsing $_GET array, 76 script, 134 $link_array elements, 111 and filtering, 135–137 $page_base variable, 106 RSS feeds, 131–133 $_POST array, 76 writing, 133–135 $result array, FILE element, 51–52 Alexa web-monitoring service, 305 $status_code_array, 114–115 “all rights reserved” notice, 320 . (period), as POP3 end-of-message Amazon Web Services, SOAP indicator, 147 interfaces, 305 ? (question mark), in GET ampersand (&), in GET method, 67 method, 67 anchor tags. See links 404 Not Found error, 287, 338 Andreessen, Marc, 1 anonymity A as a process, 283 in commercial email, 155 abstractions, of program anti-pokerbot software, 18 interface, 81 Anti-Spam Law, Virginia, 324 access log file, 29 Apache error logging in, 266 cookies, 202 and webbot detection, 266–267 headers, 33, 190 access rights, verifying, 20 installing PHP on, 30 action attribute log files, 266–267 of form, 65, web server, 6 for form analyzer, 71 Application Program Interfaces action of person, simulating, 123 () $address array, 156–157 Amazon, 131 agent name eBay, 131 default for, LIB_http, 31 Google, 127 defining for PHP/CURL Google Maps, 131 session, 330

Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 346 Thursday, February 9, 2012 10:08 AM

archive_links() function, 178 batch file, for webbot, 216 ARPANET, 139 Bcc: address field, 156–157 array Beck & Tysver legal website, 319 assigning parsed data to, 98 Bidder’s Edge spiders, 323 elements, form data as, 68 bids, timing placement of, 188, 191 of tags, src attribute Bina, Eric, 1 from, 61 binary-safe download routine, parsing 103–104 data set into, 41–42 biometrics, 198–199 table into, 96 Bing, spiders used by, 173 attributes, parsing values, 42–43 blobs, storing images as, 83–84 audience, for Internet, 2 blogs authentication, 190–208 aggregation of, 131 basic, 199–202 laws concerning, 324, 325 curl_setopt() function searching for spelling errors, 137 options for, 332 botnet management, 255–260 by PHP/CURL, 28 assigning tasks, 258–260 test pages, 201 communication methods, 255 of buyer by procurement determining tasks, 257 webbot, 186–187 performing tasks, 260 default response to request, 28 polling the botnet server, for deterring webbots, 314 256–257 digest, 201–202 task checkout, 258 and encryption, 202 uploading botnet data, 260 example scripts and practice broken links, webbot detecting, pages, 199 109–116 FTP, 140 browser buffering, 27 with query string sessions, browser-like webbots, 230 205–207 browser macros, 227–247 session, 202–205 adding functionality to, 240–247 of snipers, 189, 213 browser-like webbots, 230 strengthening by combining commands, 235 techniques, 198–199 creating your first, 231–233 types, 198 initialization, 233 automating tasks, 19 recording, 232 defined, 230 B dynamic macros, 241–245 integrating data with, bandwidth 242–245 consumption, 187, 225 scripts that create, 241 hijacking, 104, 291 hacking, 239–247 stealing, 30 installing, 230–231 base64-encoding, 84 iMacros Scripting Engine, rea- basic authentication, 199–202 sons not to use, 240–241 curl_setopt() function options launching automatically, for, 332 245–246 by PHP/CURL, 28 in Linux, 246 test pages, 199 in Windows, 245

346 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 347 Thursday, February 9, 2012 10:08 AM

necessity for, 237 code overcoming barriers with, 230 in book, 4–5 reasons to use, 227–229 libraries available online, 5 running, 237 collusion webbots, 17 suggested standard initialization command shell of, 235–237 executing webbots in, 26 browsers, 1–2 leveraging operating emulating, 75. See also browser system with, 254 macros and spider scripts, 181 executing webbots in, 26–27 comma-separated value (CSV) files inspiration from limitations of, file() function for down- 15–18 loading, 27 problem with, 2 iMacros file format, 236 search engine treatment vs. Common Object Request treatment of webbot, 126 Broker Architecture tabbed browsing, 129 (CORBA), 305 business leaders, webbot benefits communication, on incompatible for, 11–12 systems, 21–22 buy-it-now auction purchases, 189 competitive advantage, 9–13, 64, 74, 191, 265, 286, 310 C Completely Automated Public Turing test to tell Com- CamelCase, 78 puters and Humans CAPTCHA (Completely Automated Apart (CAPTCHA), Public Turing test to tell 314–315, 325 Computers and Humans compressing data, 86–88 Apart), 314–315, 325 computers. See also server Cascading Style Sheets (CSS), distributing tasks across 17, 230 multiple, 254 impact of removing constructive hacking, 11 HTML tags, 89 Content-Type line case for email message, 148 for naming, 111 in HTTP header, 34 sensitivity, stristr() function vs. $content_type variable, 142 strstr() function, 137 converting website into function, Cc: address field, 156–157 163–170 CDATA tags, 134–135 COOKIE_FILE, 212–214 certificates, 195 cookies Children’s Online Privacy Protec- about, 209–211 tion Act (COPPA), 154 adapting to management ciphers, 193, 195 changes, 294 client-server technology, 2 for authentication, 202–205 client URL Request Library defaults for, 31 (cURL), 6, 21, 23 deleting, 212, 294 clipping service, online, 20, 137 for deterring webbots, 313 clocks, synchronization for expiration dates for, 212–213 sniper, 189 and forms, 70

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 347 © 2012 Michael Schrenk webbots2e.book Page 348 Thursday, February 9, 2012 10:08 AM

cookies, continued CURLOPT_HTTPHEADER option, 331 managing multiple users’, CURLOPT_MAXREDIRS option, 288, 329 213–214 CURLOPT_NOBODY option, 330 persistence with, 212 CURLOPT_PORT option, 333 PHP/CURL to read and write, 29 CURLOPT_POSTFIELDS option, 332 purging temporary, 212–213 CURLOPT_POST option, 332 restrictions, with proxies, 206 CURLOPT_REFERER option, 329 viewing, 210–211 CURLOPT_RETURNTRANSFER and webbot design, 211 option, 329 COPPA (Children’s Online Privacy CURLOPT_SSL_VERIFYHOST option, 195 Protection Act), 154 CURLOPT_SSL_VERIFYPEER option, copyright issues, 85, 319–322 195, 332 “all rights reserved” notice, 320 CURLOPT_TIMEOUT option, 294–295 and facts, 321 CURLOPT_UNRESTRICTED_AUTH fair use laws, 322 option, 332 registration, 320 CURLOPT_URL option, 329 CORBA (Common Object Request CURLOPT_USERAGENT option, 330 Broker Architecture), 305 CURLOPT_USERPWD option, 332 crawlers. See spiders CURLOPT_VERBOSE option, 333 cron command, 215 executing, 333–334 cryptography, 193 custom logs, and webbot CSS (Cascading Style Sheets), detection, 268 17, 230 impact of removing D HTML tags, 89 daily scheduling of webbots, 217 CSV (comma-separated value) files, data file() function for down- fields in forms, 65, 66 loading, 27 networks, access and abuse, 323 iMacros file format, 236 set, parsing into array, 41 cURL (client URL Request sources, choosing for aggrega- Library), 6, 21, 23 tion webbot, 130 curl_error() function, 334–335 $data_array, 168 curl_exec() function, 334 for LIB_http library functions, 35 curl_getInfo() function, 334 database curl_init() function, 328 for saving links, 181–182 curl_setopt() function, 69, 104, storing images in, 83–84 328–333, 334 storing text in, 80–83 case sensitivity, 333 data management, 77–90 CURLOPT_COOKIEFILE option, 205, organizing data, 77–85 214, 331 naming conventions, 77–78 CURLOPT_COOKIEJAR option, 205, storing images in database, 214, 331 83–84 CURLOPT_FOLLOWLOCATION storing text in database, option, 329 80–83 CURLOPT_HEADER option, 330 structured files, 79–80

348 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 349 Thursday, February 9, 2012 10:08 AM

reducing size, 85–90 Digital Encryption Standard data compression, 86–88 (DES), 195 removing formatting, 88–89 directories, 79 storing references to image script for creating, 104 files, 85 disclaimer, 6 thumbnailing images, 89–90 disk swapping, 181 data-only interfaces, 301–307 Distributed Component Object lightweight data exchange, Model (DCOM), 305 302–305

tags, parsing data into array, 95 REST (Representational State DOS (denial-of-service) attacks, pre- Transfer) 306–307 venting, 180, 252–253, 330 SOAP (Simple Object Access download_binary_file() function, Protocol), 170, 305–306 103–104 XML (eXtensible Markup download_images_for_page() Language), 131, 301–302 function, 105 tags, for insertion parse, downloading 123–124 with FTP, 139–143 dates, in filenames, 79–80 with LIB_http, 23–35 DCOM (Distributed Component with link-verification webbot, Object Model), 305 109–110 decode_zipcode() function, 165 linked page, 113 deep linking, 291 with PHP built-in functions, default file, for web page, 24 25–27 delays, inserting between page with PHP/CURL, 27–35 fetches, 270 web pages, 23–35 DELE command (POP3), 149 download_parse_rss() function, deleting 133, 134 cookies, 212 HTML formatting, 88–89 E unwanted text, 43–44 white space, 89 eBay, 19, 130, 151, 188, 237, 306, delimiters 310, 323 parsing text between, 40 snipers and, 187 splitting string at, 39 Electronic Frontier Foundation deployment of webbots. See scaling (EFF), 325 denial-of-service (DoS) attacks, pre- email venting, 180, 252–253, 330 guidelines, 154 DES (Digital Encryption headers, 148 Standard), 195 keeping legitimate out of spam describe_zipcode() function, 167–169 filter, 158–159 developers, webbot benefits for, for notification 9–11 of FTP transmission difficult websites, scraping, 227–247 failure, 140–141 digest authentication, 201–202 of webbot action, 161 digital certificate, 194–196 placing account information in script, 150

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 349 © 2012 Michael Schrenk webbots2e.book Page 350 Thursday, February 9, 2012 10:08 AM

email, continued F reading with webbots, 145–152 facts, and copyright, 321 sending, 153–161 fair use laws, 322 HTML-formatted, 159–160 fault-tolerant webbots, 285–296 with mail() function, 154–155 cookie management notifications with webbots, changes, 294 157–158 form changes, 292–293 with PHP, 154–155 network outages and conges- undeliverable as alert to invalid tion, 294–295 address, 160–161 page content changes, 291–295 as webbot trigger, 223 URL changes, 286–291 email-controlled webbots, 151–152 page redirection, 288–290 encryption, 193–196 and referer values accuracy, authentication and, 208 290–291 certificate, 195–196 requests for nonexistent for deterring webbots, 312 pages, 292 webbots using, 194 $FETCH_DELAY, 175, 180 end-of-message indicator fgets() function, 25, 27 (POP3), 147 file() function, downloading environments, 250–252 files with, 27 many-to-many, 251 file handle, 25, 27 many-to-one, 252 filesystem, geographically one-to-many, 250 structured, 80 one-to-one, 251 File Transfer Protocol (FTP) error server, connecting to, 141 handlers, 295–296 webbots, 139–143 information $filter_array, 130, 135–136 from http_get() function, 32 filtering from http_get_withheader() by aggregation webbot, 135–137 function, 32 information by relevance, 16 logs, and webbot detection, Flash 267–268 barrier to effective eval() function, 303 webscraping, 229 event triggers, 70 for deterring webbots, 314 exclude_link() function, 179–180 for website navigation, prob- exclusion list, for spiders, 180 lems caused by, 229 executing webbots fopen() function, 25 in browsers, 26–27 format of names, 78–79 in command shell, 26 formatted_mail() function, 156 exe_sql() function, 82–83 form data variables, 66 expiration dates, for cookies, 203, forms 209–210 adapting to changes in, 292–293 eXtensible Markup Language analyzing, 71–74, 165 (XML), 301–302 avoiding errors, 75–76 assigning tasks, 258, 260 and cookies, 70 overhead, 302 emulation, 64 for RSS feeds, 131 legal issues and, 64

350 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 351 Thursday, February 9, 2012 10:08 AM

handlers, 65–66 get_base_page_address() function, 106 input tags, 66 get_domain() function, 178–179 interfaces, reverse engineering, get_http() function, 95 64–65 GET method, 61 main parts, 65 and errors, 267 source code http_get() function for down- displaying, 166 loading with, 31–32 saving, 166 vs. POST method, 68 submission, 63–76, 167 Google data fields in forms, 66 bombing, 298 event triggers, 70 developer API, 127 form handlers, 65–66 spiders used by, 173 GET method, 67–68 GoogleRankings.com, 118 PHP/CURL for, 28 graphics. See images POST method, 68 unpredictability, 70 H

tag, action attribute, 66 hacking fputs() function, 89, 107, 149 constructive, 11 From: address field, 156 iMacros, 239–247 FTP (File Transfer Protocol) webbot activity misinter- server, connecting to, 141 preted as, 266 webbots, 139–143 handle for file, 25 ftp_cdup() function, 142 handshake process, 195 ftp_chdir() function, 142 hard drives, compressing files on, ftp_delete() function, 142 87–88 ftp_get() function, 142 hardware requirements, 5–6 ftp_mkdir() function, 142 harvest, separating from ftp_put() function, 142 payload, 181 ftp_rawlist() function, 142 harvest_links() function, 177–178 ftp_rename() function, 142 hash, 157–158 ftp_rmdir() function, 142 haystack, 44 fully resolved URLs, 212 header tags, and search engine functions. See also individual optimization, 299 function names headers converting website into, 163–170 in email, 147–148 describe_zipcode() function, redirection, 113, 288 167–169 tag, detecting redirection, interface definition, 168 288–290, 312 submitting form, 168 Hello World! web page, 25 target page analysis, 165–167 hijacking bandwidth, 104, 291 holidays, scheduling G webbots on, 270 garbage collection, by PHP, 335 Hormel Foods Corporation, 153n geographically structured hotel room prices, aggregating and filesystem, 80 filtering data, 16 $_GET array, 76 href attribute get_attribute() function, 42–43, extracting value, 112 61, 107 of link tag, parsing, 42–43

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 351 © 2012 Michael Schrenk webbots2e.book Page 352 Thursday, February 9, 2012 10:08 AM

HTML (Hypertext Markup images Language) borrowing from other sites, 104 for formatting email, 159–160 storing in database, 83–84 parsing thumbnailing, 89–90 content of reoccurring tags, tags 41–42 alt attribute, 300 poorly written web pages, 38 parsing from downloaded web text between tags, 40 page, 106–107 removing formatting, 88–89 src attribute from array, htmlspecialchars() function, 313 parsing, 43 HTMLTidy (Tidy), 38, 46 incompatible systems, communica- HTTP tion on, 21–22, 151 header, 31–32 index file, for web page, 24 exchanging cookies in, 202 indexing web pages, by search and security, 68 engine spider, 300 protocol, 25 infinite loops, preventing, 330 port for, 256 information, aggregating and filter- status codes, 133–134, 337–338 ing by relevance, 16 HTTP codes, 337–338 initialization from http_get_withheader() func- download_images_for page() tion, 33 function, 102–103 http_get_form() function, 35 link-verification webbot, http_get_form_withheader() 109–110 function, 35 search-ranking script, 121–123 http_get() function, 31–32, 35 input tags in forms, 66 http_get_withheader() function, insert() function, 81–82 32–33, 35 insertion parse, 123–125 http_header() function, 35 installing http_post_form() function, 35, 168 HTMLTidy, 28 http_post_withheader() function, 35 iMacros, 230–231 http() routine, 31 PHP/CURL, 30 HTTPS protocol, 194 intellectual property human patterns, webbot simulation law, 318–319 of, 269–272 protecting, 19–20 Hypertext Markup Language. interfaces, data-only, 301–307 See HTML (Hypertext Internet Markup Language) access to, 6 audience for, 2 I customizing for business, 12 law, 324–325 iMacros. See browser macros , setting webbot image-capturing webbots, 101–108 name to, 75 binary-safe download routine, Internet Protocol (IP) addresses, 103–104 275–276 directory structure, 104–105 intranet, 6 execution, 103 IP (Internet Protocol) addresses, main script, 105–107 275–276 image-processing loop, 107

352 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 353 Thursday, February 9, 2012 10:08 AM

J LIB_mail library, 156–157 LIB_mysql library, 80, 81 applets, for deterring exe_sql() function, 81–82 webbots, 314 insert() function, 81–82 JavaScript update() function, 82 for data manipulation, 70 LIB_parse library, 37–46 deleting, 43–44 LIB_pop3 library, 149 for deterring webbots, 313 LIB_resolve_addresses library, 109 as event trigger, 70 LIB_rss library, 133 impact of removing LIB_simple_spider library, 176–180 HTML tags, 89 LIB_thumbnail library, 89–90 impact on spider indexing, 180 lightweight data exchange, 302–307 redirection with, 290 $link_array elements, 111 link-verification webbots, 109–115 K advanced options, 115 Kelly v. Arriba Soft, 322, 324 displaying page status, 114 keywords downloading linked page, 113 in meta tags, 299 flowchart, 110 spamming, 299 generating fully resolved URLs, 112–113 L initialization and downloading target, 109–110 landmark, 96 parsing links, 111 for end of data, 96 running, 114–115 to identify table, 241 setting page base, 111–112 for table heading row, 96 verification loop, 111–112 using least likely to change, 291 links legal issues. See also copyright issues broken, using webbot to detect, for email, 154 109–115 in form emulation, 64 href attribute of tag, parsing, Internet, 324–325 42–43 website policies and, 316 impact of removing HTML tags, legitimate mail, keeping out of 88–89 spam filters, 154 parsing, 111 LIB_download_images library, 102 relative, page base for, 106 LIB_http_codes library, 114, 337–338 saving in database, 181–182 LIB_http library well-defined, and search engine default conditions for, 31 ranking, 289 downloading with, 28–35 Linux, scheduling in, 215 file for storing cookies, 205 LIST command (POP3), 147 for form analysis emulation, Location: line, in HTTP header, 288 71–74 log files for form emulation, 67–69 software for monitoring, 268–269 source code, 34 webbot detection with, 266–269 defaults, 30, 35 logging in, to POP3 mail server, 146 functions, 31–34, 35 login criteria, 198

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 353 © 2012 Michael Schrenk webbots2e.book Page 354 Thursday, February 9, 2012 10:08 AM

M O Mac OS X, scheduling in, 215 obfuscation, 193, 313 macros, browser. See browser macros obsolete web pages, risk of mail() function, 154–155 targeting, 287 maximum penetration level for online spider, 174 auctions, automating bidding in, Message Digest Algorithm (MD5), 19, 63 195, 202 clipping service, 20–21 meta tags, 41–42, 127, 299 purchases, automating, 185–192 MIME type, 34, 148, 159, 307, 331 opening tags, for function mkdir() function, 104–105 parameter, 41 mkpath() function, 102, 105 open proxies, 279–280 monthly scheduling of webbots, opensocket() function, 149 217, 219, 220 optimizing website performance, 17 Mosaic, 1 organic placements in search MSN, spidering Google, 127 results, 118–119 MySQL, 6, 80, 81–84 organizing data, 77–85 naming conventions, 77–78 N storing images in database, 83–85 naming storing text in database, 80–83 conventions, 78–79 structured files, 79–80 data fields, 66 outgoing header message, from webbots, 75 PHP/CURL session, 331 National Oceanic and Atmospheric overhead, in XML file, 302 Association (NOAA), 163 needle, 44 network P socket, 25 package-tracking information, 145 adapting to outages and packet sniffer, 208n, 237 congestion, 286 page base Next button, simulating person defining, 106 clicking, 123 setting, 110–111 NOAA (National Oceanic and Atmo- $page_base variable, 106 spheric Association), 163 page redirection, 288–290 nofollow option, for robots CURLOPT_FOLLOWLOCATION meta tag, 312 option for, 329 noindex option, for robots for deterring webbots, 313 meta tag, 312 page signature, 157 non-ASCII content, and search paid placements in search results, engine spiders, 301 118–119 nonexistent web pages parse_array() function, 41–42, 52, avoiding requests for, 286–288 61–62, 95, 106, 111, 125 containing forms, 292 parse tolerance, 291 timeouts to deal with, 331 parsing, 37–62 null string, replacing text with, 45 attribute values, 42–43 data set into array, 41–42

354 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 355 Thursday, February 9, 2012 10:08 AM

image tags from downloaded version 5 support for SOAP, 305 web page, 106 website, 6 with LIB_parse, 39–44 PHP/CURL, 28 links, 111 and certificates, 195 poorly written HTML, 46 and cookies, 202 position vs. relative, 291–292 downloading with, 28–30 with regular expressions, 49–62 encryption and, 194 src attribute, from array of for following header redirec- tags, 41 tions, 288, 329 standard routines for, 38 installing, 30 text between delimiters, 40 sessions unformatted text, 45 closing, 335 passwords, 198 creating minimal, 327–328 for deterring webbots, 314 initiating, 328 pattern matching, with regular retrieving information expressions, 50 about, 334 alpha, 53 setting options, 328–333 alternate matches, 54 viewing errors, 334–335 grouping, 55 PHP Extension and Application numbers, 53 Repository (PEAR), 305 character sets, 53 php.ini file, editing to show mail ranges, 55 server location, 154–155 wildcards, 54 plotting Wi-Fi networks, 21 payload for spider, 175, 181 pokerbots, 17–18 separating from harvest, 182 POP3 protocol (Post Office Proto- pay-per-click advertising, 325 col 3), 146–152 PEAR (PHP Extension and Applica- authentication failure, 146 tion Repository), 305 executing commands with penetration level for spider, 174 webbots, 149–151 period (.), as POP3 end-of-message port indicator, 147 for HTTP and HTTPS proto- periodicity of webbots, 217, 225 cols, 194 permanent cookies, 209–210 for POP3 server, 146 persistence with cookies, 209 position parsing, avoiding, 291 phishing attack, 154 $_POST array, 76 phone numbers, parsing with regu- POST method, 68–69 lar expressions, 55–59 and errors, 267 PHP, 4–5, 6 Post Office Protocol 3 (POP3), configuring to send email, 146–152 154–155 authentication failure, 146 downloading executing commands with with built-in functions, 25–27 webbots, 149–151 with scripts, 23 preg_match_all() function, 51 and FTP, 142 preg_match() function, 51 functions, 44–46 preg_replace() function, 50 for compressing data, 86–87 preg_split() function, 52 and SSL, 194

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 355 © 2012 Michael Schrenk webbots2e.book Page 356 Thursday, February 9, 2012 10:08 AM

price-monitoring webbots, 93–100 spoofing, 280 parsing script, 96–99 transparent, 280 target, 94 reasons developers use, 274–277 procurement bot, 185–192 anonymity, 274–276 purchase criteria, 188 relocation, 277 purchase triggers, 187 Tor, 281–282 theory, 186–191 configuration for project ideas, 15–22 PHP/CURL, 282 automating tasks, 19 disadvantages of, 282 communicating on incompati- using, 277 ble systems, 21–22 in a browser, 278 consolidating industry news with PHP/CURL, 278 articles, 18–19 public, capitalizing on inexperi- intellectual property protection, ence with webbots, 12 19–20 purchase online clipping service, 20 criteria, for procurement plotting Wi-Fi networks, 21 bot, 186 pokerbots, 17–18 triggers, for procurement tracking web technologies, 21 bot, 187 verifying access rights, 20 WebSiteOptimization.com, 17 Q projects query string sessions, authentica- aggregation webbots, 129–137 tion with, 205–207 converting website into func- question mark (?), in GET tion, 163–170 method, 67 FTP webbots, 139–143 QUIT command (POP3), 149 image-capturing webbots, 101–108 link-verification webbots, R 109–115 random delay, 123 price-monitoring webbots, ranking web pages, by search 93–100 engine spider, 298 search-ranking webbots, reading mail from POP3 server, 117–127 145–152 sending email with webbots, realm, 200 152–161 Real Simple Syndication (RSS) reading email with webbots, feed, 130, 131–132 145–152 redirection, 288–290 proxies, 273–284 CURLOPT_FOLLOWLOCATION commercial, 282 option for, 329 cookie restrictions with, 206 for deterring webbots, 313 creating a service, 283–284 with PHP/CURL, 29 defined, 273–274 references to image files, storing, 85 listing services, 280 referer open, 277–280 management, with anonymous, 280 PHP/CURL, 30 dark side of, 280 variable, 32, 73, 104, 329

356 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 357 Thursday, February 9, 2012 10:08 AM

regular expressions, 39, 49–62 respect, 318–319 advanced parsing with, 49–62 REST (Representational State avoiding, 39, 47 Transfer), 306–307 disadvantages of, 60–61 $result array, FILE element, 51–52 complicating code, 61 RETR command (POP3), 147–148 confusing choices, 61 return_between() function, 40, 61, 134 difficulty debugging, 61 Return-path: address field, 157 lack of context, 60 reverse engineering form inter- functions, 50–52 faces, 64–65 preg_match(), 51 robot exclusion file, 311 preg_match_all(), 51 robots meta tag, 312 preg_replace(), 50 robots.txt file, 311–312 preg_split(), 52 root resemblance to PHP directory, creating for imported built-in, 52 file structure, 106 pattern matching with, 50 domain, parsing from target alpha, 53 URL, 178–179 alternate matches, 54 RPC (Remote Procedure Call), 305 character sets, 53 RSET command (POP3), 149 grouping, 55 RSS (Real Simple Syndication) numbers, 53 feed, 130, 131–132 ranges, 55 wildcards, 54 S parsing phone numbers with, 55–59 sale item, verifying availability, 187 speed of, vs. PHP built-in saving functions, 62 links in database, 181–182 types of, 50 source code for form, 165 PCRE, 50 scaling, 249–262. See also botnet POSIX, 50 management when to use, 60 causing DoS attacks, 252–253 relational database, 77 environments 250–252 relative links, page base for, 106, many-to-many, 251 110, 111, 115 many-to-one, 252 relay host, 155 one-to-many, 250 relevance, aggregating and filtering one-to-one, 251 information by, 15 multiple instances, creating, Remote Procedure Call (RPC), 305 253–254 remote server, using PHP/CURL to distributing tasks, 254 execute webbot on, 216 forking, 253–254 remove() function, 43–44 leveraging the operating replacing portion of string, 45 system, 254 Reply-to: address field, 157 scheduling, 215–225 Representational State Transfer adding variety to, 225 (REST), 306–307 complex, 218–219 resolve_address() function, 113 disabling, 296 resources, distributing, 169–170 for distributed spider, 183

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 357 © 2012 Michael Schrenk webbots2e.book Page 358 Thursday, February 9, 2012 10:08 AM

scheduling, continued session and stealth, 270 authentication, 202–207 webbots to run daily, 217–218 ID, forms with, 66 webbots to run monthly, 219 with proxies, 278 Task Scheduler, value, dynamically assigned, 220, 223 167–168 Windows XP Task Scheduler, set_time_limit() function, 175, 295 216–219 Short Message Service (SMS), 161, scraping, difficult websites, 227–237 341–344 scripts, 3, 4–5 Simple Object Access Protocol writing in small steps, 46 (SOAP), 170, 305–306 search engine simulating action of person, optimization, 118, 252, 297–300 269–270 spiders, 173 single points of failure, design techniques hinder- avoiding, 225 ing, 315–316 size reduction, 85–90 indexing web pages with, 298 data compression, 86–88 Terms of Service agreement, removing formatting, 88–89 126, 310 storing references to image search-ranking webbots, 117–127 files, 85 fetching search results, 123 SMS (Short Message Service), 161, how they work, 120–121 341–344 initializing variables, 121–122 snipers, 185–192 parsing search results, 123–126 authentication, 189 running, 120 clock synchronization, 189–190 search results page description, testing, 191 118–119 SOAP (Simple Object Access starting loop, 122 Protocol), 170, 305–306 what they do, 120 socket management, with search results page, parts of, 118–119 PHP/CURL, 30 search term, in URL, 122 software Secure Sockets Layer (SSL), for monitoring logs, 268–269 193–194 requirements for, 6 CURLOPT_SSL_VERIFYHOST source code option for, 195 configuration area of CURLOPT_SSL_VERIFYPEER LIB_mysql,81 option for, 195 for form sites, downloading images displaying, 165 from, 103–104 saving, 166 seed URL, 174 spam, 153–154, 255, 298 sending email, 153–161 filters, 154 server keeping legitimate mail out of, avoiding undue load on 158–159 target, 324 keywords, 299 error log, form errors in, 75–76 law, 324 obtaining clock value, 189–190 spam indexing, 298 remote, using PHP/CURL to special characters, 122, 313 execute webbot on, 216

358 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 359 Thursday, February 9, 2012 10:08 AM

spiders, 173–183 status messages, quantity created in adding payload, 181 file transfer, 333 distributing tasks across multiple stealth, 265–272 computers, 182 reasons for, 265–266 examples, 175–176 and scheduling, 270 experimenting with, 180–181 simulating human patterns in how they work, 174 order to achieve, 269–270 LIB_simple_spider library, Stenberg, Daniel, 28 176–180 strings archive_links() function, 178 detecting within strings, 44–45 exclude_link() function, measuring similarity of, 46 179–180 replacing portion of, 45 get_domain() function, splitting at delimiter, 37 178–179 strip_cdata_tags() function, 39 harvest_links() function, strip_tags() function, 61, 88 177–178 stristr() function, 44–45 maximum penetration strops() function, 124 level for, 174 str_replace() function, 45 options for treating strstr() function, 45 unwanted, 316 structured files, 79–80 potential ideas for, 173–174 Structured Query Language regulating page requests of, 183 (SQL), 80 saving links in database, 181–182 submit button, 66 of search engines, 126–127 substr() function, 52, 124 setting traps for, 315–316 synchronization, 21–22, what to do with unwanted, 316 of clocks for snipers, 189 split_string() function, 39 splitting string, at delimiter, 39 T SQL (Structured Query Language), 80 tables src attribute, from array of parsing data in, 96 tags, parsing, 43 using landmarks to identify, SSL (Secure Sockets Layer), 193–194 291–292 CURLOPT_SSL_VERIFYHOST tags. See individual tag names option for, 195 targets, 3–4 CURLOPT_SSL_VERIFYPEER validation in option for, 195 download_images_for_page() sites, downloading images function, 102–103, 105 from, 103–104 target URL, defining for $status_code_array, 114–115 PHP/CURL session, 329 status codes, 337–339 tasks, automating, 19 HTTP, 337–338 Task Scheduler (Windows 7), NNTP, 339 220–223 status of request, from Task Scheduler (Windows XP), http_get_withheader() 216–219 function, 32 complex scheduling, 218–219

Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 359 © 2012 Michael Schrenk webbots2e.book Page 360 Thursday, February 9, 2012 10:08 AM

Telnet, 2, 28 U for executing POP3 commands, undeliverable mail, using to prune 146–148 access lists, 152 temporary cookies, 209–210 unformatted text, parsing, 45 purging, 212–213 unique keywords, 299 Terms of Service agreements, Unix 126, 310 scheduling in, 215 for search engines, 118 timestamp, 190 text unsubscribe options, for email 154 embedding in other media, unwanted text, deleting, 43–44 314–315 update() function, of LIB_mysql, 82 messaging, 161, 341–344 updating website, frequency for parsing unformatted, 45 deterring webbots, 314 removing unwanted, 43–44 uploading files, with FTP, 141 storing in database, 80–83 urlencode() function, 112 thumbnailing images, 89–90 URLs Tidy (HTMLTidy), 38, 46 adapting to changes, 286–291 time page redirection, 288–290 required for downloading linked referer values’ accuracy, pages, 114 290–291 running webbot during busy, 270 requests for nonexistent timeout pages, 286–291 curl_setopt() function for, 294, defining target for PHP/CURL 331 session, 329 default for, 31 fully resolved, 112–113 and spiders, 175 US Copyright Office, 320, 321 in PHP, changing, 295 usernames, 199 for PHP/CURL, 294, 331 timestamp, Unix, 167 tag, and spiders, 298 V TLS (Transport Layer Security), 194 validation point, for downloaded Tor, 281–282 web page, 287–288 configuration for variables, passing to webbots, PHP/CURL, 282 304–306 disadvantages of, 282 verification loop, 110–111 tracking web technologies, 21–22 Virginia, Anti-Spam Law, 324 TrackRates.com, 16 virtual private networks (VPNs), 198 transactional websites, 192 virtual property, laws governing, 324 transfer protocols, PHP/CURL sup- VPNs (virtual private networks), 198 port for, 28 Transport Layer Security (TLS), 194 W trespass-to-chattels law, 241, 253, 271, 322–324 W3C (World Wide Web Consor- triggers, non-calendar-based, tium), HTTP codes, 113n 223–224 weather forecasts, 163 trim() function, 61 web agents, selectively allowing Tysver, Daniel A., 319 access to specific, 311–312</p><p>360 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 361 Thursday, February 9, 2012 10:08 AM</p><p> webbot_error_handler() function, 295 testing, 191 WEBBOT_NAME constant, 75 and trespass-to-chattels law, 241, webbots (web robots), 2 253, 271, 322–324 benefits of, 9–10 weekend scheduling of, 270 for business leaders, 11–12 web pages for developers, 10–11 accessibility to webbots, 297–300 cookies and design of, 212–214 adapting to content changes, countermeasures for, 309–316 291–292 with cookies, encryption, avoiding requests for nonexist- JavaScript, and redirec- ent, 286–288 tion, 313 displaying status of, 113–114 embedding text in other notification of change in, media, 314–315 157–160 obfuscation, 313 parsing image tags from down- reasons for, 309 loaded, 106 robots meta tag, 312 poorly written HTML within, 38 robots.txt file, 311–312 ranking by search engine allowing selective access to spider, 298 specific agents, 312–313 status of request for, 337–338 Terms of Service agree- validation point for, 287 ments, 126, 310 web services, 305 creating first script, 25 designing custom lightweight, daily scheduling of, 217–218 302–305 executing websites in browsers, 26–27 for book, 4 in command shell, 26 converting into functions, fault-tolerant, 286–296 163–170 growth in use, 10 limiting access to, 197–208 monthly scheduling of, 219 optimizing performance of, 17 periodicity of, 217, 225 transactional, 192 preparing to run as scheduled web spiders. See spiders tasks, 216 web technologies, tracking, 21 preventing negative conse- web walkers. See spiders quences of, 317–325. See WebSiteOptimization.com, 17 also copyright issues weekends, scheduling webbots not project ideas, 15–22 to run on, 270 for reading email, 145–152 well-defined links, for search engine and executing POP3 com- optimization, 298 mands, 149–151 white space, deleting, 45, 89 and POP3 protocol, 146–149 Wi-Fi networks, plotting, 21 reasons for stealth, 265–272 Windows Task Scheduler script, creating first, 25 Windows 7, 220–223 for sending email, 153–161 Windows XP, 216–219 setting traps, 315–316 complex scheduling, simulating human patterns, 218–219 269–272 wireless subscriber, mail server, 342 spreading burden of running World Wide Web, 1 complex, 169–170</p><p>Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 361 © 2012 Michael Schrenk webbots2e.book Page 362 Thursday, February 9, 2012 10:08 AM</p><p>World Wide Web Consortium Y (W3C), HTTP codes, 113n Yahoo!, spiders used by, 173 wrapper function, using PHP/CURL within, 30 Z X ZIP codes database for, 170 XML (eXtensible Markup Lan- web page for decoding, 164–166 guage), 301–302 assigning tasks, 258, 260 overhead, 302 for RSS feeds, 131 <xmp> and </xmp> tags, 26 displaying parses within, 47</p><p>362 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk</p> </div> </article> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'bf846fbcf22693112d44a35543509e7a'; var endPage = 1; var totalPage = 19; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/bf846fbcf22693112d44a35543509e7a-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 5) { adcall('pf' + endPage); } } }, { passive: true }); if (totalPage > 0) adcall('pf1'); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>