webbots2e.book Page 345 Thursday, February 9, 2012 10:08 AM
INDEX
Symbols & Numbers log record of, 268 spoofing, 29, 280, 311, 316, 329 & (ampersand), in GET method, 67 aggregating information by $address array, 156–157 relevance, 16 $content_type variable, 158 aggregation webbots, 92, 129–137 $data_array, 168 CDATA, 135 for LIB_http library functions, 35 $FETCH_DELAY, 175, 180 choosing data sources, 130 $filter_array, 130, 135–136 downloading and parsing $_GET array, 76 script, 134 $link_array elements, 111 and filtering, 135–137 $page_base variable, 106 RSS feeds, 131–133 $_POST array, 76 writing, 133–135 $result array, FILE element, 51–52 Alexa web-monitoring service, 305 $status_code_array, 114–115 “all rights reserved” notice, 320 . (period), as POP3 end-of-message Amazon Web Services, SOAP indicator, 147 interfaces, 305 ? (question mark), in GET ampersand (&), in GET method, 67 method, 67 anchor tags. See links 404 Not Found error, 287, 338 Andreessen, Marc, 1 anonymity A as a process, 283 in commercial email, 155 abstractions, of program anti-pokerbot software, 18 interface, 81 Anti-Spam Law, Virginia, 324 access log file, 29 Apache error logging in, 266 cookies, 202 and webbot detection, 266–267 headers, 33, 190 access rights, verifying, 20 installing PHP on, 30 action attribute log files, 266–267 of form, 65, web server, 6 for form analyzer, 71 Application Program Interfaces action of person, simulating, 123 (APIs) $address array, 156–157 Amazon, 131 agent name eBay, 131 default for, LIB_http, 31 Google, 127 defining for PHP/CURL Google Maps, 131 session, 330
Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 346 Thursday, February 9, 2012 10:08 AM
archive_links() function, 178 batch file, for webbot, 216 ARPANET, 139 Bcc: address field, 156–157 array Beck & Tysver legal website, 319 assigning parsed data to, 98 Bidder’s Edge spiders, 323 elements, form data as, 68 bids, timing placement of, 188, 191 of tags, src attribute Bina, Eric, 1 from, 61 binary-safe download routine, parsing 103–104 data set into, 41–42 biometrics, 198–199 table into, 96 Bing, spiders used by, 173 attributes, parsing values, 42–43 blobs, storing images as, 83–84 audience, for Internet, 2 blogs authentication, 190–208 aggregation of, 131 basic, 199–202 laws concerning, 324, 325 curl_setopt() function searching for spelling errors, 137 options for, 332 botnet management, 255–260 by PHP/CURL, 28 assigning tasks, 258–260 test pages, 201 communication methods, 255 of buyer by procurement determining tasks, 257 webbot, 186–187 performing tasks, 260 default response to request, 28 polling the botnet server, for deterring webbots, 314 256–257 digest, 201–202 task checkout, 258 and encryption, 202 uploading botnet data, 260 example scripts and practice broken links, webbot detecting, pages, 199 109–116 FTP, 140 browser buffering, 27 with query string sessions, browser-like webbots, 230 205–207 browser macros, 227–247 session, 202–205 adding functionality to, 240–247 of snipers, 189, 213 browser-like webbots, 230 strengthening by combining commands, 235 techniques, 198–199 creating your first, 231–233 types, 198 initialization, 233 automating tasks, 19 recording, 232 defined, 230 B dynamic macros, 241–245 integrating data with, bandwidth 242–245 consumption, 187, 225 scripts that create, 241 hijacking, 104, 291 hacking, 239–247 stealing, 30 installing, 230–231 base64-encoding, 84 iMacros Scripting Engine, rea- basic authentication, 199–202 sons not to use, 240–241 curl_setopt() function options launching automatically, for, 332 245–246 by PHP/CURL, 28 in Linux, 246 test pages, 199 in Windows, 245
346 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 347 Thursday, February 9, 2012 10:08 AM
necessity for, 237 code overcoming barriers with, 230 in book, 4–5 reasons to use, 227–229 libraries available online, 5 running, 237 collusion webbots, 17 suggested standard initialization command shell of, 235–237 executing webbots in, 26 browsers, 1–2 leveraging operating emulating, 75. See also browser system with, 254 macros and spider scripts, 181 executing webbots in, 26–27 comma-separated value (CSV) files inspiration from limitations of, file() function for down- 15–18 loading, 27 problem with, 2 iMacros file format, 236 search engine treatment vs. Common Object Request treatment of webbot, 126 Broker Architecture tabbed browsing, 129 (CORBA), 305 business leaders, webbot benefits communication, on incompatible for, 11–12 systems, 21–22 buy-it-now auction purchases, 189 competitive advantage, 9–13, 64, 74, 191, 265, 286, 310 C Completely Automated Public Turing test to tell Com- CamelCase, 78 puters and Humans CAPTCHA (Completely Automated Apart (CAPTCHA), Public Turing test to tell 314–315, 325 Computers and Humans compressing data, 86–88 Apart), 314–315, 325 computers. See also server Cascading Style Sheets (CSS), distributing tasks across 17, 230 multiple, 254 impact of removing constructive hacking, 11 HTML tags, 89 Content-Type line case for email message, 148 for naming, 111 in HTTP header, 34 sensitivity, stristr() function vs. $content_type variable, 142 strstr() function, 137 converting website into function, Cc: address field, 156–157 163–170 CDATA tags, 134–135 COOKIE_FILE, 212–214 certificates, 195 cookies Children’s Online Privacy Protec- about, 209–211 tion Act (COPPA), 154 adapting to management ciphers, 193, 195 changes, 294 client-server technology, 2 for authentication, 202–205 client URL Request Library defaults for, 31 (cURL), 6, 21, 23 deleting, 212, 294 clipping service, online, 20, 137 for deterring webbots, 313 clocks, synchronization for expiration dates for, 212–213 sniper, 189 and forms, 70
Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 347 © 2012 Michael Schrenk webbots2e.book Page 348 Thursday, February 9, 2012 10:08 AM
cookies, continued CURLOPT_HTTPHEADER option, 331 managing multiple users’, CURLOPT_MAXREDIRS option, 288, 329 213–214 CURLOPT_NOBODY option, 330 persistence with, 212 CURLOPT_PORT option, 333 PHP/CURL to read and write, 29 CURLOPT_POSTFIELDS option, 332 purging temporary, 212–213 CURLOPT_POST option, 332 restrictions, with proxies, 206 CURLOPT_REFERER option, 329 viewing, 210–211 CURLOPT_RETURNTRANSFER and webbot design, 211 option, 329 COPPA (Children’s Online Privacy CURLOPT_SSL_VERIFYHOST option, 195 Protection Act), 154 CURLOPT_SSL_VERIFYPEER option, copyright issues, 85, 319–322 195, 332 “all rights reserved” notice, 320 CURLOPT_TIMEOUT option, 294–295 and facts, 321 CURLOPT_UNRESTRICTED_AUTH fair use laws, 322 option, 332 registration, 320 CURLOPT_URL option, 329 CORBA (Common Object Request CURLOPT_USERAGENT option, 330 Broker Architecture), 305 CURLOPT_USERPWD option, 332 crawlers. See spiders CURLOPT_VERBOSE option, 333 cron command, 215 executing, 333–334 cryptography, 193 custom logs, and webbot CSS (Cascading Style Sheets), detection, 268 17, 230 impact of removing D HTML tags, 89 daily scheduling of webbots, 217 CSV (comma-separated value) files, data file() function for down- fields in forms, 65, 66 loading, 27 networks, access and abuse, 323 iMacros file format, 236 set, parsing into array, 41 cURL (client URL Request sources, choosing for aggrega- Library), 6, 21, 23 tion webbot, 130 curl_error() function, 334–335 $data_array, 168 curl_exec() function, 334 for LIB_http library functions, 35 curl_getInfo() function, 334 database curl_init() function, 328 for saving links, 181–182 curl_setopt() function, 69, 104, storing images in, 83–84 328–333, 334 storing text in, 80–83 case sensitivity, 333 data management, 77–90 CURLOPT_COOKIEFILE option, 205, organizing data, 77–85 214, 331 naming conventions, 77–78 CURLOPT_COOKIEJAR option, 205, storing images in database, 214, 331 83–84 CURLOPT_FOLLOWLOCATION storing text in database, option, 329 80–83 CURLOPT_HEADER option, 330 structured files, 79–80
348 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 349 Thursday, February 9, 2012 10:08 AM
reducing size, 85–90 Digital Encryption Standard data compression, 86–88 (DES), 195 removing formatting, 88–89 directories, 79 storing references to image script for creating, 104 files, 85 disclaimer, 6 thumbnailing images, 89–90 disk swapping, 181 data-only interfaces, 301–307 Distributed Component Object lightweight data exchange, Model (DCOM), 305 302–305
Webbots, Spiders, and Screen Scrapers, 2nd Edition INDEX 349 © 2012 Michael Schrenk webbots2e.book Page 350 Thursday, February 9, 2012 10:08 AM
email, continued F reading with webbots, 145–152 facts, and copyright, 321 sending, 153–161 fair use laws, 322 HTML-formatted, 159–160 fault-tolerant webbots, 285–296 with mail() function, 154–155 cookie management notifications with webbots, changes, 294 157–158 form changes, 292–293 with PHP, 154–155 network outages and conges- undeliverable as alert to invalid tion, 294–295 address, 160–161 page content changes, 291–295 as webbot trigger, 223 URL changes, 286–291 email-controlled webbots, 151–152 page redirection, 288–290 encryption, 193–196 and referer values accuracy, authentication and, 208 290–291 certificate, 195–196 requests for nonexistent for deterring webbots, 312 pages, 292 webbots using, 194 $FETCH_DELAY, 175, 180 end-of-message indicator fgets() function, 25, 27 (POP3), 147 file() function, downloading environments, 250–252 files with, 27 many-to-many, 251 file handle, 25, 27 many-to-one, 252 filesystem, geographically one-to-many, 250 structured, 80 one-to-one, 251 File Transfer Protocol (FTP) error server, connecting to, 141 handlers, 295–296 webbots, 139–143 information $filter_array, 130, 135–136 from http_get() function, 32 filtering from http_get_withheader() by aggregation webbot, 135–137 function, 32 information by relevance, 16 logs, and webbot detection, Flash 267–268 barrier to effective eval() function, 303 webscraping, 229 event triggers, 70 for deterring webbots, 314 exclude_link() function, 179–180 for website navigation, prob- exclusion list, for spiders, 180 lems caused by, 229 executing webbots fopen() function, 25 in browsers, 26–27 format of names, 78–79 in command shell, 26 formatted_mail() function, 156 exe_sql() function, 82–83 form data variables, 66 expiration dates, for cookies, 203, forms 209–210 adapting to changes in, 292–293 eXtensible Markup Language analyzing, 71–74, 165 (XML), 301–302 avoiding errors, 75–76 assigning tasks, 258, 260 and cookies, 70 overhead, 302 emulation, 64 for RSS feeds, 131 legal issues and, 64
350 INDEX Webbots, Spiders, and Screen Scrapers, 2nd Edition © 2012 Michael Schrenk webbots2e.book Page 351 Thursday, February 9, 2012 10:08 AM
handlers, 65–66 get_base_page_address() function, 106 input tags, 66 get_domain() function, 178–179 interfaces, reverse engineering, get_http() function, 95 64–65 GET method, 61 main parts, 65 and errors, 267 source code http_get() function for down- displaying, 166 loading with, 31–32 saving, 166 vs. POST method, 68 submission, 63–76, 167 Google data fields in forms, 66 bombing, 298 event triggers, 70 developer API, 127 form handlers, 65–66 spiders used by, 173 GET method, 67–68 GoogleRankings.com, 118 PHP/CURL for, 28 graphics. See images POST method, 68 unpredictability, 70 H