Automatic Identification and Removal of Low Quality Online Information

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Identification and Removal of Low Quality Online Information AUTOMATIC IDENTIFICATION AND REMOVAL OF LOW QUALITY ONLINE INFORMATION A Thesis Presented to The Academic Faculty by Steve Webb In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the College of Computing Georgia Institute of Technology December 2008 Copyright c 2008 by Steve Webb AUTOMATIC IDENTIFICATION AND REMOVAL OF LOW QUALITY ONLINE INFORMATION Approved by: Calton Pu, Advisor Ling Liu College of Computing College of Computing Georgia Institute of Technology Georgia Institute of Technology Mustaque Ahamad Shyhtsun Felix Wu College of Computing Department of Computer Science Georgia Institute of Technology University of California, Davis Nick Feamster Date Approved: 13 August 2008 College of Computing Georgia Institute of Technology For My Mother iii ACKNOWLEDGEMENTS The quest to obtain a Ph.D. is often characterized as a rollercoaster with numerous peaks and valleys, and as any Ph.D. candidate will tell you, I have experienced my fair share of those peaks and valleys. Fortunately, throughout the highest of highs and the lowest of lows, I have been surrounded by people that have offered constant encouragement and guidance. First and foremost, I am eternally grateful for the undying support that I have received from my family. I am particularly thankful for my mother, Anne, and my grandmother, Anna Ross. These two women have always been the most important people in my life, and without their unconditional love and support, I would not be the man I am today. Thank you both for everything you have done for me, and know that I will always love you. I am also thankful for my father, John. He taught me the importance of hard work and dedication, and I am truly grateful for those lessons. I love you, Dad. The rest of my family is equally amazing, and I truly appreciate their love and support. Beyond my immediate family, I am also blessed to be an honorary member of the Hammond and Little families. We might not be connected by blood, but I still love you with all of my heart. At Georgia Tech, my advisor, Calton Pu, was a constant source of inspiration. Regard- less of what graduate school or life threw at me, Calton was always there to put things in their proper perspective. In my very first meeting with Calton, he congratulated me for choosing personal and intellectual growth over the shortsighted allure of immediate financial gains. At the time, I have to admit that I thought he was crazy. But now, I understand exactly what he meant, and I cherish the journey we completed together. I am also grateful for Ling Liu’s guidance. At every one of my DISL or DoI presentations, I could always rely on her to make interesting observations about my work, and ultimately, her insights greatly enhanced my dissertation research. Leo Mark and H. Venkateswaran were also very positive influences at Tech. I could always expect a smile and a wave when I saw them on campus. iv I am indebted to Calton, Ling, Mustaque Ahamad, Nick Feamster, and Shyhtsun Felix Wu for reading and commenting on my dissertation. Their feedback was incredibly valuable, and I truly appreciate it. In addition to my committee members, I was fortunate enough to be exposed to a number of amazing people throughout my graduate career. Galen Swint and Kang Li were essentially my second and third advisors during my first year at Tech, and I appreciate every piece of advice they gave me. Subramanyam Chitti was an amazing collaborator, and I genuinely enjoyed every conversation, all-nighter, and Tin Drum trip we shared together. Jinpeng Wei and Younggyun Koh will always have a special place in my heart because we all endured the Systems Qualifier together. Li Xiong was an amazing officemate and an even better Master Shake. James Caverlee was one of the primary reasons I came to campus every day, and the three years we shared an office together were among the highlights of my life. Tim Garcia, Danesh Irani, and Qinyi Wu were also amazing officemates. I will miss tossing Frisbees with them all over the Klaus Building at all hours of the night. Finally, I am very thankful for the assistance I received from Deborah Mitchell, Mary Claire Thompson, and Susie McClain. I cherish my friendships with all of you, and I hope we will always stay in touch. v TABLE OF CONTENTS DEDICATION . iii ACKNOWLEDGEMENTS . iv LIST OF TABLES . ix LIST OF FIGURES . xi SUMMARY . xiv I INTRODUCTION . 1 II PRELIMINARIES . 9 2.1 Statistical Classification of Email Spam . 9 2.2 Dimensionality Reduction . 12 III THE IMPORTANCE OF LARGE CORPORA FOR EMAIL SPAM CLASSIFI- CATION . 14 3.1 Corpora Descriptions . 16 3.2 Evaluation of Small Corpora Training Sets . 22 3.3 Evaluation of Large Corpora Training Sets . 36 3.4 Related Work . 41 3.5 Summary . 43 IV LARGE-SCALE EVALUATION OF EMAIL SPAM CLASSIFIERS . 46 4.1 Experimental Setup . 47 4.2 Baseline Evaluation of Spam Filter Effectiveness . 50 4.3 Evaluation of Spam Filter Effectiveness Against Camouflaged Messages . 53 4.4 Baseline Evaluation of Spam Filter Effectiveness Revisited . 57 4.5 Related Work . 59 4.6 Summary . 60 V EVOLUTIONARY CHARACTERISTICS OF EMAIL SPAM . 61 5.1 Experimental Evaluation Method . 64 5.2 Evidence of Extinction . 68 5.3 Survival and Co-Existence . 72 5.4 Related Work . 79 vi 5.5 Summary . 80 VI DEFENDING CLASSIFIERS AGAINST CAMOUFLAGED EMAIL SPAM . 82 6.1 Background on the Spam Arms Race . 84 6.2 Related Work . 86 6.3 Baseline Evaluation of Learning Filters . 87 6.4 Evaluation of Camouflage Attacks . 90 6.5 Evaluation of Retrained Filter Defense . 94 6.6 Learning Filters Resisting Camouflage Attacks . 99 6.7 Summary . 106 VII INTEGRATING DIVERSE EMAIL SPAM FILTERING TECHNIQUES . 108 7.1 Related Work . 109 7.2 Description . 110 7.3 Implementation . 112 7.4 Summary . 116 VIII USING EMAIL SPAM TO IDENTIFY WEB SPAM AUTOMATICALLY . 117 8.1 The Webb Spam Corpus . 119 8.2 Sample Applications . 128 8.3 Summary . 136 IX CHARACTERIZING WEB SPAM USING CONTENT AND HTTP SESSION ANALYSIS . 138 9.1 Corpus Summary . 140 9.2 Content Analysis . 140 9.3 HTTP Session Analysis . 153 9.4 Related Work . 156 9.5 Summary . 157 X PREDICTING WEB SPAM WITH HTTP SESSION INFORMATION . 158 10.1 Related Work . 159 10.2 Web Page Retrieval Using HTTP . 161 10.3 Experimental Setup . 164 10.4 Webb Spam and WebBase Results . 171 vii 10.5 WEBSPAM-UK2006 Results . 176 10.6 Summary . 181 XI EXPLORING THE DARK SIDE OF SOCIAL ENVIRONMENTS . 182 11.1 Background on Social Networking Communities . 183 11.2 Traditional Attacks Targeting Social Networking Communities . 185 11.3 New Attacks Against Social Networking Communities . 194 11.4 Summary . 200 XII USING SOCIAL HONEYPOTS TO IDENTIFY SPAMMERS . 202 12.1 Related Work . 204 12.2 Social Spam . 205 12.3 Social Honeypots . 207 12.4 Social Honeypot Data Analysis . 209 12.5 Summary . 220 XIII CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS . 222 13.1 Countering DoI Attacks in Email Systems . 222 13.2 Countering DoI Attacks in the World Wide Web . 224 13.3 Countering DoI Attacks in Social Environments . 225 REFERENCES . 227 VITA ............................................ 238 viii LIST OF TABLES 1 Corpus summaries. 17 2 Four combinations of training sets. 24 3 Stable false positive results for the Ling-spam spam and small legitimate training sets. 27 4 Stable false negative results for the Ling-spam spam and small legitimate training sets. 28 5 Stable false positive results for the SpamAssassin spam and small legitimate training sets. 29 6 Stable false negative results for the SpamAssassin spam and small legitimate training sets. 30 7 Stable false positive results for the SpamArchive spam and small legitimate training sets. 32 8 Stable false negative results for the SpamArchive spam and small legitimate training sets. 34 9 Stable false positive results for the small spam and Enron legitimate training sets. 35 10 Stable false negative results for the small spam and Enron legitimate training sets. 36 11 Stable false positive results for the SpamArchive spam and Enron legitimate training sets. 37 12 Stable false negative results for the SpamArchive spam and Enron legitimate training sets. 38 13 False positive results for various training and test set sizes. ..
Recommended publications
  • In-Depth Evaluation of Redirect Tracking and Link Usage
    Proceedings on Privacy Enhancing Technologies ; 2020 (4):394–413 Martin Koop*, Erik Tews, and Stefan Katzenbeisser In-Depth Evaluation of Redirect Tracking and Link Usage Abstract: In today’s web, information gathering on 1 Introduction users’ online behavior takes a major role. Advertisers use different tracking techniques that invade users’ privacy It is common practice to use different tracking tech- by collecting data on their browsing activities and inter- niques on websites. This covers the web advertisement ests. To preventing this threat, various privacy tools are infrastructure like banners, so-called web beacons1 or available that try to block third-party elements. How- social media buttons to gather data on the users’ on- ever, there exist various tracking techniques that are line behavior as well as privacy sensible information not covered by those tools, such as redirect link track- [52, 69, 73]. Among others, those include information on ing. Here, tracking is hidden in ordinary website links the user’s real name, address, gender, shopping-behavior pointing to further content. By clicking those links, or or location [4, 19]. Connecting this data with informa- by automatic URL redirects, the user is being redirected tion gathered from search queries, mobile devices [17] through a chain of potential tracking servers not visible or content published in online social networks [5, 79] al- to the user. In this scenario, the tracker collects valuable lows revealing further privacy sensitive information [62]. data about the content, topic, or user interests of the This includes personal interests, problems or desires of website. Additionally, the tracker sets not only third- users, political or religious views, as well as the finan- party but also first-party tracking cookies which are far cial status.
    [Show full text]
  • IBM Cognos Analytics - Reporting Version 11.1
    IBM Cognos Analytics - Reporting Version 11.1 User Guide IBM © Product Information This document applies to IBM Cognos Analytics version 11.1.0 and may also apply to subsequent releases. Copyright Licensed Materials - Property of IBM © Copyright IBM Corp. 2005, 2021. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at " Copyright and trademark information " at www.ibm.com/legal/copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: • Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. • Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. • Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. • UNIX is a registered trademark of The Open Group in the United States and other countries. • Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
    [Show full text]
  • Download Forecheck Guide
    Forecheck Content 1 Welcome & Introduction 5 2 Overview: What Forecheck Can Do 6 3 First Steps 7 4 Projects and Analyses 9 5 Scheduler and Queue 12 6 Important Details 13 6.1 La..n..g..u..a..g..e..s.,. .C...h..a..r.a..c..t.e..r. .S..e..t.s.. .a..n..d.. .U..n..i.c..o..d..e....................................................................... 13 6.2 Ch..o..o..s..i.n..g.. .t.h..e.. .c.o..r..r.e..c..t. .F..o..n..t............................................................................................. 14 6.3 St.o..r.a..g..e.. .L..o..c..a..t.i.o..n.. .o..f. .D..a..t.a................................................................................................ 15 6.4 Fo..r.e..c..h..e..c..k. .U...s.e..r.-.A...g..e..n..t. .a..n..d.. .W...e..b.. .A..n..a..l.y..s.i.s.. .T..o..o..l.s............................................................. 15 6.5 Er.r.o..r.. .H..a..n..d..l.i.n..g................................................................................................................ 17 6.6 Ro..b..o..t.s...t.x..t.,. .n..o..i.n..d..e..x..,. .n..o..f.o..l.l.o..w......................................................................................... 18 6.7 Co..m...p..l.e..t.e.. .A..n..a..l.y..s..i.s. .o..f. .l.a..r.g..e.. .W....e..b..s.i.t.e..s............................................................................. 20 6.8 Fi.n..d..i.n..g.. .a..l.l. .p..a..g..e..s. .o..f. .a.. .W....e..b..s.i.t.e....................................................................................... 21 6.9 Go..o..g..l.e.. .A...n..a..l.y.t.i.c..s.
    [Show full text]
  • Lxmldoc-4.5.0.Pdf
    lxml 2020-01-29 Contents Contents 2 I lxml 14 1 lxml 15 Introduction................................................. 15 Documentation............................................... 15 Download.................................................. 16 Mailing list................................................. 17 Bug tracker................................................. 17 License................................................... 17 Old Versions................................................. 17 2 Why lxml? 18 Motto.................................................... 18 Aims..................................................... 18 3 Installing lxml 20 Where to get it................................................ 20 Requirements................................................ 20 Installation................................................. 21 MS Windows............................................. 21 Linux................................................. 21 MacOS-X............................................... 21 Building lxml from dev sources....................................... 22 Using lxml with python-libxml2...................................... 22 Source builds on MS Windows....................................... 22 Source builds on MacOS-X......................................... 22 4 Benchmarks and Speed 23 General notes................................................ 23 How to read the timings........................................... 24 Parsing and Serialising........................................... 24 The ElementTree
    [Show full text]
  • CSE 190 M (Web Programming), Spring 2008 University of Washington
    Extra Slides, week 2 CSE 190 M (Web Programming), Spring 2008 University of Washington Reading: Chapter 2, sections 2.4 - 2.6 Except where otherwise noted, the contents of this presentation are © Copyright 2008 Marty Stepp and Jessica Miller and are licensed under the Creative Commons Attribution 2.5 License. Additional XHTML Tags for adding metadata and icons to a page Web page metadata: <meta> information about your page (for a browser, search engine, etc.) <meta name="description" content="Authors' web site for Building Java Programs." /> <meta name="keywords" content="java, textbook" /> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> placed in the head of your XHTML page meta tags often have both the name and content attributes some meta tags use the http-equiv attribute instead of name meta element to aid browser / web server <meta http-equiv="Content-Type" content=" type of document (character encoding)" /> <meta http-equiv="refresh" content=" how often to refresh the page (seconds)" /> </head> using the Content-Type gets rid of the W3C "tentatively valid" warning <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> the meta refresh tag can also redirect from one page to another: <meta http-equiv="refresh" content="5;url=http://www.bjp.com " /> why would we want to do this? (example ) meta element to describe the page <head> <meta name="author" content=" web page's author " /> <meta name="revised" content=" web page version and/or last modification date " /> <meta name="generator"
    [Show full text]
  • Essbase Logs
    Oracle® Enterprise Performance Management System Installation and Configuration Troubleshooting Guide Release 11.1.2.1 Updated: August 2012 EPM System Installation and Configuration Troubleshooting Guide, 11.1.2.1 Copyright © 2007, 2012, Oracle and/or its affiliates. All rights reserved. Authors: EPM Information Development Team Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS: Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007).
    [Show full text]
  • CSE 190 M: HTML Page 1
    CSE 190 M: HTML Page 1 Lecture Notes 2: Hypertext Markup Language (HTML) CSE 190 M (Web Programming), Spring 2007 University of Washington Reading: Sebesta Ch. 2 sections 2.1 - 2.4.5, 2.4.7 - 2.4.9, 2.5 -2.6.1,2.7 -2.7.2 Hypertext Markup Language (HTML ) describes the content and structure of information on a web page not the same as the presentation (appearance on screen) surrounds text content with opening and closing tags each tag's name is called an element syntax: <element > content </element > example: <p>This is a paragraph</p> most whitespace is insignificant in HTML (it gets ignored or collapsed into a single space) More about HTML tags some tags can contain additional information called attributes syntax: <element attribute =" value " attribute =" value "> content </element > example: <a href="page2.html">Next page</a> some tags don't contain content; can be opened and closed in one tag syntax: <element attribute =" value " attribute =" value " /> example: <img src="bunny.jpg" alt="A bunny" /> example: <hr /> file://localhost/C:/Documents%20and%20Settings/stepp/My%20Documents/cse190m/07sp/... 05/03/2007 12:57:29 PM CSE 190 M: HTML Page 2 Structure of an HTML page <html> <head> information about the page </head> <body> page contents </body> </html> a header describes the page and a body contains the page's contents an HTML page is saved into a file ending with extension .html Example page XHTML a newer version of HTML, standardized in 2000 uses a markup format called XML (XML + HTML = XHTML) though the browser will accept some malformed HTML, we'll write "strict" XHTML that complies to the official web standards why use XHTML and web standards? more rigid and structured language more interoperable across different web browsers more likely that our pages will display correctly in the future can be interchanged with other XML data: SVG (graphics), MathML , MusicML , etc.
    [Show full text]
  • (ACM) Style Guide, WCAG 2.0 Level A
    Association of Computing Machinery (ACM) Style Guide, WCAG 2.0 Level A ACM 12/19/16 TABLE OF CONTENTS Overview ............................................................................................................................................................... 4 Media Types .......................................................................................................................................................... 4 Images ............................................................................................................................................................... 4 Best Practice: Provide Alternative Text for Images ..................................................................................... 4 Best Practice: Ensure complex images provide sufficient descriptions ....................................................... 5 Best Practice: Ensure CSS background images that convey meaning have textual and visible equivalents ...................................................................................................................................................................... 5 Color and Contrast ............................................................................................................................................ 6 Best Practice: Ensure color is not the sole means of communicating information or indicating error messages .....................................................................................................................................................
    [Show full text]
  • SC Tech Summary
    SC tech summary Failure of Success Criterion F65 1.1.1 due to omitting the alt attribute on img elements, area elements, and input elements of type image Using label elements to associate text labels with form H44 controls." category Using the title attribute to identify form controls when H65 the label element cannot be used Using alt attributes on images used as submit buttons" H36 category ARIA6 Using aria-label to provide labels for objects H37 Using alt attributes on img elements 1.1.1 Failure of Success Criterion F30 1.1.1 and 1.2.1 due to using text alternatives that are not alternatives Combining adjacent image and text links for the same H2 resource H35 Providing text alternatives on applet elements H53 Using the body of the object element" category Providing text alternatives for the area elements of image H24 maps" category Providing link text that describes the purpose of a link for H30 anchor elements H45 Using longdesc H46 Using noembed with embed Using null alt text and no title attribute on img elements H67 for images that AT should ignore H96 Using the track element to provide audio descriptions Providing an alternative for time-based media for audio- G158 only content Providing an alternative for time-based media for video- G195 1.2.1 only content Providing audio that describes the important video G166 content and describing it as such Failure of Success Criterion F30 1.1.1 and 1.2.1 due to using text alternatives that are not alternatives 1.2.2 H95 Using the track element to provide captions 1.2.3 H53 Using the body
    [Show full text]
  • Exploring the Ecosystem of Referrer-Anonymizing Services
    Exploring the Ecosystem of Referrer-Anonymizing Services Nick Nikiforakis, Steven Van Acker, Frank Piessens, and Wouter Joosen IBBT-DistriNet, KU Leuven, 3001 Leuven, Belgium [email protected] Abstract. The constant expansion of the World Wide Web allows users to enjoy a wide range of products and services delivered directly to their browsers. At the same time however, this expansion of functionality is usually coupled with more ways of attacking a user's security and privacy. In this arms race, certain web-services present themselves as privacy- preserving or privacy-enhancing. One type of such services is a Referrer- Anonymizing Service (RAS), a service which relays users from a source site to a destination site while scrubbing the contents of the referrer header from user requests. In this paper, we investigate the ecosystem of RASs and how they in- teract with web-site administrators and visiting users. We discuss their workings, what happens behind the scenes and how top Internet sites react to traffic relayed through such services. In addition, we present user statistics from our own Referrer-Anonymizing Service and show the leakage of private information by others towards advertising agencies as well as towards `curious' RAS owners. Keywords: referrer, anonymization, online ecosystem 1 Introduction In the infant stages of the Internet, privacy and anonymity were mostly unneces- sary due to the small size of the online community and the public nature of the available data. Today however, this has changed. People have online identities, are connected to the Internet almost permanently and they increasingly store their sensitive documents, photos and other data online in the cloud.
    [Show full text]
  • Package 'Htmlutils'
    Package ‘HTMLUtils’ February 19, 2015 Type Package Title Facilitates Automated HTML Report Creation Version 0.1.7 Date 2015-01-17 Depends R2HTML Suggests Author ``Markus Loecher, Berlin School of Eco- nomics and Law (BSEL)'' <[email protected]> Maintainer ``Markus Loecher, Berlin School of Eco- nomics and Law (BSEL)'' <[email protected]> Description Facilitates automated HTML report creation, in particular framed HTML pages and dynamically sortable tables. License GPL LazyLoad yes Repository CRAN Date/Publication 2015-01-17 16:21:32 NeedsCompilation no R topics documented: HTMLUtils-package . .2 BasicHTML . .2 FramedHTML . .6 HTMLhref . 12 HTMLsortedTable . 13 InstallJSC . 14 makePathName . 14 myHTMLInitFile . 15 MyReportBegin . 16 MyReportEnd . 17 Index 18 1 2 BasicHTML HTMLUtils-package Facilitates Automated HTML Report Creation Description Facilitates automated HTML report creation, in particular framed HTML pages and dynamically sortable tables. Details Package: HTMLUtils Type: Package Title: Facilitates Automated HTML Report Creation Version: 0.1.7 Date: 2015-01-17 Depends: R2HTML Suggests: Author: "Markus Loecher, Berlin School of Economics and Law (BSEL)" <[email protected]> Maintainer: "Markus Loecher, Berlin School of Economics and Law (BSEL)" <[email protected]> License: GPL LazyLoad: yes Packaged: 2012-05-17 21:56:35 UTC; mloecher Repository: CRAN Date/Publication: 2012-05-18 05:59:13 Author(s) "Markus Loecher, Berlin School of Economics and Law (BSEL)" <[email protected]> BasicHTML creates a basic HTML page displaying plots and annota Description Creates a basic HTML page displaying plots and annotations that can easily be navigated. The plots can be created either ’on the fly’ by passing the appropriate commands or beforehand in which case just the filenames need to be passed.
    [Show full text]
  • CSC 443: Web Programming Basic HTML
    1 CSC 443: Web Programming Haidar Harmanani Department of Computer Science and Mathematics Lebanese American University Byblos, 1401 2010 Lebanon CSC443: Web Programming 2 Basic HTML Haidar Harmanani Department of Computer Science and Mathematics Lebanese American University Byblos, 1401 2010 Lebanon CSC443: Web Programming Hypertext Markup Language (HTML) 3 ¨ Describes the content and structure of information on a web page ¨ Not the same as the presentation (appearance on screen) ¨ Surrounds text content with opening and closing tags ¨ Each tag's name is called an element ¤ syntax: <element> content </element> ¤ example: <p>This is a paragraph</p> CSC443: Web Programming XHTML 4 ¨ Uses a markup format called XML ¨ XML + HTML = XHTML ¨ Standardized in 2000 ¨ A strict XHTML page uses some different syntax and tags than HTML CSC443: Web Programming Structure of XHTML page 5 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> information about the page </head> <body> page contents </body> </html> HTML ¨ HTML is saved with extension .html ¨ Basic structure: tags that enclose content, i.e., elements ¨ Header describes the page ¨ Body contains the page’s contents CSC443: Web Programming Page Title <title> 6 … <head> <title> HARRY POTTER AND THE DEATHLY HALLOWS - PART 2 </title> </head> … HTML ¨ Placed within the head of the page ¨ Displayed in web browser’s title mark and when bookmarking the page CSC443: Web Programming Paragraph <p> 7 … <body> <p> Harry Potter and the Deathly Hallows, the last book in the series, begins directly after the events of the sixth book.
    [Show full text]