Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank Mccown Old Dominion University

Total Page:16

File Type:pdf, Size:1020Kb

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank Mccown Old Dominion University Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Fall 2007 Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Old Dominion University Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Computer Sciences Commons, and the Digital Communications and Networking Commons Recommended Citation McCown, Frank. "Lazy Preservation: Reconstructing Websites from the Web Infrastructure" (2007). Doctor of Philosophy (PhD), dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ys8r-nj25 https://digitalcommons.odu.edu/computerscience_etds/21 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. LAZY PRESERVATION: RECONSTRUCTING WEBSITES FROM THE WEB INFRASTRUCTURE by Frank McCown B.S. 1996, Harding University M.S. 2002, University of Arkansas at Little Rock A Dissertation Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirement for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY December 2007 Approved by: Michael L. Nelson (Director) William Y. Arms (Member) Johan Bollen (Member) Kurt Maly (Member) Ravi Mukkamala (Member) Mohammad Zubair (Member) ABSTRACT LAZY PRESERVATION: RECONSTRUCTING WEBSITES FROM THE WEB INFRASTRUCTURE Frank McCown Old Dominion University, 2007 Director: Dr. Michael L. Nelson Backup or preservation of websites is often not considered until after a catastrophic event has oc- curred. In the face of complete website loss, webmasters or concerned third parties have attempted to recover some of their websites from the Internet Archive. Still others have sought to retrieve missing resources from the caches of commercial search engines. Inspired by these post hoc reconstruction attempts, this dissertation introduces the concept of lazy preservation{ digital preservation per- formed as a result of the normal operations of the Web Infrastructure (web archives, search engines and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and behavior. Methods for reconstructing websites from the WI are then investigated, and a new type of crawler is introduced: the web-repository crawler. Several experiments are used to measure and evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository crawler strategies are introduced and evaluated. The implementation of the web-repository crawler Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented, and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI. iii c Copyright, 2007, by Frank McCown, All Rights Reserved. iv To my wife, Becky. v ACKNOWLEDGMENTS There are a number of people who I would like to acknowledge for their support during my doctoral work. I would especially like to thank my advisor, Michael L. Nelson, for the time and effort he put into mentoring me these past several years. Our many discussions sparked a number of great ideas and helped turn several dead-ends into possibilities. I am also grateful to my doctoral committee and for the input they have provided. Much of this dissertation is the product of collaboration with a number of excellent researchers. Michael Nelson and Johan Bollen (LANL) provided many of the initial ideas on lazy preservation and the Web Infrastructure. Other ideas about lazy preservation developed from collaboration with Cathy Marshall (Microsoft). Joan Smith worked with me on the decaying website experiment from Chapter IV and provided helpful ideas and advice throughout my time at ODU. Giridhar Nandigam helped with the search engine sampling experiment in Chapter IV, Amine Benjelloun developed most of the Brass system from Chapter VI, and Norou Diawara helped perform the statistical analysis in Chapter VII. I would like to thank Janet Brunelle, Hussein Abdel-Wahab and other ODU faculty for their friendship and encouragement these past several years. I enjoyed attending a number of conferences with Michael, Johan, Joan, Martin Klein, Marko Rodriguez (LANL) and Terry Harrison (CACI) and learned a lot from our many discussions. Thanks also to members of the Systems Group who kept our infrastructure running smoothly. My friends and colleagues at Harding University were very supportive of me while I worked on my Ph.D., and I am thankful to them for providing me the time off to pursue my doctorate degree. This dissertation is dedicated to my wife Becky who cheered me on through the good times, encouraged me when I was down and wanted to quit, and made me laugh every day. I could not have had a more supportive spouse. I especially enjoyed becoming a father in the final year of my doctoral work{ much of this dissertation was written only a few feet away from Ethan as he slept and played. I thank my parents Andy and Genia and my siblings John and Sara who have encouraged me in all my pursuits, and I thank my family at the Bayside Church of Christ who provided friendship and spiritual guidance while we lived in Virginia. Finally and primarily, I thank God for giving me the strength to finish what I started, and, in the spirit of 1 Corinthians 10:31, I hope that this dissertation glorifies Him. vi \ `My son,' the father said, `you are always with me, and everything I have is yours. But we had to celebrate and be glad, because this brother of yours was dead and is alive again; he was lost and is found.' " - Luke 15:31 vii TABLE OF CONTENTS Page LIST OF TABLES . ix LIST OF FIGURES . x Chapter I INTRODUCTION . 1 1 MOTIVATION . 1 2 OBJECTIVE . 3 3 APPROACH . 3 4 ORGANIZATION . 4 II PRESERVING THE WEB . 6 1 LINK ROT . 6 2 PRESERVING THE WEB . 8 3 WEB CRAWLING . 11 4 PRESERVING WEBSITES . 13 5 CONCLUSIONS . 14 III LAZY PRESERVATION AND THE WEB INFRASTRUCTURE . 15 1 LAZY PRESERVATION . 15 2 LIMITATIONS . 17 3 WEB REPOSITORIES . 18 4 CONCLUSIONS . 27 IV CHARACTERIZING THE WEB INFRASTRUCTURE . 29 1 A MODEL FOR RESOURCE AVAILABILITY . 29 2 WEB INFRASTRUCTURE PRESERVATION CAPABILITY . 30 3 WEB INFRASTRUCTURE CONTENTS . 38 4 DISCUSSION . 44 5 CONCLUSIONS . 45 V WEB-REPOSITORY CRAWLING . 46 1 CRAWLER ARCHITECTURE . 46 2 LISTER QUERIES AND CRAWLING POLICIES . 47 3 URL CANONICALIZATION . 48 4 CONCLUSIONS . 53 VI WARRICK, A WEB-REPOSITORY CRAWLER . 54 1 BRIEF HISTORY . 54 2 IMPLEMENTATION . 55 3 OPERATION . 56 4 RUNNING . 58 5 BRASS . 59 6 USAGE STATISTICS . 64 7 CONCLUSIONS . 65 VII EVALUATING LAZY PRESERVATION . 67 1 WEBSITE DEFINITIONS . 67 viii 2 RECONSTRUCTION MEASUREMENTS . 70 3 INITIAL RECONSTRUCTION EXPERIMENT . 73 4 CRAWLING POLICIES EXPERIMENT . 78 5 FACTORS AFFECTING WEBSITE RECONSTRUCTION . 83 6 CONCLUSIONS . 96 VIII RECOVERING A WEBSITE'S SERVER COMPONENTS . 97 1 GENERATING DYNAMIC WEB CONTENT . 97 2 WHAT TO PROTECT . 98 3 INJECTION MECHANICS . 98 4 EXPERIMENTS . 103 5 DISCUSSION . 113 6 CONCLUSIONS . 113 IX CONCLUSIONS AND FUTURE WORK . 114 1 CONCLUSIONS . 114 2 CONTRIBUTIONS . 114 3 FUTURE WORK . 115 BIBLIOGRAPHY . 117 APPENDICES A WARRICK COMMAND-LINE SWITCHES . 129 B RECONSTRUCTED WEBSITES . 130 VITA ................................................. 134 ix LIST OF TABLES Table Page 1 Systems for preserving and recovering web resources. 14 2 Sample of reconstructed websites. 17 3 Web repository-supported data types as of July 10, 2007. 22 4 Implementation summary of web-repository interfaces. 24 5 Resource availability states. 30 6 Caching of HTML resources from four web collections. 36 7 Web and cache accessibility. 39 8 Indexed and cached content by type. 40 9 Staleness of search engine caches (in days). 42 10 Search engine overlap with the Internet Archive. 44 11 Repository request methods and limits. 55 12 Brass usage statistics from 2007. 64 13 Brass recovery summary. 65 14 Repository use, contributions and requests. 66 15 General levels of reconstruction success. 73 16 Results of initial website reconstructions. 76 17 Results of crawling-policy website reconstructions (Part 1). 79 18 Results of crawling-policy website reconstructions (Part 2). 80 19 Statistics for crawling-policy website reconstructions. 80 20 Descriptive statistics for reconstruction success levels. 86 21 Reconstruction performance of web repositories. 91 22 Regression parameter estimates. 95 23 Various r values (bold). ..
Recommended publications
  • Erasmus Mundus Master's Journalism and Media Within Globalisation: The
    Erasmus Mundus Master’s Journalism and Media within Globalisation: The European Perspective. Blue Book 2009-2011 The 2009- 2011 Masters group includes 49 students from all over the world: Australia, Brazil, Britain, Canada, China, Czech, Cuba, Denmark, Egypt, Ethiopia, Finland, France, Germany, India, Italy, Kenya, Kyrgyzstan, Norway, Panama, Philippines, Poland, Portugal, Serbia, South Africa, Spain, Ukraine and USA. 1 Mundus Masters Journalism and Media class of 2009-2011. Family name First name Nationality Email Specialism Lopez Belinda Australia lopez.belinda[at]gmail.com London City d'Essen Caroline Brazil/France caroldessen[at]yahoo.com.br Hamburg Werber Cassie Britain cassiewerber[at]gmail.com London City Baker Amy Canada amyabaker[at]gmail.com Swansea Halas Sarka Canada/Czech sarkahalasova[at]gmail.com London City Diao Ying China dydiaoying[at]gmail.com London City Piñero Roig Jennifer Cuba jenniferpiero[at]yahoo.es Hamburg Jørgensen Charlotte Denmark charlotte_j84[at]hotmail.com Hamburg Poulsen Martin Kiil Denmark poulsen[at]martinkiil.dk Swansea Billie Nasrin Sharif Denmark Nasrin.Billie[at]gmail.com Swansea Zidore Christensen Ida Denmark IdaZidore[at]gmail.com Swansea Sørensen Lasse Berg Denmark lasseberg[at]gmail.com London City Hansen Mads Stampe Denmark Mads_Stampe[at]hotmail.com London City Al Mojaddidi Sarah Egypt mojaddidi[at]gmail.com Swansea Gebeyehu Abel Adamu Ethiopia abeltesha2003[at]gmail.com Swansea Eronen Eeva Marjatta Finland eeva.eronen[at]iki.fi London City Abbadi Naouel France naouel.mohamed[at]univ-lyon2.fr
    [Show full text]
  • Characterization of Portuguese Web Searches
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Characterization of Portuguese Web Searches Rui Ribeiro Master in Informatics and Computing Engineering Supervisor: Sérgio Nunes (PhD) July 11, 2011 Characterization of Portuguese Web Searches Rui Ribeiro Master in Informatics and Computing Engineering Approved in oral examination by the committee: Chair: João Pascoal Faria (PhD) External Examiner: Daniel Coelho Gomes (PhD) Supervisor: Sérgio Sobral Nunes (PhD) 5st July, 2011 Abstract Nowadays the Web can be seen as a worldwide library, being one of the main access points to information. The large amount of information available on websites all around the world raises the need for mechanisms capable of searching and retrieving relevant information for the user. Information retrieval systems arise in this context, as systems capable of searching large amounts of information and retrieving relevant information in the user’s perspective. On the Web, search engines are the main information retrieval sys- tems. The search engine returns a list of possible relevant websites for the user, according to his search, trying to fulfill his information need. The need to know what users search for in a search engine led to the development of methodologies that can answer this problem and provide statistical data for analysis. Many search engines store the information about all queries made in files called trans- action logs. The information stored in these logs can vary, but most of them contain information about the user, query date and time and the content of the query itself. With the analysis of these logs, it is possible to get information about the number of queries made on the search engine, the mean terms per query, the mean session duration or the most common topics.
    [Show full text]
  • Google/ Doubleclick REGULATION (EC) No 139/2004 MERGER PROCEDURE Article 8(1)
    EN This text is made available for information purposes only. A summary of this decision is published in all Community languages in the Official Journal of the European Union. Case No COMP/M.4731 – Google/ DoubleClick Only the English text is authentic. REGULATION (EC) No 139/2004 MERGER PROCEDURE Article 8(1) Date: 11-03-2008 COMMISSION OF THE EUROPEAN COMMUNITIES Brussels, 11/03/2008 C(2008) 927 final PUBLIC VERSION COMMISSION DECISION of 11/03/2008 declaring a concentration to be compatible with the common market and the functioning of the EEA Agreement (Case No COMP/M.4731 – Google/ DoubleClick) (Only the English text is authentic) Table of contents 1 INTRODUCTION .....................................................................................................4 2 THE PARTIES...........................................................................................................5 3 THE CONCENTRATION.........................................................................................6 4 COMMUNITY DIMENSION ...................................................................................6 5 MARKET DESCRIPTION......................................................................................6 6 RELEVANT MARKETS.........................................................................................17 6.1. Relevant product markets ............................................................................17 6.1.1. Provision of online advertising space.............................................17 6.1.2. Intermediation in
    [Show full text]
  • Exploring Search Engine Counts in the Identification and Characterization
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Open Repository of the University of Porto FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Exploring search engine counts in the identification and characterization of search queries Diogo Magalhães Moura Mestrado Integrado em Engenharia Informática e Computação Supervisor: Carla Teixeira Lopes July 26, 2018 Exploring search engine counts in the identification and characterization of search queries Diogo Magalhães Moura Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: João Moreira External Examiner: Ricardo Campos Supervisor: Carla Lopes July 26, 2018 Abstract The Web is the greatest source of information nowadays and it’s frequently used by most people to seek specific information. People do not search the most efficient way possible. They tend to use a small number of search terms, look into few retrieved documents and rarely use advanced features. This makes it harder of finding the intent and motive of the search and provide the user with relevant and precise documents among the numerous documents available online. This topic is particularly important when the information searched is in the health domain. Health-related searches have gained much popularity. In the U.S. only, 72% of Internet users search in this domain. Being able to provide better results could encourage people to be more participative in managing their health and have a fundamental impact on their quality of life. The purpose of this investigation is to explore the usage of a semantic similarity measure, called “Normalized Google Distance" (NGD), to enrich and increase the context of a health query.
    [Show full text]
  • CITY of ROCKPORT AGENDA CITY COUNCIL WORKSHOP MEETING 3:00 P.M., Tuesday, May 27, 2008 Rockport City Hall, 622 East Market Stree
    CITY OF ROCKPORT AGENDA CITY COUNCIL WORKSHOP MEETING 3:00 p.m., Tuesday, May 27, 2008 Rockport City Hall, 622 East Market Street NOTICE is hereby given that the Rockport City Council will hold a Regular Meeting on Tuesday, May 27, 2008 at 3:00 p.m. at the Rockport City Hall, 622 E. Market, Rockport, Texas. The following subjects will be discussed to wit: I. CALL TO ORDER. II. ITEMS FOR CONSIDERATION A. Presentation and general discussion on the Aquarium at Rockport Harbor. B. Presentation and general discussion of presentation entitled Improving Coastal Land Use Planning Through the Application and Evaluation of the Interoperability of Three Decision Support Tools as presented by Kiersten Madden, Ph.D., Stewardship Coordinator of the Mission-Aransas National Estuarine Research Reserve, UTMSI, Port Aransas . C. Presentation and general discussion of request from the Aransas First for the City to provide and fund water for Aransas First use at the TxDOT Roadside Park on Hwy. 35 for the maintenance and restoration of the Demo Garden. D. Presentation and general discussion regarding addressing efforts to better communicate with Citizens regarding city services and programs. E. Presentation and general discussion regarding 2008 Appointments for the following: (1) City Committees, Boards and Commissions and (2) Council Liaison to City Committees, Boards and Commissions and community organizations. III. ADJOURNMENT NOTICE This facility is wheelchair accessible and accessible parking spaces are available. Requests for accommodations or interpretive services must be made 48 hours prior to this meeting. Please contact the City Secretary’s office at 361/729-2213 ext. 138 or FAX 361/790-5966 or E-Mail [email protected] for further information.
    [Show full text]
  • Download Download
    International Journal of Management & Information Systems – Fourth Quarter 2011 Volume 15, Number 4 History Of Search Engines Tom Seymour, Minot State University, USA Dean Frantsvog, Minot State University, USA Satheesh Kumar, Minot State University, USA ABSTRACT As the number of sites on the Web increased in the mid-to-late 90s, search engines started appearing to help people find information quickly. Search engines developed business models to finance their services, such as pay per click programs offered by Open Text in 1996 and then Goto.com in 1998. Goto.com later changed its name to Overture in 2001, and was purchased by Yahoo! in 2003, and now offers paid search opportunities for advertisers through Yahoo! Search Marketing. Google also began to offer advertisements on search results pages in 2000 through the Google Ad Words program. By 2007, pay-per-click programs proved to be primary money-makers for search engines. In a market dominated by Google, in 2009 Yahoo! and Microsoft announced the intention to forge an alliance. The Yahoo! & Microsoft Search Alliance eventually received approval from regulators in the US and Europe in February 2010. Search engine optimization consultants expanded their offerings to help businesses learn about and use the advertising opportunities offered by search engines, and new agencies focusing primarily upon marketing and advertising through search engines emerged. The term "Search Engine Marketing" was proposed by Danny Sullivan in 2001 to cover the spectrum of activities involved in performing SEO, managing paid listings at the search engines, submitting sites to directories, and developing online marketing strategies for businesses, organizations, and individuals.
    [Show full text]
  • A Progress Report of the Coalition to End Wildlife Trafficking Online
    COALITION TO END WILDLIFE TRAFFICKING ONLINE OFFLINE AND IN THE WILD A Progress Report of the Coalition to End Wildlife Trafficking Online © Martin Harvey / WWF The world’s most endangered species are under threat from an unexpected source: the internet 2 Advances in technology and access to connectivity across the world, combined with rising buying power and demand for illegal wildlife products, have increased the ease of exchange from poacher to consumer across continents. Products derived from species like elephants, pangolins and marine turtles, as well as live tiger cubs, reptiles, primates and birds for the exotic pet trade are readily available for purchase online. Buyers don’t need advanced detection skills to seek out these animals, but rather can quickly find the rare and exotic for sale on widely used platforms. The scale and anonymous nature of online trafficking, which includes everchanging usernames, the security of offline chats and cloaked VPNs, means it is essential that tech companies take charge in detecting and disrupting online wildlife criminals. E-commerce, social media and tech companies can operate at an unprecedented scale, preventing or removing millions of listings of protected wildlife in collaboration with wildlife experts. © James Warwick / WWF-US 3 A COALITION APPROACH © Rachel Stump / WWF-US In 2018, WWF, TRAFFIC and IFAW launched the Coalition to End Wildlife Trafficking Online with 21 of the world’s biggest tech companies after years of behind-the-scenes collaborations bringing companies to the table. The Coalition brings together The Coalition e-commerce, search and social media companies across the world in partnership with these three leading wildlife brings together organizations that support this industry-wide approach to e-commerce, reduce wildlife trafficking online on company platforms by 80% by 2020.
    [Show full text]
  • A Google and SAPO Comparative Study Regarding Different Interaction Modalities
    Univ Access Inf Soc (2017) 16:581–592 DOI 10.1007/s10209-016-0489-5 LONG PAPER Usability evaluation of navigation tasks by people with intellectual disabilities: a Google and SAPO comparative study regarding different interaction modalities 1,2 1,2 1,2 3 Taˆnia Rocha • Diana Carvalho • Maximino Bessa • Sofia Reis • Luı´s Magalha˜es4 Published online: 5 August 2016 Ó Springer-Verlag Berlin Heidelberg 2016 Abstract This paper presents a case study regarding the phased study, we think we may be able to infer some rec- usability evaluation of navigation tasks by people with ommendations to be used by developers in order to create intellectual disabilities. The aim was to investigate the more intuitive layouts for easy navigation regarding this factors affecting usability, by comparing their user-Web group of people, and thereby facilitate digital inclusion. interactions and underline the difficulties observed. For that purpose, two distinct study phases were performed: the first Keywords Web search Á Web navigation tasks Á Google Á consisted in comparing interaction using two different SAPO Á Intellectual disability search engines’ layouts (Google and SAPO) and the second phase consisted in a preliminary evaluation to analyze how users performed the tasks with the usual input devices 1 Introduction (keyboard and mouse) and provide an alternative interface to help overcome possible interaction problems and Today, the evolution of the information society justifies enhance autonomy. For the latter, we compared two dif- concerns about universal access to services provided online ferent interfaces: a WIMP-based one and speech-based one. in order to improve people’s quality of life.
    [Show full text]
  • Web Trend Map 2007/V2 Social News Political Blogs Community Junctions INFORMATION ARCHITECTS JAPAN ANALYTIC DESIGN and CONSULTING Design Stations
    PChome Vnet EastMoney Tom Pandora OhMyNews Soso 9 Rules Newsvine 2.0 Technorati 2.0 Taobao 2.0 Social News Chinese Line UPS 163 Sohu Red State Malkin C&L Media Matters Passport Sina 2.0 Money Baidu Huffington Post Daily Kos 1.5 Edelman 2.5 1.5 Netscape 1.5 1.0 Statcounter XBox Live Office Live Daily Motion Uwants QQ Yahoo Messenger Geocities 2.0 Beppe Grillo 1.5 Micro Persuasion Tools Wallmart Music Delicious Yahoo Spreeblick 1.5 Livejournal 1.0 2.0 MSN Yahoo.com.cn Yahoo Maps Metacafé T-Online Hotmail Live Microsoft 1.5 Imagevenue 2.5 1.5 Bild Repubblica Yahoo Answers 1.0 Leo 1.0 1.0 Feedburner Xanga 37 Signals Yahoo News Prägnanz LeMonde News Yigg Edelman ElPais Money 2.0 Photolog.com Web.de Heise 1.0 1.0 2.5 MyVideo GMX 1.0 2.0 Sueddeutsche 1.0 2.0 Yahoo Mail Techmeme Flickr Imageshack 1.0 Der Spiegel Dell Myspace Scoblizer Techcrunch Friendster Messenger OM Malik 2.0 Seth Godin PhotoBucket ZeFrank 2.0 2.0 2.0 1.5 CNet FAZ 1.0 2.0 Hi5 Twitter Kotaku Comedy Central NZZ 2.0 MyBlogLog WebtoolsAmazon 2.0 SecondLife Google Blogsearch Knowhow Mixi Cisco Orkut 2.0 Engadget 2.0 2.0 1.5 TheGuardian Technology 2.0 2.0 1.5 Memeorandum Wired Slashdot.org IBM Google Maps John Batelle FOX CNN Reddit Murdoch News Sony Google Calendar Wall Street Journal Studiverzeichnis Jeff Jarvis Social Networks 2.0 1.5 2.0 1.5 1.5 1.0 1.0 2.0 1.0 Adwords NYTimes USA Today Slate Google.co.jp 0.5 Washington Post Trillian Google Analytics 1.0 Music Versiontracker Design AIGA Google.fr Google.de Google.it Google News Doc Searls Chinese Line 1.5 2ch.net BBC 1.5 Bebo
    [Show full text]
  • Revisiting Email Spoofing Attacks
    Revisiting Email Spoofing Attacks Hang Hu Gang Wang Virginia Tech Virginia Tech Abstract—The email system is the central battleground against emails still be delivered to users? If so, how could users know phishing and social engineering attacks, and yet email providers the email is questionable? Take Gmail for example, Gmail still face key challenges to authenticate incoming emails. As a delivers certain forged emails to the inbox and places a security result, attackers can apply spoofing techniques to impersonate a trusted entity to conduct highly deceptive phishing attacks. In cue on the sender icon (a red question mark, Figure 4(a)). this work, we study email spoofing to answer three key questions: We are curious about how a broader range of email providers (1) How do email providers detect and handle forged emails? (2) handle forged emails, and how much the security cues actually Under what conditions can forged emails penetrate the defense help to protect users. to reach user inbox? (3) Once the forged email gets in, how email In this paper, we describe our efforts and experience in eval- providers warn users? Is the warning truly effective? We answer these questions through end-to-end measurements uating the effectiveness of user-level protections against email 1 on 35 popular email providers (used by billions of users), and spoofing . We answer the above questions in two key steps. extensive user studies (N = 913) that consist of both simulated First, we examine how popular email providers detect and han- and real-world phishing experiments. We have four key findings.
    [Show full text]
  • Quantitative Analysis of Whether Machine Intelligence Can Surpass
    Quantitative Analysis of Whether Machine Intelligence Can Surpass Human Intelligence Feng Liua,1, Yong Shib aSchool of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China bResearch Center on Fictitious Economy and Data Science, the Chinese Academy of Sciences, China, Beijing 100190, China Abstract Whether the machine intelligence can surpass the human intelligence is a controversial issue. On the basis of traditional IQ, this article presents the Universal IQ test method suitable for both the machine intelligence and the human intelligence. With the method, machine and human intelligences were divided into 4 major categories and 15 subcategories. A total of 50 search engines across the world and 150 persons at different ages were subject to the relevant test. And then, the Universal IQ ranking list of 2014 for the test objects was obtained. According to the test results, it’s found that the machine intelligence is developed rapidly in low-end intelligence like mathematical calculations and general knowledge mastering, but slowly in high-end intelligence, especially in the knowledge innovation capability. In the future, people may be able to judge whether the machine intelligence can surpass the human intelligence by carrying out the tests like this regularly. Keywords: Machine Intelligence; Human Intelligence;General IQ 1. Introduction On February 18, 2011, the supercomputer Watson competed on a quiz show Jeopardy! against former winners Brad Rutter and Ken Jennings and won the first prize[1]. The famous futurist Kurzweil argued in a book 《The Singularity Is Near: When Humans Transcend Biology 》 that now the information technology is progressing to the technological singularity of a greater-than-human intelligence and when the singularity is reached in 2045, the artificial intelligence (AI) will surpass human intelligence.
    [Show full text]
  • Web Search Engine
    Web Search Engine Bosubabu Sambana, MCA, M.Tech Assistant Professor, Dept of Computer Science & Engineering, Simhadhri Engineering College, Visakhapatnam, AP-531001, India. Abstract: This hypertext pool is dynamically changing due to this reason it is more difficult to find useful The World Wide Web (WWW) allows people to share information.In 1995, when the number of “usefully information or data from the large database searchable” Web pages was a few tens of millions, it repositories globally. We need to search the was widely believed that “indexing the whole of the information with specialized tools known generically Web” was already impractical or would soon become as search engines. There are many search engines so due to its exponential growth. A little more than a available today, where retrieving meaningful decade later, the GYM search engines—Google, information is difficult. However to overcome this problem of retrieving meaningful information Yahoo!, and Microsoft—are indexing almost a intelligently in common search engines, semantic thousand times as much data and between them web technologies are playing a major role. In this providing reliable sub second responses to around a paper we present a different implementation of billion queries a day in a plethora of languages. If this semantic search engine and the role of semantic were not enough, the major engines now provide much relatedness to provide relevant results. The concept higher quality answers. of Semantic Relatedness is connected with Wordnet which is a lexical database of words. We also For most searchers, these engines do a better job of made use of TF-IDF algorithm to calculate word ranking and presenting results, respond more quickly frequency in each and every webpage and to changes in interesting content, and more effectively Keyword Extraction in order to extract only useful eliminate dead links, duplicate pages, and off-topic keywords from a huge set of words.
    [Show full text]