University​ ​Of​ ​Camerino

Total Page:16

File Type:pdf, Size:1020Kb

University​ ​Of​ ​Camerino UNIVERSITY OF CAMERINO ​ ​ ​ ​ School of Science and Technology ​ ​ ​ ​ ​ ​ ​ ​ Degree in Computer Science ​ ​ ​ ​ ​ ​ Software and systems for industries ​ ​ ​ ​ ​ ​ ​ ​ (Class LM-18) ​ ​ A Framework for web presence sentiment ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ analysis Graduand : Supervisor: ​ ​ Andrea Marcelli Prof. Fausto Marcantoni ​ ​ ​ ​ ​ ​ Academic year 2016/2017 ​ ​ ​ ​​ ​​ 2 Abstract (English) The World Wide Web has gone a long way since its early days in development at the CERN Laboratories. Now holding immense relevance in everyday activities, being intertwined with most of the people lives on this planet, its network of links and applications allow users to reach any kind of content they want. The huge quantity of data hosted on the Internet can make the difference in a number of diverse applications, but, most importantly for the scope of this work, can be useful in determining with a certain degree of precision, the polarity of opinions held by communities, users and newspapers alike with respect to a certain individual or business. As a matter of fact, data can hold either a positive or negative meaning when stating something about a topic, or in even in more general terms, as the name of an individual or business can be cited alongside positive or negative content. This activity, called “Sentiment Analysis” has been actively researched upon in several influential papers. This work is based on the framework for Sentiment Analysis created by Bing Liu ​ and cited in several researches. However, it aims at simplifying it by tailoring to a different, more specific context with respect to the more general one that a good general framework such as that developed by Liu should address. This work introduces the concept of “influence score”, that is the strength with which a text is said to be able to impact the reputation of the search subject. By using this companion concept, the simplified framework proposed in this work tries to address the issue of determining, in a naive way, the amount of impact that any content can have on either the individual or business at hand. The framework itself tries to follow a strictly scientific process by first analyzing the problem with use of Data Science principles and then by defining additional framework components and concepts to be used alongside said principles. The companion tool to the framework is developed to provide a first implementation of the concepts by examining results pages (SERPs) obtained by searching the web be means of a search engine in order to analyse each SERPs item separately, to obtain its sentiment and influence score and to provide, overtime, a monitoring functionality for the user. This document takes into account the different items at play that interacts within the framework implementation, such as Search Engines (Chapter 1), Data Science (Chapter 2), Explorative Algorithmic Techniques (Chapter 3), Web Reputation (Chapter 4), The Tool (Chapter 5), finally dwelling on future possibilities and additions to the work in Chapter 6. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 3 (Italiano) Il World Wide Web ha raggiunto uno stadio di sviluppo ed utilizzo estremamente avanzato rispetto a quanto era stato preventivato all’epoca del suo iniziale sviluppo ai laboratori CERN. Difatti, si può affermare che esso abbia raggiunto una notevole importanza nelle attività di tutti i giorni. Essendo le persone della maggior parte dei paesi poste a stretto contatto con esso, la sua rete di collegamenti e le numerose applicazioni native, esso consente ai suoi utenti di sfogliare qualsiasi tipo di contenuto essi vogliono. La grande mole di dati ospitata da Internet può essere impiegata in numerosi ambiti e diverse applicazioni ma, nel contesto di questa ricerca, rimane di maggior interesse la possibilità di utilizzare tali dati al fine di determinare, con un certo grado di precisione, la polarità delle opinioni espresse da comunità online, utenti o perfino giornali web circa un determinato argomento, individuo o business. Difatti, si può affermare che i dati possano essere caratterizzati da un orientamento, sia esso positivo o negativo, sia nel contesto specifico di un argomento di ricerca, che in termini più generali, quando tale soggetto viene citato in contesti caratterizzati da contenuto positivo o negativo. Questo genere di attività, denominata “Analisi del sentimento” (in inglese, Sentiment Analysis) è stata largamente studiata in diversi importanti studi in materia. Il seguente lavoro, si basa sul lavoro proposto da Bing Liu, che introduce ​ ​ un framework per l’analisi del sentimento, citato in numerose ricerche successive. Nonostante sia basato sul lavoro di Liu, il presente lavoro si propone di introdurre un framework semplificato e adattato a un contesto più specifico rispetto a ciò per cui il framework originale era stato inizialmente progettato. In particolare, si introduce il concetto di “punteggio di influenza” , definito come la misura con cui un particolare testo è in grado di influenzare la reputazione del soggetto della ricerca. Utilizzando questo concetto, il framework si propone di determinare, in modo semplice, la quantità di impatto che un determinato contenuto può avere sull’individuo o business in esame. Il framework si basa su di un processo scientifico che prevede una fase di iniziale di analisi dei dati attraverso l’uso dei principi della Scienza dei Dati e la definizione di ulteriori componenti e concetti interni al framework stesso. L’applicazione sviluppata come prima implementazione del framework analizza le pagine risultanti (SERPs) da una ricerca web via motori di ricerca analizzandone ogni elemento separatamente, in maniera tale da ottenerne i relativi valori di sentimento e punteggio di influenza per arricchire l’utente, come risultato di un’analisi continua, di una utile funzionalità di monitoraggio della propria reputazione. Questo documento prende, inoltre, in esame tutte le varie componenti che interagiscono tra loro all’interno dell’implementazione del framework, come i motori di ricerca ( Search Engines, Capitolo 1), la scienza dei dati (Data Science, Capitolo 2), le tecniche esplorative dei contenuti (Explorative Algorithmic Techniques, Chapter 3), reputation online (Web Reputation, Chapter 4), l’applicazione (The Tool, Chapter 5) analizzando, in ultima istanza, le eventuali aggiunte e sviluppi futuri nel Capitolo 6. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4 Abstract 3 (English) 3 (Italiano) 4 1. Search Engines: finding the needle in the haystack 9 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.1. At the heart of the web 9 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2. Search Engines 11 ​ ​ ​ ​ 1.2.1. Browsing the past 11 ​ ​ ​ ​ ​ ​ 1.2.1.2 Directories vs. Search Engines (The Yahoo example) 13 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.1.3. Other types of search engines 14 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2. The Algorithms and the theoretical framework 15 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.1. Elements commons to all search engines 15 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.2. Elements search engines differ in 15 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.3. General modules of a search engine: a more in depth look 16 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.4. Crawling and indexing 17 ​ ​ ​ ​ ​ ​ 1.2.2.5. Relevance and popularity 18 ​ ​ ​ ​ ​ ​ 1.2.2.5.1. Relevance 18 ​ ​ 1.2.2.5.1. Popularity 19 ​ ​ 1.2.2.7. Search Engine Ranking Factors 19 ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.8. Page rank: at Google’s foundation 20 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.2.2.9. Hits 23 ​ ​ Image: hubs and authorities 23 ​ ​ ​ ​ ​ ​ 1.2.2.10. TF-IDF 24 ​ ​ 1.2.2.11. Beyond the algorithms: influencing factors 26 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.3. With many data, come great possibilities 27 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.3.1. Meaningful results: web reputation 28 ​ ​ ​ ​ ​ ​ ​ ​ 1.3.2. What does the “query” say 28 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1.3.3. An interesting question: is it possible to obtain more information from the ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ returned result set? 29 ​ ​ ​ ​ 2. Data Science: a brief explorations of the principles behind data 31 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1. A scientific approach to data analysis 31 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1.1. An introduction to Data Science 31 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1.2. Data types 32 ​ ​ ​ ​ 2.1.2.1. Organized and unorganized data 32 ​ ​ ​ ​ ​ ​ ​ ​ 2.1.2.2. Quantitative and qualitative data 33 ​ ​ ​ ​ ​ ​ ​ ​ 2.1.2.3. The four levels of data 33 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1.2.3.1. Nominal level 34 ​ ​ ​ ​ 2.1.2.3.2. Ordinal level 34 ​ ​ ​ ​ 2.1.2.3.3. Intervals level 35 ​ ​ ​ ​ 2.1.2.3.4. Ratio level 36 ​ ​ ​ ​ 2.1.2.3.5. Summary 36 ​ ​ 5 2.1.3. Data science: a five steps scientific approach 37 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1.3.1. Posing an interesting question 38 ​ ​ ​ ​ ​ ​ ​ ​ 2.1.3.2. Fetching data 38 ​ ​ ​ ​ 2.1.3.3. Exploring data 38 ​ ​ ​ ​ 2.1.3.4. Modelling data 39 ​ ​ ​ ​ 2.1.3.5. Presenting findings 41 ​ ​ ​ ​ 2.1.4. Applied data science: extracting info on the web 42 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 2.1.4.1. News Analytics 42 ​ ​ ​ ​ 2.1.4.1.1. Algorithms 43 ​ ​ 2.1.4.1.2. Models 44 ​ ​ 2.1.4.1.3. Putting it all together 49 ​ ​ ​ ​ ​ ​ ​ ​ 3. Explorative algorithmic techniques 51 ​ ​ ​ ​ ​ ​ 3.1. Sentiment Analysis 51 ​ ​ ​ ​ 3.1.1. A framework for Sentiment Analysis 52 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 3.1.1.1. Tasks of sentiment analysis 53 ​ ​ ​ ​ ​ ​ ​ ​ 3.2. Influence Score 54 ​ ​ ​ ​ 3.2.1. Defining the problem 55 ​ ​ ​ ​ ​ ​ 3.2.1.1. Additional measures 56 ​ ​ ​ ​ 3.2.1.2. Adjusting the framework 57 ​ ​ ​ ​ ​ ​ 4.1. Meaningful results: web reputation 62 ​ ​ ​ ​ ​ ​ ​ ​ 4.1.2. Web Reputation Management 63 ​ ​ ​ ​ ​ ​ 4.1.2.1. Web Reputation Activities 64 ​ ​ ​ ​ ​ ​ 4.1.2.1. Techniques 66 ​ ​ 4.1.2.2. Page indexing altering techniques 67 ​ ​ ​ ​ ​ ​ ​ ​ 4.1.2.2.1. SEO - Search Engine Optimization 67 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4.1.2.2.2. When SEO is not enough 68 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 4.1.2.3. Ethics 69 ​ ​ 5. The Companion Tool 70 ​ ​ ​ ​ ​ ​ 5.1. An analysis of the work 70 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 5.1.1. A structural problem analysis 70 ​ ​ ​ ​ ​ ​ ​ ​ 5.1.1.1. Getting the Data 71 ​ ​ ​ ​ ​ ​ 5.1.1.2. Data Exploration 72 ​ ​ ​ ​ 5.1.1.3. Modelling Data 73 ​ ​ ​ ​ 5.1.1.4. Presenting Findings 73 ​ ​ ​ ​ 5.1.2. Tool’s architecture 74 ​ ​ ​ ​ 5.1.2.1. Early development 74 ​ ​ ​ ​ 5.1.2.2. Later stages of development 75 ​ ​ ​ ​ ​ ​ ​ ​ 5.1.2.3.
Recommended publications
  • Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland
    Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, Bert Wendland To cite this version: France Lasfargues, Clément Oury, Bert Wendland. Legal deposit of the French Web: harvesting strategies for a national domain. International Web Archiving Workshop, Sep 2008, Aarhus, Denmark. hal-01098538 HAL Id: hal-01098538 https://hal-bnf.archives-ouvertes.fr/hal-01098538 Submitted on 26 Dec 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, and Bert Wendland Bibliothèque nationale de France Quai François Mauriac 75706 Paris Cedex 13 {france.lasfargues, clement.oury, bert.wendland}@bnf.fr ABSTRACT 1. THE FRENCH CONTEXT According to French Copyright Law voted on August 1st, 2006, the Bibliothèque nationale de France (“BnF”, or “the Library”) is 1.1 Defining the scope of the legal deposit in charge of collecting and preserving the French Internet. The On August 1st, 2006, a new Copyright law was voted by the Library has established a “mixed model” of Web archiving, which French Parliament.
    [Show full text]
  • Web Archiving Environmental Scan
    Web Archiving Environmental Scan The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Web Archiving Environmental Scan Harvard Library Report January 2016 Prepared by Gail Truman The Harvard Library Report “Web Archiving Environmental Scan” is licensed under a Creative Commons Attribution 4.0 International License. Prepared by Gail Truman, Truman Technologies Reviewed by Andrea Goethals, Harvard Library and Abigail Bordeaux, Library Technology Services, Harvard University Revised by Andrea Goethals in July 2017 to correct the number of dedicated web archiving staff at the Danish Royal Library This report was produced with the generous support of the Arcadia Fund. Citation: Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Table of Contents Executive Summary ............................................................................................................................ 3 Introduction ......................................................................................................................................
    [Show full text]
  • User Manual [Pdf]
    Heritrix User Manual Internet Archive Kristinn Sigur#sson Michael Stack Igor Ranitovic Table of Contents 1. Introduction ............................................................................................................ 1 2. Installing and running Heritrix .................................................................................... 2 2.1. Obtaining and installing Heritrix ...................................................................... 2 2.2. Running Heritrix ........................................................................................... 3 2.3. Security Considerations .................................................................................. 7 3. Web based user interface ........................................................................................... 7 4. A quick guide to running your first crawl job ................................................................ 8 5. Creating jobs and profiles .......................................................................................... 9 5.1. Crawl job .....................................................................................................9 5.2. Profile ....................................................................................................... 10 6. Configuring jobs and profiles ................................................................................... 11 6.1. Modules (Scope, Frontier, and Processors) ....................................................... 12 6.2. Submodules ..............................................................................................
    [Show full text]
  • Web Archiving for Academic Institutions
    University of San Diego Digital USD Digital Initiatives Symposium Apr 23rd, 1:00 PM - 4:00 PM Web Archiving for Academic Institutions Lori Donovan Internet Archive Mary Haberle Internet Archive Follow this and additional works at: https://digital.sandiego.edu/symposium Donovan, Lori and Haberle, Mary, "Web Archiving for Academic Institutions" (2018). Digital Initiatives Symposium. 4. https://digital.sandiego.edu/symposium/2018/2018/4 This Workshop is brought to you for free and open access by Digital USD. It has been accepted for inclusion in Digital Initiatives Symposium by an authorized administrator of Digital USD. For more information, please contact [email protected]. Web Archiving for Academic Institutions Presenter 1 Title Senior Program Manager, Archive-It Presenter 2 Title Web Archivist Session Type Workshop Abstract With the advent of the internet, content that institutional archivists once preserved in physical formats is now web-based, and new avenues for information sharing, interaction and record-keeping are fundamentally changing how the history of the 21st century will be studied. Due to the transient nature of web content, much of this information is at risk. This half-day workshop will cover the basics of web archiving, help attendees identify content of interest to them and their communities, and give them an opportunity to interact with tools that assist with the capture and preservation of web content. Attendees will gain hands-on web archiving skills, insights into selection and collecting policies for web archives and how to apply what they've learned in the workshop to their own organizations. Location KIPJ Room B Comments Lori Donovan works with partners and the Internet Archive’s web archivists and engineering team to develop the Archive-It service so that it meets the needs of memory institutions.
    [Show full text]
  • Getting Started in Web Archiving
    Submitted on: 13.06.2017 Getting Started in Web Archiving Abigail Grotke Library Services, Digital Collections Management and Services Division, Library of Congress, Washington, D.C., United States. E-mail address: [email protected] This work is made available under the terms of the Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 Abstract: This purpose of this paper is to provide general information about how organizations can get started in web archiving, for both those who are developing new web archiving programs and for libraries that are just beginning to explore the possibilities. The paper includes an overview of considerations when establishing a web archiving program, including typical approaches that national libraries take when preserving the web. These include: collection development, legal issues, tools and approaches, staffing, and whether to do work in-house or outsource some or most of the work. The paper will introduce the International Internet Preservation Consortium and the benefits of collaboration when building web archives. Keywords: web archiving, legal deposit, collaboration 1 BACKGROUND In the more than twenty five years since the World Wide Web was invented, it has been woven into everyday life—a platform by which a huge number of individuals and more traditional publishers distribute information and communicate with one another around the world. While the size of the web can be difficult to articulate, it is generally understood that it is large and ever-changing and that content is continuously added and removed. With so much global cultural heritage being documented online, librarians, archivists, and others are increasingly becoming aware of the need to preserve this valuable resource for future generations.
    [Show full text]
  • Incremental Crawling with Heritrix
    Incremental crawling with Heritrix Kristinn Sigurðsson National and University Library of Iceland Arngrímsgötu 3 107 Reykjavík Iceland [email protected] Abstract. The Heritrix web crawler aims to be the world's first open source, extensible, web-scale, archival-quality web crawler. It has however been limited in its crawling strategies to snapshot crawling. This paper reports on work to add the ability to conduct incremental crawls to its capabilities. We first discuss the concept of incremental crawling as opposed to snapshot crawling and then the possible ways to design an effective incremental strategy. An overview is given of the implementation that we did, its limits and strengths are discussed. We then report on the results of initial experimentation with the new software which have gone well. Finally, we discuss issues that remain unresolved and possible future improvements. 1 Introduction With an increasing number of parties interested in crawling the World Wide Web, for a variety of reasons, a number of different crawl types have emerged. The development team at Internet Archive [12] responsible for the Heritrix web crawler, have highlighted four distinct variations [1], broad , focused, continuous and experimental crawling. Broad and focused crawls are in many ways similar, the primary difference being that broad crawls emphasize capturing a large scope1, whereas focused crawling calls for a more complete coverage of a smaller scope. Both approaches use a snapshot strategy , which involves crawling the scope once and once only. Of course, crawls are repeatable but only by starting again from the seeds. No information from past crawls is used in new ones, except possibly some changes to the configuration made by the operator, to avoid crawler traps etc.
    [Show full text]
  • Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS)
    University of Iceland Faculty of Engineering Department of Computer Science Adaptive Revisiting with Heritrix Master Thesis (30 credits/60 ECTS) by Kristinn Sigurðsson May 2005 Supervisors: Helgi Þorbergsson, PhD Þorsteinn Hallgrímsson Útdráttur á íslensku Veraldarvefurinn geymir sívaxandi hluta af þekkingu og menningararfi heimsins. Þar sem Vefurinn er einnig sífellt að breytast þá er nú unnið ötullega að því að varðveita innihald hans á hverjum tíma. Þessi vinna er framlenging á skylduskila lögum sem hafa í síðustu aldir stuðlað að því að varðveita prentað efni. Fyrstu þrír kaflarnir lýsa grundvallar erfiðleikum við það að safna Vefnum og kynnir hugbúnaðinn Heritrix, sem var smíðaður til að vinna það verk. Fyrsti kaflinn einbeitir sér að ástæðunum og bakgrunni þessarar vinnu en kaflar tvö og þrjú beina kastljósinu að tæknilegri þáttum. Markmið verkefnisins var að þróa nýja tækni til að safna ákveðnum hluta af Vefnum sem er álitinn breytast ört og vera í eðli sínu áhugaverður. Seinni kaflar fjalla um skilgreininu á slíkri aðferðafræði og hvernig hún var útfærð í Heritrix. Hluti þessarar umfjöllunar beinist að því hvernig greina má breytingar í skjölum. Að lokum er fjallað um fyrstu reynslu af nýja hugbúnaðinum og sjónum er beint að þeim þáttum sem þarfnast frekari vinnu eða athygli. Þar sem markmiðið með verkefninu var að leggja grunnlínur fyrir svona aðferðafræði og útbúa einfalda og stöðuga útfærsla þá inniheldur þessi hluti margar hugmyndir um hvað mætti gera betur. Keywords Web crawling, web archiving, Heritrix, Internet, World Wide Web, legal deposit, electronic legal deposit. i Abstract The World Wide Web contains an increasingly significant amount of the world’s knowledge and heritage.
    [Show full text]
  • An Introduction to Heritrix an Open Source Archival Quality Web Crawler
    An Introduction to Heritrix An open source archival quality web crawler Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery and Michele Kimpton Internet Archive Web Team {gordon,stack,igor,dan,michele}@archive.org Abstract. Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls. Introduction The Internet Archive (IA) is a 5013C non-profit corporation, whose mission is to build a public Internet digital library. Over the last 6 years, IA has built the largest public web archive to date, hosting over 400 TB of data. The Web Archive is comprised primarily of pages collected by Alexa Internet starting in 1996. Alexa Internet is a Web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. Alexa Internet takes a snapshot of the web every 2 months, currently collecting 10 TB of data per month from over 35 million sites. Alexa Internet donates this crawl to the Internet Archive, and IA stores and indexes the collection. Alexa uses its own proprietary software and techniques to crawl the web. This software is not available to Internet Archive or other institutions for use or extension.
    [Show full text]
  • Exploring the Intersections of Web Science and Accessibility 3
    Exploring the Intersections of Web Science and Accessibility TREVOR BOSTIC, JEFF STANLEY, JOHN HIGGINS, DANIEL CHUDNOV, RACHAEL L. BRADLEY MONTGOMERY, JUSTIN F. BRUNELLE, The MITRE Corporation The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rarer and assessing the accessibility of web-based information to ensure it is available to all users is increasingly difficult given the dynamic nature of representations. In this work, we survey three ongoing research threads that can inform web accessibility solutions: assessing web accessibility, modeling web user activity, and web application crawling. Current web accessibility research is contin- ually focused on increasing the percentage of automatically testable standards, but still relies heavily upon manual testing for complex interactive applications. Along-side web accessibility research, there are mechanisms developed by researchers that replicate user interactions with web pages based on usage patterns. Crawling web applications is a broad research domain; exposing content in web applications is difficult because of incompatibilities in web crawlers and the technologies used to create the applications. We describe research on crawling the deep web by exercising user forms. We close with a thought exercise regarding the convergence of these three threads and the future of automated, web-based accessibility evaluation and assurance through a use case in web archiving. These research efforts provide insight into how users interact with websites, how to automate and simulate user interactions, how to record the results of user interactions, and how to analyze, evaluate, and map resulting website content to determine its relative accessibility.
    [Show full text]
  • Partnership Opportunities with the Internet Archive Web Archiving in Libraries October 21, 2020
    Partnership Opportunities with the Internet Archive Web archiving in libraries October 21, 2020 Karl-Rainer Blumenthal Web Archivist, Internet Archive Web archiving is the process of collecting, preserving, and enabling access to web-published materials. Average lifespan of a webpage 92 days WEB ARCHIVING crawler replay app W/ARC WEB ARCHIVING TECHNOLOGY Brozzler Heritrix ARC HTTrack WARC warcprox wget Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING TECHNOLOGY Brozzler Heritrix ARC HTTrack WARC warcprox wget Archive-It Wayback Machine NetarchiveSuite (DK/FR) OpenWayback PANDAS (AUS) pywb Web Curator (UK/NZ) wab.ac Webrecorder oldweb.today WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine Limitations: Lightly curated Completeness Temporal cohesion Access: No full-text search No descriptive metadata Access by URL only ARCHIVE-IT Archive-It https://archive-it.org Curator controlled > 700 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT COLLECTIONS ARCHIVE-IT PARTNERS WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS (GOV) DATA WEB ARCHIVE ACCESSIBILITY WEB ARCHIVE ACCESSIBILITY WEB ARCHIVING COLLABORATION FDLP Libraries WEB ARCHIVING COLLABORATION FDLP Libraries Archive-It partners THANKS <3 ...and keep in touch! Karl-Rainer Blumenthal Web Archivist, Internet Archive [email protected] [email protected] Partnership Opportunities with Internet Archive Andrea Mills – Digitization Program Manager, Internet Archive 1.
    [Show full text]
  • Collecte Orientée Sur Le Web Pour La Recherche D'information Spécialisée
    Collecte orientée sur le Web pour la recherche d’information spécialisée Clément de Groc To cite this version: Clément de Groc. Collecte orientée sur le Web pour la recherche d’information spécialisée. Autre [cs.OH]. Université Paris Sud - Paris XI, 2013. Français. NNT : 2013PA112073. tel-00853250 HAL Id: tel-00853250 https://tel.archives-ouvertes.fr/tel-00853250 Submitted on 22 Aug 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Université Paris-Sud École doctorale d’informatique LIMSI-CNRS Collecte orientée sur le Web pour la recherche d’information spécialisée par Clément de Groc Thèse présentée pour obtenir le grade de Docteur de l’Université Paris-Sud Soutenue publiquement à Orsay le mercredi 5 juin 2013 en présence d’un jury composé de : Rapporteur Éric Gaussier Université de Grenoble Rapporteur Jacques Savoy Université de Neuchâtel Examinatrice Chantal Reynaud Université Paris-Sud Examinateur Mohand Boughanem Université de Toulouse Directeur Pierre Zweigenbaum CNRS Co-directeur Xavier Tannier Université Paris-Sud Invité Claude de Loupy Syllabs REMERCIEMENTS Cette thèse est l’aboutissement d’un travail individuel qui a nécessité le soutien et la participation de nombreuses personnes. Je remercie messieurs Éric Gaussier et Jacques Savoy de m’avoir fait l’immense honneur d’accep- ter d’être rapporteurs de mon travail, au même titre que madame Chantal Reynaud et monsieur Mohand Boughanem qui ont accepté de faire partie de mon jury.
    [Show full text]
  • The Use Case of Heritrix?
    An Architecture for Selective Web Harvesting: The Use Case of Heritrix? Vassilis Plachouras1, Florent Carpentier2, Julien Masan`es2, Thomas Risse3, Pierre Senellart4, Patrick Siehndel3, and Yannis Stavrakas1 1 IMIS, ATHENA Research Center, Athens, Greece 2 Internet Memory Foundation, 93100 Montreuil, France 3 L3S Research Center, University of Hannover, Germany 4 Institut Mines-T´el´ecom; T´el´ecomParisTech; CNRS LTCI, Paris, France Abstract. In this paper we provide a brief overview of the crawling architecture of ARCOMEM and how it addresses the challenges arising in the context of selective web harvesting. We describe some of the main technologies developed to perform selective harvesting and we focus on a modified version of the open source crawler Heritrix, which we have adapted to fit in ACROMEM's crawling architecture. The simulation experiments we have performed show that the proposed architecture is effective in a focused crawling setting. Keywords: Web harvesting, focused crawling, web application crawling 1 Introduction The objectives of the ARCOMEM project are to select and harvest the web content that best preserves community memories for the future by exploiting social media, as opposed to exhaustive web harvesting approaches. To achieve these objectives, there are several challenges that need to be addressed. First, it is necessary to integrate in the crawling process advanced and computationally- intensive content analysis without reducing the efficiency of the crawler. Second, it is important to guide the web harvesting with accuracy to discover and collect relevant resources. Last, the above functionalities must be scalable in order to process large amounts of data. In this paper, we provide a brief description of the ARCOMEM's crawling architecture and some of the main technologies.
    [Show full text]