An Ontology-Based Web Crawling Approach for the Retrieval of Materials in the Educational Domain

Total Page:16

File Type:pdf, Size:1020Kb

An Ontology-Based Web Crawling Approach for the Retrieval of Materials in the Educational Domain An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain Mohammed Ibrahim1 a and Yanyan Yang2 b 1School of Engineering, University of Portsmouth, Anglesea Road, PO1 3DJ, Portsmouth, United Kingdom 2School of Computing, University of Portsmouth, Anglesea Road, PO1 3DJ, Portsmouth, United Kingdom Keywords: Web Crawling, Ontology, Education Domain. Abstract: As the web continues to be a huge source of information for various domains, the information available is rapidly increasing. Most of this information is stored in unstructured databases and therefore searching for relevant information becomes a complex task and the search for pertinent information within a specific domain is time-consuming and, in all probability, results in irrelevant information being retrieved. Crawling and downloading pages that are related to the user’s enquiries alone is a tedious activity. In particular, crawlers focus on converting unstructured data and sorting this into a structured database. In this paper, among others kind of crawling, we focus on those techniques that extract the content of a web page based on the relations of ontology concepts. Ontology is a promising technique by which to access and crawl only related data within specific web pages or a domain. The methodology proposed is a Web Crawler approach based on Ontology (WCO) which defines several relevance computation strategies with increased efficiency thereby reducing the number of extracted items in addition to the crawling time. It seeks to select and search out web pages in the education domain that matches the user’s requirements. In WCO, data is structured based on the hierarchical relationship, the concepts which are adapted in the ontology domain. The approach is flexible for application to crawler items for different domains by adapting user requirements in defining several relevance computation strategies with promising results. 1 INTRODUCTION is needed (Pant et al., 2004). The educational domain is one of the domains that have been affected by this Vast amounts of information can be found on the web issue(Almohammadi et al. 2017). As the contents of (Vallet et al. 2007) onsequently, finding relevant the web grow, it will become increasingly information may not be an easy task. Therefore, an challenging especially for students seeking to find efficient and effective approach which seeks to and organize the collection of relevant and useful organize and retrieve relevant information is crucial educational content such as university information, (Yang 2010). With the rapid increase of documents subject information and career information(Chang et available from the complex WWW, more knowledge al., 2016). Until now, there has been no centralized regarding users’ needs is encompassed. However, an method of discovering, aggregating and utilizing enormous amount of information makes pinpointing educational content(Group 2009) by utilising a relevant information a tedious task. For instance, the crawler used by a search engine to retrieve standard tools for web search engines have low information from a massive number of web pages. precision as, typically, some relevant web pages are Moreover, this can also be useful as a way to find a returned but are combined with a large number of variety of information on the internet(Agre & Dongre irrelevant pages mainly due to topic-specific features 2015). Since we aim to find precise data on the web, which may occur in different contexts. Therefore, an this comprehensive method may not instantly retrieve appropriate framework which can organize the the required given the current size of the web. overwhelming number of documents on the internet Most existing approaches towards retrieval tech- a https://orcid.org/0000-0002-9976-0207 b https://orcid.org/0000-0003-1047-2274 900 Ibrahim, M. and Yang, Y. An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain. DOI: 10.5220/0007692009000906 In Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART 2019), pages 900-906 ISBN: 978-989-758-350-6 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain niques depend on keywords. There is no doubt that due to the recursive nature of its algorithm. The other the keywords or index terms fail to adequately approach for crawling higher relevant pages was by capture the contents, returning many irrelevant results the use of neural networks (Fahad et al. 2014) but causing poor retrieval performance(Agre & Mahajan even this approach has not been established as the 2015). In this paper, we propose a new approach to most efficient crawling technique to date. All these web crawler based on ontology called WCO, which is approaches, based on link analysis, only partially used to collect specific information within the solve the problem. Ganesh et al.(Ganesh et al. 2004) education domain. This approach focuses on a proposed a new metric solving the problem of finding crawler which can retrieve information by computing the relevance of pages before the process of crawling the similarity between the user’s query terms and the to an optimal level. Researchers in (Liu et al., 2011) concepts in the reference ontology for a specific present an intelligent, focused crawler algorithm in domain. For example, if a user seeks to retrieve all the which ontology is embedded to evaluate a page’s information about master’s courses in computer relevance to the topic. Ontology is “a formal, explicit science, the crawler will be able to collect all the specification of a shared conceptualization” (Gauch course information related to the specific ontology et al., 2003). Ontology provides a common designed for the computer science domain. vocabulary of an area and defines, with different The crawling system described in this paper levels of formality, the meaning of terms and the matches the ontology concepts given the desired relationships between them (Krisnadhi 2015). result. After crawling concept terms, a similarity Ontologies were developed in Artificial Intelligence ranking system ranks the crawled information. This to facilitate knowledge sharing and reuse. They have reveals highly relevant pages that may have been become an interesting research topic for researchers overlooked by focused standard web crawlers in Artificial Intelligence with specific reference to the crawling for educational contents while at the same study domain regarding knowledge engineering, time filtering redundant pages thereby avoiding natural language processing and knowledge additional paths. representation. Ontologies help in describing The paper is structured into sections. Section 2 semantic web-based knowledge management reviews related work and background; Section 3 architecture and a suite of innovative tools for introduces the proposed approach to architecture. In semantic information processing (Tarus et al., 2017). section 4 the experiment and the results are discussed. Ontology-based web crawlers use ontological Section 5 provides a conclusion and concepts to improve their performance. Hence, it may recommendations for future work. become effortless to obtain relevant data as per the user’s requirements. Ontology is also used for structuring and filtering the knowledge repository. 2 RELATED WORK The ontology concept is used in numerous studies (Gauch et al., 2003; Agre and Mahajan, 2015). Gunjan and Snehlata (Agre & Dongre 2015) proposed A web crawler is a software programme that browses an algorithm for an ontology-based internet crawler the World Wide Web in a systematic, automated which retrieved only relevant sites and made the best manner (Hammond et al., 2016). There has been estimation path for crawling that helped to improve considerable work done on the prioritizing of the URL queue to effect efficient crawling. However, the the crawler performance. The proposed approach performance of the existing prioritizing algorithms deals with information path and domain ontology, finding the most relevant web content and pages for crawling does not suit the requirements of either according to user requirements. Ontology was used the various kinds or the levels of the users. The HITS for filtering and structuring the repository algorithm proposed by Kleinberg (Pant & Srinivasan information. 2006) is based on query-time processing to deduce the hubs and authorities that exist in a sub graph of the web consisting of both the results to a query and the local neighbourhood of these results. The main 3 PROPOSED SYSTEM drawback of Kleinberg’s HITS algorithm is its query- time processing for crawling pages. The best-known Our proposed approach seeks to apply crawling to example of such link analysis is the Pagerank educational content, such as university course Algorithm which has been successfully employed by information, and sort it into a database through the the Google Search Engine (Gauch et al., 2003). calculation of the hierarchy similarity between the However, Pagerank suffers from slow computation user query and the course contents. The crawler 901 ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence consists of several stages; it begins with construction architecture of the proposed approach is illustrated in domain ontology which it uses as a reference of Fig.1. The user interacts with the crawler using
Recommended publications
  • Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software
    Department of Information Science and Technology Building a Scalable Index and a Web Search Engine for Music on the Internet using Open Source software André Parreira Ricardo Thesis submitted in partial fulfillment of the requirements for the degree of Master in Computer Science and Business Management Advisor: Professor Carlos Serrão, Assistant Professor, ISCTE-IUL September, 2010 Acknowledgments I should say that I feel grateful for doing a thesis linked to music, an art which I love and esteem so much. Therefore, I would like to take a moment to thank all the persons who made my accomplishment possible and hence this is also part of their deed too. To my family, first for having instigated in me the curiosity to read, to know, to think and go further. And secondly for allowing me to continue my studies, providing the environment and the financial means to make it possible. To my classmate André Guerreiro, I would like to thank the invaluable brainstorming, the patience and the help through our college years. To my friend Isabel Silva, who gave me a precious help in the final revision of this document. Everyone in ADETTI-IUL for the time and the attention they gave me. Especially the people over Caixa Mágica, because I truly value the expertise transmitted, which was useful to my thesis and I am sure will also help me during my professional course. To my teacher and MSc. advisor, Professor Carlos Serrão, for embracing my will to master in this area and for being always available to help me when I needed some advice.
    [Show full text]
  • Open Search Environments: the Free Alternative to Commercial Search Services
    Open Search Environments: The Free Alternative to Commercial Search Services. Adrian O’Riordan ABSTRACT Open search systems present a free and less restricted alternative to commercial search services. This paper explores the space of open search technology, looking in particular at lightweight search protocols and the issue of interoperability. A description of current protocols and formats for engineering open search applications is presented. The suitability of these technologies and issues around their adoption and operation are discussed. This open search approach is especially useful in applications involving the harvesting of resources and information integration. Principal among the technological solutions are OpenSearch, SRU, and OAI-PMH. OpenSearch and SRU realize a federated model to enable content providers and search clients communicate. Applications that use OpenSearch and SRU are presented. Connections are made with other pertinent technologies such as open-source search software and linking and syndication protocols. The deployment of these freely licensed open standards in web and digital library applications is now a genuine alternative to commercial and proprietary systems. INTRODUCTION Web search has become a prominent part of the Internet experience for millions of users. Companies such as Google and Microsoft offer comprehensive search services to users free with advertisements and sponsored links, the only reminder that these are commercial enterprises. Businesses and developers on the other hand are restricted in how they can use these search services to add search capabilities to their own websites or for developing applications with a search feature. The closed nature of the leading web search technology places barriers in the way of developers who want to incorporate search functionality into applications.
    [Show full text]
  • Efficient Focused Web Crawling Approach for Search Engine
    Ayar Pranav et al, International Journal of Computer Science and Mobile Computing, Vol.4 Issue.5, May- 2015, pg. 545-551 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545 – 551 RESEARCH ARTICLE Efficient Focused Web Crawling Approach for Search Engine 1 2 Ayar Pranav , Sandip Chauhan Computer & Science Engineering, Kalol Institute of Technology and Research Canter, Kalol, Gujarat, India 1 [email protected]; 2 [email protected] Abstract— a focused crawler traverses the web, selecting out relevant pages to a predefined topic and neglecting those out of concern. Collecting domain specific documents using focused crawlers has been considered one of most important strategies to find relevant information. While surfing the internet, it is difficult to deal with irrelevant pages and to predict which links lead to quality pages. However most focused crawler use local search algorithm to traverse the web space, but they could easily trapped within limited a sub graph of the web that surrounds the starting URLs also there is problem related to relevant pages that are miss when no links from the starting URLs. There is some relevant pages are miss. To address this problem we design a focused crawler where calculating the frequency of the topic keyword also calculate the synonyms and sub synonyms of the keyword. The weight table is constructed according to the user query. To check the similarity of web pages with respect to topic keywords and priority of extracted link is calculated.
    [Show full text]
  • Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers
    Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers Held May 28-19, 1996 in Cambridge, Massachusetts Sponsored by the World Wide Web Consortium Workshop co-chairs: Michael Schwartz, @Home Network Mic Bowman, Transarc Corp. This workshop brings together a cross-section of people concerned with distributed indexing and searching, to explore areas of common concern where standards might be defined. The Call For Participation1 suggested particular focus on repository interfaces that support efficient and powerful distributed indexing and searching. There was a great deal of interest in this workshop. Because of our desire to limit attendance to a workable size group while maximizing breadth of attendee backgrounds, we limited attendance to one person per position paper, and furthermore we limited attendance to one person per institution. In some cases, attendees submitted multiple position papers, with the intention of discussing each of their projects or ideas that were relevant to the workshop. We had not anticipated this interpretation of the Call For Participation; our intention was to use the position papers to select participants, not to provide a forum for enumerating ideas and projects. As a compromise, we decided to choose among the submitted papers and allow multiple position papers per person and per institution, but to restrict attendance as noted above. Hence, in the paper list below there are some cases where one author or institution has multiple position papers. 1 http://www.w3.org/pub/WWW/Search/960528/cfp.html 1 Agenda The Distributed Indexing/Searching Workshop will span two days. The first day's goal is to identify areas for potential standardization through several directed discussion sessions.
    [Show full text]
  • Natural Language Processing Technique for Information Extraction and Analysis
    International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 2, Issue 8, August 2015, PP 32-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Natural Language Processing Technique for Information Extraction and Analysis T. Sri Sravya1, T. Sudha2, M. Soumya Harika3 1 M.Tech (C.S.E) Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 2 Head (I/C) of C.S.E & IT Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] 3 M. Tech C.S.E, Assistant Professor, Sri Padmavati Mahila Visvavidyalayam (Women’s University), School of Engineering and Technology, Tirupati. [email protected] Abstract: In the current internet era, there are a large number of systems and sensors which generate data continuously and inform users about their status and the status of devices and surroundings they monitor. Examples include web cameras at traffic intersections, key government installations etc., seismic activity measurement sensors, tsunami early warning systems and many others. A Natural Language Processing based activity, the current project is aimed at extracting entities from data collected from various sources such as social media, internet news articles and other websites and integrating this data into contextual information, providing visualization of this data on a map and further performing co-reference analysis to establish linkage amongst the entities. Keywords: Apache Nutch, Solr, crawling, indexing 1. INTRODUCTION In today’s harsh global business arena, the pace of events has increased rapidly, with technological innovations occurring at ever-increasing speed and considerably shorter life cycles.
    [Show full text]
  • A Hadoop Based Platform for Natural Language Processing of Web Pages and Documents
    Journal of Visual Languages and Computing 31 (2015) 130–138 Contents lists available at ScienceDirect Journal of Visual Languages and Computing journal homepage: www.elsevier.com/locate/jvlc A hadoop based platform for natural language processing of web pages and documents Paolo Nesi n,1, Gianni Pantaleo 1, Gianmarco Sanesi 1 Distributed Systems and Internet Technology Lab, DISIT Lab, Department of Information Engineering (DINFO), University of Florence, Firenze, Italy article info abstract Available online 30 October 2015 The rapid and extensive pervasion of information through the web has enhanced the Keywords: diffusion of a huge amount of unstructured natural language textual resources. A great Natural language processing interest has arisen in the last decade for discovering, accessing and sharing such a vast Hadoop source of knowledge. For this reason, processing very large data volumes in a reasonable Part-of-speech tagging time frame is becoming a major challenge and a crucial requirement for many commercial Text parsing and research fields. Distributed systems, computer clusters and parallel computing Web crawling paradigms have been increasingly applied in the recent years, since they introduced Big Data Mining significant improvements for computing performance in data-intensive contexts, such as Parallel computing Big Data mining and analysis. Natural Language Processing, and particularly the tasks of Distributed systems text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce.
    [Show full text]
  • Curlcrawler:A Framework of Curlcrawler and Cached Database for Crawling the Web with Thumb
    Ela Kumaret al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (4) , 2011, 1700-1705 CurlCrawler:A framework of CurlCrawler and Cached Database for crawling the web with thumb. Dr Ela Kumar#,Ashok Kumar$ # School of Information and Communication Technology, Gautam Buddha University, Greater Noida $ AIM & ACT.Banasthali University,Banasthali(Rajasthan.) Abstract-WWW is a vast source of information and search engines about basic modules of search engine functioning and these are the meshwork to navigate the web for several purposes. It is essential modules are (see Fig.1)[9,10]. problematic to identify and ping with graphical frame of mind for the desired information amongst the large set of web pages resulted by the search engine. With further increase in the size of the Internet, the problem grows exponentially. This paper is an endeavor to develop and implement an efficient framework with multi search agents to make search engines inapt to chaffing using curl features of the programming, called CurlCrawler. This work is an implementation experience for use of graphical perception to upright search on the web. Main features of this framework are curl programming, personalization of information, caching and graphical perception. Keywords and Phrases-Indexing Agent, Filtering Agent, Interacting Agent, Seolinktool, Thumb, Whois, CachedDatabase, IECapture, Searchcon, Main_spider, apnasearchdb. 1. INTRODUCTION Information is a vital role playing versatile thing from availability at church level to web through trends of books. WWW is now the prominent and up-to-date huge repository of information available to everyone, everywhere and every time. Moreover information on web is navigated using search Fig.1: Components of Information Retrieval System engines like AltaVista, WebCrawler, Hot Boat etc[1].Amount of information on the web is growing at very fast and owing to this search engine’s optimized functioning is always a thrust Store: It stores the crawled pages.
    [Show full text]
  • Meta Search Engine with an Intelligent Interface for Information Retrieval on Multiple Domains
    International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.4, October 2011 META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON MULTIPLE DOMAINS D.Minnie1, S.Srinivasan2 1Department of Computer Science, Madras Christian College, Chennai, India [email protected] 2Department of Computer Science and Engineering, Anna University of Technology Madurai, Madurai, India [email protected] ABSTRACT This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines, and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search provides the user with results for queries on that domain. Meta Search Engines send the user’s search queries to various search engines and combine the search results. This paper introduces intelligent user interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate search engines. Few algorithms are designed to combine results from various search engines and also to display the results. KEYWORDS Information Retrieval, Web Search Engine, Vertical Search Engine, Meta Search Engine. 1. INTRODUCTION AND RELATED WORK WWW is a huge repository of information. The complexity of accessing the web data has increased tremendously over the years. There is a need for efficient searching techniques to extract appropriate information from the web, as the users require correct and complex information from the web. A Web Search Engine is a search engine designed to search WWW for information about given search query and returns links to various documents in which the search query’s key words are found.
    [Show full text]
  • Effective Focused Crawling Based on Content and Link Structure Analysis
    (IJCSIS) International Journal of Computer Science and Information Security, Vol. 2, No. 1, June 2009 Effective Focused Crawling Based on Content and Link Structure Analysis Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Emails: [email protected] , [email protected] , [email protected] Abstract — A focused crawler traverses the web selecting out dependent ontology, which in turn strengthens the metric used relevant pages to a predefined topic and neglecting those out of for prioritizing the URL queue. The Link-Structure-Based concern. While surfing the internet it is difficult to deal with method is analyzing the reference-information among the pages irrelevant pages and to predict which links lead to quality pages. to evaluate the page value. This kind of famous algorithms like In this paper, a technique of effective focused crawling is the Page Rank algorithm [5] and the HITS algorithm [6]. There implemented to improve the quality of web navigation. To check are some other experiments which measure the similarity of the similarity of web pages w.r.t. topic keywords, a similarity page contents with a specific subject using special metrics and function is used and the priorities of extracted out links are also reorder the downloaded URLs for the next crawl [7]. calculated based on meta data and resultant pages generated from focused crawler. The proposed work also uses a method for A major problem faced by the above focused crawlers is traversing the irrelevant pages that met during crawling to that it is frequently difficult to learn that some sets of off-topic improve the coverage of a specific topic.
    [Show full text]
  • Metadata Statistics for a Large Web Corpus
    Metadata Statistics for a Large Web Corpus Peter Mika Tim Potter Yahoo! Research Yahoo! Research Diagonal 177 Diagonal 177 Barcelona, Spain Barcelona, Spain [email protected] [email protected] ABSTRACT are a number of factors that complicate the comparison of We provide an analysis of the adoption of metadata stan- results. First, different studies use different web corpora. dards on the Web based a large crawl of the Web. In par- Our earlier study used a corpus collected by Yahoo!'s web ticular, we look at what forms of syntax and vocabularies crawler, while the current study uses a dataset collected by publishers are using to mark up data inside HTML pages. the Bing crawler. Bizer et al. analyze the data collected by We also describe the process that we have followed and the http://www.commoncrawl.org, which has the obvious ad- difficulties involved in web data extraction. vantage that it is publicly available. Second, the extraction methods may differ. For example, there are a multitude of microformats (one for each object type) and although most 1. INTRODUCTION search engines and extraction libraries support the popular Embedding metadata inside HTML pages is one of the ones, different processors may recognize a different subset. ways to publish structured data on the Web, often pre- Unlike the specifications of microdata and RDFa published ferred by publishers and consumers over other methods of by the RDFa, the microformat specifications are also rather exposing structured data, such as publishing data feeds, informal and thus different processors may extract different SPARQL endpoints or RDF/XML documents.
    [Show full text]
  • Apophanies Or Epiphanies? How Crawlers Impact Our Understanding of the Web
    Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web Syed Suleman Ahmad Muhammad Daniyal Dar Muhammad Fareed Zaffar University of Wisconsin - Madison University of Iowa LUMS Narseo Vallina-Rodriguez Rishab Nithyanand IMDEA Networks / ICSI University of Iowa ABSTRACT [22], International World Wide Web Conference (WWW) [25], In- Data generated by web crawlers has formed the basis for much of ternational AAAI Conference on Web and Social Media (ICWSM) our current understanding of the Internet. However, not all crawlers [24], and Privacy Enhancing Technology Symposium (PETS) [31] are created equal and crawlers generally find themselves trading off relied on Web data obtained through crawlers. However, not all between computational overhead, developer effort, data accuracy, crawlers are created equal or offer the same features, and therefore and completeness. Therefore, the choice of crawler has a critical im- the choice of crawler is critical. Crawlers generally find themselves pact on the data generated and knowledge inferred from it. In this trading-off between computational overhead, flexibility, data accuracy paper, we conduct a systematic study of the trade-offs presented by & completeness, and developer effort. On one extreme, command-line different crawlers and the impact that these can have on various tools such as Wget [35] and cURL [18] are lightweight and easy types of measurement studies. We make the following contributions: to use while providing data that is hardly accurate or complete – First, we conduct a survey of all research published since 2015 in the e.g., these tools have the inability to handle dynamically generated premier security and Internet measurement venues to identify and content or execute JavaScript which restricts the completeness of verify the repeatability of crawling methodologies deployed for dif- gathered data [21].
    [Show full text]
  • Web Crawling, Analysis and Archiving
    Web Crawling, Analysis and Archiving Vangelis Banos Aristotle University of Thessaloniki Faculty of Sciences School of Informatics Doctoral dissertation under the supervision of Professor Yannis Manolopoulos October 2015 Ανάκτηση, Ανάλυση και Αρχειοθέτηση του Παγκόσμιου Ιστού Ευάγγελος Μπάνος Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Πληροφορικής Διδακτορική Διατριβή υπό την επίβλεψη του Καθηγητή Ιωάννη Μανωλόπουλου Οκτώβριος 2015 i Web Crawling, Analysis and Archiving PhD Dissertation ©Copyright by Vangelis Banos, 2015. All rights reserved. The Doctoral Dissertation was submitted to the the School of Informatics, Faculty of Sci- ences, Aristotle University of Thessaloniki. Defence Date: 30/10/2015. Examination Committee Yannis Manolopoulos, Professor, Department of Informatics, Aristotle University of Thes- saloniki, Greece. Supervisor Apostolos Papadopoulos, Assistant Professor, Department of Informatics, Aristotle Univer- sity of Thessaloniki, Greece. Advisory Committee Member Dimitrios Katsaros, Assistant Professor, Department of Electrical & Computer Engineering, University of Thessaly, Volos, Greece. Advisory Committee Member Athena Vakali, Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece. Anastasios Gounaris, Assistant Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece. Georgios Evangelidis, Professor, Department of Applied Informatics, University of Mace- donia, Greece. Sarantos Kapidakis, Professor, Department of Archives, Library Science and Museology, Ionian University, Greece. Abstract The Web is increasingly important for all aspects of our society, culture and economy. Web archiving is the process of gathering digital materials from the Web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research. Web archiving is a difficult problem due to organizational and technical reasons. We focus on the technical aspects of Web archiving.
    [Show full text]