An Ontology-Based Web Crawling Approach for the Retrieval of Materials in the Educational Domain

An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain Mohammed Ibrahim1 a and Yanyan Yang2 b 1School of Engineering, University of Portsmouth, Anglesea Road, PO1 3DJ, Portsmouth, United Kingdom 2School of Computing, University of Portsmouth, Anglesea Road, PO1 3DJ, Portsmouth, United Kingdom Keywords: Web Crawling, Ontology, Education Domain. Abstract: As the web continues to be a huge source of information for various domains, the information available is rapidly increasing. Most of this information is stored in unstructured databases and therefore searching for relevant information becomes a complex task and the search for pertinent information within a specific domain is time-consuming and, in all probability, results in irrelevant information being retrieved. Crawling and downloading pages that are related to the user’s enquiries alone is a tedious activity. In particular, crawlers focus on converting unstructured data and sorting this into a structured database. In this paper, among others kind of crawling, we focus on those techniques that extract the content of a web page based on the relations of ontology concepts. Ontology is a promising technique by which to access and crawl only related data within specific web pages or a domain. The methodology proposed is a Web Crawler approach based on Ontology (WCO) which defines several relevance computation strategies with increased efficiency thereby reducing the number of extracted items in addition to the crawling time. It seeks to select and search out web pages in the education domain that matches the user’s requirements. In WCO, data is structured based on the hierarchical relationship, the concepts which are adapted in the ontology domain. The approach is flexible for application to crawler items for different domains by adapting user requirements in defining several relevance computation strategies with promising results. 1 INTRODUCTION is needed (Pant et al., 2004). The educational domain is one of the domains that have been affected by this Vast amounts of information can be found on the web issue(Almohammadi et al. 2017). As the contents of (Vallet et al. 2007) onsequently, finding relevant the web grow, it will become increasingly information may not be an easy task. Therefore, an challenging especially for students seeking to find efficient and effective approach which seeks to and organize the collection of relevant and useful organize and retrieve relevant information is crucial educational content such as university information, (Yang 2010). With the rapid increase of documents subject information and career information(Chang et available from the complex WWW, more knowledge al., 2016). Until now, there has been no centralized regarding users’ needs is encompassed. However, an method of discovering, aggregating and utilizing enormous amount of information makes pinpointing educational content(Group 2009) by utilising a relevant information a tedious task. For instance, the crawler used by a search engine to retrieve standard tools for web search engines have low information from a massive number of web pages. precision as, typically, some relevant web pages are Moreover, this can also be useful as a way to find a returned but are combined with a large number of variety of information on the internet(Agre & Dongre irrelevant pages mainly due to topic-specific features 2015). Since we aim to find precise data on the web, which may occur in different contexts. Therefore, an this comprehensive method may not instantly retrieve appropriate framework which can organize the the required given the current size of the web. overwhelming number of documents on the internet Most existing approaches towards retrieval tech- a https://orcid.org/0000-0002-9976-0207 b https://orcid.org/0000-0003-1047-2274 900 Ibrahim, M. and Yang, Y. An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain. DOI: 10.5220/0007692009000906 In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 900-906 ISBN: 978-989-758-350-6 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved An Ontology-based Web Crawling Approach for the Retrieval of Materials in the Educational Domain niques depend on keywords. There is no doubt that due to the recursive nature of its algorithm. The other the keywords or index terms fail to adequately approach for crawling higher relevant pages was by capture the contents, returning many irrelevant results the use of neural networks (Fahad et al. 2014) but causing poor retrieval performance(Agre & Mahajan even this approach has not been established as the 2015). In this paper, we propose a new approach to most efficient crawling technique to date. All these web crawler based on ontology called WCO, which is approaches, based on link analysis, only partially used to collect specific information within the solve the problem. Ganesh et al.(Ganesh et al. 2004) education domain. This approach focuses on a proposed a new metric solving the problem of finding crawler which can retrieve information by computing the relevance of pages before the process of crawling the similarity between the user’s query terms and the to an optimal level. Researchers in (Liu et al., 2011) concepts in the reference ontology for a specific present an intelligent, focused crawler algorithm in domain. For example, if a user seeks to retrieve all the which ontology is embedded to evaluate a page’s information about master’s courses in computer relevance to the topic. Ontology is “a formal, explicit science, the crawler will be able to collect all the specification of a shared conceptualization” (Gauch course information related to the specific ontology et al., 2003). Ontology provides a common designed for the computer science domain. vocabulary of an area and defines, with different The crawling system described in this paper levels of formality, the meaning of terms and the matches the ontology concepts given the desired relationships between them (Krisnadhi 2015). result. After crawling concept terms, a similarity Ontologies were developed in Artificial Intelligence ranking system ranks the crawled information. This to facilitate knowledge sharing and reuse. They have reveals highly relevant pages that may have been become an interesting research topic for researchers overlooked by focused standard web crawlers in Artificial Intelligence with specific reference to the crawling for educational contents while at the same study domain regarding knowledge engineering, time filtering redundant pages thereby avoiding natural language processing and knowledge additional paths. representation. Ontologies help in describing The paper is structured into sections. Section 2 semantic web-based knowledge management reviews related work and background; Section 3 architecture and a suite of innovative tools for introduces the proposed approach to architecture. In semantic information processing (Tarus et al., 2017). section 4 the experiment and the results are discussed. Ontology-based web crawlers use ontological Section 5 provides a conclusion and concepts to improve their performance. Hence, it may recommendations for future work. become effortless to obtain relevant data as per the user’s requirements. Ontology is also used for structuring and filtering the knowledge repository. 2 RELATED WORK The ontology concept is used in numerous studies (Gauch et al., 2003; Agre and Mahajan, 2015). Gunjan and Snehlata (Agre & Dongre 2015) proposed A web crawler is a software programme that browses an algorithm for an ontology-based internet crawler the World Wide Web in a systematic, automated which retrieved only relevant sites and made the best manner (Hammond et al., 2016). There has been estimation path for crawling that helped to improve considerable work done on the prioritizing of the URL queue to effect efficient crawling. However, the the crawler performance. The proposed approach performance of the existing prioritizing algorithms deals with information path and domain ontology, finding the most relevant web content and pages for crawling does not suit the requirements of either according to user requirements. Ontology was used the various kinds or the levels of the users. The HITS for filtering and structuring the repository algorithm proposed by Kleinberg (Pant & Srinivasan information. 2006) is based on query-time processing to deduce the hubs and authorities that exist in a sub graph of the web consisting of both the results to a query and the local neighbourhood of these results. The main 3 PROPOSED SYSTEM drawback of Kleinberg’s HITS algorithm is its query- time processing for crawling pages. The best-known Our proposed approach seeks to apply crawling to example of such link analysis is the Pagerank educational content, such as university course Algorithm which has been successfully employed by information, and sort it into a database through the the Google Search Engine (Gauch et al., 2003). calculation of the hierarchy similarity between the However, Pagerank suffers from slow computation user query and the course contents. The crawler 901 ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence consists of several stages; it begins with construction architecture of the proposed approach is illustrated in domain ontology which it uses as a reference of Fig.1. The user interacts with the crawler using

An Ontology-Based Web Crawling Approach for the Retrieval of Materials in the Educational Domain

Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software

Open Search Environments: the Free Alternative to Commercial Search Services

Efficient Focused Web Crawling Approach for Search Engine

Distributed Indexing/Searching Workshop Agenda, Attendee List, and Position Papers

Natural Language Processing Technique for Information Extraction and Analysis

A Hadoop Based Platform for Natural Language Processing of Web Pages and Documents

Curlcrawler:A Framework of Curlcrawler and Cached Database for Crawling the Web with Thumb

Meta Search Engine with an Intelligent Interface for Information Retrieval on Multiple Domains

Effective Focused Crawling Based on Content and Link Structure Analysis

Metadata Statistics for a Large Web Corpus

Apophanies Or Epiphanies? How Crawlers Impact Our Understanding of the Web

Web Crawling, Analysis and Archiving