Context Based Web Indexing for Storage of Relevant Web Pages
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 Context based Web Indexing for Storage of Relevant Web Pages Nidhi Tyagi Rahul Rishi R.P. Agarwal Asst. Prof. Prof. & Head Prof. & Head Shobhit University, Technical Institute of Textile & Shobhit University, Meerut Sciences, Bhiwani Meerut ABSTRACT or 1. This strategy makes the searching task faster and A focused crawler downloads web pages that are relevant to a optimized. user specified topic. The downloaded documents are indexed with a view to optimize speed and performance in finding 2. RELATED WORK relevant documents for a search query at the search engine In this section, a review of previous work on index side. However, the information will be more relevant if the organization is given. In this field of index organization and context of the topic is also made available to the retrieval maintenance, many algorithms and techniques have already system. This paper proposes a technique for indexing the been proposed but they seem to be less efficient in efficiently keyword extracted from the web documents along with their accessing the index. contexts wherein it uses a height balanced binary search F. Silvestri, R.Perego and Orlando [4] proposed the (AVL) tree, for indexing purpose to enhance the performance reordering algorithm which partitions the set of documents of the retrieval system. into k ordered clusters on the basis of similarity measure where the biggest document is selected as centroid of the first General Terms cluster and n/k1 most similar documents are assigned to this Algorithm, retrieval, indexer. cluster. Then the biggest document is selected and the same process repeats. The process keeps on repeating until all the k Keywords clusters are formed and each cluster gets completed with n/k documents. This algorithm is not effective in clustering the AVL tree, contextual, repository, balance factor. most similar documents. Oren Zamir and Oren Etzioni., [5] proposed threshold 1. INTRODUCTION based clustering algorithm. Initially, the number of clusters is With the rapid growth of the Internet, the World Wide Web unknown, based on the specified threshold value the similarity (WWW) has become one of the most important resources for between the two documents is evaluated and they are obtaining information and one of the most important media of classified to the same cluster. If the threshold is small; all the communication. The basic aim is to select the best collection elements will get assigned to different clusters. If the of information according to users need. The existing focused threshold is large, the elements may get assigned to just one crawlers [1, 2] adopt different strategies for computing the cluster. Thus the algorithm is sensitive to specification of words’ frequency in the web documents. If higher frequency threshold. words match with the topic keyword, then the document is considered to be relevant. But they generally do not analyze C. Zhou, W. Ding and Na Yang [6], the paper introduces a the context of the keyword in the web page before they double indexing mechanism for search engines based on download it. The subject of context has received a great deal campus Net. The CNSE consists of crawl machine, Chinese of attention in the information recovery literature for seeking automatic segmentation, index and search machine. The relevant information [9]. Exploring the contents of the web proposed mechanism has document index as well as word pages for automatic indexing is of fundamental importance for index. The document index is based on, where the documents various web applications. The purpose of storing an index is do the clustering, and ordered by the position in each to optimize speed and performance in finding relevant document. During the retrieval, the search engine first gets the documents for a search query. Without an index, the search document id of the word in the word index, and then goes to engine would scan every document in the corpus, which the position of corresponding word in the document index. would require considerable time and computing power. For Because in the document index, the word in the same example, while an index of 10,000 documents can be queried document is adjacent, the search engine directly compares the within milliseconds, a sequential scan of every word in largest word matching assembly with the sentence that users documents is a time consuming task. The additional computer submit. The mechanism proposed, seems to be time storage required to store the index, as well as the considerable consuming as the index exists at two levels. increase in the time required for an update to take place, are N. Chauhan and A. K. Sharma [7] proposed, the context traded off for the time saved during information retrieval. driven focused crawler (CDFC) that searches and downloads In AVL (i.e. height balanced binary tree) tree [3], the height only highly relevant web pages, thus, reducing the network of a tree is defined as the length of the longest path from the traffic. A category tree has been used, which provides root node of the tree to one of its leaf node, and the balance flexibility to the user for interacting with the system showing factor (BF) is: (height of left subtree – height of right subtree). the broad categories of the topics on the web. The proposed To call the AVL as balanced the value of BF should be -1, 0 1 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 design significantly reduces the storage space at the search engine side. P. Gupta and A. K. Sharma [8], worked on context based Crawled indexing in search engines using ontology. The index WWW web construction is done on the basis of the context using pages ontology. The context repository, thesaurus and ontology repository are used by the indexer to identify the context of the document. The critical look at the available literature reveals that there is a requirement for a technique to organize the keyword and Pre- their contexts in a better fashion as storing in a linear fashion makes searching of a document a bit time consuming. processor 3. PROPOSED WORK This paper proposes an algorithm for indexing the keyword extracted from the web documents along with their context. keywords The indexing technique uses a height balanced binary search Thesaurus (AVL) tree, in addition to improved performance in the Generate retrieval of information, this data structure is able to support dynamic indexing, which is especially important for context environments where documents are changed frequently. If the query planning about the arrangement of the keywords is done then user AVL tree can be achieved. The need to develop a technique interface that maintains a balance between the height of the left and context right sub-trees of binary tree arises for the faster search of documents stored in the database according the their keywords. AVL Query Architecture of context based indexing is represented in processor tree Figure1. The web pages are stored in the crawled web page repository. Each web document is identified by its document- id. The preprocessing steps are performed on the documents (i.e. stemming as well as removal of stop words). The frequently occurring keywords are extracted from the Keywords context Doc_id document, and their corresponding multiple contexts are identified from the thesaurus. Indexer maintains the index of the keyword using the AVL tree. The user submits the query through the search engine interface and the relevant information is made available to the user after searching the results in the index. Figure 1: Architecture of context based indexing. Figure 2 depicts the Context Based Retrieval Interface. The user enters the desired keyword (say Current), on the click of show context, the different contextual meanings of the keywords are displayed. The user selects the desired context of the keyword, through the show document button, the corresponding related web page URLs are listed (available in the repository) which can help the user to directly access more related and relevant information. This technique provides a fast access to document context and structure along with an optimized searching. 2 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 Leftchild keyword right child link Document-Ids Context1 Context2 Context3 Figure 2: Context Based Retrieval Interface Figure 4: Node structure. 3.1 Proposed algorithm for the indexing scheme 3.2 Steps are required to construct AVL The algorithm depicted in figure 3 shows the various steps in tree the construction of the context based index and hence context (i) Creation of binary search tree: This module creates a based searching. node corresponding to the various keywords and adds the related information in the node and finds suitable location in the binary tree. Step1: Preprocess the crawled web documents and extract the keyword along with their frequency of occurrence. Create_BST() //initially the tree is empty. Step2: Input the keywords to the context generator { which extracts the multiple contextual sense of the word. create new node containing the fields ( left child, Context is being searched in the thesaurus (a dictionary keyword, rightchild, link). of words available on WWW from thesaurus.com, which Leftchild value = NULL contains the words as well their multiple meanings). Rightchild value = NULL Step3: The keywords along with the context are indexed Link = address of database where the context using the AVL tree. and the corresponding document_id is stored Step4: Compare the entered keyword with the node’s Insert_node(); keyword field of the AVL tree, until a similar word is } found. Insert_node() Step5: If search is not a success, create a node { containing the following fields (Leftchild, keyword, Check, whether value in current node and a new rightchild, link) as shown in figure4.The link is pointer keyword value are equal.