International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 Context based Web Indexing for Storage of Relevant Web Pages

Nidhi Tyagi Rahul Rishi R.P. Agarwal Asst. Prof. Prof. & Head Prof. & Head Shobhit University, Technical Institute of Textile & Shobhit University, Meerut Sciences, Bhiwani Meerut

ABSTRACT or 1. This strategy makes the searching task faster and A downloads web pages that are relevant to a optimized. user specified topic. The downloaded documents are indexed with a view to optimize speed and performance in finding 2. RELATED WORK relevant documents for a search query at the In this section, a review of previous work on index side. However, the information will be more relevant if the organization is given. In this field of index organization and context of the topic is also made available to the retrieval maintenance, many algorithms and techniques have already system. This paper proposes a technique for indexing the been proposed but they seem to be less efficient in efficiently keyword extracted from the web documents along with their accessing the index. contexts wherein it uses a height balanced binary search F. Silvestri, R.Perego and Orlando [4] proposed the (AVL) tree, for indexing purpose to enhance the performance reordering algorithm which partitions the set of documents of the retrieval system. into k ordered clusters on the basis of similarity measure where the biggest document is selected as centroid of the first General Terms cluster and n/k1 most similar documents are assigned to this Algorithm, retrieval, indexer. cluster. Then the biggest document is selected and the same process repeats. The process keeps on repeating until all the k Keywords clusters are formed and each cluster gets completed with n/k documents. This algorithm is not effective in clustering the AVL tree, contextual, repository, balance factor. most similar documents.

Oren Zamir and Oren Etzioni., [5] proposed threshold 1. INTRODUCTION based clustering algorithm. Initially, the number of clusters is With the rapid growth of the , the World Wide Web unknown, based on the specified threshold value the similarity (WWW) has become one of the most important resources for between the two documents is evaluated and they are obtaining information and one of the most important media of classified to the same cluster. If the threshold is small; all the communication. The basic aim is to select the best collection elements will get assigned to different clusters. If the of information according to users need. The existing focused threshold is large, the elements may get assigned to just one crawlers [1, 2] adopt different strategies for computing the cluster. Thus the algorithm is sensitive to specification of words’ frequency in the web documents. If higher frequency threshold. words match with the topic keyword, then the document is considered to be relevant. But they generally do not analyze C. Zhou, W. Ding and Na Yang [6], the paper introduces a the context of the keyword in the web page before they double indexing mechanism for search engines based on download it. The subject of context has received a great deal campus Net. The CNSE consists of crawl machine, Chinese of attention in the information recovery literature for seeking automatic segmentation, index and search machine. The relevant information [9]. Exploring the contents of the web proposed mechanism has document index as well as word pages for automatic indexing is of fundamental importance for index. The document index is based on, where the documents various web applications. The purpose of storing an index is do the clustering, and ordered by the position in each to optimize speed and performance in finding relevant document. During the retrieval, the search engine first gets the documents for a search query. Without an index, the search document id of the word in the word index, and then goes to engine would scan every document in the corpus, which the position of corresponding word in the document index. would require considerable time and computing power. For Because in the document index, the word in the same example, while an index of 10,000 documents can be queried document is adjacent, the search engine directly compares the within milliseconds, a sequential scan of every word in largest word matching assembly with the sentence that users documents is a time consuming task. The additional computer submit. The mechanism proposed, seems to be time storage required to store the index, as well as the considerable consuming as the index exists at two levels. increase in the time required for an update to take place, are N. Chauhan and A. K. Sharma [7] proposed, the context traded off for the time saved during . driven focused crawler (CDFC) that searches and downloads In AVL (i.e. height balanced binary tree) tree [3], the height only highly relevant web pages, thus, reducing the network of a tree is defined as the length of the longest path from the traffic. A category tree has been used, which provides root node of the tree to one of its leaf node, and the balance flexibility to the user for interacting with the system showing factor (BF) is: (height of left subtree – height of right subtree). the broad categories of the topics on the web. The proposed To call the AVL as balanced the value of BF should be -1, 0

1 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 design significantly reduces the storage space at the search engine side.

P. Gupta and A. K. Sharma [8], worked on context based Crawled indexing in search engines using ontology. The index WWW web construction is done on the basis of the context using pages ontology. The context repository, thesaurus and ontology repository are used by the indexer to identify the context of the document. The critical look at the available literature reveals that there is a requirement for a technique to organize the keyword and Pre- their contexts in a better fashion as storing in a linear fashion makes searching of a document a bit time consuming. processor 3. PROPOSED WORK This paper proposes an algorithm for indexing the keyword extracted from the web documents along with their context. keywords The indexing technique uses a height balanced binary search Thesaurus (AVL) tree, in addition to improved performance in the Generate retrieval of information, this data structure is able to support dynamic indexing, which is especially important for context environments where documents are changed frequently. If the query planning about the arrangement of the keywords is done then user AVL tree can be achieved. The need to develop a technique interface that maintains a balance between the height of the left and context right sub-trees of binary tree arises for the faster search of documents stored in the database according the their keywords. AVL Query processor Architecture of context based indexing is represented in tree Figure1. The web pages are stored in the crawled web page repository. Each web document is identified by its document- id. The preprocessing steps are performed on the documents (i.e. stemming as well as removal of stop words). The frequently occurring keywords are extracted from the Keywords context Doc_id document, and their corresponding multiple contexts are identified from the thesaurus. Indexer maintains the index of the keyword using the AVL tree. The user submits the query through the search engine interface and the relevant information is made available to the user after searching the results in the index.

Figure 1: Architecture of context based indexing.

Figure 2 depicts the Context Based Retrieval Interface. The user enters the desired keyword (say Current), on the click of show context, the different contextual meanings of the keywords are displayed. The user selects the desired context of the keyword, through the show document button, the corresponding related web page URLs are listed (available in the repository) which can help the user to directly access more related and relevant information. This technique provides a fast access to document context and structure along with an optimized searching.

2 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012

Leftchild keyword right child link

Document-Ids Context1

Context2

Context3

Figure 2: Context Based Retrieval Interface Figure 4: Node structure. 3.1 Proposed algorithm for the indexing scheme 3.2 Steps are required to construct AVL The algorithm depicted in figure 3 shows the various steps in tree the construction of the context based index and hence context (i) Creation of binary search tree: This module creates a based searching. node corresponding to the various keywords and adds the related information in the node and finds suitable location in the binary tree. Step1: Preprocess the crawled web documents and extract the keyword along with their frequency of occurrence. Create_BST() //initially the tree is empty. Step2: Input the keywords to the context generator { which extracts the multiple contextual sense of the word. create new node containing the fields ( left child, Context is being searched in the thesaurus (a dictionary keyword, rightchild, link). of words available on WWW from thesaurus.com, which Leftchild value = NULL contains the words as well their multiple meanings). Rightchild value = NULL Step3: The keywords along with the context are indexed Link = address of database where the context using the AVL tree. and the corresponding document_id is stored Step4: Compare the entered keyword with the node’s Insert_node(); keyword field of the AVL tree, until a similar word is } found. Insert_node() Step5: If search is not a success, create a node { containing the following fields (Leftchild, keyword, Check, whether value in current node and a new rightchild, link) as shown in figure4.The link is pointer keyword value are equal. variable which points to the database where the context If so, duplicate is found. of keyword and the corresponding document_id is stored. Otherwise, Context is being searched in the thesaurus (a dictionary if a new keyword value is less, than the of words available on WWW from thesaurus.com, which root node's value: contains the words as well their multiple meanings). if a current node has no left child, Step6: Arrange the node in the AVL tree, according to place for insertion has been the height BF. found; Step7: Repeat step 4, 5 and 6 until all the extracted otherwise, handle the left child keywords are arranged. with the same algorithm. Step8: Now when the user fires the query with context Compute_height(); explicitly specified, then the index is being searched, if a new value is greater, than the root reducing its search time to half of the linear search. node's value: Step9. Thus, AVL indexing technique provides a fast if a current node has no right child, place access to document context and structure. for insertion has been found; otherwise, handle the right child with the same algorithm. Compute_height();

} Figure 3: Steps involved in the construction of the context based index using AVL. (ii) Insertion of a node may cause imbalance in a subtree and ultimately leading to imbalance of the whole tree. To call the AVL as balanced the value of BF should be -1, 0 or 1.

3 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012

Compute_height() crawler { rr If child = NULL then height is 0, The height of tree = 1+ max (height of left apple repository subtree, height of right subtree) ant } category

current account (iii) The rearrangement of the node can eliminate the imbalance. automatic

Arrange_tree() agent { Case1: Insertion of a node in the left subtree of the left child of the pivot. Perform rotation toward right of pivot. Figure 5: Representation of keywords using binary search Case2: Insertion of a node in the right subtree of tree the right child of the pivot. Perform rotation toward left of pivot. However, the proposed AVL tree indexing will represent the Case3: Insertion of a node in the right subtree of given keywords as shown in figure 6. the left child of the pivot. Perform left-right rotation from pivot. Case4: Insertion of a node in the left subtree of the right child of the pivot. crawler Perform right left rotation from pivot }

apple Repository 4. EXAMPLE Consider the set of keywords (Table 1) extracted from the ant various web documents with Document_ID (1 – 9). category

Table 1: List of keywords extracted from the various web account current documents Automatic Keywords Document_ID Crawler 1,2,3,8 Apple 1,4,5 agent Repository 1,2,3,9 Current 1,2,3,5, 6, 7,9 Category 2,3 context Doc_ids Ant 2,3,5 Electric Account 1,2,3,5 2 3 Agent 1,2,3,5,6 Present Automatic 1,2,5,6 5 7 9 Water- The data given in Table 1 would be represented by a BST as flow 1 2 6 shown in figure 5. Figure 6: Representation of keywords using AVL tree along with context and doc-id.

The representation of keywords as organized in the figure 5 and figure 6 reveals the proposed representation not only increases the search efficiency but each node of the tree stores the contextual words of each keyword, leading to assignment of more relevance to the corresponding documents.

4 International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012

5. CONCLUSION [5] Oren Zamir and Oren Etzioni “Web Document This paper proposes a technique for indexing the keyword Clustering: A feasibility demonstration”. Proceedings of extracted from the web documents along with their context. SIGIR, 1998. The AVL tree based indexing technique, is able to support [6] Changshang Zhou, Wei Ding and Na Yang, “Double dynamic indexing and improves the performance in terms of Indexing Mechanism of Search Engine based on Campus accuracy and efficiency for retrieving more, relevant Net”, Proceedings of the 2006 IEEE Asia-Pacific documents as per the user’s requirements since the context of Conference on Services Computing (APSCC'06), 2006. the various keywords is also stored along with them. Thus, the indexing technique provides a fast access to document context [7] Naresh Chauhan and A. K. Sharma,” Design of an and structure along with an optimized searching. Agent Based Context Driven Focused Crawler”,BVICAM’S International Journal of 6. REFERENCES Information Technology, pp 61-66, 2008. [1] Diligenti M., Coetzee F.M., Lawrence S., Giles C.L.and [8] Parul Gupta and A.K.Sharma,” Context based Gori M., “Focused Crawling using context graphs”, Proc. Indexing in Search Engines using Ontology”, International Conference on Very Large Databases International Journal of Computer Applications, Volume (VLDB ’00), pp. 527-534, 2000. 1 No. 14, pp 49-52, 2010. [2] Yang Yongsheng and Wang Hui, “Implementation [9] Steve Lawrence, “Context in Web Search”, IEEE Data of Focused Crawler”, COMP630D Course Project Engineering Bulletin, 2000. Report. [10] Wang Jicheng, Huang Yuan, Wu Gangshan and [3] A.K.Sharma, “Data Structures using C”, Pearson Zhang Fuyan, “Web Mining: Knowledge publication, 2011. Discovery on the Web”, IEEE International Conference, Tokyo, 1999. [4] Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando ”Assigning Document Identifiers to Enhance [11] O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp “Fast Compressibility of Web Search Engines Indexes”. and Intuitive Clustering of Web Documents,” Proceeding Proceedings of SAC, 2004. Third International Conference Knowledge Discovery and Data Mining, pp. 287-290, Aug. 1997.

5