Deep Web Query Translation Based on Indexing Using Ontology: Review

Advances in Computer Science and Information Technology (ACSIT) p-ISSN: 2393-9907; e-ISSN: 2393-9915; Volume 4, Issue 3; April-June, 2017, pp. 138-141 © Krishi Sanskriti Publications http://www.krishisanskriti.org/Publication.html

Deep Web Query Translation Based on Indexing using Ontology: Review

Sunaina1, Varsha Rathi2 and N N Das3 1M.Tech Scholar of Computer Science & Engineering RIET, India 2,3Computer Science & Engineering RIET, India E-mail: [email protected], [email protected], [email protected]

Abstract—Nowadays, the World Wide Web is a collection of large engine results. For integrating the resources of deep web, we amount of information which is increasing day by day. In this paper have to find the accessible query interfaces and integrate them. author also described the technologies in deep web data integration which are used in this system. In this paper a unified platform in In domain based integrated system the deep Web classification which users input and submit queries. Then the system process and is very important. The number of deep Web sources in the check the query attributes in different websites automatically, then integration is changing dynamically everyday. With a new one return results in details which users would concerned. The adding in, an automated classification is required in the deep information on the Surface Web is supported by the Deep Web, which web. Furthermore, the classification also provides a facility to can not be accessed directly by the search engines or the web arrange the large deep Web sites. In deep web Classification, crawlers. The only way to access the backend database is through query interface. Automatic extracting query attributes from the query existing works mainly focus on classifying texts or Web and automatic translating the source queries into target queries is an documents, while there is little in the deep Web. Bin efficient way for addressing the current limitations in accessing Deep proposes[9] a clustering approach to organize the structured Web data sources. We use the WordNet as a kind of ontology web sources, which is based on hypothetical hidden models technique to access the attributes contained in the semantic form. for homogeneous sources that maybe not exist in the real Many deep Web sources are structured by providing structured query worlds. Anne proposes[10] two different approaches for interfaces and results. Then classifying such structured sources into learning data types of a class of Web sources. In this paper a domains is one of the important steps toward the integration of deep Web category ontology model is proposed. heterogeneous Web sources. In this paper, we present an Ontology- based deep Web classification, which includes a classified ontology 1.1 Introduction to Deep Web model. Deep web data integration has become a main issue in recent 1. INTRODUCTION years and an important field in data integration domain. There are plenty of web databases in deep web, which have much Deep web data integration provide a global interface, thus more information than the surface of the web and significant allows users to access the data contained in the server-side of sources of all kinds of information. Deep web data integration websites which require users to login or submit a query. It is offers a global interface, thus allows users to access the data time consuming for users to submit the query for a book time contained in the server-side of websites which require users to and time again in different websites and compare the results of login or submit a query. each response page. The system is tried to simplify the procedure and make it convenient for book information Deep Web, also known as “Deep net,” the “Invisible Web,” retrieval. After the name of a book is entered and the query is the “Undernet” or the “hidden Web,” are parts of the Internet submitted, a set of results will be showed: library holdings that are not considered part of the “surface web,” or the from each online local library, prices from trusty online portion of the World Wide Web that is indexed by bookstores, and popular book reviews from online conventional search engines. Many deep web sites are not communities. And the results will be represented to user indexed because they use dynamic databases that are devoid of clearly and comprehensively after further classification and hyperlinks and can only be found by performing an internal merger. search query. The deep web[3] databases require user query interfaces and 1.2 Introduction to Ontology dynamic programs to access their contents, thus preventing Ontology[11] is the study of the kinds of things that exists. Its Web crawlers from automatically extracting their contents and mean “theory of existence”. It is a representation vocabulary, indexing them, and therefore not being included in search often specialized to some domain typically some common Deep Web Query Translation Based on Indexing using Ontology: Review 139

sense knowledge domain. It forms the knowledge 2.3. He-Xiang Xu, Xiu-Lan Hao, Shu-Yun Wang, Yun-Fa representation for that domain. There are different types of Hu[3] ontology component can be defined like concepts, instances etc. Concept is the main component of ontologies that can be The nature of deep Web source is a dynamic web site, which defined in different manner:- includes: a query form, a back-end database and query results. Due to the relation between deep Web source and its Interface Textual definition: the concept “parrot” is defined by the Schema, we can convert the classification in this paper to sentence “as individual animal being” like Bird. Logical interface schema classification. For this there are three definition using formula:-the bird is defined by the formula definition: “Living entity U Nonliving entity”. Deep Web Source, Interface Schema, Deep Web Integration. Set of properties:-A concept “Bird” can have the property like “type”, ”color”, ”food”. Finding concept can also be explained 2.4. Ying Xie, Wanli Zuo, Fengling He, Ying Wang[4] by the set of instances of a bird. Automatic Deep Web query results extraction is a key step of The concept of ontologies has contributed to the development Deep Web query results processing. In this paper, a simple of Semantic Web where Semantic Web is an extension of the method for extracting Deep Web query results automatically current World Wide Web in which information is given in a based on tag trees is proposed according to the features of well-defined meaning that translates the given unstructured Deep Web query results page. The method ﬁrst builds a tag data into knowledgeable representation data. In other words, tree of the given result page. Then ﬁnds minimal data regions Semantic Web is an information that is machine in the tag tree from top to down, and extracts data records understandable. It allows users to extract web pages according included by them. The experiment has shown that the method to the context rather than the matching of keywords in order to is effective. retrieve relevant web documents to the user’s query. 2.5. Michael Chau, Reynold Cheng, Ben Kao, and Jackey 2. RELATED WORK Ng[5] Data uncertainty is an inherent property in various 2.1. Wang Xiaoyu, Cui Xiangyang, Chen Deyun, Jiang applications due to reasons such as outdated sources or Feng[1] incorrect measurement. When data mining techniques are In this paper authors proposed a system about book applied to these data, their uncertainty has to be considered to information searching, for the purpose of providing clearly obtain high quality results. Author’s present UK-means classifying information including of price in bookstore online, clustering, an algorithm that enhances the K-means algorithm holdings of local library and book reviews. Users would not to handle data uncertainty. need to submit queries for a book time and time again in many 3. ARCHITECTURE OF ATTRIBUTE BASED different websites any longer. In this paper also described the INDEXING MAIN TITLE. technologies in deep web data integration which are used in this system. The proposed architecture of indexing in search engine 2.2. Hao liang, Wanli Zuo, Fei Ren, Chong Son[2] consists of the following functional components. The description of various components are. In this paper authors proposed an attribute search-driven mechanism, in this paper the most important factor is the 3.1 Crawler attributes and semantic relations between them. In this papers It is an important component of web search engines, where try to extract abundant attributes, which describe the concept, they are used to collect the corpus of web pages indexed by and the relationships between the set of attributes of same the search engine. search form and even different forms. The most efficient and effective technique of detecting the semantic relation between 3.2 Repository of Web pages words is the WordNet[3]. We extend each attribute into a concept set which is used for matching attributes. The This is the collection of web documents that have been framework takes source query form and target query form as collected by the crawler from the WWW. It is a database inputs and output a query for target query. During the which stores the web pages that are gathered by the crawler transaction, we first extract attributes from query forms and from WWW in order to provide web documents for indexing. find the semantic relation between attributes, and then 3.2 Preprocessing of Documents compose attributes according to the web semantic restriction, finally rewrite the query for target form. It involves the tokenization phase which break sentences into individual tokens, typically words. It will simply segregate all the words, numbers, and their characters etc. from given document. It also includes stop word removal phase, expand

Advances in Computer Science and Information Technology (ACSIT) p-ISSN: 2393-9907; e-ISSN: 2393-9915; Volume 4, Issue 3; April-June, 2017 140 Sunaina, Varsha Rathi and N N Das

abbreviation to full word and stemming process which remove the basis of attribute similarity and keyword similarity of the the keywords those occur frequently in the web page but do documents, which is calculated using the frequency count. not contribute to the context of web document. The Stemming 3.10 Searcher phase is used to extract the sub-part i.e. of a given word. For example, the words connected, connecting, connection all can It is that module of the search engine that receives user queries be rooted to the word connect. This will reduce the size of via the user interface and hence after searching the results in indexing file. the index provides them to the user. 3.4 WordNet 3.11 Query Interface The most efficient and effective technique of detecting the It is that user interface through which user types the query semantic relation between words is the WordNet[4]. 3.5 Thesaurus It is a dictionary of words available on the World Wide Web from thesaurus.com which contains the words as well as their multiple meanings. Using thesaurus the multiple meaning of the terms and various attributes can be derived. 3.6 Ontology Repository After the extraction of the keywords from the documents, and extracting the multiple context of the keywords from the thesaurus, this task is further extended by forming their structural framework which would represent the relationship and thus the semantic meaning of the document, and such representations are referred as ‘Ontologies’. Ontology repository is a database which contains the various concepts with their relationships. 3.7 Attribute of the Document The attribute of the document deduce from ontology represent the semantic or theme of the document. At this, level the different documents retrieved for the same term are categorized according to the attribute. The document attribute has been extracted using thesaurus and ontology repository. 3.8 Extracting and Mapping Attribute There are some pre-processing tasks to be done before we extracting valid attributes, issues such as concatenated words, abbreviations, and acronyms are deal with. There are three steps to finish these pre-processing tasks. First we break a label item into atomic words. Second we have to expand some abbreviations and acronyms to their full words, e.g., from “dept” to “department”. Third there are some information retrieval preprocessing method should be used, such as stopword removal and stemming. Finishing the pre-processing tasks, we get general words, which are related to the attributes of the query form. Only some of them will be labeled as attributes. 4. ALGORITHM FOR CONSTRUCTING INDEX 3.9 Target Query Based Indexing The algorithm depicted in Fig. 2. shows the various steps in the construction of the attribute based index and hence The final index is constructed using Target query based attribute based searching. indexing on the basis of attribute of the documents. In target query based indexing, the indexing of documents is done on

Advances in Computer Science and Information Technology (ACSIT) p-ISSN: 2393-9907; e-ISSN: 2393-9915; Volume 4, Issue 3; April-June, 2017 Deep Web Query Translation Based on Indexing using Ontology: Review 141

REFERENCES

[1] Wang Xiaoyu, Cui Xiangyang, Chen Deyun, Jiang Feng “Book Information Retrieval System Based On Deep-Web Data Integration” 2010 First International Conference on Pervasive Computing, Signal Processing and Applications IEEE. [2] Ying Xie, Wanli Zuo, Fengling He, Ying Wang “Automatic Deep Web Query Results Extraction Based on Tag Trees” 2009 Second International Symposium on Computational Intelligence and Design IEEE. [3] Hao liang, Wanli Zuo, Fei Ren, Chong Son “Accessing Deep Web Using Automatic Query Translation Technique” Fifth International Conference on Fuzzy Systems and Knowledge Discovery IEEE. [4] He-Xiang Xu, Xiu-Lan Hao, Shu-Yun Wang, Yun-Fa Hu “A method of deep web classification” Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007 IEEE. [5] Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng “Annotating Structured Data of the Deep Web” State University of New York at Binghamton Binghamton, NY, 13902, U.S.A. 2007 IEEE. [6] C. Abi Chahine, N. Chaignaud, JPh Kotowicz and JP Pećuchet “Context and Keyword Extraction in Plain Text using a Graph Representation” Learning Object Metadata draft standard document 2008 IEEE. [7] Travis D. Breaux, Joel W. Reed from Department of Computer Science “Using Ontology in Hierarchical Information Clustering” Proceedings of the 38th Hawaii International Conference on System Sciences – 2005 IEEE. [8] M.F. Porter Computer Laboratory, Cambridge, UK “An algorithm for suffix stripping” this paper was first published in Program, Vol. 14 No. 3, July 1980, pp. 130-7. q Emerald Group Publishing Limited. [9] Bin He, Tao Tao, Kevin ChenChuan.Chang. Organizing Structured Web Sources by Query Schemas: A Clustering Approach. CIKM’04, November 8-13, 2004, Washington, DC, USA. [10] A.H.H.Ngu, D.J.Buttler, T.J.Critchlow. Automatic Generation of Data Types for Classification of Deep Web Sources. Technical Report UCRL-CONF-209719, Lawrence Livermore National Laboratory, 2005. 7. [11] B.Chandrasekaran and John R.Josephson, Ohio State University V.RichardBenjamins,Universityof Amsterdam “What are Ontologies,and Why do we need them?”IEEE INTELLIGENT 5. CONCLUSION SYSTEMS(1094-7167),Volume 14 No.1,pp20-26,1999. This paper presents an indexing structure that can be constructed on the basis of the attribute of the document. The attribute of the document can be extracted with the help of thesaurus and ontology repository that defines the concepts and relationship between the terms. So this paper uses ontology for query based index building. This offers the retrieval from index on basis of attribute. This will help in improving the web search quality by providing the most relevant documents to the user’s query as a result.

Advances in Computer Science and Information Technology (ACSIT) p-ISSN: 2393-9907; e-ISSN: 2393-9915; Volume 4, Issue 3; April-June, 2017