Semantic Annotation Framework for Web Extracted Data

Semantic Annotation Framework for Web Extracted Data

International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 5, Sep–Oct 2016, pp. 65–76, Article ID: IJCET_07_05_008 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB EXTRACTED DATA C. Gnana Chithra Equity Research Consultant, Angeeras Securities, Chennai, Tamilnadu, India Dr. E. Ramaraj Professor, Department of Computer Science and Engineering, Alagappa University, Karaikudi, Tamilnadu, India ABSTRACT Semantic annotation of web pages is the state of art technology for achieving the unified objective of attaining Semantic web Universe, which enables sharing, and reusing the document content beyond the boundaries and applications. Web is a treasury of knowledge and efficient tools should be designed to explore the structured and unstructured data. Annotating million of web pages manually is an impossible task. For high information retrieval rates, automatic annotation of documents is mandatory. Metadata is added to the web pages to make it intelligent for processing in content based intelligent applications. This paper analyses the problems with the current Semantic annotation systems and proposes a new Ontology based Automatic annotation system Framework. Ontology based semantic annotation is one of the best methods for extracting data from the Knowledge Base. The integration of Modified Manning’s Sentence boundary detection algorithm and Noun Phrase Collocation algorithm and classification using machine learning techiques in the Information Extraction module, and developing a new data model and ontology for Structured Ontology engineering model is contributed in this paper. Annotation module annotates the output of the information extraction module with the aid of ontologies and dictionaries and stores the resultant annotated data as RDF triples in the Annotation database. Reasoning is made on the Annotated data by the RDF repository interface. FIOBODA is abbreviated as the Financial Instruments ontology based open document annotation. Web pages extracted from the Financial securities domain are mapped with the Finance ontology to extract the subject, predicate and object. SVM classifier is used to classify the correct and incorrect annotations. The correct output annotation data is stored in Annotation data base and RDF repository for later use. The proposed framework to an extent solves the problem of knowledge bottleneck due to its reusability and interoperability features. Key words: Dublin Core, FIOBODA, Financial Securities Ontology, Metadata, Semantic Annotation Framework. http://iaeme.com/Home/journal/IJCET 65 [email protected] C. Gnana Chithra and Dr. E. Ramaraj Cite this Article: C. Gnana Chithra and Dr. E. Ramaraj, Fioboda - Semantic Annotation Framework For Web Extracted Data. International Journal of Computer Engineering and Technology, 7(5), 2016, pp. 65–76. http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6 1. INTRODUCTION Many researchers are working in the area of semantic web to develop techniques and tools for searching, mining, accessing and reasoning the semantic data. The human annotation on the web content is of high accuracy but with a restriction of scalability of data. When large number of web data needs to be annotated, manual annotation lacks quality and speed. The alternative methodology is to transform the human readable web into machine readable data by adding the metadata to the document which makes it an intelligent document. In the recent past many methodologies and frameworks were proposed by the researchers on semi automatic annotation and automatic annotation with or without using ontology and other lexicons. Semantic web includes technologies such as metadata, ontologies, inference and logic modules for reasoning. Merriam dictionary [1] defines annotation as “to add a short explanation or opinion to a text or drawing”. When the web document is enriched with metadata for machine processing, and the process is called as semantic annotation. Though billions of growing documents are present in the web, the search engines such as Google, yahoo or bing does not support semantic analysis to a larger extent. Annotation types can be classified based on their functions, features used and the prevailing technologies. Using metadata with the content would provide rich semantic applications for the web. The different kinds of annotation are Textual Annotation, Image annotation, PDF annotation, Multimedia annotation, Web annotation and PDF annotation. The enormous development of research has been carried out in the field of Information Extraction such as Named entity recognition, Relation extraction etc. With the incorporation of Dublin core Metadata elements such as Creator, Title, subject, description, format date etc. Into the web page, the spider or crawler builds a content index on the website for each page. When the user makes a semantic search in the semantic search engine, the underlying information in the semantically marked up web page helps in ranking the webpage using the content index and the resultant web search pages area available for further processing. Semantic search is more efficient than the normal word-to-word search made by other search engine algorithms. The crawler indexes only the text content in the website, whereas the images, audio and video are ignored. In the current scenario more of semi-automatic semantic annotation systems are used. This is due to the limitation in the automatic semantic annotation of its scalability and accuracy features of generating and representing models of annotations. 2. RELATED STUDIES Open annotations on the web can be made classified into two types. The first one being the creation of semi automatic annotated documents using ontologies [2]. The focus of researches is currently navigated to automatic annotation [3].[4] has designed a new strategy incorporating information extraction and machine learning techniques for annotating the document”. Baumgartner et. al [5]designed wrappers to extract data from web using the supervised learning techniques. Kiryakov et.al [6] designed “KIM for knowledge and information management infrastructure for automatic semantic annotation”. Dill et.al.[7] created a tool for semantic tagging of texts in the large corpora. The concept of Open Annotation made by Open Annotation Collaboration [8] is acquired by W3C open annotation community group 3. DEFINITION OF SEMANTIC ANNOTATION Handschuh [9] defines semantic annotation as “An annotation attaches some data to some other data: it establishes, within some context, a (typed) relation between the annotated data and the annotating data.”Kiryakov et al.[6] defines semantic annotation as a schema and its more specific generated metadata http://iaeme.com/Home/journal/IJCET 66 [email protected] Fioboda - Semantic Annotation Framework For Web Extraccted Data enables discovering new information access methods and also to extend the exxisting methods. Haase [10] explains that semantic metadata can be defined as linking the related terms withh each other. 4. FORMAL ANNOTATION Annotation can be expressed as a tuple containing four elements.SAM = {C,S,OO,P} where SAM stands for semantic annotation method, “C” stands for the context of the annotation in whhich the annotation is made, “S” is the subject of the annotation or the data to be annotated, “P” is the predicate of the annotationor relationship between the annotating data and “O”is the object of the annotation. With respect to the formal annotations all the elements of “SSAM” are expressed as Uniform resource Ideentifier (URI).In ontological representation of semantic annotat tion Predicate and object are the ontological terms, and the object conforms to the ontological standards. 5. OPEN ANNOTATION Open annotation [8] is a strategy of modeling Web based documents for annotations. The documents are linked to the World Wide Web annd with the principles of structured and unstrructured data. The annotated documents are shared across differene t clients, servers and by tools and applicaations of semantic web. The URN is published and stored in the annotation servers with no particular protocol associated with it. 6. HUMAN ANNOTATION Subject experts in the area of finanncial securities were requested to annotate the web pages. The annotators annotated the instances with the targetst . Experts came with different results wwhich semantically enriched the web pages to a larger extent. More identifiers were assigned to the same wweb page. Gold standard data was obtained from the results of annotators. 7. FIBODA FRAMEWORK The proposed automatic semantic annotation framework is depicted in Fig.1. In this framework the crawler collects data from the web and stot res the selected pages as web documents. The web documents is the input to the Information Extraction module. After low-level information proccessing the data is passed to the annotation module where the ene tities and the relationships extracted are coompared with the ontological concepts and the entity is annotated with root concept. Figure 1 FIBODA- Automatic annotation Framework Diaggram http://www.iaeme.com/IJCET/index.asp 67 [email protected] C. Gnana Chithra and Dr. E. Ramaraj Apart from ontologies other lexicons such as Word Net, Wikipedia and Google are used as knowledge base during the annotation process.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us