International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 5, Sep–Oct 2016, pp. 65–76, Article ID: IJCET_07_05_008 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication

FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB EXTRACTED DATA

C. Gnana Chithra Equity Research Consultant, Angeeras Securities, Chennai, Tamilnadu, India

Dr. E. Ramaraj Professor, Department of Computer Science and Engineering, Alagappa University, Karaikudi, Tamilnadu, India

ABSTRACT Semantic annotation of web pages is the state of art technology for achieving the unified objective of attaining Universe, which enables sharing, and reusing the document content beyond the boundaries and applications. Web is a treasury of knowledge and efficient tools should be designed to explore the structured and unstructured data. Annotating million of web pages manually is an impossible task. For high information retrieval rates, automatic annotation of documents is mandatory. Metadata is added to the web pages to make it intelligent for processing in content based intelligent applications. This paper analyses the problems with the current Semantic annotation systems and proposes a new Ontology based Automatic annotation system Framework. Ontology based semantic annotation is one of the best methods for extracting data from the Knowledge Base. The integration of Modified Manning’s Sentence boundary detection algorithm and Noun Phrase Collocation algorithm and classification using machine learning techiques in the Information Extraction module, and developing a new data model and ontology for Structured Ontology engineering model is contributed in this paper. Annotation module annotates the output of the information extraction module with the aid of ontologies and dictionaries and stores the resultant annotated data as RDF triples in the Annotation database. Reasoning is made on the Annotated data by the RDF repository interface. FIOBODA is abbreviated as the Financial Instruments ontology based open document annotation. Web pages extracted from the Financial securities domain are mapped with the Finance ontology to extract the subject, predicate and object. SVM classifier is used to classify the correct and incorrect annotations. The correct output annotation data is stored in Annotation data base and RDF repository for later use. The proposed framework to an extent solves the problem of knowledge bottleneck due to its reusability and interoperability features. Key words: Dublin Core, FIOBODA, Financial Securities Ontology, Metadata, Semantic Annotation Framework.

http://iaeme.com/Home/journal/IJCET 65 [email protected] C. Gnana Chithra and Dr. E. Ramaraj

Cite this Article: C. Gnana Chithra and Dr. E. Ramaraj, Fioboda - Semantic Annotation Framework For Web Extracted Data. International Journal of Computer Engineering and Technology, 7(5), 2016, pp. 65–76. http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6 1. INTRODUCTION Many researchers are working in the area of semantic web to develop techniques and tools for searching, mining, accessing and reasoning the semantic data. The human annotation on the web content is of high accuracy but with a restriction of scalability of data. When large number of web data needs to be annotated, manual annotation lacks quality and speed. The alternative methodology is to transform the human readable web into machine readable data by adding the metadata to the document which makes it an intelligent document. In the recent past many methodologies and frameworks were proposed by the researchers on semi automatic annotation and automatic annotation with or without using ontology and other lexicons. Semantic web includes technologies such as metadata, ontologies, inference and logic modules for reasoning. Merriam dictionary [1] defines annotation as “to add a short explanation or opinion to a text or drawing”. When the web document is enriched with metadata for machine processing, and the process is called as semantic annotation. Though billions of growing documents are present in the web, the search engines such as Google, yahoo or bing does not support semantic analysis to a larger extent. Annotation types can be classified based on their functions, features used and the prevailing technologies. Using metadata with the content would provide rich semantic applications for the web. The different kinds of annotation are Textual Annotation, Image annotation, PDF annotation, Multimedia annotation, Web annotation and PDF annotation. The enormous development of research has been carried out in the field of Information Extraction such as Named entity recognition, Relation extraction etc. With the incorporation of Dublin core Metadata elements such as Creator, Title, subject, description, format date etc. Into the web page, the spider or crawler builds a content index on the website for each page. When the user makes a semantic search in the semantic search engine, the underlying information in the semantically marked up web page helps in ranking the webpage using the content index and the resultant web search pages area available for further processing. Semantic search is more efficient than the normal word-to-word search made by other search engine algorithms. The crawler indexes only the text content in the website, whereas the images, audio and video are ignored. In the current scenario more of semi-automatic semantic annotation systems are used. This is due to the limitation in the automatic semantic annotation of its scalability and accuracy features of generating and representing models of annotations. 2. RELATED STUDIES Open annotations on the web can be made classified into two types. The first one being the creation of semi automatic annotated documents using ontologies [2]. The focus of researches is currently navigated to automatic annotation [3].[4] has designed a new strategy incorporating information extraction and machine learning techniques for annotating the document”. Baumgartner et. al [5]designed wrappers to extract data from web using the supervised learning techniques. Kiryakov et.al [6] designed “KIM for knowledge and information management infrastructure for automatic semantic annotation”. Dill et.al.[7] created a tool for semantic tagging of texts in the large corpora. The concept of Open Annotation made by Open Annotation Collaboration [8] is acquired by W3C open annotation community group 3. DEFINITION OF SEMANTIC ANNOTATION Handschuh [9] defines semantic annotation as “An annotation attaches some data to some other data: it establishes, within some context, a (typed) relation between the annotated data and the annotating data.”Kiryakov et al.[6] defines semantic annotation as a schema and its more specific generated metadata

http://iaeme.com/Home/journal/IJCET 66 [email protected] Fioboda - Semantic Annotation Framework For Web Extraccted Data enables discovering new information access methods and also to extend the exxisting methods. Haase [10] explains that semantic metadata can be defined as linking the related terms withh each other. 4. FORMAL ANNOTATION Annotation can be expressed as a tuple containing four elements.SAM = {C,S,OO,P} where SAM stands for semantic annotation method, “C” stands for the context of the annotation in whhich the annotation is made, “S” is the subject of the annotation or the data to be annotated, “P” is the predicate of the annotationor relationship between the annotating data and “O”is the object of the annotation. With respect to the formal annotations all the elements of “SSAM” are expressed as Uniform resource Ideentifier (URI).In ontological representation of semantic annotat tion Predicate and object are the ontological terms, and the object conforms to the ontological standards. 5. OPEN ANNOTATION Open annotation [8] is a strategy of modeling Web based documents for annotations. The documents are linked to the World Wide Web annd with the principles of structured and unstrructured data. The annotated documents are shared across differene t clients, servers and by tools and applicaations of semantic web. The URN is published and stored in the annotation servers with no particular protocol associated with it. 6. HUMAN ANNOTATION Subject experts in the area of finanncial securities were requested to annotate the web pages. The annotators annotated the instances with the targetst . Experts came with different results wwhich semantically enriched the web pages to a larger extent. More identifiers were assigned to the same wweb page. Gold standard data was obtained from the results of annotators. 7. FIBODA FRAMEWORK The proposed automatic semantic annotation framework is depicted in Fig.1. In this framework the crawler collects data from the web and stot res the selected pages as web documents. The web documents is the input to the Information Extraction module. After low-level information proccessing the data is passed to the annotation module where the ene tities and the relationships extracted are coompared with the ontological concepts and the entity is annotated with root concept.

Figure 1 FIBODA- Automatic annotation Framework Diagramg

http://www.iaeme.com/IJCET/index.asp 67 [email protected] C. Gnana Chithra and Dr. E. Ramaraj

Apart from ontologies other lexicons such as Word Net, Wikipedia and Google are used as knowledge base during the annotation process. The resultant annotations are verified for their correctness. If the annotations are correct it is added to the annotation database, otherwise it is rejected. Query parser sends query to the inference engine and with the reasoning techniques, results are obtained from the knowledge base as well as Annotation database. Human annotation requires large set of data as training set. Supervised algorithms also require very large data set for testing and training. Compared to supervised learning, semi-supervised learning only requires less data. Automatic semantic annotation also requires data initially for learning, but very few when compared semi-automatic technologies. 8. INFORMATION EXTRACTION MODULE The input to this module is the extracted web pages. Html scraping is performed to remove the html tags as well to filter the audio, video and images. The html document is converted to plain text. The text is parsed with robust lightweight parser. The Modified sentence boundary detection and classification [11] algorithm designed by us for this research on semantic annotations will be used in this phase is given in Fig.2. The sentence boundaries are detected and classified correctly even with abbreviations including that of geographical locations and identification of university degrees and for detecting url’s.

MODIFIED MANNING’S HEURSITIC ALGORITHM • Place putative sentence boundaries after all occurrences of. ? ! (and maybe ; : -_) • Move the boundary after following quotation marks, if any. • Disqualify a period boundary in the following circumstances: • If it is preceded by a known abbreviation of a sort that does not normally occur word finally, but is commonly followed by a capitalized proper name, such as Prof. or vs. • The period character ‘.’ in the name of the initials of a person should not be split into a separate sentence. • The period character in the name of educational Degrees should not be spilt into sentences. • Lookup the ontology for recognizing the educational qualification. • If Abbreviation contains numbers check it against the ontology. • Abbreviations other than educational degrees and geographical data are referred with Wordnet ontology and ontology containing honorary titles, family titles and professional titles. • The URL should not be split as it contains periods. • Sentence should not be split after Ellipses in English. • Disqualify a boundary with a ? or ! if: • It is followed by a lowercase letter (or a known name). • When there is an imbalance in the parenthesis or bracket of sentence, do not split the sentence. Balance the parenthesis or bracket by inserting or replacing the mark. • Regard other putative sentence boundaries as sentence boundaries.

Figure 2 Modified Manning’s sentence detection algorithm

http://iaeme.com/Home/journal/IJCET 68 [email protected] Fioboda - Semantic Annotation Framework For Web Extracted Data

The correctly classified sentences are parsed using sentence segmentation techniques. It is then tokenized into smaller units called as tokens. The stop words in the list are removed, morphing analysis is performed to find the root of the word and the Porter’s stemming algorithm stems or cuts the words to the root. Finally the lexical process of Part-of-Speech tagging is made on the token to identify the Named entities and associate it with POS tags. Using the Collocation extraction and Filtering Noun phrase algorithm [12] (which is also part of this research) the phrases are extracted from the corpus and conforming to rules of Noun phrase Filters it is classified as the Noun Phrase Collocations. These noun phrases are passed on to the annotation phase.

Algorithm for Collocation Phrase Extraction

Input: List of Phrases or n grams extracted after pre-processing the web document.

Step 1: Take a phrase p1 from the list of phrases P= {p1,p2,p3..pn) in the collection.

Step 2: Compare the phrase p1 with Word Net super thesaurus. If phrase exists then add it to the potential collocation candidate (PCC) set. Go to step 7; Otherwise goto step 3. Step 3: Compare the Phrase p1 with the Wikipedia Pronoun ontology. The basic requirement is p1 should be in all capital letters. The result after the search is, if phrase exists it is the first element in the main body add to PCC. If it is a normal noun phrase it need be capitalized. If phrases exists then add to PCC. Go to step 7; Otherwise goto Step 4. Step 4: Perform Google search on the p1 and the Search engine result page (SERP) outputs results with ranking then, p1 above the threshold is added to the PCC. Go to step 7; Otherwise goto Step 5.

Step 5: Make a search for p1 in BNC dictionary. If phrase available then add to PCC. Go to

step 7;

Otherwise goto Step 6. Step 6: Search Geographic Gazateer for Proper noun Phrase. If it matches add to PCC. Step 7: If the phrase cannot be classified as PCC through step 2 to step 6 then mark the phrase as REJECTED CANDIDATE and add it to rejected list. Step 8: Increment the phrase to p2. Goto step 2 and proceed until the entire set is exhausted. Step 9: Finally PCC contains the collocation phrases.

Figure 3 Collocation Phrase Algorithm 9. ONTOLOGY DESIGN AND MANAGEMENT Ontology is a model, which is made up of Concepts, attributes and relations. It defines the relationships between the elements in such a way that it machine readable and it defines the things which are available in the real universe. Taxonomy can be defined as the hierarchical representation of things. Ontologies and Taxonomies are business models, which allow the concepts to be defined in different level of granularity. Ontology adds information to the Taxonomy aiding it to define the concepts in a machine-readable manner. The first statement in the ontology is owl:thing which means that the ontology is a sub class of main class owl:thing and it is built around the things in the real universe.

http://iaeme.com/Home/journal/IJCET 69 [email protected] C. Gnana Chithra and Dr. E. Ramaraj

Ontology is a business model which explains the relationships between enntities and additional logical information about the entities, in a way which is machine readable. The ontoloogy, like taxonomy, contains definitions of things in the real wow rld. Therefore the foundation pillars for ontology are taxonomy - the hierarchical class structure of those real world things. The Classes in the ontology should have formal explicit description, attributes or properties for each class and constraints or restriction on those properties. Financial Securities domain is analyzed and discussed in this paper. Finnancial instruments are also called as financial securities. The different securities are Equities, Debts, Swwaps, Spots, Futures, Listed options etc. The Classes in the finana cial instruments are given in Figure 4

Figure 4 TTop level Classes of Financial Instruments Ontology

http://www.iaeme.com/IJCET/index.asp 70 [email protected] Fioboda - Semantic Annotation Framework For Web Extraccted Data

Here the Financial instrument is a thing in the universe. The financial instrruments are classified as per the CFI standards of taxonomy. The equity capital is money that is raised by the company as per the contractual terms from the investors and investors also gain money by tradinng those shares in the stock market. The Class Equity and sub-classes [13] are given in the diagram Fig.5

Figure 5 Equity Classes in Financial Instruments ontologgy The Concept equity has relationship “is raised by”,”is owned by”, “has rights defined”, and “is a “between the entities. The following facts prove the relationship between the subject and the object. E.g. Owner has rights of equity. Equity is raised by owners. Equities are owned by investors. Equity is a financial instrument. Equity securities has rights defineed in Contractual terms.

Equity is raised by owners. s p o

Figure 6 Example word with subject, predicate and object classification Here the word “Equity” refeers to the subject and “owners” is the Obbject, “is raised by” is the relationship between the entities.

http://www.iaeme.com/IJCET/index.asp 71 [email protected] C. Gnana Chithra and Dr. E. Ramaraj

The excerpt from the financial securities ontology representation [14] is given in Fig. 7.

Figure 7 Slice of financial instruments ontological representtation Ontologies are well defined ana d it represents up-to-date information. Maiintenance of ontologies is a grave concern in the semantic annotation systems. 1. When a concept in the ontoloogy is removed then there is a conflict between the annotated documents in the server to the web page. 2. When the classification of onntology is modified the annotated documents in thhe server should reflect the new changes. The identifiers associated with a web page also needs to be updated. 3. The ontology needs to be onn par with the latest updated info carriers such as to identify the latest entities and their relationships. 10. ANNOTATION MODUULE The extracted noun phrases from the web document, which are instances are matched to the ontology to find the higher level of concept. The conceptual representation of the word iss matched with the instance. The values of attributes of that particular concept are filled with the valuues in the document to be annotated. It is not mandatory that all the attributes of concepts need to be filled. The more the number of filled attributes the concepts is clearly marked for the instance. The index range of alll the instances is stored in a file. When there is overlapping of the concepts then there exists a relation betwween the them. The concepts with the relation are the possible candidates of annotation. The context of the higher level concept from

http://www.iaeme.com/IJCET/index.asp 72 [email protected] Fioboda - Semantic Annotation Framework For Web Extracted Data word to the sentence is analyzed to find the lower level concepts. It is assumed there exists a spatial proximity between the concepts. The instances in the extracted web data is annotated with higher level concepts. Annotations are represented in the system as RDF/XML format. Uniform resource identifier(URI) may take the form of Uniform Resource Name(URN) which is used for internal reference of the document .Otherwise it may take the form of URL(Uniform resource Locator) for external reference in the web. Annotations are checked whether it is URI or URL. If it is URL, it need not be converted to URI and if annotation exists in the web document, it can be stored in the server later for indexing. But when the Annotated document is not a web document corresponding URN will be generated and later published and stored in the local server. The web document is integrated with the annotation data and stored for automatically annotating the documents.

Figure 8 Graphical representation of Class Equity in the ontology. Figure 8 Picture is adapted from [14].This represents the classes in the owl: thing which exists. 11. CLASSIFICATION OF ANNOTATION BY MACHINE LEARNING The resultant annotated pages are Classified into Correct Annotation and Wrong annotation using the svm classifier model. Features were studied for the classification and the Correctly classified annotations were stored in the Annotation database and the incorrect annotation in the rejected list. The correctly classified annotation serves as the training set data for future classifications.

11.1. SVM Classifier SVM is a machine learning algorithm for binary classification. The concept, which is behind the svm classifier, is that in high dimension feature space, vectors are mapped non-linearly. There is a linear separation between the training data with minimum margins between the two classes. Test data along with

http://iaeme.com/Home/journal/IJCET 73 [email protected] C. Gnana Chithra and Dr. E. Ramaraj feature set and training data classifies the data to the class to which it corresponds. The features are mapped to the feature space for performing optimization. If the training set examples cannot be separated, the regularization parameter can be used to balance the larger margin with big training error.

12. EVALUATION It is a very difficult procedure to evaluate the FIOBODA Framework. Hence the performance metrics proposed by Yang [15] is evaluated for the FIOBODA. To evaluate the performance of FIOBODA first a confusion matrix by Kohavi uses the classifiers to access features [16] or error matrix is designed, which permits the visualization of the performance. This error matrix contains classes of two dimensions such as actual and predicted classification. The Confusion matrix is given in Table.1

Table 1 Confusion Matrix Predicted Positive Negative Actual Positive True positive False Positive TP FP Negative False Negative True Negative FN TN Where TP represents the number of correct predictions to the positive instance (True Positives) FP represents the count of incorrect predictions to negative instance (False Positives) FN represents the count of incorrectly predictions for positive instance (True Positive) TN represents the count of correctly predictions for the negative instance (True Positive) The following metrics preferred by Yang [13] is used to evaluate the FIOBODA framework. Three different datasets Dataset1, Dataset 2, Dataset3 were extracted from large corpora with three domains two from the stock markets and one from the corporate websites. The top ranked named entities with their precision and recall values are given in Table. 2

Table 2 Top Ranked Entities with Precision And Recall Named Entity Precision Recall equity 98.34% 99.12% preference share 98.12% 99.00% dividend 99.00% 98.23% bonus share 97.32% 98.67% investment 70.23% 65.34% The entity “dividend” is with a high precision of 99% and recall of 98.23%. But the entity “investment” records with precision and recall rate due to its lack of specificity.

Table 2 Evaluating the proposed annotation frame work with different datasets Domain Precision Recall F-score Fallout Accuracy Error Dataset-1 97.54% 96.95% 96.97% 0.11% 95.5% 4.5% Dataset-2 98% 81.25% 88.84% 0.11% 96% 6% Dataset-3 95.55% 98.47% 96.89% 0.31% 94.66% 5.34%

http://iaeme.com/Home/journal/IJCET 74 [email protected] Fioboda - Semantic Annotation Framework For Web Extraccted Data

After pre-processing using the Modified Manning’s sentence boundary detectioe n algorithm and Noun phrase collocation detection algorithi m is applied to the datasets, the resultant entities which are extracted is of high quality. Dataset 1 contains 20000 instances to be annotated, Dataset 2 contains 25000 entities and Dataset 3 contains 15000 entities for annotation and classification. Dataset-1 and dataset 3 emergges with high Recall rate as in Table.2. Sincee the fallout in dataset 1 and Dataset 2 is very low and the errorr is also correspondingly lower, the FIOBODA Framework proves to be a great success. Though dataset 3 hasa high precision and recall, the irrelevant daata with fallout is 0.31%.The results in Table 2 are represented graphically in Fig.9. The accuracy levels aree also above 94% and it is in the range of acceptance the newly designed FIOBODA framework.

Figure 9 Graphical representtation of Performance measure on datasets using FFIOBODA framework Tabble 3 Evaluation of SVM Classifier on Datasets DATASET PRECISION RECALL Dataset 1 98.1% 98.76% Dataset 2 97.67% 98.54% Dataset 3 98.34% 99.23% The mean precision of the svm classifier on the datasets is 98.03% and the mean recall is 98.84%. SVM classifier with its parameters performs optimization and the training set is linearly separable. 13. CONCLUSION This semantic annotation framewwork annotates the document with Dublin ccore metadata elements and higher-level concepts. Due to the ffrequent changing of web page content there iis no tight coupling between the annotation in the web page annd the ontology. The correctly classified annotated documents which are stored for future use, are the potenntial candidates for machine learning. The seemantic relationship between the concepts needs to be grilled down further and the association between the ontology and the document has to be made still tighter.

http://www.iaeme.com/IJCET/index.asp 75 [email protected] C. Gnana Chithra and Dr. E. Ramaraj

REFERENCE [1] http://www.merriam-webster.com/dictionary/annotation [2] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM Semi-automatic CREAtion of Metadata.The 13th Int. Conf. on Knowledge Engineering and Management (EKAW2002), ed.Gomez-Perez, A., Springer Verlag (2002) [3] Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufman Publishers (2003) [4] Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to Harvest Information for the Semantic Web. ESWS 2004, LNCS 3053. Springer-Verlag Berlin Heidelberg (2004) 312–326 [5] Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In:Proc. of the 27th Int. Conference on Very Large Data Bases. (2001) 119–128 [6] Kiryakov, A., B. Popov, I. Terziev, D. Manov and D. Ognyanoff (2003). Semantic Annotation, Indexing and Retrieval. In proccedings of the Second International Semantic Web Conference (ISWC'2003), Florida, USA,pp. 484-499. [7] Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K. S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J. Y.: A Case for Automated Large-Scale Semantic Annotation. Journal of Web Semantics, 1(1) (2003) 115-132 [8] Sanderson, R., Van De Sompel, H. (2011a). Open Annotation. Beta Data Model Guide.http:/www/openannotation.org/spec [9] Oren, Renaud Delbru, Knud Möller, Max Völkel, Siegfried Handschuh "Annotation and Navigation in Semantic ", Proceedings of the Workshop on Semantic Wikis (SemWiki), in conjunction with 3rd European Semantic Web Conference, 2006. [10] Haase, K. (2004). Context for semantic metadata. Proceedings of the 12th ACM International Conference on Multimedia, New York, USA, 204, ACM Press. [11] Gnana Chithra.C, Ramaraj.E. Heursitic sentence boundary detection and classification. Paper selected for presentation in the First International Conference on Recent Innovations in Engineering and Technology 2016, and to be published in International Journal of Emerging Technoloies-IJET(online ISSN: 2249-3255). [12] Gnana Chithra.C, Ramaraj.E. A Novel automatic approach for Extraction and classification of Noun Phrase collocates. In Editorial for International Journal of Computational Intelligence Research (IJCIR). [13] CFI: Classification of Financial Instruments http://www.anna-web.org [14] Mike Bennett [2007], Financial securities and ontologies: An exploration www.hypercube.co.uk/docs/ontologyexploration.doc [15] Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999, 1(1-2), 67–88 [16] Kohavi, R., and Provost, F. 1998. On Applied Research in Machine Learning. In Editorial for the Special Issue on Applications of Machine Learning and Knowledge Discovery Process, Columbia University, New York, volume30. [17] Houda El Bouhissi, Mimoun Malki and Djamila Berramdane, Applying Semantic Web Services. International Journal of Computer Engineering and Technology (IJCET), 4(2), 2013, pp. 108–113. [18] Mangai P. Enhanced Web Image Re-Ranking Using Semantic Signatures , International Journal of Computer Engineering and Technology (IJCET), 7(2), 2016, p p. 24 – 29 .

http://iaeme.com/Home/journal/IJCET 76 [email protected]