Visualization and Validation of (Q)Sar Models

Total Page:16

File Type:pdf, Size:1020Kb

Visualization and Validation of (Q)Sar Models VISUALIZATION AND VALIDATION OF (Q)SARMODELS Dissertation to obtain the academic degree Dissertation of Science submitted to the Faculty Physics, Mathematics and Computer Science of the Johannes Gutenberg-University Mainz on 6. July 2015 by Martin Gütlein born in Bamberg Submission date: 6. July 2015 PhD defense: 22. July 2015 First reviewer: Second reviewer: ABSTRACT Analyzing and modeling relationships between the structure of chemical com- pounds, their physico-chemical properties, and biological or toxic effects in chem- ical datasets is a challenging task for scientific researchers in the field of chemin- formatics. Therefore, (Q)SAR model validation is essential to ensure future model predictivity on unseen compounds. Proper validation is also one of the require- ments of regulatory authorities in order to approve its use in real-world scenarios as an alternative testing method. However, at the same time, the question of how to validate a (Q)SAR model is still under discussion. In this work, we empirically compare a k-fold cross-validation with external test set validation. The introduced workflow allows to apply the built and validated models to large amounts of un- seen data, and to compare the performance of the different validation approaches. Our experimental results indicate that cross-validation produces (Q)SAR models with higher predictivity than external test set validation and reduces the variance of the results. Statistical validation is important to evaluate the performance of (Q)SAR mod- els, but does not support the user in better understanding the properties of the model or the underlying correlations. We present the 3D molecular viewer CheS- Mapper (Chemical Space Mapper) that arranges compounds in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kinds of features, like structural fragments as well as quan- titative chemical descriptors. Comprehensive functionalities including clustering, alignment of compounds according to their 3D structure, and feature highlighting aid the chemist to better understand patterns and regularities and relate the obser- vations to established scientific knowledge. Even though visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allows for the investi- gation of model validation results are still lacking. We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. New functionalities in CheS-Mapper 2.0 facilitate the analysis of (Q)SAR information and allow the visual validation of (Q)SAR models. The tool enables the comparison of model predictions to the actual activity in feature space. Our approach reveals if the endpoint is modeled too specific or too generic and highlights common proper- ties of misclassified compounds. Moreover, the researcher can use CheS-Mapper to III inspect how the (Q)SAR model predicts activity cliffs. The CheS-Mapper software is freely available at http://ches-mapper.org. IV ZUSAMMENFASSUNG Zusammenhänge zwischen der Struktur von chemischen Verbindungen und biol- ogischen oder toxischen Effekten zu analysieren und zu modellieren ist eine wis- senschaftliche Herausforderung im Bereich der Chemieinformatik. Deshalb ist die sorgfältige Validierung von (Q)SAR Modellen entscheidend um die Vorhersage- Genauigkeit eines Modells bei ungesehenen Verbindungen zu gewährleisten. Ord- nungsgemäße Validierung ist auch eine der Voraussetzungen der Regulierungs- behörden, um den Einsatz von (Q)SAR Modellen als alternative Test-Methode von Chemikalien zu genehmigen. Allerdings wird immer noch aktiv diskutiert, welches die korrekte Validierungsmethode von (Q)SAR Modellen ist. Diese Ar- beit vergleicht empirisch k-fache Kreuzvalidierung mit einer externen Validierung anhand eines Test-Datensatzes. Mit der vorgestellten Methodik werden die vali- dierten Modelle auf große Mengen ungesehener Verbindungen angewendet, und die Genauigkeit der verschiedenen Validierungsmethoden verglichen. Unsere ex- perimentellen Ergebnisse legen nahe, dass kreuzvalidierte (Q)SAR Modelle eine höhere Vorhersage-Genauigkeit aufweisen, als solche, die mit einem externen Test- datensatz validiert worden sind. Des weiteren ist die Varianz der Kreuzvalidierung geringer. Statistische Validierung ist zwingend notwendig, um die Vorhersage- Genauigkeit von (Q)SAR Modellen zu ermitteln. Diese Validierung ist aber nur eingeschränkt hilfreich um die Eigenschaften des Modells oder der zugrunde liegenden Beziehungen zu verstehen. In diesem Zusammenhang stellen wir den molekularen 3D-Viewer CheS-Mapper (Chemical Space Mapper) vor. Diese Computer-Anwendung ordnet chemische Verbindungen im 3D-Raum an, so dass die räumliche Distanz die Ähnlichkeit der Verbindungen widerspiegelt. Durch die Wahl der chemischen Deskriptoren kann der Benutzer die Ähnlichkeit festlegen. CheS-Mapper kann diverse Deskriptoren-Typen, zum Beispiel struk- turelle Fragmente oder numerische Kennzahlen, berechnen. Des weiteren erlaubt CheS-Mapper das Clustern von Verbindungen, das Ausrichten und Übereinan- derlegen der Strukturen im 3D-Raum, wie auch die farbliche Hervorhebung von Verbindungen anhand ihrer Deskriptor-Werte. Das Programm erleichtert es daher, Chemikern Muster und Zusammenhänge in den Daten zu erkennen und bekanntes wissenschaftliches Wissen zu veranschaulichen. Zwar existieren bereits einige Visualisierungs-Werkzeuge für (Q)SAR Informatio- nen in chemischen Datensätzen, allerdings fehlt eine ganzheitliche Visualisierungs- V Methode für Validierungs-Ergebnisse. Wir präsentieren visuelle Validierung, eine graphische Analyse-Methode für die Validierung eines (Q)SAR Modells mit Hilfe neuer Funktionen in CheS-Mapper 2.0. Vorhergesagte Werte für die Aktivität chemischer Verbindungen können mit tatsächlichen Aktivitäten durch die Visual- isierung im 3D-Raum verglichen werden. Unser Ansatz zeigt, ob der Endpunkt zu generisch oder zu spezifisch modelliert wurde, und hebt gemeinsame Eigen- schaften von falsch vorhergesagten Verbindungen hervor. Darüber hinaus können Forscher untersuchen, wie Activity Cliffs von einem Modell vorhergesagt werden. Die CheS-Mapper Software ist frei verfügbar unter http://ches-mapper.org. VI ACKNOWLEDGEMENTS Acknowledgements have been removed from the online version. VII CONTENTS List of Figures XIII List of Tables XIX List of Abbreviations XXI List of Papers XXIII 1 INTRODUCTION 1 1.1 Contributions and Findings . 1 1.2 Structure . 3 2 BACKGROUND 5 2.1 Cheminformatics . 5 2.1.1 Representing Chemical Structures . 5 2.1.2 Chemical Descriptors and Structural Fragments . 11 2.1.3 Chemical Databases . 13 2.1.4 Searching in Datasets and Databases . 14 2.2 (Q)SAR Modeling . 15 2.2.1 Milestones in (Q)SAR History . 15 2.2.2 Types of (Q)SAR Modeling Approaches . 16 2.2.3 Data Availability and Quality . 19 2.2.4 Applicability Domain of (Q)SAR Models . 20 2.2.5 An Application Example: lazar . 21 2.2.6 Acceptance of (Q)SARs as Alternative Testing Method . 22 2.3 Validation of (Q)SAR Models . 23 2.3.1 Cross-Validation and its Variants . 24 2.3.2 Internal and External Validation . 25 2.3.3 Current Concerns about Cross-Validation . 26 2.4 Visualization of Chemical Datasets . 27 2.4.1 Chemical Spreadsheets . 28 2.4.2 Dimensionality Reduction Techniques . 28 2.4.3 Visualization Tools for Small Molecule Datasets . 30 2.4.4 Visualization Approaches for Model Validation . 35 IX 3 CROSS - VALIDATION OF ( Q ) SARMODELS 37 3.1 Experimental Workflow Design . 37 3.1.1 Extended Workflow Parameters . 40 3.2 Experimental Results . 42 3.2.1 Performance on Reference Dataset . 42 3.2.2 Deviation of Predictivity Estimate from Reference Dataset . 45 3.2.3 Additional Experimental Results . 45 3.3 Discussion on Cross-Validation and External Validation . 52 3.4 Conclusions . 53 4 CHEMICALSPACEMAPPINGANDVISUALIZATION 55 4.1 Methods . 55 4.1.1 CheS-Mapper Wizard . 56 4.1.2 CheS-Mapper Viewer . 67 4.2 Implementation . 74 4.3 Use Cases . 74 4.3.1 Mapping a Dataset using Integrated Features . 74 4.3.2 Structural Clustering using Open Babel Fingerprints . 75 4.4 Empirical Evaluation . 77 4.5 Conclusions . 81 5 VISUAL VALIDATION 83 5.1 Motivation . 83 5.2 Visual Validation of (Q)SAR Models . 85 5.2.1 New Features for Visual Validation . 85 5.2.2 Visually Validating (Q)SAR Models in CheS-Mapper 2.0 . 91 5.3 Use Cases . 94 5.3.1 Comparing (Q)SAR Models for Caco-2 Permeation . 94 5.3.2 Structural Clustering of COX-2 Data . 97 5.3.3 Investigating Input Features for Carcinogenicity Models . 100 5.3.4 Analyzing Applicability Domains for Fish Toxicity Prediction 103 5.4 Conclusions . 105 6 CONCLUSIONS 107 6.1 Summary . 107 6.2 Application to (Q)SAR Modeling . 108 6.3 Future Work . 108 Bibliography 111 X ASUPPLEMENTARYMATERIAL 127 A.1 A KNIME Workflow for Visually Validating LOO-CV . 127 XI LISTOFFIGURES Figure 2.1 2D and 3D structure of the compound diazepam, created with its SMILES (simplified molecular-input line-entry system) string. 6 Figure 2.2 3D structure of diazepam in MDL molfile format . 10 Figure 2.3 3D structure of diazepam in chemical markup language (CML), shortened . 10 Figure 2.4 The workflow of the lazar framework, with regard to the configurable algorithms for descriptor calculation, chemical similarity calculation, and local (Q)SAR models. 21 Figure 2.5 10-fold cross-validation . 24 Figure 2.6 External validation using a single test set . 25 Figure 2.7 Scatterplots
Recommended publications
  • Different Aspects of Web Log Mining
    Different Aspects of Web Log Mining Renáta Iváncsy, István Vajk Department of Automation and Applied Informatics, and HAS-BUTE Control Research Group Budapest University of Technology and Economics Goldmann Gy. tér 3, H-1111 Budapest, Hungary e-mail: {renata.ivancsy, vajk}@aut.bme.hu Abstract: The expansion of the World Wide Web has resulted in a large amount of data that is now freely available for user access. The data have to be managed and organized in such a way that the user can access them efficiently. For this reason the application of data mining techniques on the Web is now the focus of an increasing number of researchers. One key issue is the investigation of user navigational behavior from different aspects. For this reason different types of data mining techniques can be applied on the log file collected on the servers. In this paper three of the most important approaches are introduced for web log mining. All the three methods are based on the frequent pattern mining approach. The three types of patterns that can be used for obtain useful information about the navigational behavior of the users are page set, page sequence and page graph mining. Keywords: Pattern mining, Sequence mining, Graph Mining, Web log mining 1 Introduction The expansion of the World Wide Web (Web for short) has resulted in a large amount of data that is now in general freely available for user access. The different types of data have to be managed and organized such that they can be accessed by different users efficiently. Therefore, the application of data mining techniques on the Web is now the focus of an increasing number of researchers.
    [Show full text]
  • An XML-Based Approach for Warehousing and Analyzing
    Key words: XML data warehousing, Complex data, Multidimensional modelling, complex data analysis, OLAP. X-WACoDa: An XML-based approach for Warehousing and Analyzing Complex Data Hadj Mahboubi Jean-Christian Ralaivao Sabine Loudcher Omar Boussaïd Fadila Bentayeb Jérôme Darmont Université de Lyon (ERIC Lyon 2) 5 avenue Pierre Mendès-France 69676 Bron Cedex France E-mail: first_name.last_name@univ-lyon2.fr ABSTRACT Data warehousing and OLAP applications must nowadays handle complex data that are not only numerical or symbolic. The XML language is well-suited to logically and physically represent complex data. However, its usage induces new theoretical and practical challenges at the modeling, storage and analysis levels; and a new trend toward XML warehousing has been emerging for a couple of years. Unfortunately, no standard XML data warehouse architecture emerges. In this paper, we propose a unified XML warehouse reference model that synthesizes and enhances related work, and fits into a global XML warehousing and analysis approach we have developed. We also present a software platform that is based on this model, as well as a case study that illustrates its usage. INTRODUCTION Data warehouses form the basis of decision-support systems (DSSs). They help integrate production data and support On-Line Analytical Processing (OLAP) or data mining. These technologies are nowadays mature. However, in most cases, the studied activity is materialized by numeric and symbolic data, whereas data exploited in decision processes are more and more diverse
    [Show full text]
  • A New Approach for Web Usage Mining Using Artificial Neural Network
    International Journal of Engineering Science Invention ISSN (Online): 2319 – 6734, ISSN (Print): 2319 – 6726 www.ijesi.org || PP—12-16 A New Approach For Web usage Mining using Artificial Neural network Smaranika Mohapatra1, Kusumlata Jain2, Neelamani Samal3 1(Department of Computer Science & Engineering, MAIET, Mansarovar, Jaipur, India) 2(Department of Computer Science & Engineering, MAIET, Mansarovar, Jaipur, India) 3(Department of Computer Science & Engineering, GIET, Bhubaneswar, Odisha ,India ) Abstract: Web mining includes usage of data mining methods to discover patterns of use from the web data. Studies on Web usage mining using a Concurrent Clustering has shown that the usage trend analysis very much depends on the performance of the clustering of the number of requests. Despite the simplicity and fast convergence, k-means has some remarkable drawbacks. K-means does not only concern with this but with all other algorithms, using clusters with centers. K-means define the center of cluster formed by the algorithms to make clusters. In k-means, the free parameter is k and the results depend on the value of k. there is no general theoretical solution for finding an optimal value of k for any given data set. In this paper, Multilayer Perceptron Algorithm is introduced; it uses some extension neural network, in the process of Web Usage Mining to detect user's patterns. The process details the transformations necessaries to modify the data storage in the Web Servers Log files to an input of Multilayer Perceptron Keywords: Web Usage Mining, Clustering, Web Server Log File. Neural networks, Perceptron, MLP I. Introduction Web mining has been advanced into an autonomous research field.
    [Show full text]
  • Probabilistic Semantic Web Mining Using Artificial Neural Analysis 1.1 Data, Information, and Knowledge
    (IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, No. 3, March 2010 Probabilistic Semantic Web Mining Using Artificial Neural Analysis 1.1 Data, Information, and Knowledge 1. T. KRISHNA KISHORE, 2. T.SASI VARDHAN, AND 3. N. LAKSHMI NARAYANA Abstract Most of the web user’s requirements are 1.1.1 Data search or navigation time and getting correctly matched result. These constrains can be satisfied Data are any facts, numbers, or text that with some additional modules attached to the can be processed by a computer. Today, existing search engines and web servers. This paper organizations are accumulating vast and growing proposes that powerful architecture for search amounts of data in different formats and different engines with the title of Probabilistic Semantic databases. This includes: Web Mining named from the methods used. Operational or transactional data such as, With the increase of larger and larger sales, cost, inventory, payroll, and accounting collection of various data resources on the World No operational data, such as industry Wide Web (WWW), Web Mining has become one sales, forecast data, and macro economic data of the most important requirements for the web Meta data - data about the data itself, such users. Web servers will store various formats of as logical database design or data dictionary data including text, image, audio, video etc., but definitions servers can not identify the contents of the data. 1.1.2 Information These search techniques can be improved by The patterns, associations, or relationships adding some special techniques including semantic among all this data can provide information.
    [Show full text]
  • Warehousing Complex Data from The
    Int. J. Web Engineering and Technology, Vol. x, No. x, xxxx 1 Warehousing complex data from the Web O. Boussa¨ıd∗, J. Darmont∗, F. Bentayeb and S. Loudcher ERIC, University of Lyon 2 5 avenue Pierre Mend`es-France 69676 Bron Cedex France Fax: +33 478 772 375 E-mail: firstname.lastname@univ-lyon2.fr ∗Corresponding authors Abstract: The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Our approach includes the inte- gration of complex data in an ODS, under the form of XML documents; their dimensional modeling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques. We also address the crucial issue of performance in XML warehouses. Keywords: Data warehousing, Web data, Complex data, ETL pro- cess, Dimensional modeling, XML warehousing, OLAP, Data mining, Performance. Reference to this paper should be made as follows: Boussa¨ıd, O., Darmont, J., Bentayeb, F. and Loudcher, S. (xxxx) ‘Warehousing com- plex data from the Web’, Int. J. Web Engineering and Technology, Vol. x, No. x, pp.xxx–xxx. Biographical notes: Omar Boussa¨ıd is an associate professor in com- puter science at the School of Economics and Management of the Uni- versity of Lyon 2, France. He received his PhD degree in computer science from the University of Lyon 1, France in 1988.
    [Show full text]
  • Structural Graph Clustering: Scalable Methods and Applications for Graph Classification and Regression
    TECHNISCHE UNIVERSITÄT MÜNCHEN Fakultät für Informatik Lehrstuhl für Bioinformatik Structural Graph Clustering: Scalable Methods and Applications for Graph Classification and Regression Dipl. Wirtsch.-Inf. Madeleine Seeland Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Univer- sität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Ernst W. Mayr Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Univ.-Prof. Dr. Stefan Kramer Johannes Gutenberg Universität Mainz Die Dissertation wurde am 19.05.2014 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 04.08.2014 angenom- men. In loving memory of my father, Alfred Seeland. Abstract In recent years, the collection of complex data from various domains, such as bio- and cheminformatics, has been rapidly increasing due to advances in technology for generating and collecting data. The explosive growth in data has generated an urgent need for new technologies and automated tools capable of efficiently extracting useful hidden informa- tion from these data. Data mining, therefore, has become a research area with increasing importance. As much of the data collected is structural in nature, graph-based represen- tations have been intensively employed for modeling these data, making graph mining an important topic of research. This thesis focuses on the graph mining task of clustering objects in a graph database based on graph structure and studies several applications for classification and regression. The first part of the thesis introduces scalable methods for clustering large databases of small graphs by common scaffolds, i.e., the existence of one sufficiently large subgraph shared by all cluster elements.
    [Show full text]
  • Implementing Web Mining Into Cloud Computing C Mayanka
    International Journal of Scientific & Engineering Research, Volume 5, Issue 3, March-2014 62 ISSN 2229-5518 Implementing Web Mining into Cloud Computing C Mayanka Abstract— In this paper different types of web mining and their usage in cloud computing (CC) have been explained. Web mining can be broadly defined as discovery and analysis of useful information from the World Wide Web [1]. It is divided into three parts that is Web Content Mining, Web Structure Mining and Web Usage Mining. Each Mining technique is used in different ways in cloud i.e. Cloud Mining. Index Terms— Cloud Computing, Data-Centric Mining, Web Mining, Web Content Mining, Web Structure Mining, Web Usage Mining. —————————— —————————— 1 INTRODUCTION eb Mining is type of data mining which is used for web. session period. W Web Mining has become an emerging and an important c. User/Session Identification: This is most difficult trend because of the main reason that today storage of data task in which user and session are identified from has become enormous that retrieving and processing them has log file. This is difficult because of presence of become an overhead. Two different approaches were taken in any users on the same computer, proxy servers, initially for defining Web mining they were ‘process-centric dynamic addresses etc. view’ and ‘data-centric view’ [2] [1]. Process centric accounted the sequence of tasks whereas data centric accounted for the 2. Pattern Discovery: This data is analysed to find out types of web data that was being used in the mining process. the patterns in them. The second approach is taken into consideration widely in 3.
    [Show full text]
  • A Data Warehouse Is a Repository of Multiple Heterogeneous Data Sources Organized Under a Unified Schema at a Single Site to Facilitate Management Decision Making
    Question Bank Sub Code & Name: IT6702 & Data Warehousing & Data Mining Name of the faculty : Mr.K.Rajkumar Designation & Department : Asst Prof & CSE Regulation : 2013 Year & Semester : III / 06 Branch : CSE Section : A & B UNIT-1 DATA WAREHOUSING Part – A 1. What is data warehouse? A data warehouse is a repository of multiple heterogeneous data sources organized under a unified schema at a single site to facilitate management decision making. A data warehouse is a subject-oriented, time-variant and nonvolatile collection of data in support of management’s decision-making process. 2. What are the uses of multifeature cubes? Multifeature cubes, which compute complex queries involving multiple dependent aggregates at multiple granularity. These cubes are very useful in practice. Many complex data mining queries can be answered by multifeature cubes without any significant increase in computational cost, in comparison to cube computation for simple queries with standard data cubes. 3. What is Data mart? Data mart is a data store that is subsidiary to a data ware house of integrated data. The data mart is directed at a partition of data that is created for the use of a dedicated group of users. 4. What is data warehouse metadata? Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Metadata are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes.
    [Show full text]
  • Comparative Mining of B2C Web Sites by Discovering Web Database Schemas
    University of Windsor Scholarship at UWindsor Electronic Theses and Dissertations Theses, Dissertations, and Major Papers 2016 Comparative Mining of B2C Web Sites by Discovering Web Database Schemas Bindu Peravali University of Windsor Follow this and additional works at: https://scholar.uwindsor.ca/etd Recommended Citation Peravali, Bindu, "Comparative Mining of B2C Web Sites by Discovering Web Database Schemas" (2016). Electronic Theses and Dissertations. 5861. https://scholar.uwindsor.ca/etd/5861 This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor students from 1954 forward. These documents are made available for personal study and research purposes only, in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution, Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder (original author), cannot be used for any commercial purposes, and may not be altered. Any other use would require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or thesis from this database. For additional inquiries, please contact the repository administrator via email (scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208. Comparative Mining of B2C Web Sites by Discovering Web Database Schemas By Bindu Peravali A Thesis Submitted to the Faculty of Graduate Studies through the School of Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Science at the University of Windsor Windsor, Ontario, Canada 2016 © 2016 Bindu Peravali Comparative Mining of B2C Web Sites by Discovering Web Database Schemas By Bindu Peravali APPROVED BY: Dr.
    [Show full text]
  • Data Mining and Warehousing
    Data Mining and Warehousing MCA Third Year Paper No. XI School of Distance Education Bharathiar University, Coimbatore - 641 046 Author: Jahir Uddin Copyright © 2009, Bharathiar University All Rights Reserved Produced and Printed by EXCEL BOOKS PRIVATE LIMITED A-45, Naraina, Phase-I, New Delhi-110 028 for SCHOOL OF DISTANCE EDUCATION Bharathiar University Coimbatore-641 046 CONTENTS Page No. UNIT I Lesson 1 Data Mining Concept 7 Lesson 2 Data Mining with Database 27 Lesson 3 Data Mining Techniques 40 UNIT II Lesson 4 Data Mining Classification 53 UNIT III Lesson 5 Cluster Analysis 83 Lesson 6 Association Rules 108 UNIT IV Lesson 7 Data Warehouse 119 Lesson 8 Data Warehouse and OLAP Technology 140 UNIT V Lesson 9 Developing Data Warehouse 167 Lesson 10 Application of Data Warehousing and Mining in Government 175 Model Question Paper 181 DATA MINING AND WAREHOUSING SYLLABUS UNIT I Basic data mining tasks: Data mining versus knowledge discovery in databases - Data mining issues - data mining metrices - social implications of data mining - data mining from a database perspective. Data mining techniques: Introduction - a statistical perspective on data mining - similarity measures - decision trees - neural networks - genetic algorithms. UNIT II Classification: Introduction - statistical - based algorithms - distance - based algorithms - decision tree- based algorithms- neural network - based algorithms - rule-based algorithms - combining techniques. UNIT III Clustering: Introduction - Similarity and distance Measures - Outliers - Hierarchical Algorithms - Partitional Algorithms. Association rules: Introduction-large item sets - basic algorithms - parallel & distributed algorithms - comparing approaches - incremental rules - advanced association rules techniques - measuring the quality of rules. UNIT IV Data warehousing: An introduction - characteristic of a data warehouse - data mats - other aspects of data mart.
    [Show full text]
  • Web Mining for Information Retrieval
    ISSN XXXX XXXX © 2018 IJESC Research Article Volume 8 Issue No.4 Web Mining for Information Retrieval Shadab Irfan1, Subhajit Ghosh2 Research Scholar (SCSE)1, Professor (SCSE)2 Galgotias University Greater Noida, Uttar Pradesh, India Abstract: The present era is engulfed in data and it is quite difficult to churn the emergence of unstructured data in order to mine relevant information. In this paper we present a basic outline of web mining techniques which changed the scenario of today world and help in retrieval of information. Web Mining actually referred as mining of interesting pattern by using set of tools and techniques from the vast pool of web. It focuses on different aspects of web mining referred as Web Content Mining, Web Structure Mining and Web Usage Mining. The overview of these mining techniques which help in retrieving desired information are covered along with the approaches used and algorithm employed. Keywords: Web Mining; Web Content Mining, Web Structure Mining, Web Usage Mining I. INTRODUCTION A. TEXT MINING The present era has changed a lot during the current years and Text mining focuses on unstructured texts. It is a set of processes transformed the outlook of the peoples. They view the which employs statistical techniques to monitor unstructured text information from different perspective which helps them to find and thereby convert those documents into valuable structured the valuable information by aggregating various techniques and information. With rise of social networks it has been realized that tools. The World Wide Web has emerged as vast and popular information is a strategic asset made of text and sentiment repository of hyperlinked documents containing enormous analysis is a new focus point in the field of text mining.
    [Show full text]
  • Chapter I: Introduction to Data Mining
    Osmar R. Zaïane, 1999 CMPUT690 Principles of Knowledge Discovery in Databases Chapter I: Introduction to Data Mining We are in an age often referred to as the information age. In this information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information. Unfortunately, these massive collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has led to the creation of structured databases and database management systems (DBMS). The efficient database management systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. The proliferation of database management systems has also contributed to recent massive gathering of all sorts of information. Today, we have far more information than we can handle: from business transactions and scientific data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections of data, we have now created new needs to help us make better managerial choices. These needs are automatic summarization of data, extraction of the “essence” of information stored, and the discovery of patterns in raw data. What kind of information are we collecting? We have been collecting a myriad of data, from simple numerical measurements and text documents, to more complex information such as spatial data, multimedia channels, and hypertext documents.
    [Show full text]