Data Mining Mauro Maggioni

Total Page:16

File Type:pdf, Size:1020Kb

Data Mining Mauro Maggioni WHATIS... ? Data Mining Mauro Maggioni Data collected from a variety of sources has been Typically, xi is noisy (e.g., noisy pixel values), and accumulating rapidly. Many fields of science have so is f (xi) (e.g., mislabeled samples in the training gone from being data-starved to being data-rich set). and needing to learn how to cope with large data Of course mathematicians have long concerned sets. The rising tide of data also directly affects themselves with high-dimensional problems. One our daily lives, in which computers surrounding us example is studying solutions of PDEs as func- use data-crunching algorithms to help us in tasks tions in infinite-dimensional function spaces and ranging from finding the quickest route to our performing efficient computations by projecting destination considering current traffic conditions the problem onto low-dimensional subspaces (via to automatically tagging our faces in pictures; from discretizations, finite elements, or operator com- updating in near real time the prices of sale items pression) so that the reduced problem may be to suggesting the next movie we might want to numerically solved on a computer. In the case watch. of solutions of a PDE, the model for the data The general aim of data mining is to find useful is specified: a lot of information about the PDE and interpretable patterns in data. The term can is known, and that information is exploited to encompass many diverse methods and therefore predict the properties of the data and to construct means different things to different people. Here we low-dimensional projections. For the digital data discuss some aspects of data mining potentially of discussed above, however, typically we have little interest to a broad audience of mathematicians. information and poor models. We may start with Assume a sample data point xi (e.g., a picture) crude models, measure their fitness to the data and may be cast in the form of a long vector of numbers predictive ability, and, those being not satisfactory, (e.g., the pixel intensities in an image): we represent improve the models. This is one of the key pro- it as a point in RD. Two types of related goals cesses in statistical modeling and data mining. It exist. One is to detect patterns in this set of points, is not unlike what an applied mathematician does and the other is to predict a function on the data: when modeling a complex physical system: he may start with simplifying assumptions to construct a given a training set (xi; f (xi))i, we want to predict f at points outside the training set. In the case of “tractable” model, derive consequences of such a text documents or webpages, we might want to model (e.g., properties of the solutions) analytically automatically label each document as belonging and/or with simulations, and compare the results to an area of research; in the case of pictures, we to the properties exhibited by the real-world phys- might want to recognize faces; when suggesting the ical system. New measurements and real-world next movie to watch given past ratings of movies simulations may be performed, and the fitness by a viewer, f consists of ratings of unseen movies. of the model reassessed and improved as needed for the next round of validation. While physics Mauro Maggioni is assistant professor of mathematics and drives the modeling in applied mathematics, a computer science at Duke University. His email address is new type of intuition, built on experiences in the [email protected]. world of high-dimensional data sets rather than DOI: http://dx.doi.org/10.1090/noti831 in the world of physics, drives the intuition of the 532 Notices of the AMS Volume 59, Number 4 mathematician set to analyze high-dimensional data sets, where “tractable” models are geomet- 3 ric or statistical models with a small number of ψ φ parameters. 2 One of the reasons for focusing on reduc- ing the dimension is to enable computations, φ 1 but a fundamental motivation is the so-called 0.04 curse of dimensionality. One of its manifestations 0.02 arises in the approximation of a 1-Lipschitz func- 0 tion on the unit cube, f : [0; 1]D ! R satisfying jf (x) − f (y)j ≤ jjx − yjj for x; y 2 [0; 1]D. To 0 −0.05 −1 achieve uniform error , given samples (x ; f (x )), i i −0.02 in general one needs at least one sample in each cube of side , for a total of −D samples, which is −2 −0.04 0 too large even for, say, = 10−1 and D = 100 (a −0.03 −0.02 −0.01 rather small dimension in applications). A common 0 0.01 −3 assumption is that either the samples xi lie on 0.02 0.05 D a low-dimensional subset of [0; 1] and/or f is 0.15 not simply Lipschitz but has a smoothness that Anthropology is suitably large, depending on D (see references Astronomy in [3]). Taking the former route, one assumes 0.1 Social Sciences Earth Sciences that the data lies on a low-dimensional subset in Biology the high-dimensional ambient space, such as a Mathematics low-dimensional hyperplane or unions thereof, or 0.05 Medicine Physics low-dimensional manifolds or rougher sets. Re- 6 search problems require ideas from different areas 0 of mathematics, including geometry, geometric measure theory, topology, and graph theory, with their tools for studying manifolds or rougher sets; −0.05 probability and geometric functional analysis for studying random samples and measures in high dimensions; harmonic analysis and approximation −0.1 −0.5 theory, with their ideas of multiscale analysis and 0 function approximation; and numerical analysis, 0.5 −0.15 −0.1 −0.05 0 0.05 because we need efficient algorithms to analyze 5 4 real-world data. As a concrete example, consider the following Figure 1. Top: Diffusion map embedding of the n D construction. Given n points fxigi=1 ⊂ R and > set of configurations of a small biomolecule 2 = − jjxi −xj jj = P (alanine dipeptide) from its 36-dimensional state 0, construct Wij exp( 2 ), Dii j Wij , − 1 − 1 space. The color is one of the dihedral angles = − 2 2 and the Laplacian matrix L I D WD on '; of the molecule, known to be essential to the weighted graph G with vertices fxig and edges the dynamics [4]. This is a physical system weighted by W . When xi is sampled from a man- where (approximate) equations of motion are ifold M and n tends to infinity, L approximates known, but their structure is too complicated (in a suitable sense) the Laplace-Beltrami operator and the state space too high-dimensional to be on M [2], which is a completely intrinsic object. amenable to analysis. Bottom: Diffusion map of a The random walk on G, with transition matrix data set consisting of 1161 Science News P = D−1W , approximates Brownian motion on M. articles, each modeled by a 1153-dimensional Consider, for a time t > 0, the so-called diffusion vector of word frequencies, embedded in a t t distance dt (x, y) := jjP (x, ·) − P (y; ·)jjL2(G) (see low-dimensional space with diffusion maps, as [2]). This distance is particularly useful for cap- described in the text and in [2]. turing clusters/groupings in the data, which are regions of fast diffusion connected by bottlenecks t q t q t that slow diffusion. Let 1 = λ0 ≥ λ1 ≥ · · · be space, where Φd(x) := ( λ1'1(x), : : : ; λd'd(x)), the eigenvalues of P and 'i be the correspond- for some t > 0 [2]. One can show that the Euclidean t t ing eigenvectors ('0, when G is a web graph, is distance between Φd(x) and Φd(y) approximates related to Google’s pagerank). Consider a diffu- dt (x, y), the diffusion distance at time scale t t sion map Φd that embeds the graph in Euclidean between x and y on the graph G. April 2012 Notices of the AMS 533 David In Figure 1 we apply this technique to two completely different data sets. The first one is a set of configurations of a small peptide, obtained by a 12×3 Blackwell molecular dynamics simulation: a point xi 2 R contains the coordinates in R3 of the 12 atoms Memorial Conference in the alanine dipeptide molecule (represented as an inset in Figure 1). The forces between the April 19–20, 2012 atoms in the molecule constrain the trajectories Howard University to lie close to low-dimensional sets in the 36- Washington, DC dimensional state space. In Figure 1 we apply the construction above1 and represent the diffusion Join us on April map embedding of the configurations collected [4]. 19–20, 2012, at The second one is a set of text documents (articles from Science News), each represented as a R1153 Howard University in vector whose kth coordinate is the frequency of the Washington, DC for kth word in a 1153-word dictionary. The diffusion a special conference embedding in low dimensions reveals even lower- honoring David Blackwell dimensional geometric structures, which turn out (1919–2010), former to be useful for understanding the dynamics of President of the Institute the peptide considered in the first data set and for of Mathematical Statistics automatically clustering documents by topic in the and the fi rst African- case of the second data set. Ideas from probability American to be inducted (random samples), harmonic analysis (Laplacian), into the National and geometry (manifolds) come together in these types of constructions.
Recommended publications
  • An Overview of the 50 Most Common Web Scraping Tools
    AN OVERVIEW OF THE 50 MOST COMMON WEB SCRAPING TOOLS WEB SCRAPING IS THE PROCESS OF USING BOTS TO EXTRACT CONTENT AND DATA FROM A WEBSITE. UNLIKE SCREEN SCRAPING, WHICH ONLY COPIES PIXELS DISPLAYED ON SCREEN, WEB SCRAPING EXTRACTS UNDERLYING CODE — AND WITH IT, STORED DATA — AND OUTPUTS THAT INFORMATION INTO A DESIGNATED FILE FORMAT. While legitimate uses cases exist for data harvesting, illegal purposes exist as well, including undercutting prices and theft of copyrighted content. Understanding web scraping bots starts with understanding the diverse and assorted array of web scraping tools and existing platforms. Following is a high-level overview of the 50 most common web scraping tools and platforms currently available. PAGE 1 50 OF THE MOST COMMON WEB SCRAPING TOOLS NAME DESCRIPTION 1 Apache Nutch Apache Nutch is an extensible and scalable open-source web crawler software project. A-Parser is a multithreaded parser of search engines, site assessment services, keywords 2 A-Parser and content. 3 Apify Apify is a Node.js library similar to Scrapy and can be used for scraping libraries in JavaScript. Artoo.js provides script that can be run from your browser’s bookmark bar to scrape a website 4 Artoo.js and return the data in JSON format. Blockspring lets users build visualizations from the most innovative blocks developed 5 Blockspring by engineers within your organization. BotScraper is a tool for advanced web scraping and data extraction services that helps 6 BotScraper organizations from small and medium-sized businesses. Cheerio is a library that parses HTML and XML documents and allows use of jQuery syntax while 7 Cheerio working with the downloaded data.
    [Show full text]
  • A Study on Social Network Analysis Through Data Mining Techniques – a Detailed Survey
    ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 IJARCCE International Journal of Advanced Research in Computer and Communication Engineering ICRITCSA M S Ramaiah Institute of Technology, Bangalore Vol. 5, Special Issue 2, October 2016 A Study on Social Network Analysis through Data Mining Techniques – A detailed Survey Annie Syrien1, M. Hanumanthappa2 Research Scholar, Department of Computer Science and Applications, Bangalore University, India1 Professor, Department of Computer Science and Applications, Bangalore University, India2 Abstract: The paper aims to have a detailed study on data collection, data preprocessing and various methods used in developing a useful algorithms or methodologies on social network analysis in social media. The recent trends and advancements in the big data have led many researchers to focus on social media. Web enabled devices is an another important reason for this advancement, electronic media such as tablets, mobile phones, desktops, laptops and notepads enable the users to actively participate in different social networking systems. Many research has also significantly shows the advantages and challenges that social media has posed to the research world. The principal objective of this paper is to provide an overview of social media research carried out in recent years. Keywords: data mining, social media, big data. I. INTRODUCTION Data mining is an important technique in social media in scrutiny and analysis. In any industry data mining recent years as it is used to help in extraction of lot of technique based reports become the base line for making useful information, whereas this information turns to be an decisions and formulating the decisions. This report important asset to both academia and industries.
    [Show full text]
  • Data and Computer Communications (Eighth Edition)
    DATA AND COMPUTER COMMUNICATIONS Eighth Edition William Stallings Upper Saddle River, New Jersey 07458 Library of Congress Cataloging-in-Publication Data on File Vice President and Editorial Director, ECS: Art Editor: Gregory Dulles Marcia J. Horton Director, Image Resource Center: Melinda Reo Executive Editor: Tracy Dunkelberger Manager, Rights and Permissions: Zina Arabia Assistant Editor: Carole Snyder Manager,Visual Research: Beth Brenzel Editorial Assistant: Christianna Lee Manager, Cover Visual Research and Permissions: Executive Managing Editor: Vince O’Brien Karen Sanatar Managing Editor: Camille Trentacoste Manufacturing Manager, ESM: Alexis Heydt-Long Production Editor: Rose Kernan Manufacturing Buyer: Lisa McDowell Director of Creative Services: Paul Belfanti Executive Marketing Manager: Robin O’Brien Creative Director: Juan Lopez Marketing Assistant: Mack Patterson Cover Designer: Bruce Kenselaar Managing Editor,AV Management and Production: Patricia Burns ©2007 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458 All rights reserved. No part of this book may be reproduced in any form or by any means, without permission in writing from the publisher. Pearson Prentice Hall™ is a trademark of Pearson Education, Inc. All other tradmarks or product names are the property of their respective owners. The author and publisher of this book have used their best efforts in preparing this book.These efforts include the development, research, and testing of the theories and programs to determine their effectiveness.The author and publisher make no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this book.The author and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.
    [Show full text]
  • Geoseq2seq: Information Geometric Sequence-To-Sequence Networks
    GEOSEQ2SEQ:INFORMATION GEOMETRIC SEQUENCE-TO-SEQUENCE NETWORKS Alessandro Bay Biswa Sengupta Cortexica Vision Systems Ltd. Noah’s Ark Lab (Huawei Technologies UK) London, UK Imperial College London, London, UK [email protected] ABSTRACT The Fisher information metric is an important foundation of information geometry, wherein it allows us to approximate the local geometry of a probability distribu- tion. Recurrent neural networks such as the Sequence-to-Sequence (Seq2Seq) networks that have lately been used to yield state-of-the-art performance on speech translation or image captioning have so far ignored the geometry of the latent embedding, that they iteratively learn. We propose the information geometric Seq2Seq (GeoSeq2Seq) network which abridges the gap between deep recurrent neural networks and information geometry. Specifically, the latent embedding offered by a recurrent network is encoded as a Fisher kernel of a parametric Gaus- sian Mixture Model, a formalism common in computer vision. We utilise such a network to predict the shortest routes between two nodes of a graph by learning the adjacency matrix using the GeoSeq2Seq formalism; our results show that for such a problem the probabilistic representation of the latent embedding supersedes the non-probabilistic embedding by 10-15%. 1 INTRODUCTION Information geometry situates itself in the intersection of probability theory and differential geometry, wherein it has found utility in understanding the geometry of a wide variety of probability models (Amari & Nagaoka, 2000). By virtue of Cencov’s characterisation theorem, the metric on such probability manifolds is described by a unique kernel known as the Fisher information metric. In statistics, the Fisher information is simply the variance of the score function.
    [Show full text]
  • Data Mining and Soft Computing
    © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Data mining and soft computing O. Ciftcioglu Faculty of Architecture, Deljl University of Technology Delft, The Netherlands Abstract The term data mining refers to information elicitation. On the other hand, soft computing deals with information processing. If these two key properties can be combined in a constructive way, then this formation can effectively be used for knowledge discovery in large databases. Referring to this synergetic combination, the basic merits of data mining and soft computing paradigms are pointed out and novel data mining implementation coupled to a soft computing approach for knowledge discovery is presented. Knowledge modeling by machine learning together with the computer experiments is described and the effectiveness of the machine learning approach employed is demonstrated. 1 Introduction Although the concept of data mining can be defined in several ways, it is clear enough to understand that it is related to information extraction especially in large databases. The methods of data mining are essentially borrowed from exact sciences among which statistics play the essential role. By contrast, soft computing is essentially used for information processing by employing methods, which are capable to deal with imprecision and uncertainty especially needed in ill-defined problem areas. In the former case, the outcomes are exact within the error bounds estimated. In the latter, the outcomes are approximate and in some cases they may be interpreted as outcomes from an intelligent behavior.
    [Show full text]
  • Data Mining with Cloud Computing: - an Overview
    International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 5 Issue 1, January 2016 Data Mining with Cloud Computing: - An Overview Miss. Rohini A. Dhote Dr. S. P. Deshpande Department of Computer Science & Information Technology HOD at Computer Science Technology Department, HVPM, COET Amravati, Maharashtra, India. HVPM, COET Amravati, Maharashtra, India Abstract: Data mining is a process of extracting competitive environment for data analysis. The cloud potentially useful information from data, so as to makes it possible for you to access your information improve the quality of information service. The from anywhere at any time. The cloud removes the integration of data mining techniques with Cloud need for you to be in the same physical location as computing allows the users to extract useful the hardware that stores your data. The use of cloud information from a data warehouse that reduces the computing is gaining popularity due to its mobility, costs of infrastructure and storage. Data mining has attracted a great deal of attention in the information huge availability and low cost. industry and in society as a whole in recent Years Wide availability of huge amounts of data and the imminent 2. DATA MINING need for turning such data into useful information and Data Mining involves the use of sophisticated data knowledge Market analysis, fraud detection, and and tool to discover previously unknown, valid customer retention, production control and science patterns and Relationship in large data sets. These exploration. Data mining can be viewed as a result of tools can include statistical models, mathematical the natural evolution of information technology.
    [Show full text]
  • Challenges in Bioinformatics for Statistical Data Miners
    This article appeared in the October 2003 issue of the “Bulletin of the Swiss Statistical Society” (46; 10-17), and is reprinted by permission from its editor. Challenges in Bioinformatics for Statistical Data Miners Dr. Diego Kuonen Statoo Consulting, PSE-B, 1015 Lausanne 15, Switzerland [email protected] Abstract Starting with possible definitions of statistical data mining and bioinformatics, this article will give a general side-by-side view of both fields, emphasising on the statistical data mining part and its techniques, illustrate possible synergies and discuss how statistical data miners may collaborate in bioinformatics’ challenges in order to unlock the secrets of the cell. What is statistics and why is it needed? Statistics is the science of “learning from data”. It includes everything from planning for the collection of data and subsequent data management to end-of-the-line activities, such as drawing inferences from numerical facts called data and presentation of results. Statistics is concerned with one of the most basic of human needs: the need to find out more about the world and how it operates in face of variation and uncertainty. Because of the increasing use of statistics, it has become very important to understand and practise statistical thinking. Or, in the words of H. G. Wells: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write”. But, why is statistics needed? Knowledge is what we know. Information is the communication of knowledge. Data are known to be crude information and not knowledge by itself. The way leading from data to knowledge is as follows: from data to information (data become information when they become relevant to the decision problem); from information to facts (information becomes facts when the data can support it); and finally, from facts to knowledge (facts become knowledge when they are used in the successful completion of the decision process).
    [Show full text]
  • The Theoretical Framework of Data Mining & Its
    International Journal of Social Science & Interdisciplinary Research__________________________________ ISSN 2277- 3630 IJSSIR, Vol.2 (1), January (2013) Online available at indianresearchjournals.com THE THEORETICAL FRAMEWORK OF DATA MINING & ITS TECHNIQUES SAYYAD RASHEEDUDDIN Associate Professor – HOD CSE dept Pulla Reddy Engineering College Wargal, Medak(Dist), A.P. ABSTRACT Data mining is a process which finds useful patterns from large amount of data. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data pattern analysis. The current paper discusses the following Objectives: 1. To discuss briefly about the concept of Data Mining 2. To explain about the Data Mining Algorithms & Techniques 3. To know about the Data Mining Applications Keywords: Concept of Data mining, Data mining Techniques, Data mining algorithms, Data mining applications. Introduction to Data Mining The development of Information Technology has generated large amount of databases and huge data in various areas. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data /pattern analysis. The current paper discuss about the following Objectives: 1. To discuss briefly about the concept of Data Mining 2.
    [Show full text]
  • Reinforcement Learning for Data Mining Based Evolution Algorithm
    Reinforcement Learning for Data Mining Based Evolution Algorithm for TSK-type Neural Fuzzy Systems Design Sheng-Fuu Lin, Yi-Chang Cheng, Jyun-Wei Chang, and Yung-Chi Hsu Department of Electrical and Control Engineering, National Chiao Tung University Hsinchu, Taiwan, R.O.C. ABSTRACT membership functions of a fuzzy controller, with the fuzzy rule This paper proposed a TSK-type neural fuzzy system set assigned in advance. Juang [5] proposed the CQGAF to (TNFS) with reinforcement learning for data mining based simultaneously design the number of fuzzy rules and free evolution algorithm (R-DMEA). In R-DMEA, the group-based parameters in a fuzzy system. In [6], the group-based symbiotic symbiotic evolution (GSE) is adopted that each group represents evolution (GSE) was proposed to solve the problem that all the the set of single fuzzy rule. The R-DMEA contains both fuzzy rules were encoded into one chromosome in the structure learning and parameter learning. In the structure traditional genetic algorithm. In the GSE, each chromosome learning, the proposed R-DMEA uses the two-step self-adaptive represents a single rule of fuzzy system. The GSE uses algorithm to decide the suitable number of rules in a TNFS. In group-based population to evaluate the fuzzy rule locally. parameter learning, the R-DMEA uses the mining-based Although GSE has a better performance when compared with selection strategy to decide the selected groups in GSE by the tradition symbiotic evolution, the combination of groups is data mining method called frequent pattern growth. Illustrative fixed. example is conducted to show the performance and applicability Although the above evolution learning algorithms [4]-[6] of R-DMEA.
    [Show full text]
  • Research Data Management Best Practices
    Research Data Management Best Practices Introduction ............................................................................................................................................................................ 2 Planning & Data Management Plans ...................................................................................................................................... 3 Naming and Organizing Your Files .......................................................................................................................................... 6 Choosing File Formats ............................................................................................................................................................. 9 Working with Tabular Data ................................................................................................................................................... 10 Describing Your Data: Data Dictionaries ............................................................................................................................... 12 Describing Your Project: Citation Metadata ......................................................................................................................... 15 Preparing for Storage and Preservation ............................................................................................................................... 17 Choosing a Repository .........................................................................................................................................................
    [Show full text]
  • Data Management, Analysis Tools, and Analysis Mechanics
    Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem statement has been formulated, research hypotheses have been stated, data collection planning has been conducted, and data have been collected from various sources (see Volume I for information and details on these phases of research). This chapter discusses how to combine and manage data streams, and how to use data management tools to produce analytical results that are error free and reproducible, once useful data have been obtained to accomplish the overall research goals and objectives. Purpose of Data Management Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work. In addition, as the size of databases in transportation continue to grow, it is becoming increasingly important to invest resources into the management of these data. There are a number of ancillary steps that need to be performed both before and after statistical analysis of data. For example, a database composed of different data streams needs to be matched and integrated into a single database for analysis. In addition, in some cases data must be transformed into the preferred electronic format for a variety of statistical packages. Sometimes, data obtained from “the field” must be cleaned and debugged for input and measurement errors, and reformatted. The following sections discuss considerations for developing an overall data collection, handling, and management plan, and tools necessary for successful implementation of that plan.
    [Show full text]
  • Machine Learning and Data Mining Machine Learning Algorithms Enable Discovery of Important “Regularities” in Large Data Sets
    General Domains Machine Learning and Data Mining Machine learning algorithms enable discovery of important “regularities” in large data sets. Over the past Tom M. Mitchell decade, many organizations have begun to routinely cap- ture huge volumes of historical data describing their operations, products, and customers. At the same time, scientists and engineers in many fields have been captur- ing increasingly complex experimental data sets, such as gigabytes of functional mag- netic resonance imaging (MRI) data describing brain activity in humans. The field of data mining addresses the question of how best to use this historical data to discover general regularities and improve PHILIP NORTHOVER/PNORTHOV.FUTURE.EASYSPACE.COM/INDEX.HTML the process of making decisions. COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 31 he increasing interest in Figure 1. Data mining application. A historical set of 9,714 medical data mining, or the use of records describes pregnant women over time. The top portion is a historical data to discover typical patient record (“?” indicates the feature value is unknown). regularities and improve The task for the algorithm is to discover rules that predict which T future patients will be at high risk of requiring an emergency future decisions, follows from the confluence of several recent trends: C-section delivery. The bottom portion shows one of many rules the falling cost of large data storage discovered from this data. Whereas 7% of all pregnant women in devices and the increasing ease of the data set received emergency C-sections, the rule identifies a collecting data over networks; the subclass at 60% at risk for needing C-sections.
    [Show full text]