Mining Database Structure; Or, How to Build a Data Quality Browser

Total Page:16

File Type:pdf, Size:1020Kb

Mining Database Structure; Or, How to Build a Data Quality Browser Mining Database Structure; Or, How to Build a Data Quality Browser Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk AT&T Labs-Research {t amr,johnsont,muthu,vshkap}@research.att.com ABSTRACT correctly modeling a customer and entering all information related to the sale, or a provisioning group may promptly Data mining research typically assumes that the data to be enter in the service/circuits they provision but might not analyzed has been identified, gathered, cleaned, and pro- delete them as diligently. cessed into a convenient form. While data mining tools Unfortunately, these disordered databases have a signif- greatly enhance the ability of the analyst to make data- icant cost. Planning, analysis, and data mining are frus- driven discoveries, most of the time spent in performing an trated by incorrect or missing data. New projects which analysis is spent in data identification, gathering, cleaning require access to multiple databases are difficult, expensive, and processing the data. Similarly, schema mapping tools and perhaps even impossible to implement. have been developed to help automate the task of using A variety of tools have been developed for database clean- legacy or federated data sources for a new purpose, but ing [42] and for schema mapping [35], as we discuss in more assume that the structure of the data sources is well un- detail below. In our experience, however, one is faced with derstood. However the data sets to be federated may come the difficult problem of understanding the contents and struc- from dozens of databases containing thousands of tables and ture of the database(s) at hand before they can be cleaned tens of thousands of fields, with little reliable documentation or have their schemas mapped. Large production databases about primary keys or foreign keys. often have hundreds to thousands of tables with thousands We are developing a system, Bellman, which performs to tens of thousands of fields. Even in a clean database, data mining on the structure of the database. In this paper, discovering the database structure is difficult because of the we present techniques for quickly identifying which fields scale of the problem. have similar values, identifying join paths, estimating join Production databases often contain many additional prob- directions and sizes, and identifying structures in the database. lems which make understanding their structure much more The results of the database structure mining allow the an- difficult. Constructing an entity (e.g., a corporate customer alyst to make sense of the database content. This informa- or a data service offering) often requires many joins with tion can be used to e.g., prepare data for data mining, find long join paths, often across databases. The schema doc- foreign key joins for schema mapping, or identify steps to umentation is usually sparse and out-of-date. Foreign key be taken to prevent the database from collapsing under the dependencies are usually not maintained and may degrade weight of its complexity. over time. Conversely, tables may contain undocumented foreign keys. A table may contain heterogeneous entities, 1. INTRODUCTION i.e. sets of rows in the table that have different join paths. A seeming invariant of large production databases is that The convention for recording information may be different in they become disordered over time. The disorder arises from different tables (e.g. a customer name might be recorded in a variety of causes including incorrectly entered data, incor- one field in one table, but in two or more fields in another). rect use of the database (perhaps due to a lack of documenta- As an aid to our data cleaning efforts, we have devel- tion), and use of the database to model unanticipated events oped Bellman, a data quality browser. Bellman provides and entities (e.g., new services or customer types). Admin- the usual query and schema navigation tools, and also a col- istrators and users of these databases are under demanding lection of tools and services which are designed to help the time pressures and frequently do not have the time to care- user discover the structure in the database. Bellman uses fully plan, monitor, and clean their database. For example, database profiling [13] to collect summaries of the database the sales force is more interested in making a sale than in tablespaces, tables, and fields. These summaries are dis- played to the user in an interactive manner or are used for more complex queries. Bellman collects the conventional profiles (e.g., number of rows in a table, number of distinct Permission to make digital or hard copies of all or part of this work for values in a field, etc.), as well as more sophisticated profiles personal or classroom use is granted without fee provided that copies are (which is one of the subjects of this paper). not made or distributed for profit or commercial advantage and that copies In order to understand the structure of a database, it is bear this notice and the full citation on the first page. To copy otherwise, to necessary to understand how fields relate to one another. republish, to post on servers or to redistribute to lists, requires prior specific Bellman collects concise summaries of the values of the fields. permission and/or a fee. ACM SIGMOD '2002 June 4-6, Madison, Wisconsin,USA These summaries allow Bellman to determine whether two Copyright 2002 ACM 1-58113-497-5/02/06 ...$5.00. 240 fields can be joined, and if so the direction of the join (e.g. In this paper, we make the following contributions: one to many, many to many, etc.) and the size of the join result. Even when two fields cannot be joined, Bellman can • We develop several methods for finding related database use the field value summaries to determine whether they fields using small summaries. axe textually similar, and if the text of one field is likely to be contained in another. These questions can be posed * We evaluate the size of the summaries required for as queries on the summarized information, with results re- accuracy. turned in seconds. Using the summaries, Bellman can pose data mining queries such as, • We present new algorithms for mining the structure of a database. • Find all of the joins (primary key, foreign key, or oth- erwise) between this table and any other table in the database. 2. SUMMARIZING VALUES OF A FIELD Our approach to database structure mining is to first col- • Given field F, find all sets of fields {.T} such that the lect summaries of the database. These summaries can be contents of F are likely to be a composite of the con- computed quickly, and represent the relevant features of the tents of .T. database in a small amount of space. Our data mining algo- rithms operate from these summaries, and as a consequence • Given a table T, does it have two (largely) disjoint are fast because the summaries are small. subsets which join to tables T1 and T2 (i.e. is T het- Many of these summaries are quite simple, e.g. the num- erogeneous)? ber of tuples in each table, the number of distinct and the These data mining queries, and many others, can also be number of null values of each field, etc. Other summaries answered from the summaries, and therefore evaluated in are more sophisticated and have significantly more power seconds. This interactive database structure mining allows to reveal the structure of the database. In this section, we the user to discover the structure of the database, enabling present these more sophisticated summaries and the algo- the application of data cleaning, data mining, and schema rithms which use them as input. mapping tools. 2.1 Set Resemblance 1.1 Related Work The resemblance of two sets A and B is p = [AnB[/[AUB[. The database research community has explored some as- The resemblance of two sets is a measure of how similar they pects of the problem of data cleaning [42]. One aspect of this are. These sets are computed from fields of a table by a research addresses the problem of finding duplicate values in query such as A =Select Distinct R.A. For our purposes we a table [22, 32, 37, 38]. More generally, one can perform ap- are more interested in computing size of the intersection of proximate matching, in which joins predicates can be based A and B, which can be computed from the resemblance by on string distance [37, 39, 20]. Our interest is in finding related fields among all fields in the database, rather than [AnBI= p~P (IAI + IBI) (1) performing any particular join. In [8], the authors compare two methods for finding related fields. However these axe Our system profiles the number of distinct values of each crude methods which heavily depend on schema informa- field, so [A[ and IBI is always available. tion. The real significance of the resemblance is that it can be AJAX [14] and ARKTOS [46] axe query systems designed easily estimated. Let H be the universe of values from which to express and optimize data cleaning queries. However, the elements of the sets are drawn, and let h : H ~ .Af map ele- user must first determine the data cleaning process. ments of H uniformly and "randomly" to the set of natural A related line of research is in schema mapping, and espe- numbers .Af. Let s(A) = mina~A(h(a)). Then cially in resolving naming and structural conflicts [3, 27, 41, 36, 21]. While some work has been done to automatically de- Pr[s(A) = s(B)] ----p tect database structure [21, 11, 33, 10, 43, 5], they are aimed That is, the indicator variable I[s(A) = s(B)] is a Bernoulli at mapping particular pairs of fields, rather than summariz- random variable with parameter p.
Recommended publications
  • A Study on Social Network Analysis Through Data Mining Techniques – a Detailed Survey
    ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 IJARCCE International Journal of Advanced Research in Computer and Communication Engineering ICRITCSA M S Ramaiah Institute of Technology, Bangalore Vol. 5, Special Issue 2, October 2016 A Study on Social Network Analysis through Data Mining Techniques – A detailed Survey Annie Syrien1, M. Hanumanthappa2 Research Scholar, Department of Computer Science and Applications, Bangalore University, India1 Professor, Department of Computer Science and Applications, Bangalore University, India2 Abstract: The paper aims to have a detailed study on data collection, data preprocessing and various methods used in developing a useful algorithms or methodologies on social network analysis in social media. The recent trends and advancements in the big data have led many researchers to focus on social media. Web enabled devices is an another important reason for this advancement, electronic media such as tablets, mobile phones, desktops, laptops and notepads enable the users to actively participate in different social networking systems. Many research has also significantly shows the advantages and challenges that social media has posed to the research world. The principal objective of this paper is to provide an overview of social media research carried out in recent years. Keywords: data mining, social media, big data. I. INTRODUCTION Data mining is an important technique in social media in scrutiny and analysis. In any industry data mining recent years as it is used to help in extraction of lot of technique based reports become the base line for making useful information, whereas this information turns to be an decisions and formulating the decisions. This report important asset to both academia and industries.
    [Show full text]
  • Different Aspects of Web Log Mining
    Different Aspects of Web Log Mining Renáta Iváncsy, István Vajk Department of Automation and Applied Informatics, and HAS-BUTE Control Research Group Budapest University of Technology and Economics Goldmann Gy. tér 3, H-1111 Budapest, Hungary e-mail: {renata.ivancsy, vajk}@aut.bme.hu Abstract: The expansion of the World Wide Web has resulted in a large amount of data that is now freely available for user access. The data have to be managed and organized in such a way that the user can access them efficiently. For this reason the application of data mining techniques on the Web is now the focus of an increasing number of researchers. One key issue is the investigation of user navigational behavior from different aspects. For this reason different types of data mining techniques can be applied on the log file collected on the servers. In this paper three of the most important approaches are introduced for web log mining. All the three methods are based on the frequent pattern mining approach. The three types of patterns that can be used for obtain useful information about the navigational behavior of the users are page set, page sequence and page graph mining. Keywords: Pattern mining, Sequence mining, Graph Mining, Web log mining 1 Introduction The expansion of the World Wide Web (Web for short) has resulted in a large amount of data that is now in general freely available for user access. The different types of data have to be managed and organized such that they can be accessed by different users efficiently. Therefore, the application of data mining techniques on the Web is now the focus of an increasing number of researchers.
    [Show full text]
  • Geoseq2seq: Information Geometric Sequence-To-Sequence Networks
    GEOSEQ2SEQ:INFORMATION GEOMETRIC SEQUENCE-TO-SEQUENCE NETWORKS Alessandro Bay Biswa Sengupta Cortexica Vision Systems Ltd. Noah’s Ark Lab (Huawei Technologies UK) London, UK Imperial College London, London, UK [email protected] ABSTRACT The Fisher information metric is an important foundation of information geometry, wherein it allows us to approximate the local geometry of a probability distribu- tion. Recurrent neural networks such as the Sequence-to-Sequence (Seq2Seq) networks that have lately been used to yield state-of-the-art performance on speech translation or image captioning have so far ignored the geometry of the latent embedding, that they iteratively learn. We propose the information geometric Seq2Seq (GeoSeq2Seq) network which abridges the gap between deep recurrent neural networks and information geometry. Specifically, the latent embedding offered by a recurrent network is encoded as a Fisher kernel of a parametric Gaus- sian Mixture Model, a formalism common in computer vision. We utilise such a network to predict the shortest routes between two nodes of a graph by learning the adjacency matrix using the GeoSeq2Seq formalism; our results show that for such a problem the probabilistic representation of the latent embedding supersedes the non-probabilistic embedding by 10-15%. 1 INTRODUCTION Information geometry situates itself in the intersection of probability theory and differential geometry, wherein it has found utility in understanding the geometry of a wide variety of probability models (Amari & Nagaoka, 2000). By virtue of Cencov’s characterisation theorem, the metric on such probability manifolds is described by a unique kernel known as the Fisher information metric. In statistics, the Fisher information is simply the variance of the score function.
    [Show full text]
  • Data Mining and Soft Computing
    © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Data mining and soft computing O. Ciftcioglu Faculty of Architecture, Deljl University of Technology Delft, The Netherlands Abstract The term data mining refers to information elicitation. On the other hand, soft computing deals with information processing. If these two key properties can be combined in a constructive way, then this formation can effectively be used for knowledge discovery in large databases. Referring to this synergetic combination, the basic merits of data mining and soft computing paradigms are pointed out and novel data mining implementation coupled to a soft computing approach for knowledge discovery is presented. Knowledge modeling by machine learning together with the computer experiments is described and the effectiveness of the machine learning approach employed is demonstrated. 1 Introduction Although the concept of data mining can be defined in several ways, it is clear enough to understand that it is related to information extraction especially in large databases. The methods of data mining are essentially borrowed from exact sciences among which statistics play the essential role. By contrast, soft computing is essentially used for information processing by employing methods, which are capable to deal with imprecision and uncertainty especially needed in ill-defined problem areas. In the former case, the outcomes are exact within the error bounds estimated. In the latter, the outcomes are approximate and in some cases they may be interpreted as outcomes from an intelligent behavior.
    [Show full text]
  • Data Mining with Cloud Computing: - an Overview
    International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 5 Issue 1, January 2016 Data Mining with Cloud Computing: - An Overview Miss. Rohini A. Dhote Dr. S. P. Deshpande Department of Computer Science & Information Technology HOD at Computer Science Technology Department, HVPM, COET Amravati, Maharashtra, India. HVPM, COET Amravati, Maharashtra, India Abstract: Data mining is a process of extracting competitive environment for data analysis. The cloud potentially useful information from data, so as to makes it possible for you to access your information improve the quality of information service. The from anywhere at any time. The cloud removes the integration of data mining techniques with Cloud need for you to be in the same physical location as computing allows the users to extract useful the hardware that stores your data. The use of cloud information from a data warehouse that reduces the computing is gaining popularity due to its mobility, costs of infrastructure and storage. Data mining has attracted a great deal of attention in the information huge availability and low cost. industry and in society as a whole in recent Years Wide availability of huge amounts of data and the imminent 2. DATA MINING need for turning such data into useful information and Data Mining involves the use of sophisticated data knowledge Market analysis, fraud detection, and and tool to discover previously unknown, valid customer retention, production control and science patterns and Relationship in large data sets. These exploration. Data mining can be viewed as a result of tools can include statistical models, mathematical the natural evolution of information technology.
    [Show full text]
  • Challenges in Bioinformatics for Statistical Data Miners
    This article appeared in the October 2003 issue of the “Bulletin of the Swiss Statistical Society” (46; 10-17), and is reprinted by permission from its editor. Challenges in Bioinformatics for Statistical Data Miners Dr. Diego Kuonen Statoo Consulting, PSE-B, 1015 Lausanne 15, Switzerland [email protected] Abstract Starting with possible definitions of statistical data mining and bioinformatics, this article will give a general side-by-side view of both fields, emphasising on the statistical data mining part and its techniques, illustrate possible synergies and discuss how statistical data miners may collaborate in bioinformatics’ challenges in order to unlock the secrets of the cell. What is statistics and why is it needed? Statistics is the science of “learning from data”. It includes everything from planning for the collection of data and subsequent data management to end-of-the-line activities, such as drawing inferences from numerical facts called data and presentation of results. Statistics is concerned with one of the most basic of human needs: the need to find out more about the world and how it operates in face of variation and uncertainty. Because of the increasing use of statistics, it has become very important to understand and practise statistical thinking. Or, in the words of H. G. Wells: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write”. But, why is statistics needed? Knowledge is what we know. Information is the communication of knowledge. Data are known to be crude information and not knowledge by itself. The way leading from data to knowledge is as follows: from data to information (data become information when they become relevant to the decision problem); from information to facts (information becomes facts when the data can support it); and finally, from facts to knowledge (facts become knowledge when they are used in the successful completion of the decision process).
    [Show full text]
  • The Theoretical Framework of Data Mining & Its
    International Journal of Social Science & Interdisciplinary Research__________________________________ ISSN 2277- 3630 IJSSIR, Vol.2 (1), January (2013) Online available at indianresearchjournals.com THE THEORETICAL FRAMEWORK OF DATA MINING & ITS TECHNIQUES SAYYAD RASHEEDUDDIN Associate Professor – HOD CSE dept Pulla Reddy Engineering College Wargal, Medak(Dist), A.P. ABSTRACT Data mining is a process which finds useful patterns from large amount of data. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data pattern analysis. The current paper discusses the following Objectives: 1. To discuss briefly about the concept of Data Mining 2. To explain about the Data Mining Algorithms & Techniques 3. To know about the Data Mining Applications Keywords: Concept of Data mining, Data mining Techniques, Data mining algorithms, Data mining applications. Introduction to Data Mining The development of Information Technology has generated large amount of databases and huge data in various areas. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data /pattern analysis. The current paper discuss about the following Objectives: 1. To discuss briefly about the concept of Data Mining 2.
    [Show full text]
  • Reinforcement Learning for Data Mining Based Evolution Algorithm
    Reinforcement Learning for Data Mining Based Evolution Algorithm for TSK-type Neural Fuzzy Systems Design Sheng-Fuu Lin, Yi-Chang Cheng, Jyun-Wei Chang, and Yung-Chi Hsu Department of Electrical and Control Engineering, National Chiao Tung University Hsinchu, Taiwan, R.O.C. ABSTRACT membership functions of a fuzzy controller, with the fuzzy rule This paper proposed a TSK-type neural fuzzy system set assigned in advance. Juang [5] proposed the CQGAF to (TNFS) with reinforcement learning for data mining based simultaneously design the number of fuzzy rules and free evolution algorithm (R-DMEA). In R-DMEA, the group-based parameters in a fuzzy system. In [6], the group-based symbiotic symbiotic evolution (GSE) is adopted that each group represents evolution (GSE) was proposed to solve the problem that all the the set of single fuzzy rule. The R-DMEA contains both fuzzy rules were encoded into one chromosome in the structure learning and parameter learning. In the structure traditional genetic algorithm. In the GSE, each chromosome learning, the proposed R-DMEA uses the two-step self-adaptive represents a single rule of fuzzy system. The GSE uses algorithm to decide the suitable number of rules in a TNFS. In group-based population to evaluate the fuzzy rule locally. parameter learning, the R-DMEA uses the mining-based Although GSE has a better performance when compared with selection strategy to decide the selected groups in GSE by the tradition symbiotic evolution, the combination of groups is data mining method called frequent pattern growth. Illustrative fixed. example is conducted to show the performance and applicability Although the above evolution learning algorithms [4]-[6] of R-DMEA.
    [Show full text]
  • Machine Learning and Data Mining Machine Learning Algorithms Enable Discovery of Important “Regularities” in Large Data Sets
    General Domains Machine Learning and Data Mining Machine learning algorithms enable discovery of important “regularities” in large data sets. Over the past Tom M. Mitchell decade, many organizations have begun to routinely cap- ture huge volumes of historical data describing their operations, products, and customers. At the same time, scientists and engineers in many fields have been captur- ing increasingly complex experimental data sets, such as gigabytes of functional mag- netic resonance imaging (MRI) data describing brain activity in humans. The field of data mining addresses the question of how best to use this historical data to discover general regularities and improve PHILIP NORTHOVER/PNORTHOV.FUTURE.EASYSPACE.COM/INDEX.HTML the process of making decisions. COMMUNICATIONS OF THE ACM November 1999/Vol. 42, No. 11 31 he increasing interest in Figure 1. Data mining application. A historical set of 9,714 medical data mining, or the use of records describes pregnant women over time. The top portion is a historical data to discover typical patient record (“?” indicates the feature value is unknown). regularities and improve The task for the algorithm is to discover rules that predict which T future patients will be at high risk of requiring an emergency future decisions, follows from the confluence of several recent trends: C-section delivery. The bottom portion shows one of many rules the falling cost of large data storage discovered from this data. Whereas 7% of all pregnant women in devices and the increasing ease of the data set received emergency C-sections, the rule identifies a collecting data over networks; the subclass at 60% at risk for needing C-sections.
    [Show full text]
  • An XML-Based Approach for Warehousing and Analyzing
    Key words: XML data warehousing, Complex data, Multidimensional modelling, complex data analysis, OLAP. X-WACoDa: An XML-based approach for Warehousing and Analyzing Complex Data Hadj Mahboubi Jean-Christian Ralaivao Sabine Loudcher Omar Boussaïd Fadila Bentayeb Jérôme Darmont Université de Lyon (ERIC Lyon 2) 5 avenue Pierre Mendès-France 69676 Bron Cedex France E-mail: [email protected] ABSTRACT Data warehousing and OLAP applications must nowadays handle complex data that are not only numerical or symbolic. The XML language is well-suited to logically and physically represent complex data. However, its usage induces new theoretical and practical challenges at the modeling, storage and analysis levels; and a new trend toward XML warehousing has been emerging for a couple of years. Unfortunately, no standard XML data warehouse architecture emerges. In this paper, we propose a unified XML warehouse reference model that synthesizes and enhances related work, and fits into a global XML warehousing and analysis approach we have developed. We also present a software platform that is based on this model, as well as a case study that illustrates its usage. INTRODUCTION Data warehouses form the basis of decision-support systems (DSSs). They help integrate production data and support On-Line Analytical Processing (OLAP) or data mining. These technologies are nowadays mature. However, in most cases, the studied activity is materialized by numeric and symbolic data, whereas data exploited in decision processes are more and more diverse
    [Show full text]
  • Big Data, Data Mining and Machine Learning
    Contents Forward xiii Preface xv Acknowledgments xix Introduction 1 Big Data Timeline 5 Why This Topic Is Relevant Now 8 Is Big Data a Fad? 9 Where Using Big Data Makes a Big Difference 12 Part One The Computing Environment ..............................23 Chapter 1 Hardware 27 Storage (Disk) 27 Central Processing Unit 29 Memory 31 Network 33 Chapter 2 Distributed Systems 35 Database Computing 36 File System Computing 37 Considerations 39 Chapter 3 Analytical Tools 43 Weka 43 Java and JVM Languages 44 R 47 Python 49 SAS 50 ix ftoc ix April 17, 2014 10:05 PM Excerpt from Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners by Jared Dean. Copyright © 2014 SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. x ▸ CONTENTS Part Two Turning Data into Business Value .......................53 Chapter 4 Predictive Modeling 55 A Methodology for Building Models 58 sEMMA 61 Binary Classifi cation 64 Multilevel Classifi cation 66 Interval Prediction 66 Assessment of Predictive Models 67 Chapter 5 Common Predictive Modeling Techniques 71 RFM 72 Regression 75 Generalized Linear Models 84 Neural Networks 90 Decision and Regression Trees 101 Support Vector Machines 107 Bayesian Methods Network Classifi cation 113 Ensemble Methods 124 Chapter 6 Segmentation 127 Cluster Analysis 132 Distance Measures (Metrics) 133 Evaluating Clustering 134 Number of Clusters 135 K‐means Algorithm 137 Hierarchical Clustering 138 Profi ling Clusters 138 Chapter 7 Incremental Response Modeling 141 Building the Response
    [Show full text]
  • A New Approach for Web Usage Mining Using Artificial Neural Network
    International Journal of Engineering Science Invention ISSN (Online): 2319 – 6734, ISSN (Print): 2319 – 6726 www.ijesi.org || PP—12-16 A New Approach For Web usage Mining using Artificial Neural network Smaranika Mohapatra1, Kusumlata Jain2, Neelamani Samal3 1(Department of Computer Science & Engineering, MAIET, Mansarovar, Jaipur, India) 2(Department of Computer Science & Engineering, MAIET, Mansarovar, Jaipur, India) 3(Department of Computer Science & Engineering, GIET, Bhubaneswar, Odisha ,India ) Abstract: Web mining includes usage of data mining methods to discover patterns of use from the web data. Studies on Web usage mining using a Concurrent Clustering has shown that the usage trend analysis very much depends on the performance of the clustering of the number of requests. Despite the simplicity and fast convergence, k-means has some remarkable drawbacks. K-means does not only concern with this but with all other algorithms, using clusters with centers. K-means define the center of cluster formed by the algorithms to make clusters. In k-means, the free parameter is k and the results depend on the value of k. there is no general theoretical solution for finding an optimal value of k for any given data set. In this paper, Multilayer Perceptron Algorithm is introduced; it uses some extension neural network, in the process of Web Usage Mining to detect user's patterns. The process details the transformations necessaries to modify the data storage in the Web Servers Log files to an input of Multilayer Perceptron Keywords: Web Usage Mining, Clustering, Web Server Log File. Neural networks, Perceptron, MLP I. Introduction Web mining has been advanced into an autonomous research field.
    [Show full text]