Introduction to Data Mining
Total Page:16
File Type:pdf, Size:1020Kb
© Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION CHAPTENOT FORR SALE1 OR DISTRIBUTION © Jones & Bartlett Learning, LLC Introduction© Jones & Bartlett Learning, to LLC NOT FOR SALE OR DISTRIBUTION DataNOT FOR SALEMining OR DISTRIBUTION © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION ■■ 1.1 TRADITIONAL DATABASE MANAGEMENT SYSTEMS A traditional Database Management System (DBMS) provides generic soft ware tools and environments that support the development of a database © Jonesapplication. & Bartlett A DBMS Learning, environment LLC supports the two main functions© Jones of database& Bartlett Learning, LLC NOT FORdevelopment SALE andOR operations:DISTRIBUTION data definition and data manipulation.NOT FOR SALEDuring OR DISTRIBUTION the data definition stage of database development, the structural components of a database—such as table structures, primary keys, and the number of tables for the database—are determined. The tasks of data manipulation include data © Jones & Bartlettstorage, Learning, data modification, LLC and data retrieval© Jones through & the Bartlett use of Learning,queries. This LLC NOT FOR SALE ORconcept DISTRIBUTION is depicted in Figure 1.1. OtherNOT tools FORof a DBMS SALE include OR DISTRIBUTION utilities, report generators, form generators, development tools, design aids, and transac tion managers. In Figure 1.1, the application developers are those who perform logi © Jones & Bartlett Learning, LLCcal database design and© databaseJones & programming Bartlett Learning, for physical LLC database NOT FOR SALE OR DISTRIBUTIONdevelopment. They designNOT Entity FOR Relationship SALE OR DiagramsDISTRIBUTION (ERDs), define and construct relational tables, determine primary and foreign keys of each table, and specify data types. Database refinement and maintenance tasks are also performed by application developers. They select a particular software © Jonesproduct & Bartlett from a particularLearning, vendor LLC that can provide all the© Jonesnecessary & Bartletttools Learning, LLC NOT FORfor the SALE design OR requirements. DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Any DBMS software product should provide all functions of data defini tion and data manipulation. In addition, a DBMS software product should provide data security and integrity functions. Through these functions, the © Jones & Bartlett DMBSLearning, can monitor LLC user queries, and any© Jonesattempts & to Bartlett violate security Learning, and LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION © Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 85871_CH01_FINAL.indd 1 12/30/10 3:24:10 PM © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 2 CHAPTER 1 INTRODUCTION TO DATA MINING ■■ FIGURE© Jones 1.1 & Bartlett Learning, LLC Tables© (AttJonesributes) & Bartlett Learning, LLC A simplified Keys functional structureNOT FOR SALE OR DISTRIBUTION DataNOT Types FOR SALE OR DISTRIBUTION of a DBMS Relationships Data Requirements © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC Data Definition Database Design NOT FOR SALE OR DISTRIBUTION NOT FOR SALE ORDatabase DISTRIBUTION Queries Data Manipulation Database Application Query Requirements Developer © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning,Insert LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTIONDelete Create Retrieve End User Modify Requests © Jones & Bartlett Learning,Results LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION integrity constraints will be denied. The data dictionary is another function that must be supported by the DBMS. The data dictionary contains meta data, which is data about data, and generally contains various data schemas and © Jones & Bartlettmappings, Learning, security LLC and integrity constraints,© Jones and virtually & Bartlett anything Learning, about LLC NOT FOR SALE ORdata DISTRIBUTION that can be useful to a database developerNOT FORfor development SALE OR and DISTRIBUTION main tenance. Data recovery and concurrency functions must also be supported by the DBMS together with performancetuning functions, since all DBMS functions should be performed as efficiently as possible. © Jones & Bartlett Learning, LLC The task of end users© is Jonesto simply & submitBartlett requests. Learning, Queries LLC allow users NOT FOR SALE OR DISTRIBUTIONto insert, delete, create, retrieve,NOT FOR and modify SALE dataOR inDISTRIBUTION databases. Hence, user requests are submitted as database queries to a database through the DBMS, which in turn returns the result of the query execution. The efficiency of a database system is often measured by the convenience and the ease of query writing and data manipulation. Users feel more comfortable when © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC it is easier for them to manipulate the system and write the desired queries. NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION Therefore, the system should allow users to access all parts of the database and to form queries as easily as possible. Once the queries are formed and submitted, it is the task of the DBMS to interpret the query and execute it. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION © Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 85871_CH01_FINAL.indd 2 12/30/10 3:24:12 PM © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION 1.2 KNOWLEDGE DISCOVERY IN DATABASES 3 © JonesA DBMS & Bartlett software Learning, component LLC is often available to allow© Jones queries & toBartlett be Learning, LLC NOT FORspecified SALE in ORa flexible DISTRIBUTION manner, such as “forms” orNOT Query FOR By SALEExample OR DISTRIBUTION (QBE). The presentation of query results is another area that affects usability. Analysis and interpretation of query results is needed if they are to be © Jones & Bartlettmeaningful. Learning, DBMS LLC software products often© Jones provide & “report” Bartlett capabilities Learning, to LLC NOT FOR SALE ORhelp DISTRIBUTION users with the interpretation of queryNOT results. FOR However, SALE there OR isDISTRIBUTION still one potential issue, namely confirming their validity. If a query returns incorrect results, it must be decided whether the query was wrongly formed or the database contains invalid data. Although many commercially available software © Jones & Bartlett Learning, LLCproducts support integrity© Jonesconstraints, & Bartlett the task Learning,of proving the LLC correctness of query results has been a major challenge in database application system NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION development. In general, the effectiveness of database applications depends on the type and accuracy of the data in the database, the flexibility and convenience of user queries, clear and accurate interpretation of query execution results, and © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC validation of query results. If the data in a database is incorrect, then regard NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION less of the query, the result is also incorrect. Data cleansing, noise reduction, and noise removal methods may partially solve this problem, but any missing and/or invalid data must be replaced with correct data in order for the cor rect result to be obtained. Furthermore, convenient tools for query genera © Jones & Bartletttion Learning, minimizes LLC potential problems and errors.© Jones The &interpretation Bartlett Learning, of query LLC NOT FOR SALE ORresults DISTRIBUTION is also a very important task becauseNOT this FORinformation SALE will OR be DISTRIBUTION used for decision making. Finally, providing justification of query results is also very important because it gives confidence and assurance to the user regarding the accuracy of the resulting data. © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR■■ 1.2 DISTRIBUTION KNOWLEDGE DISCOVERY NOTIN D FORATABASES SALE OR DISTRIBUTION The rapid and constant growth of databases in business, government, and science has far outpaced our ability to interpret and make sense of this data avalanche, creating a need for a new generation of tools and techniques for © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC intelligent and automated analysis of databases. These tools and techniques NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION are the subject of the rapidly emerging field of data mining and Knowledge Discovery in Databases (KDD). KDD is a process that takes a large amount of unprocessed and raw data stored in a data warehouse, transforms it into meaningful patterns, © Jones & Bartlett Learning, LLC © Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION NOT FOR SALE OR DISTRIBUTION © Jones & Bartlett Learning, LLC. NOT FOR SALE OR DISTRIBUTION. © Jones