Oracle Data Mining Concepts, 10G Release 1 (10.1) Part No
Total Page:16
File Type:pdf, Size:1020Kb
Oracle® Data Mining Concepts 10g Release 1 (10.1) Part No. B10698-01 December 2003 Oracle Data Mining Concepts, 10g Release 1 (10.1) Part No. B10698-01 Copyright © 2003 Oracle. All rights reserved. Primary Authors: Margaret Taft, Ramkumar Krishnan, Mark Hornick, Denis Mukhin, George Tang, Shiby Thomas. Contributors: Charlie Berger, Marcos Campos, Boriana Milenova, Pablo Tamayo, Gina Abeles, Joseph Yarmus, Sunil Venkayala. The Programs (which include both the software and documentation) contain proprietary information of Oracle Corporation; they are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright, patent and other intellectual and industrial property laws. Reverse engineering, disassembly or decompilation of the Programs, except to the extent required to obtain interoperability with other independently created software or as specified by law, is prohibited. The information contained in this document is subject to change without notice. If you find any problems in the documentation, please report them to us in writing. Oracle Corporation does not warrant that this document is error-free. Except as may be expressly permitted in your license agreement for these Programs, no part of these Programs may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Oracle Corporation. If the Programs are delivered to the U.S. Government or anyone licensing or using the programs on behalf of the U.S. Government, the following notice is applicable: Restricted Rights Notice Programs delivered subject to the DOD FAR Supplement are "commercial computer software" and use, duplication, and disclosure of the Programs, including documentation, shall be subject to the licensing restrictions set forth in the applicable Oracle license agreement. Otherwise, Programs delivered subject to the Federal Acquisition Regulations are "restricted computer software" and use, duplication, and disclosure of the Programs shall be subject to the restrictions in FAR 52.227-19, Commercial Computer Software - Restricted Rights (June, 1987). Oracle Corporation, 500 Oracle Parkway, Redwood City, CA 94065. The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherently dangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup, redundancy, and other measures to ensure the safe use of such applications if the Programs are used for such purposes, and Oracle Corporation disclaims liability for any damages caused by such use of the Programs. Oracle is a registered trademark, and PL/SQL and SQL*Plus are trademarks or registered trademarks of Oracle Corporation. Other names may be trademarks of their respective owners. Contents Send Us Your Comments ................................................................................................................... ix Preface............................................................................................................................................................ xi 1 Introduction to Oracle Data Mining 1.1 What is Data Mining? ........................................................................................................... 1-1 1.2 What Is Oracle Data Mining? .............................................................................................. 1-1 1.2.1 Oracle Data Mining Programming Interfaces............................................................ 1-2 1.2.2 ODM Data Mining Functions....................................................................................... 1-2 2 Data for Oracle Data Mining 2.1 ODM Data, Cases, and Attributes....................................................................................... 2-1 2.2 ODM Data Requirements..................................................................................................... 2-2 2.2.1 ODM Data Table Format............................................................................................... 2-2 2.2.1.1 Single-Record Case Data........................................................................................ 2-2 2.2.1.2 Multi-Record Case Data in the Java Interface..................................................... 2-3 2.2.1.3 Wide Data in DBMS_DATA_MINING................................................................ 2-3 2.2.2 Column Data Types Supported by ODM................................................................... 2-5 2.2.2.1 Unstructured Data in ODM................................................................................... 2-5 2.2.2.2 Dates in ODM.......................................................................................................... 2-5 2.2.3 Attribute Type for Oracle Data Mining ...................................................................... 2-6 2.2.3.1 Target t Attribute .................................................................................................... 2-7 2.2.4 Data Storage Issues ........................................................................................................ 2-7 2.2.5 Missing Values in ODM ................................................................................................ 2-7 iii 2.2.5.1 Missing Values and Null Values in ODM ........................................................... 2-7 2.2.5.2 Missing Values Handling....................................................................................... 2-7 2.2.6 Sparse Data in Oracle Data Mining ............................................................................. 2-8 2.2.7 Outliers and Oracle Data Mining................................................................................. 2-8 2.3 Prepared and Unprepared Data........................................................................................ 2-10 2.3.1 Data Preparation for the ODM Java Interface.......................................................... 2-10 2.3.2 Data Preparation for DBMS_DATA_MINING ........................................................ 2-10 2.3.3 Binning (Discretization) in Data Mining................................................................... 2-10 2.3.3.1 Methods for Computing Bin Boundaries .......................................................... 2-11 2.3.4 Normalization in Oracle Data Mining ...................................................................... 2-12 3 Predictive Data Mining Models 3.1 Classification .......................................................................................................................... 3-1 3.1.1 Costs ................................................................................................................................. 3-2 3.1.2 Priors ................................................................................................................................ 3-3 3.1.3 Naive Bayes Algorithm ................................................................................................. 3-3 3.1.4 Adaptive Bayes Network Algorithm........................................................................... 3-4 3.1.4.1 ABN Model Types................................................................................................... 3-5 3.1.4.2 ABN Rules................................................................................................................ 3-5 3.1.4.3 ABN Build Parameters ........................................................................................... 3-6 3.1.4.4 ABN Model States................................................................................................... 3-8 3.1.5 Comparison of NB and ABN Models.......................................................................... 3-8 3.1.6 Support Vector Machine................................................................................................ 3-9 3.1.6.1 Data Preparation and Settings Choice for Support Vector Machines ............. 3-9 3.2 Regression............................................................................................................................. 3-10 3.2.1 SVM Algorithm for Regression .................................................................................. 3-10 3.3 Attribute Importance .......................................................................................................... 3-10 3.3.1 Minimum Descriptor Length...................................................................................... 3-11 3.4 ODM Model Seeker (Java Interface Only) ....................................................................... 3-12 4 Descriptive Data Mining Models 4.1 Clustering in Oracle Data Mining....................................................................................... 4-1 4.1.1 Enhanced k-Means Algorithm ..................................................................................... 4-2 4.1.1.1 Data for k-Means ..................................................................................................... 4-4 4.1.1.2 Scalability through Summarization...................................................................... 4-5 iv 4.1.1.3 Scoring (Applying Models) ................................................................................... 4-5 4.1.2 Orthogonal Partitioning Clustering (O-Cluster) ....................................................... 4-5 4.1.2.1 O-Cluster Data Use................................................................................................