Mining Algorithms for Generic and Biological Data
Total Page:16
File Type:pdf, Size:1020Kb
MINING ALGORITHMS FOR GENERIC AND BIOLOGICAL DATA By JUN LUO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2002 Copyright 2002 by Jun Luo To my parents and lovely wife ACKNOWLEDGMENTS I would like to express my gratitude toward my advisor, Professor Sanguthevar Rajasekaran, for giving me the opportunity to work with him in the Computer and Information Science and Engineering Department. His excellent advice and greatly helpful guidance were critical for me to finish my thesis smoothly and successfully. My full appreciation goes to Dr. Sartaj K. Sahni. I admire Dr. Sahni’s excellent achievements and respected personalities. It is my honor to have Dr. Sahni in my committee. I want to thank Dr. Douglas D. Dankel II who never hesitated to give me the advice and help whenever I was in difficulties. I also want to thank Dr. Meera Sitharam and Dr. Joseph N. Wilson for their reviews of my dissertation. Their insightful suggestions gave me a lot of inspirations. My thanks also go to Dr. Tim Davis, Dr. Loc Vu-Quoc, Mr. John Bowers, and Mrs. Ardiniece Y. Caudle, etc. Also, I highly appreciate Dr. Reed Ellis at the Editorial Office in the Graduate School for sparing time reviewing my dissertation carefully and patiently. Last but not least, I extend my utmost gratitude to my wife, Yanjin E, my mother, Yaqin Wang, and my father, Zhensheng Luo, for their enduring support. iv TABLE OF CONTENTS page MINING ALGORITHMS FOR GENERIC AND BIOLOGICAL DATA......................... I ACKNOWLEDGMENTS ................................................................................................ IV ABSTRACT....................................................................................................................VIII 1 INTRODUCTION ............................................................................................................1 1.1 What is Data Mining?............................................................................................... 2 1.2 Primary Tasks of Data Mining.................................................................................. 3 1.2.1 Mining Association Rules............................................................................... 3 1.2.2 Classification................................................................................................... 4 1.2.3 Clustering........................................................................................................ 5 1.2.4 Incremental Mining......................................................................................... 6 1.2.5 Sequential Patterns.......................................................................................... 6 1.2.6 Regression....................................................................................................... 7 1.2.7 Web Mining .................................................................................................... 7 1.2.7 Text Mining..................................................................................................... 8 1.2.8 Biology Mining............................................................................................... 9 1.3 Organization of the Dissertation............................................................................... 9 2 MINING ASSOCIATION RULES ................................................................................11 2.1 Problem Definitions................................................................................................ 11 2.2 Mining Association Rules....................................................................................... 12 2.3 Examples of Association Rule Mining ................................................................... 13 2.4 A Survey of Algorithms for Association Rule Mining........................................... 15 2.4.1 Apriori, Apriori-like Algorithms................................................................... 16 2.4.2 Partition and Sampling.................................................................................. 19 2.4.3 FP-Growth..................................................................................................... 20 2.4.4 Mining Multi-level Associaiton Rules.......................................................... 21 2.4.5 Mining Quantitative Associaiton Rules ........................................................ 21 2.4.6 Parallel Algorithms ....................................................................................... 21 3 INTERSECING ATTRIBUTE LISTS USING A HASH TABLE.................................23 3.1 IT Algorithm........................................................................................................... 23 3.1.1 Basic Idea...................................................................................................... 23 3.1.2 INSERT Algorithm....................................................................................... 28 v 3.1.3 Performance Analysis ................................................................................... 30 3.2 Dynamic Rename Algorithm (DRA)...................................................................... 32 3.3 Optimization Methods ............................................................................................ 33 3.3.1 Reorder Frequent Itemset (RFI).................................................................... 33 3.3.2 Similarity Detection (SD) ............................................................................. 37 3.3.3 Early Stop Detection (ESD).......................................................................... 37 4 SUPER FAST ALGORITHMS FOR DISCOVERING FREQUENT ITEM-SETS......38 4.1 FIT Algorithm......................................................................................................... 39 4.1.1 Basic Idea...................................................................................................... 39 4.1.2 FIT Algorithm............................................................................................... 42 4.1.3 Performance Analysis ................................................................................... 44 4.1.4 Implementation Issues................................................................................... 54 4.2 SFIT Algorithm....................................................................................................... 57 4.2.1 SFIT Algorithm............................................................................................. 58 4.2.2 Performance Analysis ................................................................................... 59 4.2.3 Implementation Issues................................................................................... 64 5 PARALLEL ALGORITHMS FOR MINING ASSOCIATION RULES.......................65 5.1 Basic Idea................................................................................................................ 65 5.2. Database Division.................................................................................................. 68 5.2.1 Horizontal Division....................................................................................... 68 5.2.2 Randomized Horizontal Division.................................................................. 69 5.3 Parallel Algorithms................................................................................................. 70 5.3.1 Multithread Algorithms in the SMP Architecture......................................... 71 5.3.2 Parallel Algorithms in Distributed Computer System .................................. 75 6 (1+ε)-PASS ALGORITHM............................................................................................78 6.2 Problem Description and Related Work ................................................................. 79 6.3 (1+ε)-Pass Algorithm.............................................................................................. 82 6.3.1 Basic Idea...................................................................................................... 82 6.3.2 Description of (1+ε)-Pass Algorithm............................................................ 87 6.3.3 Implementation Issues................................................................................... 90 7 EFFICIENT ALGORITHMS FOR SIMILARITY SEARCH........................................92 7.1 Abstract................................................................................................................... 92 7.2 Introduction............................................................................................................. 92 7.3 Sorting Based Algorithm (SBA)............................................................................. 93 7.4 GST Based Algorithm (GSTBA)............................................................................ 94 7.4.1 Color Set Size (CSS) Problem ...................................................................... 96 7.4.2 The GSTBA Algorithm................................................................................. 98 7.5 An Experimental Comparison of SBA and GSTBA............................................... 99 7.6 Conclusions........................................................................................................... 100 vi 8 EXPERIMENTAL RESULTS......................................................................................103 8.1. Comparison