Mining Algorithms for Generic and Biological Data

MINING ALGORITHMS FOR GENERIC AND BIOLOGICAL DATA By JUN LUO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2002 Copyright 2002 by Jun Luo To my parents and lovely wife ACKNOWLEDGMENTS I would like to express my gratitude toward my advisor, Professor Sanguthevar Rajasekaran, for giving me the opportunity to work with him in the Computer and Information Science and Engineering Department. His excellent advice and greatly helpful guidance were critical for me to finish my thesis smoothly and successfully. My full appreciation goes to Dr. Sartaj K. Sahni. I admire Dr. Sahni’s excellent achievements and respected personalities. It is my honor to have Dr. Sahni in my committee. I want to thank Dr. Douglas D. Dankel II who never hesitated to give me the advice and help whenever I was in difficulties. I also want to thank Dr. Meera Sitharam and Dr. Joseph N. Wilson for their reviews of my dissertation. Their insightful suggestions gave me a lot of inspirations. My thanks also go to Dr. Tim Davis, Dr. Loc Vu-Quoc, Mr. John Bowers, and Mrs. Ardiniece Y. Caudle, etc. Also, I highly appreciate Dr. Reed Ellis at the Editorial Office in the Graduate School for sparing time reviewing my dissertation carefully and patiently. Last but not least, I extend my utmost gratitude to my wife, Yanjin E, my mother, Yaqin Wang, and my father, Zhensheng Luo, for their enduring support. iv TABLE OF CONTENTS page MINING ALGORITHMS FOR GENERIC AND BIOLOGICAL DATA......................... I ACKNOWLEDGMENTS ................................................................................................ IV ABSTRACT....................................................................................................................VIII 1 INTRODUCTION ............................................................................................................1 1.1 What is Data Mining?............................................................................................... 2 1.2 Primary Tasks of Data Mining.................................................................................. 3 1.2.1 Mining Association Rules............................................................................... 3 1.2.2 Classification................................................................................................... 4 1.2.3 Clustering........................................................................................................ 5 1.2.4 Incremental Mining......................................................................................... 6 1.2.5 Sequential Patterns.......................................................................................... 6 1.2.6 Regression....................................................................................................... 7 1.2.7 Web Mining .................................................................................................... 7 1.2.7 Text Mining..................................................................................................... 8 1.2.8 Biology Mining............................................................................................... 9 1.3 Organization of the Dissertation............................................................................... 9 2 MINING ASSOCIATION RULES ................................................................................11 2.1 Problem Definitions................................................................................................ 11 2.2 Mining Association Rules....................................................................................... 12 2.3 Examples of Association Rule Mining ................................................................... 13 2.4 A Survey of Algorithms for Association Rule Mining........................................... 15 2.4.1 Apriori, Apriori-like Algorithms................................................................... 16 2.4.2 Partition and Sampling.................................................................................. 19 2.4.3 FP-Growth..................................................................................................... 20 2.4.4 Mining Multi-level Associaiton Rules.......................................................... 21 2.4.5 Mining Quantitative Associaiton Rules ........................................................ 21 2.4.6 Parallel Algorithms ....................................................................................... 21 3 INTERSECING ATTRIBUTE LISTS USING A HASH TABLE.................................23 3.1 IT Algorithm........................................................................................................... 23 3.1.1 Basic Idea...................................................................................................... 23 3.1.2 INSERT Algorithm....................................................................................... 28 v 3.1.3 Performance Analysis ................................................................................... 30 3.2 Dynamic Rename Algorithm (DRA)...................................................................... 32 3.3 Optimization Methods ............................................................................................ 33 3.3.1 Reorder Frequent Itemset (RFI).................................................................... 33 3.3.2 Similarity Detection (SD) ............................................................................. 37 3.3.3 Early Stop Detection (ESD).......................................................................... 37 4 SUPER FAST ALGORITHMS FOR DISCOVERING FREQUENT ITEM-SETS......38 4.1 FIT Algorithm......................................................................................................... 39 4.1.1 Basic Idea...................................................................................................... 39 4.1.2 FIT Algorithm............................................................................................... 42 4.1.3 Performance Analysis ................................................................................... 44 4.1.4 Implementation Issues................................................................................... 54 4.2 SFIT Algorithm....................................................................................................... 57 4.2.1 SFIT Algorithm............................................................................................. 58 4.2.2 Performance Analysis ................................................................................... 59 4.2.3 Implementation Issues................................................................................... 64 5 PARALLEL ALGORITHMS FOR MINING ASSOCIATION RULES.......................65 5.1 Basic Idea................................................................................................................ 65 5.2. Database Division.................................................................................................. 68 5.2.1 Horizontal Division....................................................................................... 68 5.2.2 Randomized Horizontal Division.................................................................. 69 5.3 Parallel Algorithms................................................................................................. 70 5.3.1 Multithread Algorithms in the SMP Architecture......................................... 71 5.3.2 Parallel Algorithms in Distributed Computer System .................................. 75 6 (1+ε)-PASS ALGORITHM............................................................................................78 6.2 Problem Description and Related Work ................................................................. 79 6.3 (1+ε)-Pass Algorithm.............................................................................................. 82 6.3.1 Basic Idea...................................................................................................... 82 6.3.2 Description of (1+ε)-Pass Algorithm............................................................ 87 6.3.3 Implementation Issues................................................................................... 90 7 EFFICIENT ALGORITHMS FOR SIMILARITY SEARCH........................................92 7.1 Abstract................................................................................................................... 92 7.2 Introduction............................................................................................................. 92 7.3 Sorting Based Algorithm (SBA)............................................................................. 93 7.4 GST Based Algorithm (GSTBA)............................................................................ 94 7.4.1 Color Set Size (CSS) Problem ...................................................................... 96 7.4.2 The GSTBA Algorithm................................................................................. 98 7.5 An Experimental Comparison of SBA and GSTBA............................................... 99 7.6 Conclusions........................................................................................................... 100 vi 8 EXPERIMENTAL RESULTS......................................................................................103 8.1. Comparison

Mining Algorithms for Generic and Biological Data

A Case for High Performance Computing with Virtual Machines

A Light-Weight Virtual Machine Monitor for Blue Gene/P

Opportunities for Leveraging OS Virtualization in High-End Supercomputing

A Virtual Machine Environment for Real Time Systems Laboratories

GT-Virtual-Machines

Virtualizationoverview

Virtual Machine Benchmarking Kim-Thomas M¨Oller Diploma Thesis

Distributed Virtual Machines: a System Architecture for Network Computing

Enabling Technologies for Distributed and Cloud Computing Dr

Virtual Machines Aaron Sloman ([email protected]) School of Computer Science, University of Birmingham, Birmingham, B15 2TT, UK

Metaprogramming: Interpretaaon Interpreters

Middleware for a CLEVER Use of Virtual Resources in Federated Clouds