C 2007 by Xiaoxin Yin. All Rights Reserved. SCALABLE MINING and LINK ANALYSIS ACROSS MULTIPLE DATABASE RELATIONS
Total Page:16
File Type:pdf, Size:1020Kb
c 2007 by Xiaoxin Yin. All rights reserved. SCALABLE MINING AND LINK ANALYSIS ACROSS MULTIPLE DATABASE RELATIONS BY XIAOXIN YIN B.E., Tsinghua University, 2001 M.S., University of Illinois at Urbana-Champaign, 2003 DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2007 Urbana, Illinois Abstract Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity- relationship links. Unfortunately, most existing data mining approaches can only handle data stored in single tables, and cannot be applied to relational databases. Therefore, it is an urgent task to design data mining approaches that can discover knowledge from multi-relational data. In this thesis we study three most important data mining tasks in multi-relational environments: clas- sification, clustering, and duplicate detection. Since information is widely spread across multiple relations, the most crucial and common challenge in multi-relational data mining is how to utilize the relational in- formation linked with each object. We rely on two types of information, — neighbor tuples and linkages between objects, to analyze the properties of objects and relationships among them. Because of the complexity of multi-relational data, efficiency and scalability are two major concerns in multi-relational data mining. In this thesis we propose scalable and accurate approaches for each data mining task studied. In order to achieve high efficiency and scalability, the approaches utilize novel techniques for virtually joining different relations, single-scan algorithms, and multi-resolutional data structures to dramatically reduce computational costs. Our experiments show that our approaches are highly efficient and scalable, and also achieve high accuracies in multi-relational data mining. iii To my dear wife Wen, for her love, encouragement, and support. iv Acknowledgments It is amazing to look back and see how things have changed in the past several years. A new graduate student with very little research experiences, as I was five years ago, has become a Ph.D. candidate capable of performing research independently on newly emerging data mining issues. I wish to express my deepest gratitude to my advisor Dr. Jiawei Han, who made these changes possible. As an advisor, he has given me endless encouragement and support by sharing his knowledge and experiences. More importantly, he taught me to choose worthwhile topics and think deep, from which I will benefit in the rest of my career. I am very thankful to Dr. Philip S. Yu, for the numerous insightful advices he gave me in our many discussions. He has provided great help to improve the quality of my research and this thesis. I am also grateful to Dr. Jiong Yang for his contribution to various projects I have participated. I greatly enjoyed his ideas and his enthusiasm. Special thanks to Dr. Kevin Chang, Dr. Marianne Winslett, Dr. Chengxiang Zhai, and Dr. Anhai Doan for their invaluable comments and suggestions to my research, which helped to keep my research work on the right track. I would also like to express my appreciation to my friends who have contributed suggestions and feedback to my work, even if they were not officially affiliated with it: Xifeng Yan, Hwanjo Yu, Xiaolei Li, Yifan Li, Dong Xin, Zheng Shao, Dr. Hongyan Liu, Dr. Jianlin Feng, Hong Cheng, Hector Gonzalez, Deng Cai, Chao Liu. I learned a great deal from discussing with them. v Table of Contents ListofTables.............................................. ix ListofFigures.............................................. x Chapter1 Introduction....................................... 1 Chapter 2 A Survey on Multi-relational Data Mining . .................. 7 2.1 Approaches for Multi-relational Data Mining ........................... 7 2.1.1 Inductive Logic Programming . ............................. 7 2.1.2 Association Rule Mining ................................... 8 2.1.3 Feature-based Approaches .................................. 8 2.1.4 Linkage-based Approaches .................................. 8 2.1.5 Probabilistic Approaches .................................. 9 2.2 Applications of Multi-relational Data Mining ........................... 9 2.2.1 Financial Applications .................................... 9 2.2.2 Computational Biology ................................... 10 2.2.3 Recommendation Systems .................................. 10 2.3 Summary ............................................... 10 Chapter3 EfficientMulti-relationalClassification....................... 11 3.1Overview............................................... 11 3.2 Problem Definitions ......................................... 13 3.2.1 Predicates ........................................... 13 3.2.2 Rules ............................................. 14 3.2.3 Decision Trees ........................................ 15 3.3 Tuple ID Propagation ........................................ 16 3.3.1 Searching for Predicates by Joins . ............................. 16 3.3.2 Tuple ID Propagation .................................... 18 3.3.3 Analysis and Constraints .................................. 19 3.4BuildingClassifiers..........................................20 3.4.1 The CrossMine-Tree Algorithm . ............................. 21 3.4.2 The CrossMine-Rule Algorithm . ............................. 21 3.4.3 Finding Best Predicate or Attribute ............................ 22 3.4.4 Predicting Class Labels ................................... 25 3.4.5 Tuple Sampling ........................................ 25 3.5ExperimentalResults......................................... 26 3.5.1 Synthetic Databases ..................................... 27 3.5.2 Real Databases ........................................ 30 3.6 Related Work ............................................. 31 3.7Discussions.............................................. 32 3.8 Summary ............................................... 33 vi Chapter4 Multi-relationalClusteringwithUserGuidance.................. 34 4.1Overview............................................... 34 4.2 Related Work ............................................. 38 4.3 Problem Definitions ......................................... 39 4.3.1 Multi-Relational Features .................................. 40 4.3.2 Tuple Similarity ....................................... 41 4.3.3 Similarity Vectors ...................................... 42 4.4FindingPertinentFeatures..................................... 42 4.4.1 Similarity between Features ................................. 42 4.4.2 Efficient Computation of Feature Similarity ........................ 44 4.4.3 Efficient Generation of Feature Values ........................... 48 4.4.4 Searching for Pertinent Features . ............................. 48 4.5ClusteringTuples........................................... 50 4.5.1 Similarity Measure ...................................... 51 4.5.2 Clustering Method 1: CLARANS . ............................. 51 4.5.3 Clustering Method 2: K-Means Clustering ......................... 51 4.5.4 Clustering Method 3: Agglomerative Hierarchical Clustering .............. 52 4.6ExperimentalResults......................................... 53 4.6.1 Measure of Clustering Accuracy . ............................. 53 4.6.2 Clustering Effectiveness ................................... 54 4.6.3 Scalability Tests ....................................... 59 4.7 Summary ............................................... 60 Chapter5 Multi-relationalDuplicateDetection........................ 62 5.1Overview............................................... 62 5.2 Problem Definitions ......................................... 65 5.3 Closed Loop Property ........................................ 66 5.4 Duplicate Detection Algorithm ................................... 67 5.4.1 Path-Based Random Walk Model . ............................. 68 5.4.2 Computing Random Walk Probabilities .......................... 69 5.4.3 Handling Shared Attribute Values ............................. 73 5.4.4 Building Model of Path-based Random Walk ....................... 73 5.5EmpiricalStudy........................................... 75 5.5.1 CS Dept Dataset ....................................... 76 5.5.2 DBLP Dataset ........................................ 78 5.5.3 Detecting Real Duplicates in DBLP ............................ 79 5.5.4 Experiments on Scalability ................................. 81 5.6 Related Work ............................................. 82 5.7 Summary ............................................... 84 Chapter6 ObjectDistinctionInRelationalDatabases.................... 85 6.1 Introduction .............................................. 85 6.2 Similarity Between References .................................... 88 6.2.1 Neighbor Tuples ....................................... 89 6.2.2 Connection Strength ..................................... 90 6.2.3 Set Resemblance of Neighbor Tuples ............................ 91 6.2.4 Random Walk Probability .................................. 92 6.3 Supervised Learning with Automatically Constructed Training Set ............... 93 6.4 Clustering References ........................................ 95 6.4.1 Clustering Strategy ...................................... 95 6.4.2 Efficient Computation ...................................