Bi-Clustering – Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

Bi-Clustering – Algorithms Silesian University of Technology Faculty of Automatic Control, Electronics and Computer Science Institute of Informatics Doctor of Philosophy Dissertation Bi-clustering – algorithms and applications Paweł Foszner Supervisor: prof. dr hab. inż. Andrzej Polański Gliwice, 2014 1 2 To my lovely wife Aleksandra for her full support over those years. 3 4 Table of Contents Acknowledgements ................................................................................................................................... 9 1. Introduction .......................................................................................................................................11 2. Aims .......................................................................................................................................................13 3. Theses ...................................................................................................................................................15 4. Main contribution and original elements of the thesis .................................................16 5. Formulation of main problems .................................................................................................17 5.1. Definition of bi-clusters ......................................................................................................17 5.2. Index functions for evaluating quality of bi-clustering systems .....................22 5.2.1. Mean square residue (MSR) ....................................................................................22 5.2.2. Average Correlation Value (ACV) ..........................................................................22 5.2.3. Average Spearman's rho (ASR) ..............................................................................23 5.3. Stop criteria for bi-clustering algorithms...................................................................25 5.3.1. Mathematical convergence ......................................................................................25 5.3.2. Connectivity matrix .....................................................................................................26 5.3.3. Conditions defined by the user. .............................................................................28 6. An overview of bi-clustering methods ..................................................................................29 6.1. Algorithms based on matrix decomposition .............................................................29 6.1.1. Based on LSE. ..................................................................................................................29 6.1.2. Based on Kullback–Leibler divergence ..............................................................30 6.1.3. Based on non-smooth Kullback–Leibler divergence. ..................................30 6.1.5. FABIA ..................................................................................................................................32 6.2. Algorithms based on bipartite graphs .........................................................................34 6.2.1. QUBIC..................................................................................................................................34 6.3. Algorithms based on Iterative Row and Column search .....................................36 5 6.3.1. Coupled Two-Way Clustering (CTWC)............................................................... 36 6.4. Algorithms based on Divide and Conquer approach ............................................ 37 6.4.1. Block clustering ............................................................................................................. 37 6.5. Algorithms based on Greedy iterative search.......................................................... 38 6.5.1. δ-bi-clusters .................................................................................................................... 38 6.6. Algorithms based on Exhaustive bi-cluster enumeration ................................. 39 6.6.1. Statistical-Algorithmic Method for Bi-cluster Analysis (SAMBA) ......... 39 6.7. Algorithms based on Distribution parameter identification ............................ 40 6.7.1. Plaid Model ...................................................................................................................... 40 7. Comparing the results .................................................................................................................. 41 7.1. Similarity measures .............................................................................................................. 41 7.1.1. Jaccard Index .................................................................................................................. 41 7.1.2. Relevance and recovery ............................................................................................ 42 7.1.3. Consensus score ............................................................................................................ 43 7.2. Hungarian algorithm............................................................................................................ 45 7.3. Generalized Hungarian algorithm ................................................................................. 52 7.3.1. Problem formulation .................................................................................................. 52 7.3.2. Related work .................................................................................................................. 54 7.3.3. Hungarian algorithm .................................................................................................. 54 7.3.4. Two-dimensional approach .................................................................................... 56 7.3.5. Multidimensional approach .................................................................................... 61 7.4. Consensus algorithm ........................................................................................................... 64 8. Graphical presentation of results ........................................................................................... 67 8.1. Presenting bi-clusters ......................................................................................................... 67 8.1.1. BiVoC .................................................................................................................................. 67 8.1.2. BicOverlapper ................................................................................................................ 68 8.1.3. BiCluster Viewer ........................................................................................................... 68 6 8.2. Presenting the results of domain ...................................................................................70 8.2.1. Clusters containing genes .........................................................................................70 8.3. Presenting the results from different experiments. ..............................................71 9. Computational experiments ......................................................................................................72 9.1. Environment for data generation and evaluation ..................................................72 9.1.1. Data ......................................................................................................................................74 9.1.2. Distributed computing ...............................................................................................75 9.1.3. Defining own synthetic matrix ...............................................................................76 9.1.4. Browsing data and results ........................................................................................77 9.1.5. Update functionality ....................................................................................................79 9.1.6. Program availability ....................................................................................................79 9.2. Third-party software ............................................................................................................81 9.3. Data ...............................................................................................................................................82 9.3.1. Synthetic data .................................................................................................................82 9.3.2. Real data ............................................................................................................................83 9.4. Computational results .........................................................................................................85 9.4.1. Synthetic data .................................................................................................................85 9.4.2. Real data ............................................................................................................................86 10. Conclusions and summary .....................................................................................................93 Bibliography ...............................................................................................................................................95 List of Symbols and Abbreviations ............................................................................................... 101 Table
Recommended publications
  • Ordinary Differential Equations (ODE) Using Euler's Technique And
    Mathematical Models and Methods in Modern Science Ordinary Differential Equations (ODE) using Euler’s Technique and SCILAB Programming ZULZAMRI SALLEH Applied Sciences and Advance Technology Universiti Kuala Lumpur Malaysian Institute of Marine Engineering Technology Bandar Teknologi Maritim Jalan Pantai Remis 32200 Lumut, Perak MALAYSIA [email protected] http://www.mimet.edu.my Abstract: - Mathematics is very important for the engineering and scientist but to make understand the mathematics is very difficult if without proper tools and suitable measurement. A numerical method is one of the algorithms which involved with computer programming. In this paper, Scilab is used to carter the problems related the mathematical models such as Matrices, operation with ODE’s and solving the Integration. It remains true that solutions of the vast majority of first order initial value problems cannot be found by analytical means. Therefore, it is important to be able to approach the problem in other ways. Today there are numerous methods that produce numerical approximations to solutions of differential equations. Here, we introduce the oldest and simplest such method, originated by Euler about 1768. It is called the tangent line method or the Euler method. It uses a fixed step size h and generates the approximate solution. The purpose of this paper is to show the details of implementing of Euler’s method and made comparison between modify Euler’s and exact value by integration solution, as well as solve the ODE’s use built-in functions available in Scilab programming. Key-Words: - Numerical Methods, Scilab programming, Euler methods, Ordinary Differential Equation. 1 Introduction availability of digital computers has led to a Numerical methods are techniques by which veritable explosion in the use and development of mathematical problems are formulated so that they numerical methods [1].
    [Show full text]
  • Numerical Methods for Ordinary Differential Equations
    CHAPTER 1 Numerical Methods for Ordinary Differential Equations In this chapter we discuss numerical method for ODE . We will discuss the two basic methods, Euler’s Method and Runge-Kutta Method. 1. Numerical Algorithm and Programming in Mathcad 1.1. Numerical Algorithm. If you look at dictionary, you will the following definition for algorithm, 1. a set of rules for solving a problem in a finite number of steps; 2. a sequence of steps designed for programming a computer to solve a specific problem. A numerical algorithm is a set of rules for solving a problem in finite number of steps that can be easily implemented in computer using any programming language. The following is an algorithm for compute the root of f(x) = 0; Input f, a, N and tol . Output: the approximate solution to f(x) = 0 with initial guess a or failure message. ² Step One: Set x = a ² Step Two: For i=0 to N do Step Three - Four f(x) Step Three: Compute x = x ¡ f 0(x) Step Four: If f(x) · tol return x ² Step Five return ”failure”. In analogy, a numerical algorithm is like a cook recipe that specify the input — cooking material, the output—the cooking product, and steps of carrying computation — cooking steps. In an algorithm, you will see loops (for, while), decision making statements(if, then, else(otherwise)) and return statements. ² for loop: Typically used when specific number of steps need to be carried out. You can break a for loop with return or break statement. 1 2 1. NUMERICAL METHODS FOR ORDINARY DIFFERENTIAL EQUATIONS ² while loop: Typically used when unknown number of steps need to be carried out.
    [Show full text]
  • Getting Started with Euler
    Getting started with Euler Samuel Fux High Performance Computing Group, Scientific IT Services, ETH Zurich ETH Zürich | Scientific IT Services | HPC Group Samuel Fux | 19.02.2020 | 1 Outlook . Introduction . Accessing the cluster . Data management . Environment/LMOD modules . Using the batch system . Applications . Getting help ETH Zürich | Scientific IT Services | HPC Group Samuel Fux | 19.02.2020 | 2 Outlook . Introduction . Accessing the cluster . Data management . Environment/LMOD modules . Using the batch system . Applications . Getting help ETH Zürich | Scientific IT Services | HPC Group Samuel Fux | 19.02.2020 | 3 Intro > What is EULER? . EULER stands for . Erweiterbarer, Umweltfreundlicher, Leistungsfähiger ETH Rechner . It is the 5th central (shared) cluster of ETH . 1999–2007 Asgard ➔ decommissioned . 2004–2008 Hreidar ➔ integrated into Brutus . 2005–2008 Gonzales ➔ integrated into Brutus . 2007–2016 Brutus . 2014–2020+ Euler . It benefits from the 15 years of experience gained with those previous large clusters ETH Zürich | Scientific IT Services | HPC Group Samuel Fux | 19.02.2020 | 4 Intro > Shareholder model . Like its predecessors, Euler has been financed (for the most part) by its users . So far, over 100 (!) research groups from almost all departments of ETH have invested in Euler . These so-called “shareholders” receive a share of the cluster’s resources (processors, memory, storage) proportional to their investment . The small share of Euler financed by IT Services is open to all members of ETH . The only requirement
    [Show full text]
  • Freemat V3.6 Documentation
    FreeMat v3.6 Documentation Samit Basu November 16, 2008 2 Contents 1 Introduction and Getting Started 5 1.1 INSTALL Installing FreeMat . 5 1.1.1 General Instructions . 5 1.1.2 Linux . 5 1.1.3 Windows . 6 1.1.4 Mac OS X . 6 1.1.5 Source Code . 6 2 Variables and Arrays 7 2.1 CELL Cell Array Definitions . 7 2.1.1 Usage . 7 2.1.2 Examples . 7 2.2 Function Handles . 8 2.2.1 Usage . 8 2.3 GLOBAL Global Variables . 8 2.3.1 Usage . 8 2.3.2 Example . 9 2.4 INDEXING Indexing Expressions . 9 2.4.1 Usage . 9 2.4.2 Array Indexing . 9 2.4.3 Cell Indexing . 13 2.4.4 Structure Indexing . 14 2.4.5 Complex Indexing . 16 2.5 MATRIX Matrix Definitions . 17 2.5.1 Usage . 17 2.5.2 Examples . 17 2.6 PERSISTENT Persistent Variables . 19 2.6.1 Usage . 19 2.6.2 Example . 19 2.7 STRING String Arrays . 20 2.7.1 Usage . 20 2.8 STRUCT Structure Array Constructor . 22 2.8.1 Usage . 22 2.8.2 Example . 22 3 4 CONTENTS 3 Functions and Scripts 25 3.1 ANONYMOUS Anonymous Functions . 25 3.1.1 Usage . 25 3.1.2 Examples . 25 3.2 FUNCTION Function Declarations . 26 3.2.1 Usage . 26 3.2.2 Examples . 28 3.3 KEYWORDS Function Keywords . 30 3.3.1 Usage . 30 3.3.2 Example . 31 3.4 NARGIN Number of Input Arguments . 32 3.4.1 Usage .
    [Show full text]
  • Towards a Fully Automated Extraction and Interpretation of Tabular Data Using Machine Learning
    UPTEC F 19050 Examensarbete 30 hp August 2019 Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Per Hedbrant Master Thesis in Engineering Physics Department of Engineering Sciences Uppsala University Sweden Abstract Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Teknisk- naturvetenskaplig fakultet UTH-enheten Motivation A challenge for researchers at CBCS is the ability to efficiently manage the Besöksadress: different data formats that frequently are changed. Significant amount of time is Ångströmlaboratoriet Lägerhyddsvägen 1 spent on manual pre-processing, converting from one format to another. There are Hus 4, Plan 0 currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Postadress: Box 536 751 21 Uppsala Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for Telefon: automated recognition and loading of data stored in arbitrary formats. The aim of 018 – 471 30 03 this study is three-folded: A) Investigate if unsupervised machine learning Telefax: methods can be used to label different types of cells in spreadsheets. B) 018 – 471 30 00 Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and Hemsida: technologies for the SaaS solution. http://www.teknat.uu.se/student Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done.
    [Show full text]
  • A Case Study in Fitting Area-Proportional Euler Diagrams with Ellipses Using Eulerr
    A Case Study in Fitting Area-Proportional Euler Diagrams with Ellipses using eulerr Johan Larsson and Peter Gustafsson Department of Statistics, School of Economics and Management, Lund University, Lund, Sweden [email protected] [email protected] Abstract. Euler diagrams are common and user-friendly visualizations for set relationships. Most Euler diagrams use circles, but circles do not always yield accurate diagrams. A promising alternative is ellipses, which, in theory, enable accurate diagrams for a wider range of input. Elliptical diagrams, however, have not yet been implemented for more than three sets or three-set diagrams where there are disjoint or subset relationships. The aim of this paper is to present eulerr: a software package for elliptical Euler diagrams for, in theory, any number of sets. It fits Euler diagrams using numerical optimization and exact-area algorithms through a two-step procedure, first generating an initial layout using pairwise relationships and then finalizing this layout using all set relationships. 1 Background The Euler diagram, first described by Leonard Euler in 1802 [1], is a generalization of the popular Venn diagram. Venn and Euler diagrams both visualize set relationships by mapping areas in the diagram to relationships in the data. They differ, however, in that Venn diagrams require all intersections to be present| even if they are empty|whilst Euler diagrams do not, which means that Euler diagrams lend themselves well to be area-proportional. Euler diagrams may be fashioned out of any closed shape, and have been implemented for triangles [2], rectangles [2], ellipses [3], smooth curves [4], poly- gons [2], and circles [5, 2].
    [Show full text]
  • The Python Interpreter
    The Python interpreter Remi Lehe Lawrence Berkeley National Laboratory (LBNL) US Particle Accelerator School (USPAS) Summer Session Self-Consistent Simulations of Beam and Plasma Systems S. M. Lund, J.-L. Vay, R. Lehe & D. Winklehner Colorado State U, Ft. Collins, CO, 13-17 June, 2016 Python interpreter: Outline 1 Overview of the Python language 2 Python, numpy and matplotlib 3 Reusing code: functions, modules, classes 4 Faster computation: Forthon Overview Scientific Python Reusing code Forthon Overview of the Python programming language Interpreted language (i.e. not compiled) ! Interactive, but not optimal for computational speed Readable and non-verbose No need to declare variables Indentation is enforced Free and open-source + Large community of open-souce packages Well adapted for scientific and data analysis applications Many excellent packages, esp. numerical computation (numpy), scientific applications (scipy), plotting (matplotlib), data analysis (pandas, scikit-learn) 3 Overview Scientific Python Reusing code Forthon Interfaces to the Python language Scripting Interactive shell Code written in a file, with a Obtained by typing python or text editor (gedit, vi, emacs) (better) ipython Execution via command line Commands are typed in and (python + filename) executed one by one Convenient for exploratory work, Convenient for long-term code debugging, rapid feedback, etc... 4 Overview Scientific Python Reusing code Forthon Interfaces to the Python language IPython (a.k.a Jupyter) notebook Notebook interface, similar to Mathematica. Intermediate between scripting and interactive shell, through reusable cells Obtained by typing jupyter notebook, opens in your web browser Convenient for exploratory work, scientific analysis and reports 5 Overview Scientific Python Reusing code Forthon Overview of the Python language This lecture Reminder of the main points of the Scipy lecture notes through an example problem.
    [Show full text]
  • Advanced Data Mining with Weka (Class 5)
    Advanced Data Mining with Weka Class 5 – Lesson 1 Invoking Python from Weka Peter Reutemann Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Lesson 5.1: Invoking Python from Weka Class 1 Time series forecasting Lesson 5.1 Invoking Python from Weka Class 2 Data stream mining Lesson 5.2 Building models in Weka and MOA Lesson 5.3 Visualization Class 3 Interfacing to R and other data mining packages Lesson 5.4 Invoking Weka from Python Class 4 Distributed processing with Apache Spark Lesson 5.5 A challenge, and some Groovy Class 5 Scripting Weka in Python Lesson 5.6 Course summary Lesson 5.1: Invoking Python from Weka Scripting Pros script captures preprocessing, modeling, evaluation, etc. write script once, run multiple times easy to create variants to test theories no compilation involved like with Java Cons programming involved need to familiarize yourself with APIs of libraries writing code is slower than clicking in the GUI Invoking Python from Weka Scripting languages Jython - https://docs.python.org/2/tutorial/ - pure-Java implementation of Python 2.7 - runs in JVM - access to all Java libraries on CLASSPATH - only pure-Python libraries can be used Python - invoking Weka from Python 2.7 - access to full Python library ecosystem Groovy (briefly) - http://www.groovy-lang.org/documentation.html - Java-like syntax - runs in JVM - access to all Java libraries on CLASSPATH Invoking Python from Weka Java vs Python Java Output public class Blah { 1: Hello WekaMOOC! public static void main(String[]
    [Show full text]
  • Introduction to Weka
    Introduction to Weka Overview What is Weka? Where to find Weka? Command Line Vs GUI Datasets in Weka ARFF Files Classifiers in Weka Filters What is Weka? Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Where to find Weka Weka website (Latest version 3.6): – http://www.cs.waikato.ac.nz/ml/weka/ Weka Manual: − http://transact.dl.sourceforge.net/sourcefor ge/weka/WekaManual-3.6.0.pdf CLI Vs GUI Recommended for in-depth usage Explorer Offers some functionality not Experimenter available via the GUI Knowledge Flow Datasets in Weka Each entry in a dataset is an instance of the java class: − weka.core.Instance Each instance consists of a number of attributes Attributes Nominal: one of a predefined list of values − e.g. red, green, blue Numeric: A real or integer number String: Enclosed in “double quotes” Date Relational ARFF Files The external representation of an Instances class Consists of: − A header: Describes the attribute types − Data section: Comma separated list of data ARFF File Example Dataset name Comment Attributes Target / Class variable Data Values Assignment ARFF Files Credit-g Heart-c Hepatitis Vowel Zoo http://www.cs.auckland.ac.nz/~pat/weka/ ARFF Files Basic statistics and validation by running: − java weka.core.Instances data/soybean.arff Classifiers in Weka Learning algorithms in Weka are derived from the abstract class: − weka.classifiers.Classifier Simple classifier: ZeroR − Just determines the most common class − Or the median (in the case of numeric values) − Tests how well the class can be predicted without considering other attributes − Can be used as a Lower Bound on Performance.
    [Show full text]
  • Mathematica Document
    Mathematica Project: Exploratory Data Analysis on ‘Data Scientists’ A big picture view of the state of data scientists and machine learning engineers. ����� ���� ��������� ��� ������ ���� ������ ������ ���� ������/ ������ � ���������� ���� ��� ������ ��� ���������������� �������� ������/ ����� ��������� ��� ���� ���������������� ����� ��������������� ��������� � ������(�������� ���������� ���������) ������ ��������� ����� ������� �������� ����� ������� ��� ������ ����������(���� �������) ��������� ����� ���� ������ ����� (���������� �������) ����������(���������� ������� ���������� ��� ���� ���� �����/ ��� �������������� � ����� ���� �� �������� � ��� ����/���������� ��������������� ������� ������������� ��� ���������� ����� �����(���� �������) ����������� ����� / ����� ��� ������ ��������������� ���������� ����������/�++ ������/������������/����/������ ���� ������� ����� ������� ������� ����������������� ������� ������� ����/����/�������/��/��� ����������(�����/����-�������� ��������) ������������ In this Mathematica project, we will explore the capabilities of Mathematica to better understand the state of data science enthusiasts. The dataset consisting of more than 10,000 rows is obtained from Kaggle, which is a result of ‘Kaggle Survey 2017’. We will explore various capabilities of Mathematica in Data Analysis and Data Visualizations. Further, we will utilize Machine Learning techniques to train models and Classify features with several algorithms, such as Nearest Neighbors, Random Forest. Dataset : https : // www.kaggle.com/kaggle/kaggle
    [Show full text]
  • Freemat V4.0 Documentation
    FreeMat v4.0 Documentation Samit Basu October 4, 2009 2 Contents 1 Introduction and Getting Started 39 1.1 INSTALL Installing FreeMat . 39 1.1.1 General Instructions . 39 1.1.2 Linux . 39 1.1.3 Windows . 40 1.1.4 Mac OS X . 40 1.1.5 Source Code . 40 2 Variables and Arrays 41 2.1 CELL Cell Array Definitions . 41 2.1.1 Usage . 41 2.1.2 Examples . 41 2.2 Function Handles . 42 2.2.1 Usage . 42 2.3 GLOBAL Global Variables . 42 2.3.1 Usage . 42 2.3.2 Example . 42 2.4 INDEXING Indexing Expressions . 43 2.4.1 Usage . 43 2.4.2 Array Indexing . 43 2.4.3 Cell Indexing . 46 2.4.4 Structure Indexing . 47 2.4.5 Complex Indexing . 49 2.5 MATRIX Matrix Definitions . 49 2.5.1 Usage . 49 2.5.2 Examples . 49 2.6 PERSISTENT Persistent Variables . 51 2.6.1 Usage . 51 2.6.2 Example . 51 2.7 STRUCT Structure Array Constructor . 51 2.7.1 Usage . 51 2.7.2 Example . 52 3 4 CONTENTS 3 Functions and Scripts 55 3.1 ANONYMOUS Anonymous Functions . 55 3.1.1 Usage . 55 3.1.2 Examples . 55 3.2 FUNCTION Function Declarations . 56 3.2.1 Usage . 56 3.2.2 Examples . 57 3.3 KEYWORDS Function Keywords . 60 3.3.1 Usage . 60 3.3.2 Example . 60 3.4 NARGIN Number of Input Arguments . 61 3.4.1 Usage . 61 3.4.2 Example . 61 3.5 NARGOUT Number of Output Arguments .
    [Show full text]
  • WEKA: the Waikato Environment for Knowledge Analysis
    WEKA: The Waikato Environment for Knowledge Analysis Stephen R. Garner Department of Computer Science, University of Waikato, Hamilton. ABSTRACT WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand’s agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education. 1. Introduction The WEKA machine learning workbench has grown out of the need to be able to apply machine learning to real world data sets in a way that promotes a “what if?…” or exploratory approach. Each machine learning algorithm implementation requires the data to be present in its own format, and has its own way of specifying parameters and output. The WEKA system was designed to bring a range of machine learning techniques or schemes under a common interface so that they may be easily applied to this data in a consistent method. This interface should be flexible enough to encourage the addition of new schemes, and simple enough that users need only concern themselves with the selection of features in the data for analysis and what the output means, rather than how to use a machine learning scheme. The WEKA system has been used over the past year to work with a variety of agricultural data sets in order to try to answer a range of questions.
    [Show full text]